DevOps Classroom Series – 02/Aug/2020 (Evening)

Site Reliability Engineering (SRE)

  • Refer Here for slides
  • Refer Here if you are interested in books
  • SRE is an approach adopted, practised and preached by google for handling applications in Production.
  • In Simple terms, SRE is a way of doing DevOps (followed by Google)
  • Principles of DevOps
    1. Accept Failure as Normal
    2. Implement Gradual changes
    3. Leverage Tooling and Automation
    4. Measure Everything
  • DevOps gives you an Abstract idea, whereas SRE is about practical implementation
  • We hear this Term "Class SRE implements DevOps"

Service Level Objectives

  • We define 3 key indicators

    1. Service Level Indicator (SLI)
    2. Service Level Objective (SLO)
    3. Service Level Agreement
  • Service Level Indicator:

    • Defined quantitative measure of some aspect of the level of service that is provided by your application.
    • Most services consider request latency (How long does it take to get a response to request from your application/service)
    • Other interesting indicator could be error rate
    • Example objective:
      1. My application will have request latency less than 100 ms for 99.99% of the requests in last 10 mins
      2. Our identity service will have error rate less than 0.5% for last 10000 requests
  • Service Level Objective:

    • A target value or range of values for a service that is measured by an SLI
    • We derive SLO from SLI.
    • Generally SLI’s are more aggressive than SLO’s
    • SLO is Value derived from SLO for a considerable period of time
    • Example:
      1. My application will respond with a latency less than 100 ms for 99.9% over a year.
      2. Our identity service will have error rate less than 0.6% over a period of 3 months
  • Service Level Agreement:

    • This is an implicit or explicit contract with the user for meeting the SLOs.
    • Consequences of not meeting SLAs could be rewards to the user.
    • Examples:
      1. My application will respond with a latency less than 100 ms for 99.5% of requests over a year
      2. Our identity service will have error rate less than 0.75% over a period of 3 months

Risks and Error Budget

  • Error Budget: Is the time where your application is allowed to fail
  • Scenario 1:
99.99% of the time my website is available in an year
0.01% of the time my website will be down
my website will be down for 52 minutes in year

  • Scenario2:
My application will respond with a latency less than 100 ms for 99.5% of requests over a year

Toil

  • Toil is manual repetitive work, like restarting servers. Taking backups of databases before production upgrades (if you are doing manually)
  • Toil is kind of the work tied to running a production service that tends to be manual, repetitive, automatable and it grows linearly as service grows.
  • Solution to toil is to Automate the activity, but we automate it only when there is ROI.

Elastic Cloud

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

About learningthoughtsadmin