DevOps Classroom Series – 02/Aug/2020 (Evening)

Site Reliability Engineering (SRE)

  • Refer Here for slides
  • Refer Here if you are interested in books
  • SRE is an approach adopted, practised and preached by google for handling applications in Production.
  • In Simple terms, SRE is a way of doing DevOps (followed by Google)
  • Principles of DevOps
    1. Accept Failure as Normal
    2. Implement Gradual changes
    3. Leverage Tooling and Automation
    4. Measure Everything
  • DevOps gives you an Abstract idea, whereas SRE is about practical implementation
  • We hear this Term "Class SRE implements DevOps"

Service Level Objectives

  • We define 3 key indicators

    1. Service Level Indicator (SLI)
    2. Service Level Objective (SLO)
    3. Service Level Agreement
  • Service Level Indicator:

    • Defined quantitative measure of some aspect of the level of service that is provided by your application.
    • Most services consider request latency (How long does it take to get a response to request from your application/service)
    • Other interesting indicator could be error rate
    • Example objective:
      1. My application will have request latency less than 100 ms for 99.99% of the requests in last 10 mins
      2. Our identity service will have error rate less than 0.5% for last 10000 requests
  • Service Level Objective:

    • A target value or range of values for a service that is measured by an SLI
    • We derive SLO from SLI.
    • Generally SLI’s are more aggressive than SLO’s
    • SLO is Value derived from SLO for a considerable period of time
    • Example:
      1. My application will respond with a latency less than 100 ms for 99.9% over a year.
      2. Our identity service will have error rate less than 0.6% over a period of 3 months
  • Service Level Agreement:

    • This is an implicit or explicit contract with the user for meeting the SLOs.
    • Consequences of not meeting SLAs could be rewards to the user.
    • Examples:
      1. My application will respond with a latency less than 100 ms for 99.5% of requests over a year
      2. Our identity service will have error rate less than 0.75% over a period of 3 months

Risks and Error Budget

  • Error Budget: Is the time where your application is allowed to fail
  • Scenario 1:
99.99% of the time my website is available in an year
0.01% of the time my website will be down
my website will be down for 52 minutes in year

  • Scenario2:
My application will respond with a latency less than 100 ms for 99.5% of requests over a year

Toil

  • Toil is manual repetitive work, like restarting servers. Taking backups of databases before production upgrades (if you are doing manually)
  • Toil is kind of the work tied to running a production service that tends to be manual, repetitive, automatable and it grows linearly as service grows.
  • Solution to toil is to Automate the activity, but we automate it only when there is ROI.

Elastic Cloud

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Please turn AdBlock off
Animated Social Media Icons by Acurax Wordpress Development Company

Discover more from Direct DevOps from Quality Thought

Subscribe now to keep reading and get access to the full archive.

Continue reading

Visit Us On FacebookVisit Us On LinkedinVisit Us On Youtube