DevOps Classroom Series – 02/Aug/2020 (Evening) – Direct DevOps from Quality Thought

Refer Here for slides
Refer Here if you are interested in books
SRE is an approach adopted, practised and preached by google for handling applications in Production.
In Simple terms, SRE is a way of doing DevOps (followed by Google)
Principles of DevOps
1. Accept Failure as Normal
2. Implement Gradual changes
3. Leverage Tooling and Automation
4. Measure Everything
DevOps gives you an Abstract idea, whereas SRE is about practical implementation
We hear this Term "Class SRE implements DevOps"

We define 3 key indicators
1. Service Level Indicator (SLI)
2. Service Level Objective (SLO)
3. Service Level Agreement
Service Level Indicator:
- Defined quantitative measure of some aspect of the level of service that is provided by your application.
- Most services consider request latency (How long does it take to get a response to request from your application/service)
- Other interesting indicator could be error rate
- Example objective:
  1. My application will have request latency less than 100 ms for 99.99% of the requests in last 10 mins
  2. Our identity service will have error rate less than 0.5% for last 10000 requests
Service Level Objective:
- A target value or range of values for a service that is measured by an SLI
- We derive SLO from SLI.
- Generally SLI’s are more aggressive than SLO’s
- SLO is Value derived from SLO for a considerable period of time
- Example:
  1. My application will respond with a latency less than 100 ms for 99.9% over a year.
  2. Our identity service will have error rate less than 0.6% over a period of 3 months
Service Level Agreement:
- This is an implicit or explicit contract with the user for meeting the SLOs.
- Consequences of not meeting SLAs could be rewards to the user.
- Examples:
  1. My application will respond with a latency less than 100 ms for 99.5% of requests over a year
  2. Our identity service will have error rate less than 0.75% over a period of 3 months

99.99% of the time my website is available in an year
0.01% of the time my website will be down
my website will be down for 52 minutes in year

My application will respond with a latency less than 100 ms for 99.5% of requests over a year

Toil is manual repetitive work, like restarting servers. Taking backups of databases before production upgrades (if you are doing manually)
Toil is kind of the work tied to running a production service that tends to be manual, repetitive, automatable and it grows linearly as service grows.
Solution to toil is to Automate the activity, but we automate it only when there is ROI.