Primary Reference
SLI (Service Level Indicator)
- Indicator of Availability of your application/service
- Sample Indicators:
- Latency of the home page over last 5 minutes will be less than 300ms for 99.9% of requests
SLO (Service Level Objective)
- SLI binds over a period of time
- Sample:
- Latency of homepage over a year will be less than 300 ms for 99.9% of the requests
SLA (Service Level Agreements)
- Sample:
- Customer will be offered free credits if 99.5% of the requests over a year fail to achieved the latency of less than 300 ms
Problem
- We build systems and they fail at some point.
- What’s the SRE approach towards failures
Risks
- You can make aggressive deployments as long as you are with in Error Budget.
- If Error Budget is exceeded no more deployments
Error Budget
- Allowed time in minutes or hours of failure.
- Sample:
-
SLO : Latency will be less than 300 ms for 99.9% of request over the year
-
ERROR budget is what is left of total time after removing SLO (100-99.9) * 365 * 24 * 60/ 100 = 525.6 minutes/year
-
SLO : Latency will be less than 300 ms for 99.99% of request over the year
-
ERROR Budget (100-99.99) * 365 * 24 * 60 /100 = 52.5 minutes/year
-
Error Budget Burndown
- Error Budget used
- Fast Burndowns
- Slow Burndowns
Toil
- Repetitive manual work that can be automated
- Focus on Toils which are more frequent than infrequent ones