SLI, SLO and SLAs
- SLI defines SLOs which helps in coming up with SLAs
- SLA is SLO with consequences
SLI: Home page will be loaded with 3 seconds for a period of 10 mins SLO: Home Page will be loaded within 3 seconds for a period of one month for 99.99% of the requests SLA: If the homepage is not loaded within 3 seconds for a period of one month for 99.95% of the requests customer will recieve redeem points
Who defines What?
- The time where application is allowed to fail
SLA : 99% uptime for a period of one month Error Budget => 1*30*24* (100-99)/100 = 7.2 hours
Error budget has to evenly categorized into multiple known areas
If the Error budget is burnt down, then SREs can impose restrictions on any new features during that period.
- In Actual Environment where the application is deployed, we might have some risks which impact our SLAs
- Service Providers Availability is less than 99% then Applications SLA connot be grater than 99%
- Toil is an activity which
- This Toil can be ideal candidate to be automated, but we will not automate any thing which doesnt add value
Scenario 1: Every year during chrismas, an engineer needs to restart 10 servers which takes 20 mins To automate this it takes 20 hours of work from SRE. In this case don't automate Scenario 2: Every week all the logs from web server needs to be exported to blob storage This activity takes 20 mins of SRE's time To automate this it takes 10 hours In this we would automate this scenario.