Site Reliability Engineering (SRE)
- Refer Here for slides
- Refer Here if you are interested in books
- SRE is an approach adopted, practised and preached by google for handling applications in Production.
- In Simple terms, SRE is a way of doing DevOps (followed by Google)
- Principles of DevOps
- Accept Failure as Normal
- Implement Gradual changes
- Leverage Tooling and Automation
- Measure Everything
- DevOps gives you an Abstract idea, whereas SRE is about practical implementation
- We hear this Term "Class SRE implements DevOps"
Service Level Objectives
We define 3 key indicators
- Service Level Indicator (SLI)
- Service Level Objective (SLO)
- Service Level Agreement
Service Level Indicator:
- Defined quantitative measure of some aspect of the level of service that is provided by your application.
- Most services consider request latency (How long does it take to get a response to request from your application/service)
- Other interesting indicator could be error rate
- Example objective:
- My application will have request latency less than 100 ms for 99.99% of the requests in last 10 mins
- Our identity service will have error rate less than 0.5% for last 10000 requests
Service Level Objective:
- A target value or range of values for a service that is measured by an SLI
- We derive SLO from SLI.
- Generally SLI’s are more aggressive than SLO’s
- SLO is Value derived from SLO for a considerable period of time
- My application will respond with a latency less than 100 ms for 99.9% over a year.
- Our identity service will have error rate less than 0.6% over a period of 3 months
Service Level Agreement:
- This is an implicit or explicit contract with the user for meeting the SLOs.
- Consequences of not meeting SLAs could be rewards to the user.
- My application will respond with a latency less than 100 ms for 99.5% of requests over a year
- Our identity service will have error rate less than 0.75% over a period of 3 months
Risks and Error Budget
- Error Budget: Is the time where your application is allowed to fail
- Scenario 1:
99.99% of the time my website is available in an year 0.01% of the time my website will be down my website will be down for 52 minutes in year
My application will respond with a latency less than 100 ms for 99.5% of requests over a year
- Toil is manual repetitive work, like restarting servers. Taking backups of databases before production upgrades (if you are doing manually)
- Toil is kind of the work tied to running a production service that tends to be manual, repetitive, automatable and it grows linearly as service grows.
- Solution to toil is to Automate the activity, but we automate it only when there is ROI.