Site Reliability Engineering (SRE)
- Refer Here for the Site Reliability Engineering notes
- SRE is principles based on how Google runs production systems
- Engineering approach to operations
- Basic problem statement
- Functions of Site Reliability Engineering
- Reducing Toil
- Managing Risk
- Handling Failures
- For any application there are four golden signals
- Latency: This is the time taken to send a request and recieve a response
- Traffic: This is measured in number of requests flowing across the n/w
- Errors: Errors can tell us about misconfigurations in infrastructure, bugs in application code or broken dependencies
- Saturation: This defines th load on your network and server resources
- Service Level Indicator
- Success Rate: for every 5000 requests send to the server 4800 requests are be successful
- SLI : 96% of requests successful
- Latency: For the last 5000 requests 4000 requests have latency less than 0.5 seconds, 600 with in 2 seconds and 300 within 5seconds
- SLI : Latency of 80% request with in 0.5%, 92% with in 2s, 99.5 within 5 seconds
- Success Rate: for every 5000 requests send to the server 4800 requests are be successful
- Service Level Objective:
- application will be up and available for 99.5% in a year
Observability in Elastic Stack
- Create an Elastic Stack Cloud account Refer Here
- After setup run the spring pet clinic by following apm agent
- Exercise:
- Send metrics, logs and enable tracing for a spring pet clinic application
- To Install
- Create a ubuntu linux
sudo apt update sudo apt install openjdk-11-jdk -y wget https://storage.googleapis.com/qtreferenceapplications/spring-petclinic-2.4.2.jar wget https://repo1.maven.org/maven2/co/elastic/apm/elastic-apm-agent/1.23.0/elastic-apm-agent-1.23.0.jar- Now run the spring petclinic application
- Configure heart beat, metric beat to kibana in elastic cloud
