Metrics to measure Reliability
- MTTF (Mean Time To Failure):
- Time from the begining of deployment till the first failure is reported
- MTBF (Mean Time Between Failures):
- Time between Two Failures
- MTTR (Mean Time To Repair):
- Time Taken to repair (resolve the incident)
Post-Mortem
- Blameless
- Every Person involved in Incident Respose will have to prepare this document
- Initial Draft of this document will/can be from Incident Commander
- In Post-mortem unlike RCA (Root Cause Analysis), document can speak about multiple factors contribute the Error.
- All the actions taken, what can be done to eliminate the same problem from reocurrance/
- Sample Postmortem Doc
Ansible and Chef
- Enable Debug Logs
- Integrate Ansible/Chef Logs to Monitoring System
- Production Deployments should be Canary
Docker, Kubernetes
- Enable Logging Drivers
- Change K8s deployment strategy to be Canary
Monitoring
- Make monitoring Systems Observations