DevOps Classroom Series – 25/Oct/2019

Metrics to measure Reliability

  • MTTF (Mean Time To Failure):
    • Time from the begining of deployment till the first failure is reported
  • MTBF (Mean Time Between Failures):
    • Time between Two Failures
  • MTTR (Mean Time To Repair):
    • Time Taken to repair (resolve the incident)


  • Blameless
  • Every Person involved in Incident Respose will have to prepare this document
  • Initial Draft of this document will/can be from Incident Commander
  • In Post-mortem unlike RCA (Root Cause Analysis), document can speak about multiple factors contribute the Error.
  • All the actions taken, what can be done to eliminate the same problem from reocurrance/
  • Sample Postmortem Doc

Ansible and Chef

  • Enable Debug Logs
  • Integrate Ansible/Chef Logs to Monitoring System
  • Production Deployments should be Canary

Docker, Kubernetes

  • Enable Logging Drivers
  • Change K8s deployment strategy to be Canary


  • Make monitoring Systems Observations

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

About continuous learner

devops & cloud enthusiastic learner