DevOps Classroom Series – 25/Oct/2019

Metrics to measure Reliability

  • MTTF (Mean Time To Failure):
    • Time from the begining of deployment till the first failure is reported
  • MTBF (Mean Time Between Failures):
    • Time between Two Failures
  • MTTR (Mean Time To Repair):
    • Time Taken to repair (resolve the incident)

Post-Mortem

  • Blameless
  • Every Person involved in Incident Respose will have to prepare this document
  • Initial Draft of this document will/can be from Incident Commander
  • In Post-mortem unlike RCA (Root Cause Analysis), document can speak about multiple factors contribute the Error.
  • All the actions taken, what can be done to eliminate the same problem from reocurrance/
  • Sample Postmortem Doc

Ansible and Chef

  • Enable Debug Logs
  • Integrate Ansible/Chef Logs to Monitoring System
  • Production Deployments should be Canary

Docker, Kubernetes

  • Enable Logging Drivers
  • Change K8s deployment strategy to be Canary

Monitoring

  • Make monitoring Systems Observations

By continuous learner

devops & cloud enthusiastic learner

Leave a ReplyCancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Please turn AdBlock off
Social Network Integration by Acurax Social Media Branding Company

Discover more from Direct DevOps from Quality Thought

Subscribe now to keep reading and get access to the full archive.

Continue reading

Exit mobile version
%%footer%%