DevOps Classroom Series – 25/Oct/2019

Metrics to measure Reliability

MTTF (Mean Time To Failure):
- Time from the begining of deployment till the first failure is reported
MTBF (Mean Time Between Failures):
- Time between Two Failures
MTTR (Mean Time To Repair):
- Time Taken to repair (resolve the incident)

Post-Mortem

Blameless
Every Person involved in Incident Respose will have to prepare this document
Initial Draft of this document will/can be from Incident Commander
In Post-mortem unlike RCA (Root Cause Analysis), document can speak about multiple factors contribute the Error.
All the actions taken, what can be done to eliminate the same problem from reocurrance/
Sample Postmortem Doc

Ansible and Chef

Enable Debug Logs
Integrate Ansible/Chef Logs to Monitoring System
Production Deployments should be Canary

Docker, Kubernetes

Enable Logging Drivers
Change K8s deployment strategy to be Canary

Monitoring

Make monitoring Systems Observations

By continuous learner

devops & cloud enthusiastic learner

View all of continuous learner's posts.

Leave a ReplyCancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Please turn AdBlock off

Floating Social Media Icons by Acurax Wordpress Designers