Tale of Food delivery application
- Architecture

-
Possible failures:
- Connectivity issues
- Hardware issues
- OS issues
- Web server/app server failures
- applications
- performance bottlenecks
- cpu
- RAM
- disk
- Our idea is to ensure we have low MTTR (Mean time to recover) and high MTBF (Mean Time between failures) and MTTF (Mean time to fail)
-
To do this we need to monitor
- heart beat or alive
- servers
- applications
- performance monitoring
- cpu utilization
- free disk space
- free RAM
- log monitoring
- Trace monitoring / Profiling
- heart beat or alive
- Centralized Monitoring: All the metrics, logs, trace should be centralized to analyse failure or predict failures without need to login into each server.
Monitoring Types
-
There are two types of monitoring
- Server Monitoring
- Application Monitoring (logs, traces, metrics)
-
Softwares to help here
- Server Monitoring:
- Nagios
- Zabbix
- Application Monitoring
- New Relic
- App dynamics
- splunk
- prometheus
- Cloud Monitoring
- AWS Cloudwatch
- Azure Monitor
- Server Monitoring:
- Tools we will understand
- Elastic Stack
- Prometheus, Grafana (kubernetes)
- Technologies to watch out
- eBPF
- Open Telemetry
Methodologies
- ITIL
- Continuous Monitoring
- SRE
