Site Reliability Engineering
- Refer Here for sre books published by google.
- SRE is an engineering process on how google runs production systems
- Engineering ideas were largely adopted by customers of google and also other enterprises and now we have a job role called as SRE
- Refer Here for presentation on SRE
Observability
- Observability runs on collection 3 major informations about applications
- metrics: A numerical value that represents some collected metric (cpu, memory, latency, error rate)
- logs: A text record
- levels:
- information
- warning
- error
- debug (verbosity levels)
- levels:
- traces
- We integrate the above with actionable alerting system.
- Centralized log aggregation tools:
- Elastic Search (logstash and beats)
- Splunk
- Fluentd
- datadog (sass product)
- metrics:
- New Relic
- Metric beats => Elastic Search
- nagios & zabbix
- Tracing (APM)
- app dynamics
- elastic search apm
- How to acheive observability
- Fluentd
- Prometheus
- Grafana
