Monitoring and Observability
Monitoring vs Observability
Monitoring: Checking if your system is working (is the server up?). Observability: Understanding why your system behaves as it does.
Three Pillars of Observability
Logs
Detailed records of events in your system.
2025-05-01 10:23:45 ERROR Database connection failed: timeout
Metrics
Numeric measurements over time.
cpu_usage: 75%
memory_usage: 2048MB
request_latency_p99: 250ms
Traces
Follow a request through your system.
Request → Service A → Service B → Database
↓ 10ms 20ms 15ms
Key Metrics to Monitor
Application Metrics
- Request rate, latency, error rate
- Database query performance
- Cache hit rates
- Active connections
System Metrics
- CPU usage, memory, disk space
- Network bandwidth
- I/O operations
Business Metrics
- Revenue, transactions
- User engagement
- Feature adoption
Monitoring Tools
- Prometheus: Metrics collection and alerting
- Grafana: Visualization
- ELK Stack: Logs (Elasticsearch, Logstash, Kibana)
- New Relic: Full-stack monitoring
- Datadog: Comprehensive observability
Alerting Strategies
Alert on Symptoms, Not Causes
Alert on high error rate, not on CPU spike.
Avoid Alert Fatigue
Too many alerts make operators ignore them.
Clear Alert Messages
Include runbook links and context.
Best Practices
- Set baselines and alert on deviations
- Use structured logging
- Correlate logs with metrics and traces
- Implement alert routing by severity
- Review and tune alerts regularly
- Store historical data for analysis
- Practice incident response procedures
