2 min read
Syntax Candy
Syntax Candy

Monitoring and Observability

Understand the difference and implement both to maintain system health

Monitoring and Observability featured image

Monitoring and Observability

Monitoring vs Observability

Monitoring: Checking if your system is working (is the server up?). Observability: Understanding why your system behaves as it does.

Three Pillars of Observability

Logs

Detailed records of events in your system.

2025-05-01 10:23:45 ERROR Database connection failed: timeout

Metrics

Numeric measurements over time.

cpu_usage: 75%
memory_usage: 2048MB
request_latency_p99: 250ms

Traces

Follow a request through your system.

Request → Service A → Service B → Database
    ↓         10ms      20ms      15ms

Key Metrics to Monitor

Application Metrics

  • Request rate, latency, error rate
  • Database query performance
  • Cache hit rates
  • Active connections

System Metrics

  • CPU usage, memory, disk space
  • Network bandwidth
  • I/O operations

Business Metrics

  • Revenue, transactions
  • User engagement
  • Feature adoption

Monitoring Tools

  • Prometheus: Metrics collection and alerting
  • Grafana: Visualization
  • ELK Stack: Logs (Elasticsearch, Logstash, Kibana)
  • New Relic: Full-stack monitoring
  • Datadog: Comprehensive observability

Alerting Strategies

Alert on Symptoms, Not Causes

Alert on high error rate, not on CPU spike.

Avoid Alert Fatigue

Too many alerts make operators ignore them.

Clear Alert Messages

Include runbook links and context.

Best Practices

  • Set baselines and alert on deviations
  • Use structured logging
  • Correlate logs with metrics and traces
  • Implement alert routing by severity
  • Review and tune alerts regularly
  • Store historical data for analysis
  • Practice incident response procedures

Read more from Crispedia