Observability Strategy: From Monitoring to Understanding Systems
Observability is not monitoring with a new name. Monitoring tells you when something is broken; observability helps you understand why. As distributed systems grow in complexity, a deliberate observability strategy is the difference between firefighting and engineering resilience.
The Four Pillars
The classic "three pillars" have expanded. Profiling is now a first-class signal.
Observability Signals
├── Metrics (numeric, aggregated, time-series)
│ ├── RED: Rate, Errors, Duration (services)
│ └── USE: Utilization, Saturation, Errors (resources)
├── Logs (discrete events, structured or unstructured)
│ ├── Application logs
│ ├── Audit logs
│ └── Access logs
├── Traces (distributed request paths)
│ ├── Spans
│ ├── Context propagation
│ └── Sampling strategies
└── Profiles (continuous profiling)
├── CPU flame graphs
├── Memory allocation
└── Lock contention
Observability vs Monitoring
| Dimension | Monitoring | Observability |
|---|---|---|
| Question | "Is it working?" | "Why is it not working?" |
| Data model | Predefined dashboards, thresholds | High-cardinality, explorable data |
| Approach | Known-unknowns (alerts for expected failures) | Unknown-unknowns (ad-hoc investigation) |
| Instrumentation | Agent-based, metrics-centric | SDK-based, traces + structured events |
| Cost driver | Number of hosts/services | Data volume (events, spans, cardinality) |
| Culture | Ops-owned | Engineering-wide ownership |
Tool Landscape
| Capability | Datadog | Grafana Stack | New Relic | Dynatrace | Open Source Only |
|---|---|---|---|---|---|
| Metrics | Native | Prometheus + Mimir | Native | Native | Prometheus / VictoriaMetrics |
| Logs | Native | Loki | Native | Native | Loki / OpenSearch |
| Traces | Native | Tempo | Native | Native | Jaeger / Tempo |
| Profiling | Continuous Profiler | Pyroscope | Codestream | PurePath | Pyroscope |
| OTel support | Collector + SDK | Native | Native | OneAgent + OTel | Native |
| Pricing model | Per host + per GB | Self-hosted free / Grafana Cloud | Per-user + GB | Per-host (full-stack) | Infrastructure cost |
| Strength | Unified UX, integrations | Flexibility, no vendor lock-in | Free tier, AI insights | Auto-instrumentation, AI | Full control, no license |
| Weakness | Cost at scale | Operational overhead | Cardinality limits | Opaque pricing | Requires expertise |
OpenTelemetry Architecture
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Service A │ │ Service B │ │ Service C │
│ (OTel SDK) │ │ (OTel SDK) │ │ (OTel SDK) │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
└────────┬────────┴────────┬────────┘
│ OTLP (gRPC) │
┌──────▼──────┐ ┌──────▼──────┐
│ OTel │ │ OTel │
│ Collector │ │ Collector │
│ (Agent) │ │ (Gateway) │
└──────┬──────┘ └──────┬──────┘
│ │
┌────────┼────────┬────────┘
▼ ▼ ▼
Prometheus Loki Tempo/Jaeger
(Metrics) (Logs) (Traces)
│ │ │
└────────┼────────┘
▼
Grafana / Datadog / New Relic
(Visualization & Alerting)
Observability Maturity Model
| Level | Name | Characteristics |
|---|---|---|
| 0 | Reactive | No monitoring. SSH into boxes to check logs. |
| 1 | Basic Monitoring | Uptime checks, CPU/memory alerts, unstructured logs. |
| 2 | Structured Monitoring | Centralized logging, APM, basic dashboards, runbooks. |
| 3 | Observability | Distributed tracing, structured logs, SLOs, on-call. |
| 4 | Proactive Observability | Continuous profiling, anomaly detection, chaos engineering. |
| 5 | Adaptive | AIOps, automated remediation, cost-aware telemetry pipelines. |
Strategic Decisions
Vendor vs OSS: A Grafana stack (Prometheus + Loki + Tempo + Pyroscope) gives you zero vendor lock-in but requires dedicated platform engineering. Datadog or Dynatrace reduces operational burden but creates significant cost exposure at scale.
Instrumentation strategy: Adopt OpenTelemetry as your single instrumentation layer. OTel is CNCF graduated, vendor-neutral, and supported by every major backend. This decouples your application code from your observability vendor.
Sampling and cost control: At scale, you cannot store every trace. Implement tail-based sampling in the OTel Collector gateway to capture 100% of errors and slow requests while sampling normal traffic.
SLOs over alerts: Define service-level objectives (SLOs) with error budgets. Alert on burn rate, not raw thresholds. This reduces alert fatigue and focuses engineering effort on user-facing impact.
Resources
- OpenTelemetry Documentation
- Google SRE Book - Monitoring Distributed Systems
- Grafana LGTM Stack Guide
- Charity Majors - Observability Engineering (O'Reilly)
:::