Operate systems with signals, not guesses
A 60-minute session on the three pillars (metrics, logs, traces), Golden Signals, and SLO/SLI/error budgets, with a short lab to instrument an app and set basic alerts.
Agenda and outcomes
Agenda
By the end, learners should be able to
- Differentiate metrics, logs, and traces and when to use each
- Start monitoring with the Golden Signals (P50/P95/P99, not averages)
- Define an SLI + SLO and compute an error budget
- Adopt structured JSON logging and avoid PII leakage
- Explain common stacks (Prometheus/Grafana, EFK, Loki, OpenTelemetry)
The three pillars of observability
Metrics
- Numeric measurements over time
- Cheap to store and query
- Best for dashboards and alerting
Logs
- Timestamped events
- High detail for debugging
- Cost grows fast with volume
Traces
- Request journey across services
- Explains where latency is spent
- Critical for microservices
Start with metrics for alerting, add structured logs for debugging, and add tracing when you need to debug latency across distributed systems.
The Four Golden Signals (Google SRE)
What to monitor first
- Latency: P50/P95/P99 (avoid averages)
- Traffic: requests per second (RPS)
- Errors: 5xx/total, timeouts, failed jobs
- Saturation: CPU, memory, queue depth, connections
Latency: p99_http_request_duration_seconds
Traffic: http_requests_total
Errors: rate(http_requests_total{status=~"5.."}[5m])
Saturation:node_cpu_seconds_total, queue_depth
Names shown are typical Prometheus-style metrics. The idea matters more than the exact names.
SLIs, SLOs, SLAs, and error budgets
Definitions
- SLI: a measurement (e.g., P99 latency, availability)
- SLO: a target for an SLI (e.g., P99 < 200ms, 99.9% monthly)
- SLA: contractual commitment with penalties
- Error budget:
100% - SLO
Example
SLO: 99.9% monthly availability
Error budget: 0.1%
30 days: 43.2 minutes allowed downtime
30.4 days avg: ~43.8 minutes
When the error budget is consumed, freeze risky deploys until it refills.
Structured logs beat readable logs in production
Unstructured
2024-01-15 ERROR User login failed for john@example.com
Hard to query. Parsing is fragile.
Structured (JSON)
{"ts":"2024-01-15T10:00:00Z","level":"ERROR",
"event":"login_failed","user":"john@example.com",
"ip":"1.2.3.4","request_id":"abc-123"}
Queryable and filterable. Add request_id everywhere.
Log aggregation: EFK and Loki
EFK (Kubernetes)
- Elasticsearch: storage + search
- Fluent Bit/Fluentd: collectors
- Kibana: query + dashboards
Collectors
- Fluent Bit is lightweight
- Typically runs as a DaemonSet
- Enrich with labels: app, ns, pod
Loki (alternative)
- Label-based log indexing
- Lower cost profile than ES in many cases
- Often paired with Grafana
Logs are expensive. Keep DEBUG off in production; log high-signal events and make errors rich with context.
High-signal logging checklist
Do log
- Requests: method, path, status, latency
- Correlation: request_id, trace_id (if present)
- Errors: type, stack trace, context
- Business events: order_created, payment_processed
Dont log
- Passwords, tokens, credit cards
- Raw PII; mask/hash if required
- Everything at DEBUG in production
- Duplicate noisy logs without value
When to add tracing (and how)
When traces pay off
- You have multiple services per request
- You need to find which hop is slow
- You need dependency visibility
- You want end-to-end latency breakdown
Typical approach
- Use OpenTelemetry SDKs
- Sample traces (not 100%)
- Export to Jaeger/Tempo/OTel Collector
- Correlate: trace_id into logs
Lab: instrument and observe one service
Lab objectives
- Expose a metrics endpoint
- Emit structured logs with request_id
- Build a Golden Signals dashboard
- Create one alert on error rate
Step 1: add metrics (Prometheus style)
Expose /metrics
- requests_total{route,status}
- request_duration_seconds{route}
Prefer histograms for latency
Use p95/p99 from histograms
Validate
- Hit
/metricsand confirm counters increase - Ensure labels do not explode cardinality
- Record latency as histogram, not raw values
Step 2: add structured logs + correlation IDs
Log once per request
{"ts":...,"level":"INFO","route":"/pay",
"status":200,"latency_ms":18,
"request_id":"...","trace_id":"..."}
Validate
- Every log line includes
request_id - Errors include stack traces + context
- No secrets or PII in logs
Step 3: dashboard + one alert
Dashboard panels:
- RPS
- Error rate
- p95/p99 latency
- CPU/memory saturation
Alert example:
error_rate > 1% for 5m
Validate
- Force a 500 to see logs + error rate move
- Ensure alerts are actionable (avoid noise)
- Use paging only for user-impacting issues
Common observability mistakes
Mistake 1
Alerting on symptoms without SLOs, causing alert fatigue.
Mistake 2
Logging too much (high cost) or too little (no context) and no request correlation.
Mistake 3
High-cardinality metric labels (user_id, request_id) exploding storage and query time.
Minimum viable observability plan
Start here
- Golden Signals dashboard for every user-facing service
- One SLO per service (availability or latency)
- Structured logs with request_id everywhere
- One paging alert tied to user impact
Then iterate
- Add tracing for multi-hop latency
- Refine alerts using error budgets
- Add log-based alerts for specific failure modes
- Review dashboards monthly with incidents