Monitoring & Observability

Operate systems with signals, not guesses

01 / 15

A 60-minute session on the three pillars (metrics, logs, traces), Golden Signals, and SLO/SLI/error budgets, with a short lab to instrument an app and set basic alerts.

60minutes total
45minutes theory + walkthrough
10minutes lab runbook
5minutes recap + Q&A
Session Map

Agenda and outcomes

02 / 15

Agenda

0-5 min
Why observability matters
5-20 min
The 3 pillars: metrics, logs, traces
20-30 min
Golden Signals: latency, traffic, errors, saturation
30-40 min
SLIs, SLOs, SLAs, error budgets
40-55 min
Logging strategy: structured logs, EFK/Loki
55-60 min
Lab + recap

By the end, learners should be able to

  • Differentiate metrics, logs, and traces and when to use each
  • Start monitoring with the Golden Signals (P50/P95/P99, not averages)
  • Define an SLI + SLO and compute an error budget
  • Adopt structured JSON logging and avoid PII leakage
  • Explain common stacks (Prometheus/Grafana, EFK, Loki, OpenTelemetry)
Pillars

The three pillars of observability

03 / 15

Metrics

  • Numeric measurements over time
  • Cheap to store and query
  • Best for dashboards and alerting
CPU % RPS Error rate

Logs

  • Timestamped events
  • High detail for debugging
  • Cost grows fast with volume
request_id error context

Traces

  • Request journey across services
  • Explains where latency is spent
  • Critical for microservices
span trace_id hop latency

Start with metrics for alerting, add structured logs for debugging, and add tracing when you need to debug latency across distributed systems.

Signals

The Four Golden Signals (Google SRE)

04 / 15

What to monitor first

  • Latency: P50/P95/P99 (avoid averages)
  • Traffic: requests per second (RPS)
  • Errors: 5xx/total, timeouts, failed jobs
  • Saturation: CPU, memory, queue depth, connections
Latency:   p99_http_request_duration_seconds
Traffic:   http_requests_total
Errors:    rate(http_requests_total{status=~"5.."}[5m])
Saturation:node_cpu_seconds_total, queue_depth

Names shown are typical Prometheus-style metrics. The idea matters more than the exact names.

Targets

SLIs, SLOs, SLAs, and error budgets

05 / 15

Definitions

  • SLI: a measurement (e.g., P99 latency, availability)
  • SLO: a target for an SLI (e.g., P99 < 200ms, 99.9% monthly)
  • SLA: contractual commitment with penalties
  • Error budget: 100% - SLO

Example

SLO: 99.9% monthly availability
Error budget: 0.1%
30 days: 43.2 minutes allowed downtime
30.4 days avg: ~43.8 minutes

When the error budget is consumed, freeze risky deploys until it refills.

Logging

Structured logs beat readable logs in production

06 / 15

Unstructured

2024-01-15 ERROR User login failed for john@example.com

Hard to query. Parsing is fragile.

Structured (JSON)

{"ts":"2024-01-15T10:00:00Z","level":"ERROR",
 "event":"login_failed","user":"john@example.com",
 "ip":"1.2.3.4","request_id":"abc-123"}

Queryable and filterable. Add request_id everywhere.

Log JSON in prod Never log passwords/tokens/CC Mask/hash PII
Stacks

Log aggregation: EFK and Loki

07 / 15

EFK (Kubernetes)

  • Elasticsearch: storage + search
  • Fluent Bit/Fluentd: collectors
  • Kibana: query + dashboards

Collectors

  • Fluent Bit is lightweight
  • Typically runs as a DaemonSet
  • Enrich with labels: app, ns, pod

Loki (alternative)

  • Label-based log indexing
  • Lower cost profile than ES in many cases
  • Often paired with Grafana

Logs are expensive. Keep DEBUG off in production; log high-signal events and make errors rich with context.

What To Log

High-signal logging checklist

08 / 15

Do log

  • Requests: method, path, status, latency
  • Correlation: request_id, trace_id (if present)
  • Errors: type, stack trace, context
  • Business events: order_created, payment_processed

Dont log

  • Passwords, tokens, credit cards
  • Raw PII; mask/hash if required
  • Everything at DEBUG in production
  • Duplicate noisy logs without value
Traces

When to add tracing (and how)

09 / 15

When traces pay off

  • You have multiple services per request
  • You need to find which hop is slow
  • You need dependency visibility
  • You want end-to-end latency breakdown

Typical approach

  • Use OpenTelemetry SDKs
  • Sample traces (not 100%)
  • Export to Jaeger/Tempo/OTel Collector
  • Correlate: trace_id into logs
Requestclient -> gateway
Spansauth, db, cache
Tracetotal latency + hops
Lab

Lab: instrument and observe one service

10 / 15
ServiceHTTP endpoint
Metricsrequest count + latency
LogsJSON + request_id
DashboardGolden Signals
Alertp99 latency / error rate
Debugfilter logs by request_id

Lab objectives

  • Expose a metrics endpoint
  • Emit structured logs with request_id
  • Build a Golden Signals dashboard
  • Create one alert on error rate
Lab Runbook

Step 1: add metrics (Prometheus style)

11 / 15
Expose /metrics
- requests_total{route,status}
- request_duration_seconds{route}

Prefer histograms for latency
Use p95/p99 from histograms

Validate

  • Hit /metrics and confirm counters increase
  • Ensure labels do not explode cardinality
  • Record latency as histogram, not raw values
Lab Runbook

Step 2: add structured logs + correlation IDs

12 / 15
Log once per request
{"ts":...,"level":"INFO","route":"/pay",
 "status":200,"latency_ms":18,
 "request_id":"...","trace_id":"..."}

Validate

  • Every log line includes request_id
  • Errors include stack traces + context
  • No secrets or PII in logs
Lab Runbook

Step 3: dashboard + one alert

13 / 15
Dashboard panels:
- RPS
- Error rate
- p95/p99 latency
- CPU/memory saturation

Alert example:
error_rate > 1% for 5m

Validate

  • Force a 500 to see logs + error rate move
  • Ensure alerts are actionable (avoid noise)
  • Use paging only for user-impacting issues
Watch Outs

Common observability mistakes

14 / 15

Mistake 1

Alerting on symptoms without SLOs, causing alert fatigue.

Mistake 2

Logging too much (high cost) or too little (no context) and no request correlation.

Mistake 3

High-cardinality metric labels (user_id, request_id) exploding storage and query time.

Close

Minimum viable observability plan

15 / 15

Start here

  • Golden Signals dashboard for every user-facing service
  • One SLO per service (availability or latency)
  • Structured logs with request_id everywhere
  • One paging alert tied to user impact

Then iterate

  • Add tracing for multi-hop latency
  • Refine alerts using error budgets
  • Add log-based alerts for specific failure modes
  • Review dashboards monthly with incidents