Monitoring & Observability

Operate systems with signals, not guesses

01 / 15

A 60-minute session on the three pillars (metrics, logs, traces), Golden Signals, and SLO/SLI/error budgets, with a short lab to instrument an app and set basic alerts.

60minutes total

45minutes theory + walkthrough

10minutes lab runbook

5minutes recap + Q&A

Session Map

Agenda and outcomes

02 / 15

Agenda

0-5 min

Why observability matters

5-20 min

The 3 pillars: metrics, logs, traces

20-30 min

Golden Signals: latency, traffic, errors, saturation

30-40 min

SLIs, SLOs, SLAs, error budgets

40-55 min

Logging strategy: structured logs, EFK/Loki

55-60 min

Lab + recap

By the end, learners should be able to

Differentiate metrics, logs, and traces and when to use each
Start monitoring with the Golden Signals (P50/P95/P99, not averages)
Define an SLI + SLO and compute an error budget
Adopt structured JSON logging and avoid PII leakage
Explain common stacks (Prometheus/Grafana, EFK, Loki, OpenTelemetry)

Pillars

The three pillars of observability

03 / 15

Metrics

Numeric measurements over time
Cheap to store and query
Best for dashboards and alerting

CPU % RPS Error rate

Logs

Timestamped events
High detail for debugging
Cost grows fast with volume

request_id error context

Traces

Request journey across services
Explains where latency is spent
Critical for microservices

span trace_id hop latency

Start with metrics for alerting, add structured logs for debugging, and add tracing when you need to debug latency across distributed systems.

Signals

The Four Golden Signals (Google SRE)

04 / 15

What to monitor first

Latency: P50/P95/P99 (avoid averages)
Traffic: requests per second (RPS)
Errors: 5xx/total, timeouts, failed jobs
Saturation: CPU, memory, queue depth, connections

Latency:   p99_http_request_duration_seconds
Traffic:   http_requests_total
Errors:    rate(http_requests_total{status=~"5.."}[5m])
Saturation:node_cpu_seconds_total, queue_depth

Names shown are typical Prometheus-style metrics. The idea matters more than the exact names.

Targets

SLIs, SLOs, SLAs, and error budgets

05 / 15

Definitions

SLI: a measurement (e.g., P99 latency, availability)
SLO: a target for an SLI (e.g., P99 < 200ms, 99.9% monthly)
SLA: contractual commitment with penalties
Error budget: 100% - SLO

Example

SLO: 99.9% monthly availability
Error budget: 0.1%
30 days: 43.2 minutes allowed downtime
30.4 days avg: ~43.8 minutes

When the error budget is consumed, freeze risky deploys until it refills.

Logging

Structured logs beat readable logs in production

06 / 15

Unstructured

2024-01-15 ERROR User login failed for john@example.com

Hard to query. Parsing is fragile.

Structured (JSON)

{"ts":"2024-01-15T10:00:00Z","level":"ERROR",
 "event":"login_failed","user":"john@example.com",
 "ip":"1.2.3.4","request_id":"abc-123"}

Queryable and filterable. Add request_id everywhere.

Log JSON in prod Never log passwords/tokens/CC Mask/hash PII

Stacks

Log aggregation: EFK and Loki

07 / 15

EFK (Kubernetes)

Elasticsearch: storage + search
Fluent Bit/Fluentd: collectors
Kibana: query + dashboards

Collectors

Fluent Bit is lightweight
Typically runs as a DaemonSet
Enrich with labels: app, ns, pod

Loki (alternative)

Label-based log indexing
Lower cost profile than ES in many cases
Often paired with Grafana

Logs are expensive. Keep DEBUG off in production; log high-signal events and make errors rich with context.

What To Log

High-signal logging checklist

08 / 15

Do log

Requests: method, path, status, latency
Correlation: request_id, trace_id (if present)
Errors: type, stack trace, context
Business events: order_created, payment_processed

Dont log

Passwords, tokens, credit cards
Raw PII; mask/hash if required
Everything at DEBUG in production
Duplicate noisy logs without value

Traces

When to add tracing (and how)

09 / 15

When traces pay off

You have multiple services per request
You need to find which hop is slow
You need dependency visibility
You want end-to-end latency breakdown

Typical approach

Use OpenTelemetry SDKs
Sample traces (not 100%)
Export to Jaeger/Tempo/OTel Collector
Correlate: trace_id into logs

Requestclient -> gateway

Spansauth, db, cache

Tracetotal latency + hops

Lab

Lab: instrument and observe one service

10 / 15

ServiceHTTP endpoint

Metricsrequest count + latency

LogsJSON + request_id

DashboardGolden Signals

Alertp99 latency / error rate

Debugfilter logs by request_id

Lab objectives

Expose a metrics endpoint
Emit structured logs with request_id
Build a Golden Signals dashboard
Create one alert on error rate

Lab Runbook

Step 1: add metrics (Prometheus style)

11 / 15

Expose /metrics
- requests_total{route,status}
- request_duration_seconds{route}

Prefer histograms for latency
Use p95/p99 from histograms

Validate

Hit /metrics and confirm counters increase
Ensure labels do not explode cardinality
Record latency as histogram, not raw values

Lab Runbook

Step 2: add structured logs + correlation IDs

12 / 15

Log once per request
{"ts":...,"level":"INFO","route":"/pay",
 "status":200,"latency_ms":18,
 "request_id":"...","trace_id":"..."}

Validate

Every log line includes request_id
Errors include stack traces + context
No secrets or PII in logs

Lab Runbook

Step 3: dashboard + one alert

13 / 15

Dashboard panels:
- RPS
- Error rate
- p95/p99 latency
- CPU/memory saturation

Alert example:
error_rate > 1% for 5m

Validate

Force a 500 to see logs + error rate move
Ensure alerts are actionable (avoid noise)
Use paging only for user-impacting issues

Watch Outs

Common observability mistakes

14 / 15

Mistake 1

Alerting on symptoms without SLOs, causing alert fatigue.

Mistake 2

Logging too much (high cost) or too little (no context) and no request correlation.

Mistake 3

High-cardinality metric labels (user_id, request_id) exploding storage and query time.

Close

Minimum viable observability plan

15 / 15

Start here

Golden Signals dashboard for every user-facing service
One SLO per service (availability or latency)
Structured logs with request_id everywhere
One paging alert tied to user impact

Then iterate

Add tracing for multi-hop latency
Refine alerts using error budgets
Add log-based alerts for specific failure modes
Review dashboards monthly with incidents

Operate systems with signals, not guesses

Agenda and outcomes

Agenda

By the end, learners should be able to

The three pillars of observability

Metrics

Logs

Traces

The Four Golden Signals (Google SRE)

What to monitor first

SLIs, SLOs, SLAs, and error budgets

Definitions

Example

Structured logs beat readable logs in production

Unstructured

Structured (JSON)

Log aggregation: EFK and Loki

EFK (Kubernetes)

Collectors

Loki (alternative)

High-signal logging checklist

Do log

Dont log

When to add tracing (and how)

When traces pay off

Typical approach

Lab: instrument and observe one service

Lab objectives

Step 1: add metrics (Prometheus style)

Validate

Step 2: add structured logs + correlation IDs

Validate

Step 3: dashboard + one alert

Validate

Common observability mistakes

Mistake 1

Mistake 2

Mistake 3

Minimum viable observability plan

Start here

Then iterate

Dont log