Monitoring & Observability

Operate systems with signals, not guesses

01 / 20

A 60-minute session on the three pillars (metrics, logs, traces), Golden Signals, SLO/SLI/error budgets, and an end-to-end lab that connects Git, Docker, CI/CD, Kubernetes (Minikube), and observability.

60minutes total
30minutes theory
25minutes end-to-end lab
5minutes recap + Q&A
Session Map

Agenda and outcomes

02 / 20

Agenda

0-5
Why observability matters
5-15
3 pillars: metrics, logs, traces
15-22
Golden Signals and percentiles
22-30
SLIs, SLOs, error budgets
30-55
End-to-end lab: Git -> Docker -> CI/CD -> Minikube -> Observability
55-60
Recap + Q&A

By the end, learners should be able to

  • Pick the right signal: metrics vs logs vs traces
  • Start monitoring with Golden Signals (P95/P99, not averages)
  • Define an SLO from an SLI and manage an error budget
  • Ship structured logs and avoid secrets/PII
  • Explain an end-to-end path from code to dashboard
Pillars

The three pillars of observability

03 / 20

Metrics

  • Numeric measurements over time
  • Aggregated, efficient
  • Best for alerting
CPU % RPS Error rate

Logs

  • Timestamped events
  • High detail
  • Best for debugging
request_id error context

Traces

  • Request journey across services
  • Latency per hop
  • Best for microservices
trace_id span hop latency

Start with metrics for alerting, add JSON logs for debugging, add traces when you need end-to-end latency visibility.

Signals

The Four Golden Signals (Google SRE)

04 / 20

What to monitor first

  • Latency: P50/P95/P99 (avoid averages)
  • Traffic: RPS / throughput
  • Errors: 5xx/total, timeouts
  • Saturation: CPU, memory, queues
Latency:   p99_http_request_duration_seconds
Traffic:   http_requests_total
Errors:    rate(http_requests_total{status=~"5.."}[5m])
Saturation:node_cpu_seconds_total, queue_depth

Metric names vary; the categories do not.

Targets

SLIs, SLOs, SLAs, and error budgets

05 / 20

Definitions

  • SLI: the measured metric (availability, latency, error rate)
  • SLO: target for the SLI over a window
  • SLA: contractual commitment
  • Error budget: 100% - SLO

Example

SLO: 99.9% monthly availability
Error budget: 0.1%
30 days: 43.2 minutes downtime
30.4 days avg: ~43.8 minutes

If the budget is gone: freeze risky deploys.

Logging

Structured JSON logs in production

06 / 20

Unstructured

2024-01-15 ERROR User login failed for john@example.com

Structured

{"ts":"2024-01-15T10:00:00Z","level":"ERROR",
 "event":"login_failed","user":"john@example.com",
 "ip":"1.2.3.4","request_id":"abc-123"}
Always include request_id Never log passwords/tokens Mask/hash PII
Stacks

EFK and Loki (logs), Prometheus/Grafana (metrics)

07 / 20

Metrics

  • Prometheus scrapes /metrics
  • Grafana dashboards and alerts
  • Alertmanager routes pages

EFK

  • Fluent Bit collector (DaemonSet)
  • Elasticsearch storage + search
  • Kibana UI

Loki

  • Label-based log indexing
  • Lower cost profile vs ES in many cases
  • Grafana for queries
Lab

End-to-end DevOps lab (local)

08 / 20
Gitrepo + commits
Dockerbuild image
CIGitHub Actions
CDself-hosted runner -> Minikube
K8sDeployment + Service
ObsProm/Grafana + Loki

What you will ship

  • App with /health and /metrics
  • JSON logs with request_id
  • Docker image pushed to GHCR
  • GitHub Actions CI + CD deploy to Minikube
  • Dashboards + one alert + log queries

Jenkins (optional)

Same pipeline can be expressed as a Jenkinsfile; use it if your org standardizes on Jenkins.

Lab Download

Download prerequisite: demo app bundle

09 / 20

Download

Get the complete sample app (code, Dockerfile, Helm chart, and CI workflows) as a single zip.

/health /metrics JSON logs + request_id

Download demo-obs-app.zip

Extract

unzip demo-obs-app.zip
cd demo-obs-app

Use this folder as the lab repo. The slides assume you run commands from inside demo-obs-app.

Lab Step 0

Prerequisites (local machine)

10 / 20
# tools
git --version
docker --version
kubectl version --client
minikube version
helm version

# start cluster
minikube start
kubectl get nodes

Notes

  • Minikube uses your local Docker by default in many setups
  • If you cannot pull GHCR from Minikube, configure registry auth or use minikube image load
  • Keep namespaces separate: app and observability
Lab Step 1

Git repo + minimal app signals

11 / 20
# use the provided sample app (recommended)
# (from the repo root)
cd demo-obs-app
git init

# app requirements
/health  -> 200 OK
/metrics -> Prometheus format
logs     -> JSON per request (request_id)

# commit

git add .
git commit -m "feat: initial app with metrics and logs"

Minimum payload

  • Metrics: request count + latency histogram
  • Logs: method, path, status, latency_ms, request_id
  • Tracing: optional in this lab; add later with OpenTelemetry
Lab Step 2

Dockerize + run locally

12 / 20
# Dockerfile
# present in the sample app repo:
# demo-obs-app/Dockerfile

# build & run
docker build -t demo-obs:local .
docker run --rm -p 8080:8080 demo-obs:local
curl -s localhost:8080/health
curl -s localhost:8080/metrics | head

Why this matters

  • CI builds the same image you run locally
  • Debug signals before Kubernetes adds more moving parts
Lab Step 3

Deploy to Minikube (Kubernetes)

13 / 20
# namespace
kubectl create ns app

# manifests (sketch)
Deployment: image, port 8080
Service: ClusterIP 8080
ServiceMonitor (optional): scrape /metrics

# apply
kubectl -n app apply -f k8s/
kubectl -n app get pods,svc
kubectl -n app port-forward svc/demo-obs 8080:8080

Verify signals

  • Hit app via port-forward; verify /metrics
  • Check logs: kubectl -n app logs deploy/demo-obs
  • Do not put request_id as a metrics label (cardinality)
Lab Step 4

Deploy the app with Helm (instead of raw YAML)

14 / 20
# use the chart included with the sample app
# (from the repo root)
cd demo-obs-app

kubectl create ns app || true

# install (first time)
helm install demo-obs ./helm/demo-obs-chart -n app \\
  --set image.repository=ghcr.io/ORG/demo-obs \\
  --set image.tag=${GITHUB_SHA}

# upgrade (new image tag)
helm upgrade demo-obs ./helm/demo-obs-chart -n app \\
  --set image.tag=${GITHUB_SHA}

# rollback (previous revision)
helm history demo-obs -n app
helm rollback demo-obs 1 -n app

What to notice

  • Helm installs a versioned release you can upgrade/rollback
  • Values change behavior without editing templates per environment
  • Use release-aware naming to avoid conflicts
templates/Deployment templates/Service values.yaml
Lab Step 5

GitHub Actions CI: build and push to GHCR

15 / 20
# .github/workflows/ci.yml
name: ci
on: [push]
jobs:
  build:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write
    steps:
      - uses: actions/checkout@v4
      - uses: docker/setup-buildx-action@v3
      - uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      - uses: docker/build-push-action@v6
        with:
          push: true
          tags: ghcr.io/ORG/demo-obs:${{ github.sha }}

Notes

  • Use GHCR for a clean demo; any registry works
  • Tag with commit SHA for traceability
  • Promotion can be done with a second tag (e.g., prod)
Lab Step 6

CD to local Minikube (self-hosted runner)

16 / 20
# Why self-hosted?
GitHub-hosted runners cannot reach your laptop's Minikube.

# Option A (recommended for local):
Use a self-hosted GitHub Actions runner on the same machine
where Minikube runs.

# deploy job (sketch)
jobs:
  deploy:
    runs-on: self-hosted
    steps:
      - uses: actions/checkout@v4
      - run: helm version
      - run: helm upgrade --install demo-obs ./helm/demo-obs-chart -n app \
          --set image.repository=ghcr.io/ORG/demo-obs \
          --set image.tag=${{ github.sha }}
      - run: kubectl -n app rollout status deploy/demo-obs

Jenkins alternative

// Jenkinsfile (sketch)
pipeline {
  agent any
  stages {
    stage('Build') { steps { sh 'docker build -t demo-obs:ci .' } }
    stage('Push')  { steps { sh 'docker push ...' } }
    stage('Deploy'){ steps { sh 'helm upgrade --install demo-obs ./helm/demo-obs-chart -n app --set image.tag=$GIT_COMMIT' } }
  }
}
Lab Step 7

Install observability stack (Helm)

17 / 20
# namespace
kubectl create ns observability

# metrics: kube-prometheus-stack
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install kps prometheus-community/kube-prometheus-stack -n observability \\
  --set grafana.adminPassword=admin

# logs: Loki + Promtail
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm install loki grafana/loki-stack -n observability

Wire it up

  • Scrape app metrics: add ServiceMonitor (if using kube-prometheus-stack)
  • Dashboards: Golden Signals + node saturation
  • Logs: query by namespace/app labels; filter by request_id

Grafana login for this lab: user admin, password admin (set above). Change it in real environments.

If you did not set a password: read it from the secret: kubectl -n observability get secret kps-grafana -o jsonpath='{.data.admin-password}' | base64 -d

Lab Step 8

Dashboards, alert, and failure drill

18 / 20
# Grafana (port-forward)
kubectl -n observability port-forward svc/kps-grafana 3000:80

# Login
user: admin
pass: admin

# Alert idea (concept)
error_rate > 1% for 5m
p99 latency > 500ms for 10m

# Drill
- force 500s
- observe: error rate panel + logs
- rollback deployment if needed

Acceptance criteria

  • RPS and error rate move when you generate traffic
  • Latency histogram produces p95/p99
  • Logs are queryable and correlated via request_id
  • One alert is actionable and not noisy
Watch Outs

Common mistakes in end-to-end setups

19 / 20

Mistake 1

Trying to CD from a cloud runner to local Minikube without a self-hosted runner.

Mistake 2

High-cardinality labels in metrics (user_id/request_id) causing Prometheus pain.

Mistake 3

Logging secrets/PII or running DEBUG logs in production.

Close

Minimum viable DevOps observability baseline

20 / 20

Baseline

  • CI builds + pushes immutable images
  • CD deploys and verifies rollout status
  • Golden Signals dashboard per service
  • One paging alert tied to user impact

Next upgrades

  • Add tracing (OpenTelemetry + Tempo/Jaeger)
  • Define one SLO per service and track error budgets
  • Add canary releases (Argo Rollouts / Flagger)
  • Automate env promotion with approvals