Monitoring & Observability

Operate systems with signals, not guesses

01 / 20

A 60-minute session on the three pillars (metrics, logs, traces), Golden Signals, SLO/SLI/error budgets, and an end-to-end lab that connects Git, Docker, CI/CD, Kubernetes (Minikube), and observability.

60minutes total

30minutes theory

25minutes end-to-end lab

5minutes recap + Q&A

Session Map

Agenda and outcomes

02 / 20

Agenda

0-5

Why observability matters

5-15

3 pillars: metrics, logs, traces

15-22

Golden Signals and percentiles

22-30

SLIs, SLOs, error budgets

30-55

End-to-end lab: Git -> Docker -> CI/CD -> Minikube -> Observability

55-60

Recap + Q&A

By the end, learners should be able to

Pick the right signal: metrics vs logs vs traces
Start monitoring with Golden Signals (P95/P99, not averages)
Define an SLO from an SLI and manage an error budget
Ship structured logs and avoid secrets/PII
Explain an end-to-end path from code to dashboard

Pillars

The three pillars of observability

03 / 20

Metrics

Numeric measurements over time
Aggregated, efficient
Best for alerting

CPU % RPS Error rate

Logs

Timestamped events
High detail
Best for debugging

request_id error context

Traces

Request journey across services
Latency per hop
Best for microservices

trace_id span hop latency

Start with metrics for alerting, add JSON logs for debugging, add traces when you need end-to-end latency visibility.

Signals

The Four Golden Signals (Google SRE)

04 / 20

What to monitor first

Latency: P50/P95/P99 (avoid averages)
Traffic: RPS / throughput
Errors: 5xx/total, timeouts
Saturation: CPU, memory, queues

Latency:   p99_http_request_duration_seconds
Traffic:   http_requests_total
Errors:    rate(http_requests_total{status=~"5.."}[5m])
Saturation:node_cpu_seconds_total, queue_depth

Metric names vary; the categories do not.

Targets

SLIs, SLOs, SLAs, and error budgets

05 / 20

Definitions

SLI: the measured metric (availability, latency, error rate)
SLO: target for the SLI over a window
SLA: contractual commitment
Error budget: 100% - SLO

Example

SLO: 99.9% monthly availability
Error budget: 0.1%
30 days: 43.2 minutes downtime
30.4 days avg: ~43.8 minutes

If the budget is gone: freeze risky deploys.

Logging

Structured JSON logs in production

06 / 20

Unstructured

2024-01-15 ERROR User login failed for john@example.com

Structured

{"ts":"2024-01-15T10:00:00Z","level":"ERROR",
 "event":"login_failed","user":"john@example.com",
 "ip":"1.2.3.4","request_id":"abc-123"}

Always include request_id Never log passwords/tokens Mask/hash PII

Stacks

EFK and Loki (logs), Prometheus/Grafana (metrics)

07 / 20

Metrics

Prometheus scrapes /metrics
Grafana dashboards and alerts
Alertmanager routes pages

EFK

Fluent Bit collector (DaemonSet)
Elasticsearch storage + search
Kibana UI

Loki

Label-based log indexing
Lower cost profile vs ES in many cases
Grafana for queries

Lab

End-to-end DevOps lab (local)

08 / 20

Gitrepo + commits

Dockerbuild image

CIGitHub Actions

CDself-hosted runner -> Minikube

K8sDeployment + Service

ObsProm/Grafana + Loki

What you will ship

App with /health and /metrics
JSON logs with request_id
Docker image pushed to GHCR
GitHub Actions CI + CD deploy to Minikube
Dashboards + one alert + log queries

Jenkins (optional)

Same pipeline can be expressed as a Jenkinsfile; use it if your org standardizes on Jenkins.

Lab Download

Download prerequisite: demo app bundle

09 / 20

Download

Get the complete sample app (code, Dockerfile, Helm chart, and CI workflows) as a single zip.

/health /metrics JSON logs + request_id

Download demo-obs-app.zip

Extract

unzip demo-obs-app.zip
cd demo-obs-app

Use this folder as the lab repo. The slides assume you run commands from inside demo-obs-app.

Lab Step 0

Prerequisites (local machine)

10 / 20

# tools
git --version
docker --version
kubectl version --client
minikube version
helm version

# start cluster
minikube start
kubectl get nodes

Notes

Minikube uses your local Docker by default in many setups
If you cannot pull GHCR from Minikube, configure registry auth or use minikube image load
Keep namespaces separate: app and observability

Lab Step 1

Git repo + minimal app signals

11 / 20

# use the provided sample app (recommended)
# (from the repo root)
cd demo-obs-app
git init

# app requirements
/health  -> 200 OK
/metrics -> Prometheus format
logs     -> JSON per request (request_id)

# commit

git add .
git commit -m "feat: initial app with metrics and logs"

Minimum payload

Metrics: request count + latency histogram
Logs: method, path, status, latency_ms, request_id
Tracing: optional in this lab; add later with OpenTelemetry

Lab Step 2

Dockerize + run locally

12 / 20

# Dockerfile
# present in the sample app repo:
# demo-obs-app/Dockerfile

# build & run
docker build -t demo-obs:local .
docker run --rm -p 8080:8080 demo-obs:local
curl -s localhost:8080/health
curl -s localhost:8080/metrics | head

Why this matters

CI builds the same image you run locally
Debug signals before Kubernetes adds more moving parts

Lab Step 3

Deploy to Minikube (Kubernetes)

13 / 20

# namespace
kubectl create ns app

# manifests (sketch)
Deployment: image, port 8080
Service: ClusterIP 8080
ServiceMonitor (optional): scrape /metrics

# apply
kubectl -n app apply -f k8s/
kubectl -n app get pods,svc
kubectl -n app port-forward svc/demo-obs 8080:8080

Verify signals

Hit app via port-forward; verify /metrics
Check logs: kubectl -n app logs deploy/demo-obs
Do not put request_id as a metrics label (cardinality)

Lab Step 4

Deploy the app with Helm (instead of raw YAML)

14 / 20

# use the chart included with the sample app
# (from the repo root)
cd demo-obs-app

kubectl create ns app || true

# install (first time)
helm install demo-obs ./helm/demo-obs-chart -n app \\
  --set image.repository=ghcr.io/ORG/demo-obs \\
  --set image.tag=${GITHUB_SHA}

# upgrade (new image tag)
helm upgrade demo-obs ./helm/demo-obs-chart -n app \\
  --set image.tag=${GITHUB_SHA}

# rollback (previous revision)
helm history demo-obs -n app
helm rollback demo-obs 1 -n app

What to notice

Helm installs a versioned release you can upgrade/rollback
Values change behavior without editing templates per environment
Use release-aware naming to avoid conflicts

templates/Deployment templates/Service values.yaml

Lab Step 5

GitHub Actions CI: build and push to GHCR

15 / 20

# .github/workflows/ci.yml
name: ci
on: [push]
jobs:
  build:
    runs-on: ubuntu-latest
    permissions:
      contents: read
      packages: write
    steps:
      - uses: actions/checkout@v4
      - uses: docker/setup-buildx-action@v3
      - uses: docker/login-action@v3
        with:
          registry: ghcr.io
          username: ${{ github.actor }}
          password: ${{ secrets.GITHUB_TOKEN }}
      - uses: docker/build-push-action@v6
        with:
          push: true
          tags: ghcr.io/ORG/demo-obs:${{ github.sha }}

Notes

Use GHCR for a clean demo; any registry works
Tag with commit SHA for traceability
Promotion can be done with a second tag (e.g., prod)

Lab Step 6

CD to local Minikube (self-hosted runner)

16 / 20

# Why self-hosted?
GitHub-hosted runners cannot reach your laptop's Minikube.

# Option A (recommended for local):
Use a self-hosted GitHub Actions runner on the same machine
where Minikube runs.

# deploy job (sketch)
jobs:
  deploy:
    runs-on: self-hosted
    steps:
      - uses: actions/checkout@v4
      - run: helm version
      - run: helm upgrade --install demo-obs ./helm/demo-obs-chart -n app \
          --set image.repository=ghcr.io/ORG/demo-obs \
          --set image.tag=${{ github.sha }}
      - run: kubectl -n app rollout status deploy/demo-obs

Jenkins alternative

// Jenkinsfile (sketch)
pipeline {
  agent any
  stages {
    stage('Build') { steps { sh 'docker build -t demo-obs:ci .' } }
    stage('Push')  { steps { sh 'docker push ...' } }
    stage('Deploy'){ steps { sh 'helm upgrade --install demo-obs ./helm/demo-obs-chart -n app --set image.tag=$GIT_COMMIT' } }
  }
}

Lab Step 7

Install observability stack (Helm)

17 / 20

# namespace
kubectl create ns observability

# metrics: kube-prometheus-stack
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install kps prometheus-community/kube-prometheus-stack -n observability \\
  --set grafana.adminPassword=admin

# logs: Loki + Promtail
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm install loki grafana/loki-stack -n observability

Wire it up

Scrape app metrics: add ServiceMonitor (if using kube-prometheus-stack)
Dashboards: Golden Signals + node saturation
Logs: query by namespace/app labels; filter by request_id

Grafana login for this lab: user admin, password admin (set above). Change it in real environments.

If you did not set a password: read it from the secret: kubectl -n observability get secret kps-grafana -o jsonpath='{.data.admin-password}' | base64 -d

Lab Step 8

Dashboards, alert, and failure drill

18 / 20

# Grafana (port-forward)
kubectl -n observability port-forward svc/kps-grafana 3000:80

# Login
user: admin
pass: admin

# Alert idea (concept)
error_rate > 1% for 5m
p99 latency > 500ms for 10m

# Drill
- force 500s
- observe: error rate panel + logs
- rollback deployment if needed

Acceptance criteria

RPS and error rate move when you generate traffic
Latency histogram produces p95/p99
Logs are queryable and correlated via request_id
One alert is actionable and not noisy

Watch Outs

Common mistakes in end-to-end setups

19 / 20

Mistake 1

Trying to CD from a cloud runner to local Minikube without a self-hosted runner.

Mistake 2

High-cardinality labels in metrics (user_id/request_id) causing Prometheus pain.

Mistake 3

Logging secrets/PII or running DEBUG logs in production.

Close

Minimum viable DevOps observability baseline

20 / 20

Baseline

CI builds + pushes immutable images
CD deploys and verifies rollout status
Golden Signals dashboard per service
One paging alert tied to user impact

Next upgrades

Add tracing (OpenTelemetry + Tempo/Jaeger)
Define one SLO per service and track error budgets
Add canary releases (Argo Rollouts / Flagger)
Automate env promotion with approvals