Operate systems with signals, not guesses
A 60-minute session on the three pillars (metrics, logs, traces), Golden Signals, SLO/SLI/error budgets, and an end-to-end lab that connects Git, Docker, CI/CD, Kubernetes (Minikube), and observability.
Agenda and outcomes
Agenda
By the end, learners should be able to
- Pick the right signal: metrics vs logs vs traces
- Start monitoring with Golden Signals (P95/P99, not averages)
- Define an SLO from an SLI and manage an error budget
- Ship structured logs and avoid secrets/PII
- Explain an end-to-end path from code to dashboard
The three pillars of observability
Metrics
- Numeric measurements over time
- Aggregated, efficient
- Best for alerting
Logs
- Timestamped events
- High detail
- Best for debugging
Traces
- Request journey across services
- Latency per hop
- Best for microservices
Start with metrics for alerting, add JSON logs for debugging, add traces when you need end-to-end latency visibility.
The Four Golden Signals (Google SRE)
What to monitor first
- Latency: P50/P95/P99 (avoid averages)
- Traffic: RPS / throughput
- Errors: 5xx/total, timeouts
- Saturation: CPU, memory, queues
Latency: p99_http_request_duration_seconds
Traffic: http_requests_total
Errors: rate(http_requests_total{status=~"5.."}[5m])
Saturation:node_cpu_seconds_total, queue_depth
Metric names vary; the categories do not.
SLIs, SLOs, SLAs, and error budgets
Definitions
- SLI: the measured metric (availability, latency, error rate)
- SLO: target for the SLI over a window
- SLA: contractual commitment
- Error budget:
100% - SLO
Example
SLO: 99.9% monthly availability
Error budget: 0.1%
30 days: 43.2 minutes downtime
30.4 days avg: ~43.8 minutes
If the budget is gone: freeze risky deploys.
Structured JSON logs in production
Unstructured
2024-01-15 ERROR User login failed for john@example.com
Structured
{"ts":"2024-01-15T10:00:00Z","level":"ERROR",
"event":"login_failed","user":"john@example.com",
"ip":"1.2.3.4","request_id":"abc-123"}
EFK and Loki (logs), Prometheus/Grafana (metrics)
Metrics
- Prometheus scrapes /metrics
- Grafana dashboards and alerts
- Alertmanager routes pages
EFK
- Fluent Bit collector (DaemonSet)
- Elasticsearch storage + search
- Kibana UI
Loki
- Label-based log indexing
- Lower cost profile vs ES in many cases
- Grafana for queries
End-to-end DevOps lab (local)
What you will ship
- App with
/healthand/metrics - JSON logs with
request_id - Docker image pushed to GHCR
- GitHub Actions CI + CD deploy to Minikube
- Dashboards + one alert + log queries
Jenkins (optional)
Same pipeline can be expressed as a Jenkinsfile; use it if your org standardizes on Jenkins.
Download prerequisite: demo app bundle
Download
Get the complete sample app (code, Dockerfile, Helm chart, and CI workflows) as a single zip.
Extract
unzip demo-obs-app.zip
cd demo-obs-app
Use this folder as the lab repo. The slides assume you run commands from inside demo-obs-app.
Prerequisites (local machine)
# tools
git --version
docker --version
kubectl version --client
minikube version
helm version
# start cluster
minikube start
kubectl get nodes
Notes
- Minikube uses your local Docker by default in many setups
- If you cannot pull GHCR from Minikube, configure registry auth or use
minikube image load - Keep namespaces separate:
appandobservability
Git repo + minimal app signals
# use the provided sample app (recommended)
# (from the repo root)
cd demo-obs-app
git init
# app requirements
/health -> 200 OK
/metrics -> Prometheus format
logs -> JSON per request (request_id)
# commit
git add .
git commit -m "feat: initial app with metrics and logs"
Minimum payload
- Metrics: request count + latency histogram
- Logs: method, path, status, latency_ms, request_id
- Tracing: optional in this lab; add later with OpenTelemetry
Dockerize + run locally
# Dockerfile
# present in the sample app repo:
# demo-obs-app/Dockerfile
# build & run
docker build -t demo-obs:local .
docker run --rm -p 8080:8080 demo-obs:local
curl -s localhost:8080/health
curl -s localhost:8080/metrics | head
Why this matters
- CI builds the same image you run locally
- Debug signals before Kubernetes adds more moving parts
Deploy to Minikube (Kubernetes)
# namespace
kubectl create ns app
# manifests (sketch)
Deployment: image, port 8080
Service: ClusterIP 8080
ServiceMonitor (optional): scrape /metrics
# apply
kubectl -n app apply -f k8s/
kubectl -n app get pods,svc
kubectl -n app port-forward svc/demo-obs 8080:8080
Verify signals
- Hit app via port-forward; verify
/metrics - Check logs:
kubectl -n app logs deploy/demo-obs - Do not put
request_idas a metrics label (cardinality)
Deploy the app with Helm (instead of raw YAML)
# use the chart included with the sample app
# (from the repo root)
cd demo-obs-app
kubectl create ns app || true
# install (first time)
helm install demo-obs ./helm/demo-obs-chart -n app \\
--set image.repository=ghcr.io/ORG/demo-obs \\
--set image.tag=${GITHUB_SHA}
# upgrade (new image tag)
helm upgrade demo-obs ./helm/demo-obs-chart -n app \\
--set image.tag=${GITHUB_SHA}
# rollback (previous revision)
helm history demo-obs -n app
helm rollback demo-obs 1 -n app
What to notice
- Helm installs a versioned release you can upgrade/rollback
- Values change behavior without editing templates per environment
- Use release-aware naming to avoid conflicts
GitHub Actions CI: build and push to GHCR
# .github/workflows/ci.yml
name: ci
on: [push]
jobs:
build:
runs-on: ubuntu-latest
permissions:
contents: read
packages: write
steps:
- uses: actions/checkout@v4
- uses: docker/setup-buildx-action@v3
- uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- uses: docker/build-push-action@v6
with:
push: true
tags: ghcr.io/ORG/demo-obs:${{ github.sha }}
Notes
- Use GHCR for a clean demo; any registry works
- Tag with commit SHA for traceability
- Promotion can be done with a second tag (e.g.,
prod)
CD to local Minikube (self-hosted runner)
# Why self-hosted?
GitHub-hosted runners cannot reach your laptop's Minikube.
# Option A (recommended for local):
Use a self-hosted GitHub Actions runner on the same machine
where Minikube runs.
# deploy job (sketch)
jobs:
deploy:
runs-on: self-hosted
steps:
- uses: actions/checkout@v4
- run: helm version
- run: helm upgrade --install demo-obs ./helm/demo-obs-chart -n app \
--set image.repository=ghcr.io/ORG/demo-obs \
--set image.tag=${{ github.sha }}
- run: kubectl -n app rollout status deploy/demo-obs
Jenkins alternative
// Jenkinsfile (sketch)
pipeline {
agent any
stages {
stage('Build') { steps { sh 'docker build -t demo-obs:ci .' } }
stage('Push') { steps { sh 'docker push ...' } }
stage('Deploy'){ steps { sh 'helm upgrade --install demo-obs ./helm/demo-obs-chart -n app --set image.tag=$GIT_COMMIT' } }
}
}
Install observability stack (Helm)
# namespace
kubectl create ns observability
# metrics: kube-prometheus-stack
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install kps prometheus-community/kube-prometheus-stack -n observability \\
--set grafana.adminPassword=admin
# logs: Loki + Promtail
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update
helm install loki grafana/loki-stack -n observability
Wire it up
- Scrape app metrics: add ServiceMonitor (if using kube-prometheus-stack)
- Dashboards: Golden Signals + node saturation
- Logs: query by namespace/app labels; filter by
request_id
Grafana login for this lab: user admin, password admin (set above). Change it in real environments.
If you did not set a password: read it from the secret: kubectl -n observability get secret kps-grafana -o jsonpath='{.data.admin-password}' | base64 -d
Dashboards, alert, and failure drill
# Grafana (port-forward)
kubectl -n observability port-forward svc/kps-grafana 3000:80
# Login
user: admin
pass: admin
# Alert idea (concept)
error_rate > 1% for 5m
p99 latency > 500ms for 10m
# Drill
- force 500s
- observe: error rate panel + logs
- rollback deployment if needed
Acceptance criteria
- RPS and error rate move when you generate traffic
- Latency histogram produces p95/p99
- Logs are queryable and correlated via request_id
- One alert is actionable and not noisy
Common mistakes in end-to-end setups
Mistake 1
Trying to CD from a cloud runner to local Minikube without a self-hosted runner.
Mistake 2
High-cardinality labels in metrics (user_id/request_id) causing Prometheus pain.
Mistake 3
Logging secrets/PII or running DEBUG logs in production.
Minimum viable DevOps observability baseline
Baseline
- CI builds + pushes immutable images
- CD deploys and verifies rollout status
- Golden Signals dashboard per service
- One paging alert tied to user impact
Next upgrades
- Add tracing (OpenTelemetry + Tempo/Jaeger)
- Define one SLO per service and track error budgets
- Add canary releases (Argo Rollouts / Flagger)
- Automate env promotion with approvals