Skillweave - Incident Learning Platform

Module Content

Not Started

Key Takeaways

Deploy POS backend v2.1 → SSH to each store, copy files, restart service
Service crashes → No automatic restart
Need 3 instances for high availability → Manually start 3 copies and load balance
Store hardware fails → Manually move workloads to backup hardware
Declare desired state: "Run 3 copies of POS backend v2.1"

Overview

Reading time: ~45 minutes

What is Kubernetes (K8s)?

Simple definition: An orchestration platform that automatically deploys, scales, and manages containerized applications.

Why "K8s"? Kubernetes → K-ubernete-s (8 letters between K and s) → K8s

Think of it as: An operating system for a cluster of computers that makes them act like one big computer.

Why Kubernetes for Byte Edge?

The Problem Without K8s

Imagine deploying to 60,000 stores manually:

Deploy POS backend v2.1 → SSH to each store, copy files, restart service
Service crashes → No automatic restart
Need 3 instances for high availability → Manually start 3 copies and load balance
Store hardware fails → Manually move workloads to backup hardware

Untenable at scale.

The Solution With K8s

Declare desired state: "Run 3 copies of POS backend v2.1"
K8s ensures it happens: Automatically deploys, monitors, restarts if crashed
Rolling updates: Deploy v2.2 gradually, rollback if issues
Self-healing: If a container crashes, K8s restarts it automatically

Core Kubernetes Concepts

1. Container

What: A lightweight, standalone package of software with everything it needs to run.

Analogy: A shipping container that works on any truck, ship, or train.

# Example: Container image for a POS backend
FROM node:18
COPY app.js /app/
CMD ["node", "/app/app.js"]

Key point for IM: When investigating issues, you're looking at logs from containers.

2. Pod

What: The smallest deployable unit in K8s. Usually contains one container (sometimes multiple related containers).

Analogy: A pod is like a virtual machine that runs your container.

# Example: POS backend pod
apiVersion: v1
kind: Pod
metadata:
  name: pos-backend-abc123
  labels:
    app: pos-backend
spec:
  containers:
  - name: pos-backend
    image: yum/pos-backend:2.1
    ports:
    - containerPort: 8080

Key point for IM: During an incident, you'll often check if pods are running or crashing.

Common pod states:

Running: Everything is working
CrashLoopBackOff: Container keeps crashing (red flag!)
Pending: Waiting to be scheduled
ImagePullBackOff: Can't download container image (network issue?)

3. Deployment

What: Manages a set of identical pods, ensures desired number are always running.

Analogy: A manager that ensures you always have 3 cashiers working (if one quits, hire another).

# Example: POS backend deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: pos-backend
spec:
  replicas: 3  # Always run 3 copies
  selector:
    matchLabels:
      app: pos-backend
  template:
    metadata:
      labels:
        app: pos-backend
    spec:
      containers:
      - name: pos-backend
        image: yum/pos-backend:2.1

Key point for IM: If a service is "down," check if the deployment has the right number of replicas running.

4. Service

What: A stable network endpoint that load balances traffic to pods.

Analogy: A phone number for a company that routes calls to available agents.

# Example: POS backend service
apiVersion: v1
kind: Service
metadata:
  name: pos-backend-service
spec:
  selector:
    app: pos-backend
  ports:
  - port: 80
    targetPort: 8080
  type: ClusterIP

Key point for IM: If the POS can't reach the backend, check if the service is properly routing traffic.

5. Namespace

What: Virtual clusters within a physical cluster. Isolates resources.

Analogy: Folders on a computer to organize files.

Common namespaces in Byte Edge:

default: Default namespace
pos: POS-related services
payment: Payment processing services
observability: ClickStack components (ClickHouse, HyperDX UI, OpenTelemetry Collector)
kube-system: K8s internal services (don't touch!)

Key point for IM: Always specify namespace when investigating. Services in different namespaces are isolated.

6. ConfigMap & Secret

What: Store configuration data and sensitive data (passwords, API keys).

ConfigMap: Plain text configuration Secret: Base64-encoded sensitive data

# ConfigMap example
apiVersion: v1
kind: ConfigMap
metadata:
  name: pos-config
data:
  api_endpoint: "https://cloud.yum.com/api"
  log_level: "info"

---
# Secret example
apiVersion: v1
kind: Secret
metadata:
  name: payment-secret
type: Opaque
data:
  api_key: c29tZS1zZWNyZXQta2V5  # base64 encoded

Key point for IM: Configuration changes (new ConfigMap/Secret) can cause issues. Check recent changes during incidents.

Kubernetes Architecture (Simplified)

┌───────────────────────────────────────────────────────────┐
│                   Kubernetes Cluster                       │
│                                                            │
│  ┌────────────────────────────────────────────────────┐  │
│  │ Control Plane (Brain)                              │  │
│  │  - API Server: Receives commands                   │  │
│  │  - Scheduler: Decides where to run pods            │  │
│  │  - Controller: Ensures desired state               │  │
│  └────────────────────────────────────────────────────┘  │
│                           │                                │
│          ┌────────────────┼────────────────┐              │
│          │                │                │              │
│  ┌───────▼──────┐ ┌──────▼──────┐ ┌───────▼──────┐      │
│  │   Node 1     │ │   Node 2    │ │   Node 3     │      │
│  │ (Worker)     │ │ (Worker)    │ │ (Worker)     │      │
│  │              │ │             │ │              │      │
│  │ [Pod] [Pod]  │ │ [Pod] [Pod] │ │ [Pod] [Pod]  │      │
│  │ [Pod]        │ │ [Pod]       │ │              │      │
│  └──────────────┘ └─────────────┘ └──────────────┘      │
└───────────────────────────────────────────────────────────┘

For Byte Edge: Typically 1-3 nodes per restaurant (physical servers/mini-PCs).

Essential kubectl Commands for IM

kubectl is the CLI tool to interact with K8s clusters.

Check Cluster Health

# View all nodes
kubectl get nodes

# View all pods across all namespaces
kubectl get pods --all-namespaces

# View pods in specific namespace
kubectl get pods -n pos

Investigate Pod Issues

# Get detailed info about a pod
kubectl describe pod pos-backend-abc123 -n pos

# View recent events (crashes, restarts, errors)
kubectl get events -n pos --sort-by='.lastTimestamp'

# Check pod logs (most common IM command!)
kubectl logs pos-backend-abc123 -n pos

# Follow logs in real-time
kubectl logs -f pos-backend-abc123 -n pos

# View logs from previous crashed container
kubectl logs pos-backend-abc123 -n pos --previous

Check Service Health

# List services
kubectl get svc -n pos

# Describe service (see which pods it routes to)
kubectl describe svc pos-backend-service -n pos

# Check endpoints (actual pod IPs behind the service)
kubectl get endpoints pos-backend-service -n pos

Check Deployments

# View deployments
kubectl get deployments -n pos

# Check if desired replicas match actual
kubectl get deployment pos-backend -n pos
# Example output:
# NAME          READY   UP-TO-DATE   AVAILABLE   AGE
# pos-backend   2/3     3            2           10m
# ^ Red flag: Only 2/3 pods are ready!

# View deployment history (recent updates)
kubectl rollout history deployment/pos-backend -n pos

# Rollback to previous version
kubectl rollout undo deployment/pos-backend -n pos

Debugging Commands

# Execute command inside a running pod
kubectl exec -it pos-backend-abc123 -n pos -- /bin/bash

# Check resource usage (CPU/memory)
kubectl top pods -n pos

# View ConfigMaps/Secrets
kubectl get configmap -n pos
kubectl get secret -n pos

Common K8s Issues & IM Response

Issue 1: CrashLoopBackOff

What it means: Pod keeps starting and crashing repeatedly.

IM Response:

Check logs: kubectl logs <pod> -n <namespace> --previous
Describe pod: kubectl describe pod <pod> -n <namespace>
Look for: Application errors, missing dependencies, resource limits exceeded

Common causes:

Application code bug
Missing ConfigMap/Secret
Database connection failure
Out of memory

Issue 2: ImagePullBackOff

What it means: Can't download container image.

IM Response:

Describe pod: kubectl describe pod <pod> -n <namespace>
Check event logs for image name and error
Verify: Internet connectivity, image registry access, image exists

Common causes:

Internet/network outage at the edge
Container registry is down
Typo in image name/tag
Missing credentials for private registry

Issue 3: Service Unreachable

What it means: Clients can't connect to a service.

IM Response:

Check if pods are running: kubectl get pods -l app=<service> -n <namespace>
Check service endpoints: kubectl get endpoints <service> -n <namespace>
Verify pod labels match service selector

Common causes:

All pods crashed
Service selector doesn't match pod labels
Network policy blocking traffic

Issue 4: High CPU/Memory

What it means: Pod consuming excessive resources.

IM Response:

Check resource usage: kubectl top pods -n <namespace>
View logs for errors: kubectl logs <pod> -n <namespace>
Check if resource limits are set: kubectl describe pod <pod> -n <namespace>

Common causes:

Memory leak
Infinite loop
High traffic/load
Resource limits too low

Byte Edge K8s Architecture

Restaurant Edge Server (Byte Edge K8s Cluster)
├── Namespace: pos
│   ├── Deployment: pos-backend (3 replicas)
│   ├── Deployment: pos-frontend (2 replicas)
│   └── Service: pos-backend-service
├── Namespace: payment
│   ├── Deployment: payment-processor (2 replicas)
│   └── Service: payment-service
├── Namespace: order
│   ├── Deployment: order-service (2 replicas)
│   └── Service: order-service
├── Namespace: observability
│   ├── Deployment: hyperdx (1 replica)
│   ├── Deployment: clickhouse (1 replica)
│   └── Service: hyperdx-service
└── Namespace: kube-system
    └── K8s internal services

What You DON'T Need to Know (Yet)

As an IM responder focused on incident investigation, you can skip:

Writing Kubernetes YAML manifests from scratch
Understanding K8s networking in depth (CNI, NetworkPolicies)
Setting up K8s clusters
K8s security (RBAC, PodSecurityPolicies)
Advanced concepts (StatefulSets, DaemonSets, Jobs, CRDs)

Focus on: Reading pod logs, checking pod status, understanding deployments, and correlating K8s events with incidents.

Practice Scenarios

Before moving to Module 3, practice these scenarios mentally:

Scenario 1: "POS is down at Store #4523"

Your investigation flow:

kubectl get pods -n pos → Check if POS pods are running
If pods are CrashLoopBackOff → kubectl logs <pod> -n pos --previous
If pods are Running but service is down → kubectl get endpoints pos-backend-service -n pos
Check recent events → kubectl get events -n pos --sort-by='.lastTimestamp'

Scenario 2: "Payment processing is slow"

Your investigation flow:

kubectl top pods -n payment → Check CPU/memory usage
kubectl logs <payment-pod> -n payment → Look for errors or slow queries
kubectl describe pod <payment-pod> -n payment → Check resource limits
Check if multiple replicas are healthy → kubectl get deployment payment-processor -n payment

Key Takeaways

Kubernetes = Container orchestration platform (auto-deploy, scale, heal)
Pod = Running container(s), where your app lives
Deployment = Manages multiple pods, ensures desired state
Service = Stable network endpoint, load balances to pods
kubectl = CLI tool to inspect and debug K8s
CrashLoopBackOff = Pod keeps crashing (check logs!)
Logs are your friend: kubectl logs <pod> -n <namespace>

Next Steps

✅ Complete Module 1: Edge Computing ✅ Complete Module 2: Kubernetes Overview ⬜ Read Module 3: Observability & Telemetry ⬜ Install kubectl locally (preparation for Module 5)

Estimated time to next module: 1 day (practice kubectl commands if possible)

Reading Checkpoint

Current score: 0%

Sections complete

0/0

Checkpoint confirmed

Not yet

Reflection

0 chars

Completion requires 80% section coverage, checkpoint confirmation, and a short reflection. On completion, you will move to the next module automatically.

I can explain one operational takeaway from this module and when to apply it. Reflection (40+ chars)

Add 40 more characters.

Mark at least 80% of sections complete.

Kubernetes for Incident Managers

Module Navigator

Learning Guidance

Unable to enter QR code to launch Best Voices Survey in Micro App

Lower than expected payment authorization attempts by tender type for ph_mx

Lower than usual payment authorization attempts detected by tender type for ph_mx

Module Content

Overview

What is Kubernetes (K8s)?

Why Kubernetes for Byte Edge?

The Problem Without K8s

The Solution With K8s

Core Kubernetes Concepts

1. Container

2. Pod

3. Deployment

4. Service

5. Namespace

6. ConfigMap & Secret

Kubernetes Architecture (Simplified)

Essential kubectl Commands for IM

Check Cluster Health

Investigate Pod Issues

Check Service Health

Check Deployments

Debugging Commands

Common K8s Issues & IM Response

Issue 1: CrashLoopBackOff

Issue 2: ImagePullBackOff

Issue 3: Service Unreachable

Issue 4: High CPU/Memory

Byte Edge K8s Architecture

What You DON'T Need to Know (Yet)

Practice Scenarios

Scenario 1: "POS is down at Store #4523"

Scenario 2: "Payment processing is slow"

Key Takeaways

Next Steps