Back to course: Edge

Byte Edge | Reading Module

Kubernetes for Incident Managers

Status: Not Started | Pass threshold: 100% | Points: 75

L2 30 min

Best score

0%

Attempts

0

Pass rate

0%

Passed

0

Completion happens in the checkpoint panel below.

Learning Guidance

Objectives

  • Deploy POS backend v2.1 → SSH to each store, copy files, restart service
  • Service crashes → No automatic restart
  • Need 3 instances for high availability → Manually start 3 copies and load balance
  • Store hardware fails → Manually move workloads to backup hardware

Source Artifacts

Internal source references are available for maintainers but are not exposed in deployed environments.

Field Evidence

Real incidents related to what you're learning.

Module Content

Not Started

Key Takeaways

  • Deploy POS backend v2.1 → SSH to each store, copy files, restart service
  • Service crashes → No automatic restart
  • Need 3 instances for high availability → Manually start 3 copies and load balance
  • Store hardware fails → Manually move workloads to backup hardware
  • Declare desired state: "Run 3 copies of POS backend v2.1"

Overview

Reading time: ~45 minutes


What is Kubernetes (K8s)?

Simple definition: An orchestration platform that automatically deploys, scales, and manages containerized applications.

Why "K8s"? Kubernetes → K-ubernete-s (8 letters between K and s) → K8s

Think of it as: An operating system for a cluster of computers that makes them act like one big computer.


Why Kubernetes for Byte Edge?

The Problem Without K8s

Imagine deploying to 60,000 stores manually:

  • Deploy POS backend v2.1 → SSH to each store, copy files, restart service
  • Service crashes → No automatic restart
  • Need 3 instances for high availability → Manually start 3 copies and load balance
  • Store hardware fails → Manually move workloads to backup hardware

Untenable at scale.

The Solution With K8s

  • Declare desired state: "Run 3 copies of POS backend v2.1"
  • K8s ensures it happens: Automatically deploys, monitors, restarts if crashed
  • Rolling updates: Deploy v2.2 gradually, rollback if issues
  • Self-healing: If a container crashes, K8s restarts it automatically

Core Kubernetes Concepts

1. Container

What: A lightweight, standalone package of software with everything it needs to run.

Analogy: A shipping container that works on any truck, ship, or train.

# Example: Container image for a POS backend
FROM node:18
COPY app.js /app/
CMD ["node", "/app/app.js"]

Key point for IM: When investigating issues, you're looking at logs from containers.


2. Pod

What: The smallest deployable unit in K8s. Usually contains one container (sometimes multiple related containers).

Analogy: A pod is like a virtual machine that runs your container.

# Example: POS backend pod
apiVersion: v1
kind: Pod
metadata:
  name: pos-backend-abc123
  labels:
    app: pos-backend
spec:
  containers:
  - name: pos-backend
    image: yum/pos-backend:2.1
    ports:
    - containerPort: 8080

Key point for IM: During an incident, you'll often check if pods are running or crashing.

Common pod states:

  • Running: Everything is working
  • CrashLoopBackOff: Container keeps crashing (red flag!)
  • Pending: Waiting to be scheduled
  • ImagePullBackOff: Can't download container image (network issue?)

3. Deployment

What: Manages a set of identical pods, ensures desired number are always running.

Analogy: A manager that ensures you always have 3 cashiers working (if one quits, hire another).

# Example: POS backend deployment
apiVersion: apps/v1
kind: Deployment
metadata:
  name: pos-backend
spec:
  replicas: 3  # Always run 3 copies
  selector:
    matchLabels:
      app: pos-backend
  template:
    metadata:
      labels:
        app: pos-backend
    spec:
      containers:
      - name: pos-backend
        image: yum/pos-backend:2.1

Key point for IM: If a service is "down," check if the deployment has the right number of replicas running.


4. Service

What: A stable network endpoint that load balances traffic to pods.

Analogy: A phone number for a company that routes calls to available agents.

# Example: POS backend service
apiVersion: v1
kind: Service
metadata:
  name: pos-backend-service
spec:
  selector:
    app: pos-backend
  ports:
  - port: 80
    targetPort: 8080
  type: ClusterIP

Key point for IM: If the POS can't reach the backend, check if the service is properly routing traffic.


5. Namespace

What: Virtual clusters within a physical cluster. Isolates resources.

Analogy: Folders on a computer to organize files.

Common namespaces in Byte Edge:

  • default: Default namespace
  • pos: POS-related services
  • payment: Payment processing services
  • observability: ClickStack components (ClickHouse, HyperDX UI, OpenTelemetry Collector)
  • kube-system: K8s internal services (don't touch!)

Key point for IM: Always specify namespace when investigating. Services in different namespaces are isolated.


6. ConfigMap & Secret

What: Store configuration data and sensitive data (passwords, API keys).

ConfigMap: Plain text configuration Secret: Base64-encoded sensitive data

# ConfigMap example
apiVersion: v1
kind: ConfigMap
metadata:
  name: pos-config
data:
  api_endpoint: "https://cloud.yum.com/api"
  log_level: "info"

---
# Secret example
apiVersion: v1
kind: Secret
metadata:
  name: payment-secret
type: Opaque
data:
  api_key: c29tZS1zZWNyZXQta2V5  # base64 encoded

Key point for IM: Configuration changes (new ConfigMap/Secret) can cause issues. Check recent changes during incidents.


Kubernetes Architecture (Simplified)

┌───────────────────────────────────────────────────────────┐
│                   Kubernetes Cluster                       │
│                                                            │
│  ┌────────────────────────────────────────────────────┐  │
│  │ Control Plane (Brain)                              │  │
│  │  - API Server: Receives commands                   │  │
│  │  - Scheduler: Decides where to run pods            │  │
│  │  - Controller: Ensures desired state               │  │
│  └────────────────────────────────────────────────────┘  │
│                           │                                │
│          ┌────────────────┼────────────────┐              │
│          │                │                │              │
│  ┌───────▼──────┐ ┌──────▼──────┐ ┌───────▼──────┐      │
│  │   Node 1     │ │   Node 2    │ │   Node 3     │      │
│  │ (Worker)     │ │ (Worker)    │ │ (Worker)     │      │
│  │              │ │             │ │              │      │
│  │ [Pod] [Pod]  │ │ [Pod] [Pod] │ │ [Pod] [Pod]  │      │
│  │ [Pod]        │ │ [Pod]       │ │              │      │
│  └──────────────┘ └─────────────┘ └──────────────┘      │
└───────────────────────────────────────────────────────────┘

For Byte Edge: Typically 1-3 nodes per restaurant (physical servers/mini-PCs).


Essential kubectl Commands for IM

kubectl is the CLI tool to interact with K8s clusters.

Check Cluster Health

# View all nodes
kubectl get nodes

# View all pods across all namespaces
kubectl get pods --all-namespaces

# View pods in specific namespace
kubectl get pods -n pos

Investigate Pod Issues

# Get detailed info about a pod
kubectl describe pod pos-backend-abc123 -n pos

# View recent events (crashes, restarts, errors)
kubectl get events -n pos --sort-by='.lastTimestamp'

# Check pod logs (most common IM command!)
kubectl logs pos-backend-abc123 -n pos

# Follow logs in real-time
kubectl logs -f pos-backend-abc123 -n pos

# View logs from previous crashed container
kubectl logs pos-backend-abc123 -n pos --previous

Check Service Health

# List services
kubectl get svc -n pos

# Describe service (see which pods it routes to)
kubectl describe svc pos-backend-service -n pos

# Check endpoints (actual pod IPs behind the service)
kubectl get endpoints pos-backend-service -n pos

Check Deployments

# View deployments
kubectl get deployments -n pos

# Check if desired replicas match actual
kubectl get deployment pos-backend -n pos
# Example output:
# NAME          READY   UP-TO-DATE   AVAILABLE   AGE
# pos-backend   2/3     3            2           10m
# ^ Red flag: Only 2/3 pods are ready!

# View deployment history (recent updates)
kubectl rollout history deployment/pos-backend -n pos

# Rollback to previous version
kubectl rollout undo deployment/pos-backend -n pos

Debugging Commands

# Execute command inside a running pod
kubectl exec -it pos-backend-abc123 -n pos -- /bin/bash

# Check resource usage (CPU/memory)
kubectl top pods -n pos

# View ConfigMaps/Secrets
kubectl get configmap -n pos
kubectl get secret -n pos

Common K8s Issues & IM Response

Issue 1: CrashLoopBackOff

What it means: Pod keeps starting and crashing repeatedly.

IM Response:

  1. Check logs: kubectl logs <pod> -n <namespace> --previous
  2. Describe pod: kubectl describe pod <pod> -n <namespace>
  3. Look for: Application errors, missing dependencies, resource limits exceeded

Common causes:

  • Application code bug
  • Missing ConfigMap/Secret
  • Database connection failure
  • Out of memory

Issue 2: ImagePullBackOff

What it means: Can't download container image.

IM Response:

  1. Describe pod: kubectl describe pod <pod> -n <namespace>
  2. Check event logs for image name and error
  3. Verify: Internet connectivity, image registry access, image exists

Common causes:

  • Internet/network outage at the edge
  • Container registry is down
  • Typo in image name/tag
  • Missing credentials for private registry

Issue 3: Service Unreachable

What it means: Clients can't connect to a service.

IM Response:

  1. Check if pods are running: kubectl get pods -l app=<service> -n <namespace>
  2. Check service endpoints: kubectl get endpoints <service> -n <namespace>
  3. Verify pod labels match service selector

Common causes:

  • All pods crashed
  • Service selector doesn't match pod labels
  • Network policy blocking traffic

Issue 4: High CPU/Memory

What it means: Pod consuming excessive resources.

IM Response:

  1. Check resource usage: kubectl top pods -n <namespace>
  2. View logs for errors: kubectl logs <pod> -n <namespace>
  3. Check if resource limits are set: kubectl describe pod <pod> -n <namespace>

Common causes:

  • Memory leak
  • Infinite loop
  • High traffic/load
  • Resource limits too low

Byte Edge K8s Architecture

Restaurant Edge Server (Byte Edge K8s Cluster)
├── Namespace: pos
│   ├── Deployment: pos-backend (3 replicas)
│   ├── Deployment: pos-frontend (2 replicas)
│   └── Service: pos-backend-service
├── Namespace: payment
│   ├── Deployment: payment-processor (2 replicas)
│   └── Service: payment-service
├── Namespace: order
│   ├── Deployment: order-service (2 replicas)
│   └── Service: order-service
├── Namespace: observability
│   ├── Deployment: hyperdx (1 replica)
│   ├── Deployment: clickhouse (1 replica)
│   └── Service: hyperdx-service
└── Namespace: kube-system
    └── K8s internal services

What You DON'T Need to Know (Yet)

As an IM responder focused on incident investigation, you can skip:

  • Writing Kubernetes YAML manifests from scratch
  • Understanding K8s networking in depth (CNI, NetworkPolicies)
  • Setting up K8s clusters
  • K8s security (RBAC, PodSecurityPolicies)
  • Advanced concepts (StatefulSets, DaemonSets, Jobs, CRDs)

Focus on: Reading pod logs, checking pod status, understanding deployments, and correlating K8s events with incidents.


Practice Scenarios

Before moving to Module 3, practice these scenarios mentally:

Scenario 1: "POS is down at Store #4523"

Your investigation flow:

  1. kubectl get pods -n pos → Check if POS pods are running
  2. If pods are CrashLoopBackOffkubectl logs <pod> -n pos --previous
  3. If pods are Running but service is down → kubectl get endpoints pos-backend-service -n pos
  4. Check recent events → kubectl get events -n pos --sort-by='.lastTimestamp'

Scenario 2: "Payment processing is slow"

Your investigation flow:

  1. kubectl top pods -n payment → Check CPU/memory usage
  2. kubectl logs <payment-pod> -n payment → Look for errors or slow queries
  3. kubectl describe pod <payment-pod> -n payment → Check resource limits
  4. Check if multiple replicas are healthy → kubectl get deployment payment-processor -n payment

Key Takeaways

  1. Kubernetes = Container orchestration platform (auto-deploy, scale, heal)
  2. Pod = Running container(s), where your app lives
  3. Deployment = Manages multiple pods, ensures desired state
  4. Service = Stable network endpoint, load balances to pods
  5. kubectl = CLI tool to inspect and debug K8s
  6. CrashLoopBackOff = Pod keeps crashing (check logs!)
  7. Logs are your friend: kubectl logs <pod> -n <namespace>

Next Steps

✅ Complete Module 1: Edge Computing ✅ Complete Module 2: Kubernetes Overview ⬜ Read Module 3: Observability & Telemetry ⬜ Install kubectl locally (preparation for Module 5)

Estimated time to next module: 1 day (practice kubectl commands if possible)

Reading Checkpoint

Current score: 0%

Sections complete

0/0

Checkpoint confirmed

Not yet

Reflection

0 chars

Completion requires 80% section coverage, checkpoint confirmation, and a short reflection. On completion, you will move to the next module automatically.

Add 40 more characters.

Mark at least 80% of sections complete.