Skillweave - Incident Learning Platform

Module Content

Not Started

Key Takeaways

Completed Module 5 (local demo environment running)
ClickStack UI (HyperDX) accessible at http://localhost:8080
Sample POS app deployed with OpenTelemetry instrumentation
Find all ERROR logs from the past hour
Count how many errors occurred

Overview

Time to complete: ~4-6 hours (spread over multiple sessions)

Overview

These exercises simulate real incident scenarios you'll encounter as Byte Edge SME. Each exercise builds on the previous one, gradually increasing complexity.

Prerequisites:

Completed Module 5 (local demo environment running)
ClickStack UI (HyperDX) accessible at http://localhost:8080
Sample POS app deployed with OpenTelemetry instrumentation

Exercise 1: Basic Log Investigation

Difficulty: ⭐ Easy Time: 20 minutes

Scenario

A store manager reports: "The POS is showing errors occasionally. Can you investigate?"

Your Mission

Find all ERROR logs from the past hour
Count how many errors occurred
Identify which service is generating errors
Determine the most common error message

Steps

1. Generate Test Data

# Generate mixed traffic (80% success, 20% errors)
kubectl port-forward -n pos svc/pos-backend-service 8081:80 &
for i in {1..50}; do
  if [ $((i % 5)) -eq 0 ]; then
    curl -s http://localhost:8081/api/invalid-endpoint > /dev/null
  else
    curl -s http://localhost:8081/api/orders -d '{"order_id": "'$i'"}' > /dev/null
  fi
  sleep 0.3
done
pkill -f "port-forward"

2. Investigate in ClickStack UI (HyperDX)

Open ClickStack UI at http://localhost:8080
Navigate to Logs
Set time range: Last 1 hour

Try both query modes:

Lucene syntax (quick search):

severity:ERROR service:pos-backend

SQL syntax (analytical query):

SELECT
  timestamp,
  service_name,
  body,
  attributes.http_status_code
FROM logs
WHERE severity_text = 'ERROR'
  AND service_name = 'pos-backend'
  AND timestamp >= now() - INTERVAL 1 HOUR
ORDER BY timestamp DESC;

Answer these questions:

How many ERROR logs do you see?
What is the error message?
What HTTP status code is returned?
Which endpoint is failing?
Which query mode (Lucene vs SQL) did you find easier for this task?

3. Export to "Cloud"

# Query ClickHouse to export error logs
kubectl exec -it -n observability clickhouse-0 -- \
  clickhouse-client --query \
  "SELECT
     formatDateTime(timestamp, '%Y-%m-%d %H:%i:%s') AS time,
     service_name,
     body,
     attributes['http.status_code'] AS status_code
   FROM default.otel_logs
   WHERE severity_text = 'ERROR'
   AND timestamp > now() - INTERVAL 1 HOUR
   ORDER BY timestamp DESC
   LIMIT 20
   FORMAT Pretty"

Expected Findings

~10 ERROR logs (20% of 50 requests)
Error: "404 Not Found" or similar
Endpoint: /api/invalid-endpoint
Root cause: Application receiving requests to non-existent endpoint

Exercise 2: Trace Analysis for Slow Requests

Difficulty: ⭐⭐ Medium Time: 30 minutes

Scenario

"Orders are taking longer than usual to process. Some customers are complaining about delays."

Your Mission

Find slow traces (>1 second duration)
Identify which service is causing the delay
Determine the P99 latency
Correlate slow traces with error logs

Steps

1. Generate Slow Traffic

Create generate-slow-traffic.sh:

#!/bin/bash

kubectl port-forward -n pos svc/pos-backend-service 8081:80 &
PORT_FORWARD_PID=$!
sleep 3

for i in {1..30}; do
  # Simulate slow requests (1 in 3 is slow)
  if [ $((i % 3)) -eq 0 ]; then
    echo "Slow request $i"
    curl -s "http://localhost:8081/api/orders?delay=2000" \
      -d '{"order_id": "'$i'"}' > /dev/null &
  else
    curl -s http://localhost:8081/api/orders \
      -d '{"order_id": "'$i'"}' > /dev/null &
  fi
  sleep 0.5
done

wait
kill $PORT_FORWARD_PID
echo "Done!"

chmod +x generate-slow-traffic.sh
./generate-slow-traffic.sh

2. Investigate in HyperDX

A. Find Slow Traces:

Navigate to Traces
Set time range: Last 1 hour
Filter: service_name = pos-backend
Sort by: Duration (descending)

B. Analyze Trace Details:

Click on the slowest trace
View the waterfall diagram
Identify which span is taking the most time

C. Calculate P99 Latency: Query ClickHouse:

kubectl exec -it -n observability clickhouse-0 -- \
  clickhouse-client --query \
  "SELECT
     quantile(0.99)(duration_ns / 1000000) AS p99_latency_ms,
     quantile(0.50)(duration_ns / 1000000) AS p50_latency_ms,
     avg(duration_ns / 1000000) AS avg_latency_ms
   FROM default.otel_traces
   WHERE service_name = 'pos-backend'
   AND timestamp > now() - INTERVAL 1 HOUR
   FORMAT Pretty"

D. Correlate with Logs:

Copy a trace_id from a slow trace
Go to Logs
Search by trace_id = <paste_trace_id>
Look for WARN or ERROR logs associated with slow traces

Questions to Answer

What is the P99 latency? P50 latency?
What percentage of requests are slow (>1s)?
Which span/operation is slow? (Hint: look for database query, external API call, etc.)
Are slow traces also generating error logs?

Expected Findings

P99 latency: ~2000ms
P50 latency: ~200ms
~33% of requests are slow (1 in 3)
Slow span: Likely a "process order" or "database query" operation

Exercise 3: Service Dependency Analysis

Difficulty: ⭐⭐ Medium Time: 30 minutes

Scenario

"Payment processing is failing. We need to understand the flow from POS to payment service."

Your Mission

Map the service dependency chain
Identify where payments are failing
Determine if it's a POS issue, payment service issue, or external API issue

Steps

1. Deploy Payment Service

Create payment-service.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-service
  namespace: pos
spec:
  replicas: 1
  selector:
    matchLabels:
      app: payment-service
  template:
    metadata:
      labels:
        app: payment-service
    spec:
      containers:
      - name: payment-service
        image: ghcr.io/open-telemetry/demo:latest
        ports:
        - containerPort: 8080
        env:
        - name: OTEL_EXPORTER_OTLP_ENDPOINT
          value: "http://hyperdx-collector.observability.svc.cluster.local:4318"
        - name: OTEL_SERVICE_NAME
          value: "payment-service"
        - name: OTEL_RESOURCE_ATTRIBUTES
          value: "store.id=demo-4523,environment=local"

---
apiVersion: v1
kind: Service
metadata:
  name: payment-service
  namespace: pos
spec:
  selector:
    app: payment-service
  ports:
  - port: 80
    targetPort: 8080

kubectl apply -f payment-service.yaml
kubectl get pods -n pos  # Verify both pos-backend and payment-service are running

2. Generate Payment Traffic

# Port-forward payment service
kubectl port-forward -n pos svc/payment-service 8082:80 &

# Simulate payments (70% success, 30% failure)
for i in {1..50}; do
  if [ $((i % 3)) -eq 0 ]; then
    curl -s http://localhost:8082/api/payments/fail > /dev/null
  else
    curl -s http://localhost:8082/api/payments \
      -d '{"amount": 42.50, "order_id": "'$i'"}' > /dev/null
  fi
  sleep 0.3
done

pkill -f "port-forward"

3. Analyze Service Map

A. View Traces Across Services:

Go to Traces in HyperDX
Filter: Time range = Last 1 hour
Look for traces that span multiple services

B. Build Service Dependency Map: Query ClickHouse:

kubectl exec -it -n observability clickhouse-0 -- \
  clickhouse-client --query \
  "SELECT
     parent.service_name AS caller,
     child.service_name AS callee,
     count() AS call_count
   FROM default.otel_traces AS parent
   JOIN default.otel_traces AS child
     ON parent.trace_id = child.trace_id
     AND parent.span_id = child.parent_span_id
   WHERE parent.timestamp > now() - INTERVAL 1 HOUR
   GROUP BY caller, callee
   ORDER BY call_count DESC
   FORMAT Pretty"

C. Identify Failure Points:

kubectl exec -it -n observability clickhouse-0 -- \
  clickhouse-client --query \
  "SELECT
     service_name,
     status_code,
     count() AS count
   FROM default.otel_traces
   WHERE timestamp > now() - INTERVAL 1 HOUR
   GROUP BY service_name, status_code
   ORDER BY service_name, count DESC
   FORMAT Pretty"

Questions to Answer

Draw the service dependency map (pos-backend → payment-service → ???)
What is the failure rate for payment-service?
Which service has the highest error rate?
Are failures isolated to payment-service or cascading from pos-backend?

Expected Findings

Dependency: pos-backend → payment-service
Payment service failure rate: ~30%
Failures are isolated to payment-service (not cascading)

Exercise 4: Resource Exhaustion Incident

Difficulty: ⭐⭐⭐ Hard Time: 45 minutes

Scenario

"Store #4523 is experiencing intermittent slowness. Sometimes it's fast, sometimes it's very slow. No clear error messages."

Your Mission

Identify if it's a resource issue (CPU, memory, disk)
Correlate resource metrics with performance degradation
Determine if it's affecting all services or just one

Steps

1. Simulate Resource Pressure

Create stress-test.sh:

#!/bin/bash

# Deploy a resource-intensive pod
kubectl run stress-test -n pos \
  --image=polinux/stress \
  --restart=Never \
  -- stress --cpu 2 --timeout 120s &

# Generate traffic during stress
sleep 5
kubectl port-forward -n pos svc/pos-backend-service 8081:80 &
PORT_FORWARD_PID=$!
sleep 3

for i in {1..60}; do
  START=$(date +%s%3N)
  curl -s http://localhost:8081/api/orders \
    -d '{"order_id": "'$i'"}' > /dev/null
  END=$(date +%s%3N)
  DURATION=$((END - START))
  echo "Request $i: ${DURATION}ms"
  sleep 1
done

kill $PORT_FORWARD_PID
kubectl delete pod stress-test -n pos

echo "Stress test complete!"

chmod +x stress-test.sh
./stress-test.sh

2. Investigate Resource Metrics

A. Check Pod Resource Usage:

# Real-time resource monitoring
kubectl top pods -n pos --watch

# Or, query historical data from HyperDX (if metrics are available)

B. Correlate with Latency: Query ClickHouse:

kubectl exec -it -n observability clickhouse-0 -- \
  clickhouse-client --query \
  "SELECT
     toStartOfMinute(timestamp) AS minute,
     avg(duration_ns / 1000000) AS avg_latency_ms,
     quantile(0.99)(duration_ns / 1000000) AS p99_latency_ms
   FROM default.otel_traces
   WHERE service_name = 'pos-backend'
   AND timestamp > now() - INTERVAL 2 HOUR
   GROUP BY minute
   ORDER BY minute ASC
   FORMAT Pretty"

C. Check for OOMKilled or CrashLoopBackOff:

kubectl get events -n pos --sort-by='.lastTimestamp' | grep -i "oom\|crash"

3. Identify Root Cause

Questions:

Did CPU usage spike during the stress test?
Did request latency increase during high CPU usage?
Were any pods restarted or OOMKilled?
What is the correlation between resource usage and performance?

Expected Findings

CPU usage spiked to 80-100%
Latency increased 2-5x during high CPU
No OOMKilled (unless memory limits are very low)
Clear correlation: High CPU → High latency

Exercise 5: Cross-Service Incident Investigation

Difficulty: ⭐⭐⭐ Hard Time: 60 minutes

Scenario

"Customers are reporting that orders are failing. The error message says 'Payment authorization failed,' but payments were working fine yesterday. What changed?"

Your Mission

Investigate recent changes (deployments, config changes)
Trace the full request flow: POS → Payment → External API
Identify the root cause
Recommend a fix

Steps

1. Simulate a Breaking Change

# Deploy a "broken" payment service configuration
kubectl set env deployment/payment-service -n pos \
  PAYMENT_API_TIMEOUT=100  # Too short, will cause timeouts

# Wait for rollout
kubectl rollout status deployment/payment-service -n pos

2. Generate Traffic

kubectl port-forward -n pos svc/pos-backend-service 8081:80 &
for i in {1..30}; do
  curl -s http://localhost:8081/api/checkout \
    -d '{"order_id": "'$i'", "amount": 42.50}' > /dev/null
  sleep 1
done
pkill -f "port-forward"

3. Investigate

A. Check Recent Changes:

# View deployment history
kubectl rollout history deployment/payment-service -n pos

# View recent events
kubectl get events -n pos --sort-by='.lastTimestamp' | tail -20

# Check if ConfigMaps/Secrets changed
kubectl describe deployment payment-service -n pos | grep -A10 "Environment"

B. Trace Failed Requests:

Go to HyperDX Traces
Filter: status_code = ERROR and time range = Last 1 hour
Click on a failed trace
Examine each span:
POS backend (successful?)
Payment service (successful or failed?)
External API call (timed out?)

C. Correlate with Logs:

Copy trace_id from failed trace
Go to Logs, search by trace_id
Look for error messages: "timeout," "connection refused," "API error"

D. Identify the Breaking Change:

# Compare current config with previous version
kubectl get deployment payment-service -n pos -o yaml | grep -A5 "env:"

4. Fix the Issue

# Rollback to previous version
kubectl rollout undo deployment/payment-service -n pos

# Or, fix the config
kubectl set env deployment/payment-service -n pos \
  PAYMENT_API_TIMEOUT=5000  # 5 seconds (reasonable)

# Verify fix
kubectl rollout status deployment/payment-service -n pos

# Generate traffic to confirm
kubectl port-forward -n pos svc/pos-backend-service 8081:80 &
curl -s http://localhost:8081/api/checkout -d '{"order_id": "test"}' && echo "✅ Success"
pkill -f "port-forward"

Questions to Answer

What was the recent change that broke payments?
Which span in the trace shows the failure?
What was the error message?
How did you identify the root cause?
What was your fix?

Expected Findings

Recent change: PAYMENT_API_TIMEOUT reduced from 5000ms to 100ms
Failed span: External API call (timeout)
Error: "Request timeout after 100ms"
Root cause: Timeout too short for external payment API
Fix: Increase timeout to 5000ms or rollback deployment

Exercise 6: Conditional Export Workflow

Difficulty: ⭐⭐⭐ Advanced Time: 45 minutes

Scenario

"We need to export edge telemetry to Datadog for a post-incident review. Export only ERROR logs and failed traces from the past 2 hours for the payment-service."

Your Mission

Query edge telemetry (HyperDX/ClickHouse)
Filter data based on criteria (severity, service, time range)
Export to a format suitable for Datadog
Document the export process

Steps

1. Define Export Criteria

# Export specification
time_range: last 2 hours
services: [payment-service]
log_severity: [ERROR, WARN]
traces: [status_code = ERROR]
destination: datadog (simulated as JSON files)

2. Export Logs

# Export ERROR and WARN logs from payment-service
kubectl exec -it -n observability clickhouse-0 -- \
  clickhouse-client --query \
  "SELECT
     toUnixTimestamp(timestamp) * 1000 AS timestamp_ms,
     service_name,
     severity_text,
     body AS message,
     trace_id,
     span_id,
     attributes
   FROM default.otel_logs
   WHERE service_name = 'payment-service'
   AND severity_text IN ('ERROR', 'WARN')
   AND timestamp > now() - INTERVAL 2 HOUR
   ORDER BY timestamp DESC
   FORMAT JSONEachRow" > payment-service-logs-export.json

# View exported logs
cat payment-service-logs-export.json | jq '.' | head -50

3. Export Failed Traces

# Export traces with status = ERROR
kubectl exec -it -n observability clickhouse-0 -- \
  clickhouse-client --query \
  "SELECT
     trace_id,
     span_id,
     parent_span_id,
     span_name,
     service_name,
     duration_ns,
     status_code,
     attributes
   FROM default.otel_traces
   WHERE service_name = 'payment-service'
   AND status_code = 'ERROR'
   AND timestamp > now() - INTERVAL 2 HOUR
   ORDER BY timestamp DESC
   FORMAT JSONEachRow" > payment-service-traces-export.json

# View exported traces
cat payment-service-traces-export.json | jq '.' | head -50

4. Export Metrics (Aggregated)

# Export error rate and latency metrics
kubectl exec -it -n observability clickhouse-0 -- \
  clickhouse-client --query \
  "SELECT
     toStartOfMinute(timestamp) AS minute,
     service_name,
     countIf(status_code = 'ERROR') AS error_count,
     count() AS total_count,
     (error_count / total_count) * 100 AS error_rate_pct,
     avg(duration_ns / 1000000) AS avg_latency_ms
   FROM default.otel_traces
   WHERE service_name = 'payment-service'
   AND timestamp > now() - INTERVAL 2 HOUR
   GROUP BY minute, service_name
   ORDER BY minute ASC
   FORMAT JSONEachRow" > payment-service-metrics-export.json

cat payment-service-metrics-export.json | jq '.' | head -20

5. Create Export Summary

Create export-summary.md:

# Telemetry Export Summary

**Export Date**: 2026-02-19
**Store ID**: demo-4523
**Time Range**: Last 2 hours
**Services**: payment-service

## Export Contents

### Logs
- **File**: payment-service-logs-export.json
- **Count**: [run: `wc -l payment-service-logs-export.json`]
- **Severity**: ERROR, WARN
- **Format**: JSON (Datadog-compatible)

### Traces
- **File**: payment-service-traces-export.json
- **Count**: [run: `wc -l payment-service-traces-export.json`]
- **Status**: ERROR only
- **Format**: JSON (OpenTelemetry format)

### Metrics
- **File**: payment-service-metrics-export.json
- **Aggregation**: Per minute
- **Metrics**: error_count, error_rate_pct, avg_latency_ms

## Findings
[Document your findings here after reviewing the exported data]

## Next Steps
1. Upload to Datadog (via API or UI)
2. Share links with incident response team
3. Use for post-incident review (PIR)

Questions to Answer

How many ERROR logs were exported?
How many failed traces?
What is the average error rate per minute?
What is the total size of exported data?
How would you automate this export process?

Final Challenge: Full Incident Simulation

Difficulty: ⭐⭐⭐⭐ Expert Time: 90 minutes

Scenario

"Multiple stores are reporting issues. Some say orders are slow, others say payments are failing. Investigate and provide a root cause analysis."

Your Mission

Combine everything you've learned:

Investigate logs, traces, and metrics
Identify multiple failure modes
Correlate issues across services
Provide a timeline of events
Recommend fixes
Export relevant telemetry to Datadog

Steps

I'll leave this one open-ended for you to design and execute!

Hints:

Simulate multiple issues simultaneously (slow service + high error rate)
Use different services (pos-backend, payment-service)
Introduce resource pressure
Make a configuration change
Generate traffic with mixed success/failure rates

Deliverable: Write a mock Post-Incident Review (PIR) document:

Timeline: What happened and when?
Root cause: What caused the issue(s)?
Impact: How many requests failed? How long were services degraded?
Resolution: What fixed the issue?
Prevention: How to prevent this in the future?

Certification Checklist

You've completed the upskilling program when you can:

[ ] Explain edge computing concepts to a non-technical stakeholder
[ ] Navigate Kubernetes cluster using kubectl
[ ] Search and filter logs in HyperDX
[ ] Analyze distributed traces and identify bottlenecks
[ ] Query ClickHouse database for telemetry data
[ ] Correlate logs, traces, and metrics during an incident
[ ] Identify resource exhaustion issues (CPU, memory)
[ ] Trace recent changes (deployments, config updates)
[ ] Export edge telemetry conditionally
[ ] Write a basic incident summary with root cause

Next Steps: Real-World Application

Week 4 Goals:

Access KFC US lab environment

Request access from Byte Edge engineer
Deploy similar setup in lab (HyperDX + ClickHouse)

Shadow real incident

Join on-call shift (observer mode)
Use HyperDX to investigate real issues
Compare with Datadog investigation

Document IM requirements

What telemetry does IM need for investigations?
What's missing from current HyperDX deployment?
What export triggers make sense?

Train team

Share learnings with Byte IM team
Demo HyperDX capabilities
Create edge investigation runbook

Resources

Local demo: Refer back to Module 5 for setup
ClickHouse queries: https://clickhouse.com/docs/en/sql-reference/
HyperDX docs: https://www.hyperdx.io/docs
OpenTelemetry: https://opentelemetry.io/docs/

Congratulations on completing the Byte Edge SME upskilling program! 🎉

You're now equipped to:

Investigate edge incidents using HyperDX
Understand Kubernetes and edge architecture
Export telemetry conditionally to Datadog
Collaborate with Byte Edge team on future enhancements

Keep practicing, and reach out to Christian or the Byte Edge engineer with questions!

Reading Checkpoint

Current score: 0%

Sections complete

0/0

Checkpoint confirmed

Not yet

Reflection

0 chars

Completion requires 80% section coverage, checkpoint confirmation, and a short reflection. On completion, you will move to the next module automatically.

I can explain one operational takeaway from this module and when to apply it. Reflection (40+ chars)

Add 40 more characters.

Mark at least 80% of sections complete.

Hands-On Edge Exercise Playbook

Module Navigator

Learning Guidance

[Astro Alert] WARNING: DAG run failed for tango_sites_s3_sf_extr

[Astro Alert] WARNING: DAG run failed for sf2fourth_recon_trigger

Customer login flow failing on Pizza Hut website synthetic test

Module Content

Overview

Overview

Exercise 1: Basic Log Investigation

Scenario

Your Mission

Steps

1. Generate Test Data

2. Investigate in ClickStack UI (HyperDX)

3. Export to "Cloud"

Expected Findings

Exercise 2: Trace Analysis for Slow Requests

Scenario

Your Mission

Steps

1. Generate Slow Traffic

2. Investigate in HyperDX

Questions to Answer

Expected Findings

Exercise 3: Service Dependency Analysis

Scenario

Your Mission

Steps

1. Deploy Payment Service

2. Generate Payment Traffic

3. Analyze Service Map

Questions to Answer

Expected Findings

Exercise 4: Resource Exhaustion Incident

Scenario

Your Mission

Steps

1. Simulate Resource Pressure

2. Investigate Resource Metrics

3. Identify Root Cause

Expected Findings

Exercise 5: Cross-Service Incident Investigation

Scenario

Your Mission

Steps

1. Simulate a Breaking Change

2. Generate Traffic

3. Investigate

4. Fix the Issue

Questions to Answer

Expected Findings

Exercise 6: Conditional Export Workflow

Scenario

Your Mission

Steps

1. Define Export Criteria

2. Export Logs

3. Export Failed Traces

4. Export Metrics (Aggregated)

5. Create Export Summary

Questions to Answer

Final Challenge: Full Incident Simulation

Scenario

Your Mission

Steps

Certification Checklist

Next Steps: Real-World Application

Week 4 Goals:

Resources