Back to course: Edge

Byte Edge | Reading Module

Hands-On Edge Exercise Playbook

Status: Not Started | Pass threshold: 100% | Points: 100

L3 45 min

Best score

0%

Attempts

0

Pass rate

0%

Passed

0

Completion happens in the checkpoint panel below.

Learning Guidance

Objectives

  • Completed Module 5 (local demo environment running)
  • ClickStack UI (HyperDX) accessible at http://localhost:8080
  • Sample POS app deployed with OpenTelemetry instrumentation
  • Find all ERROR logs from the past hour

Source Artifacts

Internal source references are available for maintainers but are not exposed in deployed environments.

Field Evidence

Real incidents related to what you're learning.

Module Content

Not Started

Key Takeaways

  • Completed Module 5 (local demo environment running)
  • ClickStack UI (HyperDX) accessible at http://localhost:8080
  • Sample POS app deployed with OpenTelemetry instrumentation
  • Find all ERROR logs from the past hour
  • Count how many errors occurred

Overview

Time to complete: ~4-6 hours (spread over multiple sessions)


Overview

These exercises simulate real incident scenarios you'll encounter as Byte Edge SME. Each exercise builds on the previous one, gradually increasing complexity.

Prerequisites:

  • Completed Module 5 (local demo environment running)
  • ClickStack UI (HyperDX) accessible at http://localhost:8080
  • Sample POS app deployed with OpenTelemetry instrumentation

Exercise 1: Basic Log Investigation

Difficulty: ⭐ Easy Time: 20 minutes

Scenario

A store manager reports: "The POS is showing errors occasionally. Can you investigate?"

Your Mission

  1. Find all ERROR logs from the past hour
  2. Count how many errors occurred
  3. Identify which service is generating errors
  4. Determine the most common error message

Steps

1. Generate Test Data

# Generate mixed traffic (80% success, 20% errors)
kubectl port-forward -n pos svc/pos-backend-service 8081:80 &
for i in {1..50}; do
  if [ $((i % 5)) -eq 0 ]; then
    curl -s http://localhost:8081/api/invalid-endpoint > /dev/null
  else
    curl -s http://localhost:8081/api/orders -d '{"order_id": "'$i'"}' > /dev/null
  fi
  sleep 0.3
done
pkill -f "port-forward"

2. Investigate in ClickStack UI (HyperDX)

  • Open ClickStack UI at http://localhost:8080
  • Navigate to Logs
  • Set time range: Last 1 hour

Try both query modes:

Lucene syntax (quick search):

severity:ERROR service:pos-backend

SQL syntax (analytical query):

SELECT
  timestamp,
  service_name,
  body,
  attributes.http_status_code
FROM logs
WHERE severity_text = 'ERROR'
  AND service_name = 'pos-backend'
  AND timestamp >= now() - INTERVAL 1 HOUR
ORDER BY timestamp DESC;

Answer these questions:

  1. How many ERROR logs do you see?
  2. What is the error message?
  3. What HTTP status code is returned?
  4. Which endpoint is failing?
  5. Which query mode (Lucene vs SQL) did you find easier for this task?

3. Export to "Cloud"

# Query ClickHouse to export error logs
kubectl exec -it -n observability clickhouse-0 -- \
  clickhouse-client --query \
  "SELECT
     formatDateTime(timestamp, '%Y-%m-%d %H:%i:%s') AS time,
     service_name,
     body,
     attributes['http.status_code'] AS status_code
   FROM default.otel_logs
   WHERE severity_text = 'ERROR'
   AND timestamp > now() - INTERVAL 1 HOUR
   ORDER BY timestamp DESC
   LIMIT 20
   FORMAT Pretty"

Expected Findings

  • ~10 ERROR logs (20% of 50 requests)
  • Error: "404 Not Found" or similar
  • Endpoint: /api/invalid-endpoint
  • Root cause: Application receiving requests to non-existent endpoint

Exercise 2: Trace Analysis for Slow Requests

Difficulty: ⭐⭐ Medium Time: 30 minutes

Scenario

"Orders are taking longer than usual to process. Some customers are complaining about delays."

Your Mission

  1. Find slow traces (>1 second duration)
  2. Identify which service is causing the delay
  3. Determine the P99 latency
  4. Correlate slow traces with error logs

Steps

1. Generate Slow Traffic

Create generate-slow-traffic.sh:

#!/bin/bash

kubectl port-forward -n pos svc/pos-backend-service 8081:80 &
PORT_FORWARD_PID=$!
sleep 3

for i in {1..30}; do
  # Simulate slow requests (1 in 3 is slow)
  if [ $((i % 3)) -eq 0 ]; then
    echo "Slow request $i"
    curl -s "http://localhost:8081/api/orders?delay=2000" \
      -d '{"order_id": "'$i'"}' > /dev/null &
  else
    curl -s http://localhost:8081/api/orders \
      -d '{"order_id": "'$i'"}' > /dev/null &
  fi
  sleep 0.5
done

wait
kill $PORT_FORWARD_PID
echo "Done!"
chmod +x generate-slow-traffic.sh
./generate-slow-traffic.sh

2. Investigate in HyperDX

A. Find Slow Traces:

  • Navigate to Traces
  • Set time range: Last 1 hour
  • Filter: service_name = pos-backend
  • Sort by: Duration (descending)

B. Analyze Trace Details:

  • Click on the slowest trace
  • View the waterfall diagram
  • Identify which span is taking the most time

C. Calculate P99 Latency: Query ClickHouse:

kubectl exec -it -n observability clickhouse-0 -- \
  clickhouse-client --query \
  "SELECT
     quantile(0.99)(duration_ns / 1000000) AS p99_latency_ms,
     quantile(0.50)(duration_ns / 1000000) AS p50_latency_ms,
     avg(duration_ns / 1000000) AS avg_latency_ms
   FROM default.otel_traces
   WHERE service_name = 'pos-backend'
   AND timestamp > now() - INTERVAL 1 HOUR
   FORMAT Pretty"

D. Correlate with Logs:

  • Copy a trace_id from a slow trace
  • Go to Logs
  • Search by trace_id = <paste_trace_id>
  • Look for WARN or ERROR logs associated with slow traces

Questions to Answer

  1. What is the P99 latency? P50 latency?
  2. What percentage of requests are slow (>1s)?
  3. Which span/operation is slow? (Hint: look for database query, external API call, etc.)
  4. Are slow traces also generating error logs?

Expected Findings

  • P99 latency: ~2000ms
  • P50 latency: ~200ms
  • ~33% of requests are slow (1 in 3)
  • Slow span: Likely a "process order" or "database query" operation

Exercise 3: Service Dependency Analysis

Difficulty: ⭐⭐ Medium Time: 30 minutes

Scenario

"Payment processing is failing. We need to understand the flow from POS to payment service."

Your Mission

  1. Map the service dependency chain
  2. Identify where payments are failing
  3. Determine if it's a POS issue, payment service issue, or external API issue

Steps

1. Deploy Payment Service

Create payment-service.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-service
  namespace: pos
spec:
  replicas: 1
  selector:
    matchLabels:
      app: payment-service
  template:
    metadata:
      labels:
        app: payment-service
    spec:
      containers:
      - name: payment-service
        image: ghcr.io/open-telemetry/demo:latest
        ports:
        - containerPort: 8080
        env:
        - name: OTEL_EXPORTER_OTLP_ENDPOINT
          value: "http://hyperdx-collector.observability.svc.cluster.local:4318"
        - name: OTEL_SERVICE_NAME
          value: "payment-service"
        - name: OTEL_RESOURCE_ATTRIBUTES
          value: "store.id=demo-4523,environment=local"

---
apiVersion: v1
kind: Service
metadata:
  name: payment-service
  namespace: pos
spec:
  selector:
    app: payment-service
  ports:
  - port: 80
    targetPort: 8080
kubectl apply -f payment-service.yaml
kubectl get pods -n pos  # Verify both pos-backend and payment-service are running

2. Generate Payment Traffic

# Port-forward payment service
kubectl port-forward -n pos svc/payment-service 8082:80 &

# Simulate payments (70% success, 30% failure)
for i in {1..50}; do
  if [ $((i % 3)) -eq 0 ]; then
    curl -s http://localhost:8082/api/payments/fail > /dev/null
  else
    curl -s http://localhost:8082/api/payments \
      -d '{"amount": 42.50, "order_id": "'$i'"}' > /dev/null
  fi
  sleep 0.3
done

pkill -f "port-forward"

3. Analyze Service Map

A. View Traces Across Services:

  • Go to Traces in HyperDX
  • Filter: Time range = Last 1 hour
  • Look for traces that span multiple services

B. Build Service Dependency Map: Query ClickHouse:

kubectl exec -it -n observability clickhouse-0 -- \
  clickhouse-client --query \
  "SELECT
     parent.service_name AS caller,
     child.service_name AS callee,
     count() AS call_count
   FROM default.otel_traces AS parent
   JOIN default.otel_traces AS child
     ON parent.trace_id = child.trace_id
     AND parent.span_id = child.parent_span_id
   WHERE parent.timestamp > now() - INTERVAL 1 HOUR
   GROUP BY caller, callee
   ORDER BY call_count DESC
   FORMAT Pretty"

C. Identify Failure Points:

kubectl exec -it -n observability clickhouse-0 -- \
  clickhouse-client --query \
  "SELECT
     service_name,
     status_code,
     count() AS count
   FROM default.otel_traces
   WHERE timestamp > now() - INTERVAL 1 HOUR
   GROUP BY service_name, status_code
   ORDER BY service_name, count DESC
   FORMAT Pretty"

Questions to Answer

  1. Draw the service dependency map (pos-backend → payment-service → ???)
  2. What is the failure rate for payment-service?
  3. Which service has the highest error rate?
  4. Are failures isolated to payment-service or cascading from pos-backend?

Expected Findings

  • Dependency: pos-backend → payment-service
  • Payment service failure rate: ~30%
  • Failures are isolated to payment-service (not cascading)

Exercise 4: Resource Exhaustion Incident

Difficulty: ⭐⭐⭐ Hard Time: 45 minutes

Scenario

"Store #4523 is experiencing intermittent slowness. Sometimes it's fast, sometimes it's very slow. No clear error messages."

Your Mission

  1. Identify if it's a resource issue (CPU, memory, disk)
  2. Correlate resource metrics with performance degradation
  3. Determine if it's affecting all services or just one

Steps

1. Simulate Resource Pressure

Create stress-test.sh:

#!/bin/bash

# Deploy a resource-intensive pod
kubectl run stress-test -n pos \
  --image=polinux/stress \
  --restart=Never \
  -- stress --cpu 2 --timeout 120s &

# Generate traffic during stress
sleep 5
kubectl port-forward -n pos svc/pos-backend-service 8081:80 &
PORT_FORWARD_PID=$!
sleep 3

for i in {1..60}; do
  START=$(date +%s%3N)
  curl -s http://localhost:8081/api/orders \
    -d '{"order_id": "'$i'"}' > /dev/null
  END=$(date +%s%3N)
  DURATION=$((END - START))
  echo "Request $i: ${DURATION}ms"
  sleep 1
done

kill $PORT_FORWARD_PID
kubectl delete pod stress-test -n pos

echo "Stress test complete!"
chmod +x stress-test.sh
./stress-test.sh

2. Investigate Resource Metrics

A. Check Pod Resource Usage:

# Real-time resource monitoring
kubectl top pods -n pos --watch

# Or, query historical data from HyperDX (if metrics are available)

B. Correlate with Latency: Query ClickHouse:

kubectl exec -it -n observability clickhouse-0 -- \
  clickhouse-client --query \
  "SELECT
     toStartOfMinute(timestamp) AS minute,
     avg(duration_ns / 1000000) AS avg_latency_ms,
     quantile(0.99)(duration_ns / 1000000) AS p99_latency_ms
   FROM default.otel_traces
   WHERE service_name = 'pos-backend'
   AND timestamp > now() - INTERVAL 2 HOUR
   GROUP BY minute
   ORDER BY minute ASC
   FORMAT Pretty"

C. Check for OOMKilled or CrashLoopBackOff:

kubectl get events -n pos --sort-by='.lastTimestamp' | grep -i "oom\|crash"

3. Identify Root Cause

Questions:

  1. Did CPU usage spike during the stress test?
  2. Did request latency increase during high CPU usage?
  3. Were any pods restarted or OOMKilled?
  4. What is the correlation between resource usage and performance?

Expected Findings

  • CPU usage spiked to 80-100%
  • Latency increased 2-5x during high CPU
  • No OOMKilled (unless memory limits are very low)
  • Clear correlation: High CPU → High latency

Exercise 5: Cross-Service Incident Investigation

Difficulty: ⭐⭐⭐ Hard Time: 60 minutes

Scenario

"Customers are reporting that orders are failing. The error message says 'Payment authorization failed,' but payments were working fine yesterday. What changed?"

Your Mission

  1. Investigate recent changes (deployments, config changes)
  2. Trace the full request flow: POS → Payment → External API
  3. Identify the root cause
  4. Recommend a fix

Steps

1. Simulate a Breaking Change

# Deploy a "broken" payment service configuration
kubectl set env deployment/payment-service -n pos \
  PAYMENT_API_TIMEOUT=100  # Too short, will cause timeouts

# Wait for rollout
kubectl rollout status deployment/payment-service -n pos

2. Generate Traffic

kubectl port-forward -n pos svc/pos-backend-service 8081:80 &
for i in {1..30}; do
  curl -s http://localhost:8081/api/checkout \
    -d '{"order_id": "'$i'", "amount": 42.50}' > /dev/null
  sleep 1
done
pkill -f "port-forward"

3. Investigate

A. Check Recent Changes:

# View deployment history
kubectl rollout history deployment/payment-service -n pos

# View recent events
kubectl get events -n pos --sort-by='.lastTimestamp' | tail -20

# Check if ConfigMaps/Secrets changed
kubectl describe deployment payment-service -n pos | grep -A10 "Environment"

B. Trace Failed Requests:

  • Go to HyperDX Traces
  • Filter: status_code = ERROR and time range = Last 1 hour
  • Click on a failed trace
  • Examine each span:
  • POS backend (successful?)
  • Payment service (successful or failed?)
  • External API call (timed out?)

C. Correlate with Logs:

  • Copy trace_id from failed trace
  • Go to Logs, search by trace_id
  • Look for error messages: "timeout," "connection refused," "API error"

D. Identify the Breaking Change:

# Compare current config with previous version
kubectl get deployment payment-service -n pos -o yaml | grep -A5 "env:"

4. Fix the Issue

# Rollback to previous version
kubectl rollout undo deployment/payment-service -n pos

# Or, fix the config
kubectl set env deployment/payment-service -n pos \
  PAYMENT_API_TIMEOUT=5000  # 5 seconds (reasonable)

# Verify fix
kubectl rollout status deployment/payment-service -n pos

# Generate traffic to confirm
kubectl port-forward -n pos svc/pos-backend-service 8081:80 &
curl -s http://localhost:8081/api/checkout -d '{"order_id": "test"}' && echo "✅ Success"
pkill -f "port-forward"

Questions to Answer

  1. What was the recent change that broke payments?
  2. Which span in the trace shows the failure?
  3. What was the error message?
  4. How did you identify the root cause?
  5. What was your fix?

Expected Findings

  • Recent change: PAYMENT_API_TIMEOUT reduced from 5000ms to 100ms
  • Failed span: External API call (timeout)
  • Error: "Request timeout after 100ms"
  • Root cause: Timeout too short for external payment API
  • Fix: Increase timeout to 5000ms or rollback deployment

Exercise 6: Conditional Export Workflow

Difficulty: ⭐⭐⭐ Advanced Time: 45 minutes

Scenario

"We need to export edge telemetry to Datadog for a post-incident review. Export only ERROR logs and failed traces from the past 2 hours for the payment-service."

Your Mission

  1. Query edge telemetry (HyperDX/ClickHouse)
  2. Filter data based on criteria (severity, service, time range)
  3. Export to a format suitable for Datadog
  4. Document the export process

Steps

1. Define Export Criteria

# Export specification
time_range: last 2 hours
services: [payment-service]
log_severity: [ERROR, WARN]
traces: [status_code = ERROR]
destination: datadog (simulated as JSON files)

2. Export Logs

# Export ERROR and WARN logs from payment-service
kubectl exec -it -n observability clickhouse-0 -- \
  clickhouse-client --query \
  "SELECT
     toUnixTimestamp(timestamp) * 1000 AS timestamp_ms,
     service_name,
     severity_text,
     body AS message,
     trace_id,
     span_id,
     attributes
   FROM default.otel_logs
   WHERE service_name = 'payment-service'
   AND severity_text IN ('ERROR', 'WARN')
   AND timestamp > now() - INTERVAL 2 HOUR
   ORDER BY timestamp DESC
   FORMAT JSONEachRow" > payment-service-logs-export.json

# View exported logs
cat payment-service-logs-export.json | jq '.' | head -50

3. Export Failed Traces

# Export traces with status = ERROR
kubectl exec -it -n observability clickhouse-0 -- \
  clickhouse-client --query \
  "SELECT
     trace_id,
     span_id,
     parent_span_id,
     span_name,
     service_name,
     duration_ns,
     status_code,
     attributes
   FROM default.otel_traces
   WHERE service_name = 'payment-service'
   AND status_code = 'ERROR'
   AND timestamp > now() - INTERVAL 2 HOUR
   ORDER BY timestamp DESC
   FORMAT JSONEachRow" > payment-service-traces-export.json

# View exported traces
cat payment-service-traces-export.json | jq '.' | head -50

4. Export Metrics (Aggregated)

# Export error rate and latency metrics
kubectl exec -it -n observability clickhouse-0 -- \
  clickhouse-client --query \
  "SELECT
     toStartOfMinute(timestamp) AS minute,
     service_name,
     countIf(status_code = 'ERROR') AS error_count,
     count() AS total_count,
     (error_count / total_count) * 100 AS error_rate_pct,
     avg(duration_ns / 1000000) AS avg_latency_ms
   FROM default.otel_traces
   WHERE service_name = 'payment-service'
   AND timestamp > now() - INTERVAL 2 HOUR
   GROUP BY minute, service_name
   ORDER BY minute ASC
   FORMAT JSONEachRow" > payment-service-metrics-export.json

cat payment-service-metrics-export.json | jq '.' | head -20

5. Create Export Summary

Create export-summary.md:

# Telemetry Export Summary

**Export Date**: 2026-02-19
**Store ID**: demo-4523
**Time Range**: Last 2 hours
**Services**: payment-service

## Export Contents

### Logs
- **File**: payment-service-logs-export.json
- **Count**: [run: `wc -l payment-service-logs-export.json`]
- **Severity**: ERROR, WARN
- **Format**: JSON (Datadog-compatible)

### Traces
- **File**: payment-service-traces-export.json
- **Count**: [run: `wc -l payment-service-traces-export.json`]
- **Status**: ERROR only
- **Format**: JSON (OpenTelemetry format)

### Metrics
- **File**: payment-service-metrics-export.json
- **Aggregation**: Per minute
- **Metrics**: error_count, error_rate_pct, avg_latency_ms

## Findings
[Document your findings here after reviewing the exported data]

## Next Steps
1. Upload to Datadog (via API or UI)
2. Share links with incident response team
3. Use for post-incident review (PIR)

Questions to Answer

  1. How many ERROR logs were exported?
  2. How many failed traces?
  3. What is the average error rate per minute?
  4. What is the total size of exported data?
  5. How would you automate this export process?

Final Challenge: Full Incident Simulation

Difficulty: ⭐⭐⭐⭐ Expert Time: 90 minutes

Scenario

"Multiple stores are reporting issues. Some say orders are slow, others say payments are failing. Investigate and provide a root cause analysis."

Your Mission

Combine everything you've learned:

  1. Investigate logs, traces, and metrics
  2. Identify multiple failure modes
  3. Correlate issues across services
  4. Provide a timeline of events
  5. Recommend fixes
  6. Export relevant telemetry to Datadog

Steps

I'll leave this one open-ended for you to design and execute!

Hints:

  • Simulate multiple issues simultaneously (slow service + high error rate)
  • Use different services (pos-backend, payment-service)
  • Introduce resource pressure
  • Make a configuration change
  • Generate traffic with mixed success/failure rates

Deliverable: Write a mock Post-Incident Review (PIR) document:

  • Timeline: What happened and when?
  • Root cause: What caused the issue(s)?
  • Impact: How many requests failed? How long were services degraded?
  • Resolution: What fixed the issue?
  • Prevention: How to prevent this in the future?

Certification Checklist

You've completed the upskilling program when you can:

  • [ ] Explain edge computing concepts to a non-technical stakeholder
  • [ ] Navigate Kubernetes cluster using kubectl
  • [ ] Search and filter logs in HyperDX
  • [ ] Analyze distributed traces and identify bottlenecks
  • [ ] Query ClickHouse database for telemetry data
  • [ ] Correlate logs, traces, and metrics during an incident
  • [ ] Identify resource exhaustion issues (CPU, memory)
  • [ ] Trace recent changes (deployments, config updates)
  • [ ] Export edge telemetry conditionally
  • [ ] Write a basic incident summary with root cause

Next Steps: Real-World Application

Week 4 Goals:

  1. Access KFC US lab environment
  • Request access from Byte Edge engineer
  • Deploy similar setup in lab (HyperDX + ClickHouse)
  1. Shadow real incident
  • Join on-call shift (observer mode)
  • Use HyperDX to investigate real issues
  • Compare with Datadog investigation
  1. Document IM requirements
  • What telemetry does IM need for investigations?
  • What's missing from current HyperDX deployment?
  • What export triggers make sense?
  1. Train team
  • Share learnings with Byte IM team
  • Demo HyperDX capabilities
  • Create edge investigation runbook

Resources

  • Local demo: Refer back to Module 5 for setup
  • ClickHouse queries: https://clickhouse.com/docs/en/sql-reference/
  • HyperDX docs: https://www.hyperdx.io/docs
  • OpenTelemetry: https://opentelemetry.io/docs/

Congratulations on completing the Byte Edge SME upskilling program! 🎉

You're now equipped to:

  • Investigate edge incidents using HyperDX
  • Understand Kubernetes and edge architecture
  • Export telemetry conditionally to Datadog
  • Collaborate with Byte Edge team on future enhancements

Keep practicing, and reach out to Christian or the Byte Edge engineer with questions!

Reading Checkpoint

Current score: 0%

Sections complete

0/0

Checkpoint confirmed

Not yet

Reflection

0 chars

Completion requires 80% section coverage, checkpoint confirmation, and a short reflection. On completion, you will move to the next module automatically.

Add 40 more characters.

Mark at least 80% of sections complete.