Skillweave - Incident Learning Platform

Module Content

Not Started

Key Takeaways

Speedometer (metric): Current speed
Check engine light (alert): Something is wrong
OBD scanner (logs): Detailed error codes and diagnostics
Always-on internet connectivity ✅
Centralized infrastructure ✅

Overview

Reading time: ~40 minutes

What is Observability?

Simple definition: The ability to understand what's happening inside a system by examining its outputs.

Analogy: Your car dashboard

Speedometer (metric): Current speed
Check engine light (alert): Something is wrong
OBD scanner (logs): Detailed error codes and diagnostics

In software: Observability helps answer "Why is the system behaving this way?"

The Three Pillars of Observability

1. Logs

What: Time-stamped records of discrete events.

Example:

2026-02-19T14:32:15Z [INFO] Order #12345 received
2026-02-19T14:32:16Z [INFO] Payment authorized: $42.50
2026-02-19T14:32:17Z [ERROR] Failed to send order to kitchen: Connection timeout
2026-02-19T14:32:18Z [WARN] Retrying order submission (attempt 2/3)

When to use: Debugging specific events, tracing user actions, root cause analysis

IM use case: "What happened when Store #4523 reported order failures at 2:30 PM?"

2. Metrics

What: Numerical measurements aggregated over time.

Example:

http_requests_total{service="pos-backend", status="200"} 45234
http_requests_total{service="pos-backend", status="500"} 12
cpu_usage_percent{host="store-4523"} 87.3
memory_usage_mb{service="payment-processor"} 2048
order_processing_duration_ms{p99} 450

When to use: Tracking trends, setting alerts, capacity planning, performance monitoring

IM use case: "Is the error rate spiking? Is CPU/memory exhausted?"

3. Traces

What: End-to-end journey of a request through multiple services.

Example:

Trace ID: 7d5d747b-e160-e280-5049-099d984bcfe0

1. [pos-frontend] HTTP POST /order (10ms)
   └─> 2. [pos-backend] Process order (150ms)
       ├─> 3. [payment-service] Authorize payment (300ms)
       │   └─> 4. [cybersource-api] External API call (280ms) ⚠️ SLOW
       └─> 5. [order-service] Create order (50ms)
           └─> 6. [kitchen-display] Send to kitchen (20ms)

Total: 530ms (slow because Cybersource took 280ms)

When to use: Identifying bottlenecks, understanding service dependencies, diagnosing latency

IM use case: "Why is order submission taking 30 seconds? Which service is the bottleneck?"

Why Edge Telemetry is Different

Traditional Cloud Observability

Setup: All services stream logs/metrics/traces to centralized Datadog.

┌────────────┐
│  Service A │──┐
└────────────┘  │
                │
┌────────────┐  │    ┌──────────────┐
│  Service B │──┼───>│   Datadog    │
└────────────┘  │    │  (Cloud SaaS)│
                │    └──────────────┘
┌────────────┐  │
│  Service C │──┘
└────────────┘

Assumptions:

Always-on internet connectivity ✅
Centralized infrastructure ✅
Cost scales with data volume (acceptable for cloud services) ✅

Edge Observability Challenges

Problem 1: Intermittent Connectivity

Restaurant internet goes down (storm, ISP outage, construction)
Can't stream telemetry to Datadog
Solution: Store telemetry locally at the edge

Problem 2: Bandwidth Costs

60,000 stores streaming logs 24/7 = massive bandwidth
Especially problematic in international markets (Australia, India, etc.)
Solution: Store locally, export only when needed

Problem 3: Data Volume & Cost

Edge services generate high-volume logs (every POS transaction, every button click)
Sending all data to Datadog = $$$$ (could be 10x-100x current spend)
Solution: Store locally with longer retention, export selectively

Problem 4: Latency

Investigating an issue at Store #4523
Querying Datadog involves: Edge → Cloud → Datadog → Cloud → Edge (round-trip)
Solution: Query telemetry directly at the edge (sub-second response)

Edge Telemetry Architecture

High-Level Design

┌─────────────────────────────────────────────────────────┐
│         Cloud (AWS)                                      │
│  ┌──────────────┐         ┌──────────────┐             │
│  │   Datadog    │◄────────│  Export Job  │             │
│  │  (selective) │         │ (conditional)│             │
│  └──────────────┘         └──────┬───────┘             │
└─────────────────────────────────┼─────────────────────┘
                                   │ (on-demand export)
                                   │
┌──────────────────────────────────▼─────────────────────┐
│  Restaurant Edge Server (Store #4523)                   │
│                                                          │
│  ┌──────────────────────────────────────────────────┐  │
│  │  Applications (generate telemetry)               │  │
│  │  - pos-backend                                   │  │
│  │  - payment-service                               │  │
│  │  - order-service                                 │  │
│  └───┬──────────────────────────────────────────┬───┘  │
│      │ (logs, metrics, traces)                  │      │
│      │                                           │      │
│  ┌───▼────────────────────────────────────────────────┐ │
│  │              ClickStack Platform                   │ │
│  │                                                     │ │
│  │  ┌─────────────┐   ┌──────────────┐   ┌────────┐ │ │
│  │  │  HyperDX    │◄──│  ClickHouse  │◄──│  OTLP  │ │ │
│  │  │  (UI/API)   │   │  (Storage)   │   │Collector│ │ │
│  │  │             │   │              │   │         │ │ │
│  │  │ - SQL query │   │ - Logs table │   │ - OTLP  │ │ │
│  │  │ - Lucene    │   │ - Traces tbl │   │ - Enrich│ │ │
│  │  │ - Unified   │   │ - Metrics tbl│   │ - Route │ │ │
│  │  └─────────────┘   └──────────────┘   └────────┘ │ │
│  └─────────────────────────────────────────────────┘ │
│                                                          │
│  [Local Storage: 30-90 days retention]                  │
└──────────────────────────────────────────────────────────┘

Key Components

Applications: Generate telemetry (logs, metrics, traces) using OpenTelemetry SDKs
ClickStack: Complete observability platform consisting of:

ClickHouse: Columnar database with native JSON support for high-performance storage
HyperDX: Unified UI/API layer with dual query syntax (SQL + Lucene)
OpenTelemetry Collector: Standard OTLP ingestion pipeline

Export Job: Conditional logic to send edge telemetry to Datadog when needed

ClickStack vs Datadog

Feature	Datadog (Cloud)	ClickStack (Edge)
Deployment	SaaS (cloud-hosted)	Self-hosted (on edge server)
Data storage	Datadog's infrastructure	ClickHouse (local, native JSON)
Retention	15-30 days (expensive for more)	30-90 days (cost = local disk)
Access	Internet required	Local network (works offline)
Query speed	1-5 seconds	Sub-second (local queries)
Query syntax	Proprietary	SQL + Lucene (dual mode)
Cost model	Per GB ingested + retention	Fixed (hardware cost only)
High cardinality	Limited by pricing	Billions of labels supported
Use case	Global overview, trends	Store-specific deep dives

Key insight: They complement each other!

Datadog: Fleet-wide metrics, cross-store trends, alerting, ML-based anomaly detection
ClickStack: Granular store-level investigation, verbose logs, offline access, SQL analytics

Incident Investigation: Cloud vs Edge

Scenario: "Orders failing at Store #4523"

Traditional approach (Cloud-only):

Check Datadog for store #4523
See limited logs (only critical errors sent to cloud)
Can't see detailed local context
Escalate to field ops → 2+ hour response

Edge-enabled approach:

Check Datadog for high-level overview
Access HyperDX for store #4523 directly
Query detailed logs: every order attempt, every API call, every error
Identify root cause in minutes
If needed, export relevant logs to Datadog for correlation with cloud services

Time saved: 2 hours → 10 minutes

Conditional Export: The "Flight Recorder"

Concept: Store verbose telemetry locally, export to Datadog only when needed.

Analogy: Airplane black boxes record everything, but data is only retrieved after an incident.

Export Triggers

When to export edge telemetry to Datadog:

Incident declared: Store reports an issue → export last 2 hours of logs
Alert threshold hit: Error rate >5% → export affected service logs
Manual request: IM engineer investigating → export specific time range
Post-mortem: Export telemetry for historical analysis

Export Configuration

# Example: Conditional export rules
export_rules:
  - trigger: incident_declared
    lookback: 2h
    services: [pos-backend, payment-service]
    log_level: [ERROR, WARN, INFO]
    destination: datadog

  - trigger: error_rate_threshold
    threshold: 5%
    lookback: 30m
    services: [affected_service]
    log_level: [ERROR, WARN]
    destination: datadog

  - trigger: manual_request
    lookback: custom
    services: custom
    log_level: custom
    destination: datadog

Benefits:

Cost control: Only pay for telemetry that's actually needed in Datadog
Comprehensive data: All telemetry stored locally, nothing is lost
Offline resilience: Investigate locally even without internet

Telemetry Best Practices

1. Structured Logging

Bad:

Order failed

Good:

{
  "timestamp": "2026-02-19T14:32:17Z",
  "level": "ERROR",
  "service": "pos-backend",
  "store_id": "4523",
  "order_id": "12345",
  "error": "Connection timeout",
  "error_code": "TIMEOUT_ERR_001",
  "context": {
    "customer_id": "987654",
    "total": 42.50,
    "payment_method": "credit_card"
  }
}

Why: Structured logs are searchable, filterable, and correlatable.

2. Correlation IDs

What: Unique identifier that follows a request through all services.

Example:

Trace ID: 7d5d747b-e160-e280-5049-099d984bcfe0

[pos-frontend] trace_id=7d5d747b → Order received
[pos-backend]  trace_id=7d5d747b → Processing order
[payment-svc]  trace_id=7d5d747b → Authorizing payment
[order-svc]    trace_id=7d5d747b → Creating order

Why: Easily trace a single transaction across multiple services.

3. Metric Cardinality

Problem: Too many unique metric labels = high cardinality = performance issues

Bad (high cardinality):

order_count{customer_id="12345", order_id="98765", ...}
# Millions of unique combinations!

Good (low cardinality):

order_count{store_id="4523", status="success"}
# Dozens of stores × 3 statuses = manageable

Why: High cardinality metrics are expensive to store and query.

4. Sampling Traces

Problem: Tracing every request = massive data volume

Solution: Sample traces intelligently

100% of errors (always trace failed requests)
100% of slow requests (>1s)
1% of successful requests (statistical sample)

Why: Get comprehensive error visibility while controlling costs.

Edge Telemetry Retention Strategy

Telemetry Type	Edge Retention	Cloud Retention	Export Policy
Error logs	90 days	30 days	Always export
Warn logs	60 days	15 days	Export on incident
Info logs	30 days	7 days	Export on demand
Debug logs	7 days	Never	Export on manual request
Metrics	90 days	15 months	Export aggregates only
Traces (errors)	30 days	15 days	Always export
Traces (success)	7 days	1 day (sampled)	Sample 1%

Strategy: Keep verbose data at the edge, send only high-signal data to cloud.

IM Workflow with Edge Telemetry

Step 1: Alert Received

Check Datadog for high-level overview
Identify affected store(s)

Step 2: Access Edge Telemetry

Open ClickStack UI for specific store
Query logs for time window around incident (use Lucene for quick search, SQL for analytics)
View metrics (CPU, memory, error rates)
Trace requests to identify bottlenecks
Leverage unified correlation (logs → traces → metrics)

Step 3: Root Cause Analysis

Correlate edge telemetry with cloud telemetry
Identify: Is it edge, network, or cloud issue?
Document findings

Step 4: Export to Datadog (if needed)

Trigger export of relevant logs/traces
Share Datadog links with team for collaboration
Use for post-incident review

Step 5: Resolution & Documentation

Resolve incident
Update runbook with learnings
Track metrics (MTTA, MTTR)

Key Takeaways

Three pillars: Logs (events), Metrics (numbers), Traces (journeys)
Edge telemetry challenge: Can't stream everything to Datadog (cost, bandwidth, connectivity)
Solution: Store locally with ClickStack (ClickHouse + HyperDX + OpenTelemetry), export conditionally
Flight Recorder concept: Verbose local storage, selective cloud export
Complementary tools: Datadog (fleet-wide), ClickStack (store-specific, high-volume)
IM benefit: Faster diagnosis with granular edge visibility and dual query modes (SQL + Lucene)
Best practices: Structured logs, correlation IDs, controlled cardinality, intelligent sampling
ClickStack advantages: Native JSON columns, unified signal correlation, sub-second queries

Discussion Questions

Before moving to Module 4, think about:

What types of logs should ALWAYS be exported to Datadog?
How would you balance local storage capacity vs retention duration?
What edge telemetry would help investigate payment failures?
How would you handle telemetry for stores that are offline for days?

Next Steps

✅ Complete Module 1: Edge Computing ✅ Complete Module 2: Kubernetes Overview ✅ Complete Module 3: Observability & Telemetry ⬜ Read Module 4: ClickStack Deep Dive (ClickHouse + HyperDX + OpenTelemetry) ⬜ Prepare for hands-on local demo (Module 5)

Estimated time to next module: 1 day

Reading Checkpoint

Current score: 0%

Sections complete

0/0

Checkpoint confirmed

Not yet

Reflection

0 chars

Completion requires 80% section coverage, checkpoint confirmation, and a short reflection. On completion, you will move to the next module automatically.

I can explain one operational takeaway from this module and when to apply it. Reflection (40+ chars)

Add 40 more characters.

Mark at least 80% of sections complete.

Edge Telemetry Fundamentals

Module Navigator

Learning Guidance

Incorrect Champs ID for KFC 147 and KFC137 in reporting

Lower than expected payment authorization attempts by tender type for ph_mx

Lower than usual payment authorization attempts detected by tender type for ph_mx

Module Content

Overview

What is Observability?

The Three Pillars of Observability

1. Logs

2. Metrics

3. Traces

Why Edge Telemetry is Different

Traditional Cloud Observability

Edge Observability Challenges

Edge Telemetry Architecture

High-Level Design

Key Components

ClickStack vs Datadog

Incident Investigation: Cloud vs Edge

Scenario: "Orders failing at Store #4523"

Conditional Export: The "Flight Recorder"

Export Triggers

Export Configuration

Telemetry Best Practices

1. Structured Logging

2. Correlation IDs

3. Metric Cardinality

4. Sampling Traces

Edge Telemetry Retention Strategy

IM Workflow with Edge Telemetry

Step 1: Alert Received

Step 2: Access Edge Telemetry

Step 3: Root Cause Analysis

Step 4: Export to Datadog (if needed)

Step 5: Resolution & Documentation

Key Takeaways

Discussion Questions

Next Steps