Back to course: Edge

Byte Edge | Reading Module

Edge Telemetry Fundamentals

Status: Not Started | Pass threshold: 100% | Points: 90

L2 35 min

Best score

0%

Attempts

0

Pass rate

0%

Passed

0

Completion happens in the checkpoint panel below.

Learning Guidance

Objectives

  • **Speedometer** (metric): Current speed
  • **Check engine light** (alert): Something is wrong
  • **OBD scanner** (logs): Detailed error codes and diagnostics
  • [pos-frontend] HTTP POST /order (10ms)

Source Artifacts

Internal source references are available for maintainers but are not exposed in deployed environments.

Field Evidence

Real incidents related to what you're learning.

Module Content

Not Started

Key Takeaways

  • Speedometer (metric): Current speed
  • Check engine light (alert): Something is wrong
  • OBD scanner (logs): Detailed error codes and diagnostics
  • Always-on internet connectivity ✅
  • Centralized infrastructure ✅

Overview

Reading time: ~40 minutes


What is Observability?

Simple definition: The ability to understand what's happening inside a system by examining its outputs.

Analogy: Your car dashboard

  • Speedometer (metric): Current speed
  • Check engine light (alert): Something is wrong
  • OBD scanner (logs): Detailed error codes and diagnostics

In software: Observability helps answer "Why is the system behaving this way?"


The Three Pillars of Observability

1. Logs

What: Time-stamped records of discrete events.

Example:

2026-02-19T14:32:15Z [INFO] Order #12345 received
2026-02-19T14:32:16Z [INFO] Payment authorized: $42.50
2026-02-19T14:32:17Z [ERROR] Failed to send order to kitchen: Connection timeout
2026-02-19T14:32:18Z [WARN] Retrying order submission (attempt 2/3)

When to use: Debugging specific events, tracing user actions, root cause analysis

IM use case: "What happened when Store #4523 reported order failures at 2:30 PM?"


2. Metrics

What: Numerical measurements aggregated over time.

Example:

http_requests_total{service="pos-backend", status="200"} 45234
http_requests_total{service="pos-backend", status="500"} 12
cpu_usage_percent{host="store-4523"} 87.3
memory_usage_mb{service="payment-processor"} 2048
order_processing_duration_ms{p99} 450

When to use: Tracking trends, setting alerts, capacity planning, performance monitoring

IM use case: "Is the error rate spiking? Is CPU/memory exhausted?"


3. Traces

What: End-to-end journey of a request through multiple services.

Example:

Trace ID: 7d5d747b-e160-e280-5049-099d984bcfe0

1. [pos-frontend] HTTP POST /order (10ms)
   └─> 2. [pos-backend] Process order (150ms)
       ├─> 3. [payment-service] Authorize payment (300ms)
       │   └─> 4. [cybersource-api] External API call (280ms) ⚠️ SLOW
       └─> 5. [order-service] Create order (50ms)
           └─> 6. [kitchen-display] Send to kitchen (20ms)

Total: 530ms (slow because Cybersource took 280ms)

When to use: Identifying bottlenecks, understanding service dependencies, diagnosing latency

IM use case: "Why is order submission taking 30 seconds? Which service is the bottleneck?"


Why Edge Telemetry is Different

Traditional Cloud Observability

Setup: All services stream logs/metrics/traces to centralized Datadog.

┌────────────┐
│  Service A │──┐
└────────────┘  │
                │
┌────────────┐  │    ┌──────────────┐
│  Service B │──┼───>│   Datadog    │
└────────────┘  │    │  (Cloud SaaS)│
                │    └──────────────┘
┌────────────┐  │
│  Service C │──┘
└────────────┘

Assumptions:

  • Always-on internet connectivity ✅
  • Centralized infrastructure ✅
  • Cost scales with data volume (acceptable for cloud services) ✅

Edge Observability Challenges

Problem 1: Intermittent Connectivity

  • Restaurant internet goes down (storm, ISP outage, construction)
  • Can't stream telemetry to Datadog
  • Solution: Store telemetry locally at the edge

Problem 2: Bandwidth Costs

  • 60,000 stores streaming logs 24/7 = massive bandwidth
  • Especially problematic in international markets (Australia, India, etc.)
  • Solution: Store locally, export only when needed

Problem 3: Data Volume & Cost

  • Edge services generate high-volume logs (every POS transaction, every button click)
  • Sending all data to Datadog = $$$$ (could be 10x-100x current spend)
  • Solution: Store locally with longer retention, export selectively

Problem 4: Latency

  • Investigating an issue at Store #4523
  • Querying Datadog involves: Edge → Cloud → Datadog → Cloud → Edge (round-trip)
  • Solution: Query telemetry directly at the edge (sub-second response)

Edge Telemetry Architecture

High-Level Design

┌─────────────────────────────────────────────────────────┐
│         Cloud (AWS)                                      │
│  ┌──────────────┐         ┌──────────────┐             │
│  │   Datadog    │◄────────│  Export Job  │             │
│  │  (selective) │         │ (conditional)│             │
│  └──────────────┘         └──────┬───────┘             │
└─────────────────────────────────┼─────────────────────┘
                                   │ (on-demand export)
                                   │
┌──────────────────────────────────▼─────────────────────┐
│  Restaurant Edge Server (Store #4523)                   │
│                                                          │
│  ┌──────────────────────────────────────────────────┐  │
│  │  Applications (generate telemetry)               │  │
│  │  - pos-backend                                   │  │
│  │  - payment-service                               │  │
│  │  - order-service                                 │  │
│  └───┬──────────────────────────────────────────┬───┘  │
│      │ (logs, metrics, traces)                  │      │
│      │                                           │      │
│  ┌───▼────────────────────────────────────────────────┐ │
│  │              ClickStack Platform                   │ │
│  │                                                     │ │
│  │  ┌─────────────┐   ┌──────────────┐   ┌────────┐ │ │
│  │  │  HyperDX    │◄──│  ClickHouse  │◄──│  OTLP  │ │ │
│  │  │  (UI/API)   │   │  (Storage)   │   │Collector│ │ │
│  │  │             │   │              │   │         │ │ │
│  │  │ - SQL query │   │ - Logs table │   │ - OTLP  │ │ │
│  │  │ - Lucene    │   │ - Traces tbl │   │ - Enrich│ │ │
│  │  │ - Unified   │   │ - Metrics tbl│   │ - Route │ │ │
│  │  └─────────────┘   └──────────────┘   └────────┘ │ │
│  └─────────────────────────────────────────────────┘ │
│                                                          │
│  [Local Storage: 30-90 days retention]                  │
└──────────────────────────────────────────────────────────┘

Key Components

  1. Applications: Generate telemetry (logs, metrics, traces) using OpenTelemetry SDKs
  2. ClickStack: Complete observability platform consisting of:
  • ClickHouse: Columnar database with native JSON support for high-performance storage
  • HyperDX: Unified UI/API layer with dual query syntax (SQL + Lucene)
  • OpenTelemetry Collector: Standard OTLP ingestion pipeline
  1. Export Job: Conditional logic to send edge telemetry to Datadog when needed

ClickStack vs Datadog

FeatureDatadog (Cloud)ClickStack (Edge)
DeploymentSaaS (cloud-hosted)Self-hosted (on edge server)
Data storageDatadog's infrastructureClickHouse (local, native JSON)
Retention15-30 days (expensive for more)30-90 days (cost = local disk)
AccessInternet requiredLocal network (works offline)
Query speed1-5 secondsSub-second (local queries)
Query syntaxProprietarySQL + Lucene (dual mode)
Cost modelPer GB ingested + retentionFixed (hardware cost only)
High cardinalityLimited by pricingBillions of labels supported
Use caseGlobal overview, trendsStore-specific deep dives

Key insight: They complement each other!

  • Datadog: Fleet-wide metrics, cross-store trends, alerting, ML-based anomaly detection
  • ClickStack: Granular store-level investigation, verbose logs, offline access, SQL analytics

Incident Investigation: Cloud vs Edge

Scenario: "Orders failing at Store #4523"

Traditional approach (Cloud-only):

  1. Check Datadog for store #4523
  2. See limited logs (only critical errors sent to cloud)
  3. Can't see detailed local context
  4. Escalate to field ops → 2+ hour response

Edge-enabled approach:

  1. Check Datadog for high-level overview
  2. Access HyperDX for store #4523 directly
  3. Query detailed logs: every order attempt, every API call, every error
  4. Identify root cause in minutes
  5. If needed, export relevant logs to Datadog for correlation with cloud services

Time saved: 2 hours → 10 minutes


Conditional Export: The "Flight Recorder"

Concept: Store verbose telemetry locally, export to Datadog only when needed.

Analogy: Airplane black boxes record everything, but data is only retrieved after an incident.

Export Triggers

When to export edge telemetry to Datadog:

  1. Incident declared: Store reports an issue → export last 2 hours of logs
  2. Alert threshold hit: Error rate >5% → export affected service logs
  3. Manual request: IM engineer investigating → export specific time range
  4. Post-mortem: Export telemetry for historical analysis

Export Configuration

# Example: Conditional export rules
export_rules:
  - trigger: incident_declared
    lookback: 2h
    services: [pos-backend, payment-service]
    log_level: [ERROR, WARN, INFO]
    destination: datadog

  - trigger: error_rate_threshold
    threshold: 5%
    lookback: 30m
    services: [affected_service]
    log_level: [ERROR, WARN]
    destination: datadog

  - trigger: manual_request
    lookback: custom
    services: custom
    log_level: custom
    destination: datadog

Benefits:

  • Cost control: Only pay for telemetry that's actually needed in Datadog
  • Comprehensive data: All telemetry stored locally, nothing is lost
  • Offline resilience: Investigate locally even without internet

Telemetry Best Practices

1. Structured Logging

Bad:

Order failed

Good:

{
  "timestamp": "2026-02-19T14:32:17Z",
  "level": "ERROR",
  "service": "pos-backend",
  "store_id": "4523",
  "order_id": "12345",
  "error": "Connection timeout",
  "error_code": "TIMEOUT_ERR_001",
  "context": {
    "customer_id": "987654",
    "total": 42.50,
    "payment_method": "credit_card"
  }
}

Why: Structured logs are searchable, filterable, and correlatable.


2. Correlation IDs

What: Unique identifier that follows a request through all services.

Example:

Trace ID: 7d5d747b-e160-e280-5049-099d984bcfe0

[pos-frontend] trace_id=7d5d747b → Order received
[pos-backend]  trace_id=7d5d747b → Processing order
[payment-svc]  trace_id=7d5d747b → Authorizing payment
[order-svc]    trace_id=7d5d747b → Creating order

Why: Easily trace a single transaction across multiple services.


3. Metric Cardinality

Problem: Too many unique metric labels = high cardinality = performance issues

Bad (high cardinality):

order_count{customer_id="12345", order_id="98765", ...}
# Millions of unique combinations!

Good (low cardinality):

order_count{store_id="4523", status="success"}
# Dozens of stores × 3 statuses = manageable

Why: High cardinality metrics are expensive to store and query.


4. Sampling Traces

Problem: Tracing every request = massive data volume

Solution: Sample traces intelligently

  • 100% of errors (always trace failed requests)
  • 100% of slow requests (>1s)
  • 1% of successful requests (statistical sample)

Why: Get comprehensive error visibility while controlling costs.


Edge Telemetry Retention Strategy

Telemetry TypeEdge RetentionCloud RetentionExport Policy
Error logs90 days30 daysAlways export
Warn logs60 days15 daysExport on incident
Info logs30 days7 daysExport on demand
Debug logs7 daysNeverExport on manual request
Metrics90 days15 monthsExport aggregates only
Traces (errors)30 days15 daysAlways export
Traces (success)7 days1 day (sampled)Sample 1%

Strategy: Keep verbose data at the edge, send only high-signal data to cloud.


IM Workflow with Edge Telemetry

Step 1: Alert Received

  • Check Datadog for high-level overview
  • Identify affected store(s)

Step 2: Access Edge Telemetry

  • Open ClickStack UI for specific store
  • Query logs for time window around incident (use Lucene for quick search, SQL for analytics)
  • View metrics (CPU, memory, error rates)
  • Trace requests to identify bottlenecks
  • Leverage unified correlation (logs → traces → metrics)

Step 3: Root Cause Analysis

  • Correlate edge telemetry with cloud telemetry
  • Identify: Is it edge, network, or cloud issue?
  • Document findings

Step 4: Export to Datadog (if needed)

  • Trigger export of relevant logs/traces
  • Share Datadog links with team for collaboration
  • Use for post-incident review

Step 5: Resolution & Documentation

  • Resolve incident
  • Update runbook with learnings
  • Track metrics (MTTA, MTTR)

Key Takeaways

  1. Three pillars: Logs (events), Metrics (numbers), Traces (journeys)
  2. Edge telemetry challenge: Can't stream everything to Datadog (cost, bandwidth, connectivity)
  3. Solution: Store locally with ClickStack (ClickHouse + HyperDX + OpenTelemetry), export conditionally
  4. Flight Recorder concept: Verbose local storage, selective cloud export
  5. Complementary tools: Datadog (fleet-wide), ClickStack (store-specific, high-volume)
  6. IM benefit: Faster diagnosis with granular edge visibility and dual query modes (SQL + Lucene)
  7. Best practices: Structured logs, correlation IDs, controlled cardinality, intelligent sampling
  8. ClickStack advantages: Native JSON columns, unified signal correlation, sub-second queries

Discussion Questions

Before moving to Module 4, think about:

  1. What types of logs should ALWAYS be exported to Datadog?
  2. How would you balance local storage capacity vs retention duration?
  3. What edge telemetry would help investigate payment failures?
  4. How would you handle telemetry for stores that are offline for days?

Next Steps

✅ Complete Module 1: Edge Computing ✅ Complete Module 2: Kubernetes Overview ✅ Complete Module 3: Observability & Telemetry ⬜ Read Module 4: ClickStack Deep Dive (ClickHouse + HyperDX + OpenTelemetry) ⬜ Prepare for hands-on local demo (Module 5)

Estimated time to next module: 1 day

Reading Checkpoint

Current score: 0%

Sections complete

0/0

Checkpoint confirmed

Not yet

Reflection

0 chars

Completion requires 80% section coverage, checkpoint confirmation, and a short reflection. On completion, you will move to the next module automatically.

Add 40 more characters.

Mark at least 80% of sections complete.