Back to course: Edge

Byte Edge | Reading Module

Edge Computing Foundations

Status: Not Started | Pass threshold: 100% | Points: 70

L1 25 min

Best score

0%

Attempts

0

Pass rate

0%

Passed

0

Completion happens in the checkpoint panel below.

Module Navigator

Previous

You are at the first module.

Recommended Next

No recommendation available yet.

Next

Kubernetes for Incident Managers

Reading Module

Continue To Next Module

Learning Guidance

Objectives

  • **Cloud-only**: All POS, kiosks, kitchen displays go offline. Store can't take orders.
  • **Edge-enabled**: Orders continue processing locally. Sync to cloud when internet returns.
  • **Revenue protection**: ~$200-500/hour lost per store during outage
  • **Customer experience**: No "sorry, our system is down"

Source Artifacts

Internal source references are available for maintainers but are not exposed in deployed environments.

Field Evidence

Real incidents related to what you're learning.

Module Content

Not Started

Key Takeaways

  • Cloud-only: All POS, kiosks, kitchen displays go offline. Store can't take orders.
  • Edge-enabled: Orders continue processing locally. Sync to cloud when internet returns.
  • Revenue protection: ~$200-500/hour lost per store during outage
  • Customer experience: No "sorry, our system is down"
  • Operational continuity: Kitchen keeps receiving orders

Overview

Reading time: ~30 minutes


What is Edge Computing?

Simple definition: Running compute workloads physically close to where data is created and consumed, rather than in a centralized cloud datacenter.

Traditional cloud model:

[Restaurant Store] --internet--> [AWS Cloud] --internet--> [Restaurant Store]
     POS sends order              Processes order          Receives confirmation

Problem: If internet fails, store is down. Latency for round-trip.

Edge model:

[Restaurant Store with Edge Server]
     POS --local network--> Edge Server --local network--> POS
                               ↓
                         (sync to cloud when available)

Benefit: Store continues operating during internet outages.

Why Edge Computing for Restaurants?

Problem: Cloud Dependency Risk

Scenario: Pizza Hut store loses internet connection

  • Cloud-only: All POS, kiosks, kitchen displays go offline. Store can't take orders.
  • Edge-enabled: Orders continue processing locally. Sync to cloud when internet returns.

Real Business Impact

  • Revenue protection: ~$200-500/hour lost per store during outage
  • Customer experience: No "sorry, our system is down"
  • Operational continuity: Kitchen keeps receiving orders

Technical Requirements

  1. Low latency: Sub-100ms response times for POS transactions
  2. Resilience: Operate during internet/cloud outages
  3. Scale: 60,000+ stores globally
  4. Cost: Bandwidth costs for streaming all data to cloud are prohibitive

Byte Edge: Yum's Edge Platform

What it is: Kubernetes-based platform that runs standardized workloads in restaurants.

What it does:

  • Runs containerized applications (POS backend, order processing, payment processing)
  • Stores data locally (orders, menu, customer data)
  • Syncs with cloud services when connectivity is available
  • Provides local observability and telemetry

Architecture:

┌─────────────────────────────────────────────────────────┐
│                    Yum Cloud (AWS)                       │
│  - Datadog                                              │
│  - Central services                                      │
│  - Analytics/BI                                          │
└────────────────────┬────────────────────────────────────┘
                     │ (internet - unreliable)
                     │
┌────────────────────▼────────────────────────────────────┐
│            Restaurant Edge Server (Byte Edge)            │
│  ┌──────────────────────────────────────────────────┐  │
│  │  Kubernetes Cluster (K8s)                        │  │
│  │  - POS Backend Service                           │  │
│  │  - Order Processing Service                      │  │
│  │  - Payment Service                               │  │
│  │  - Menu Service                                  │  │
│  │  - HyperDX (observability)                       │  │
│  │  - ClickHouse (telemetry database)               │  │
│  └──────────────────────────────────────────────────┘  │
│                                                          │
│  [Local Storage: Orders, Logs, Metrics]                 │
└───┬─────────────┬─────────────┬────────────────────────┘
    │             │             │
    │             │             │
┌───▼───┐   ┌────▼────┐   ┌────▼────┐
│  POS  │   │  Kiosk  │   │ Kitchen │
│       │   │         │   │ Display │
└───────┘   └─────────┘   └─────────┘

Key Differences: Cloud vs Edge

AspectCloud ModelEdge Model
Latency50-200ms (varies)1-10ms (local network)
ReliabilityDepends on internetWorks offline
BandwidthHigh (stream all data)Low (sync deltas)
ObservabilityStream to DatadogStore locally, export conditionally
DeploymentDeploy once, runs everywhereDeploy to 60k+ individual edge servers
DebuggingCentralized logsDistributed - logs at each edge

Edge Computing Challenges

1. Deployment Complexity

  • Problem: How do you deploy updates to 60,000 stores?
  • Solution: Kubernetes orchestration + automated rollouts

2. Observability

  • Problem: Can't stream all logs/metrics to Datadog (cost, bandwidth)
  • Solution: Store telemetry locally (HyperDX), export only on-demand

3. State Management

  • Problem: Each store has local state that must sync with cloud
  • Solution: Event-driven architecture, eventual consistency

4. Security

  • Problem: Physical access to edge hardware in restaurants
  • Solution: Secrets management, encryption, tamper detection

5. Incident Management

  • Problem: How do you debug issues at a specific store?
  • Solution: Edge-local telemetry + conditional export to Datadog

Why This Matters for Incident Management

Current State (Cloud-Only)

When a store reports an issue:

  1. Check Datadog for cloud service health
  2. If cloud is healthy → probably edge/network issue
  3. Limited visibility into what's happening at the edge
  4. Escalate to store tech support or field ops

Future State (Byte Edge + HyperDX)

When a store reports an issue:

  1. Access edge telemetry via HyperDX (direct to that store's edge server)
  2. See exact local logs, metrics, traces
  3. Correlate with cloud telemetry in Datadog
  4. Identify if issue is edge, network, or cloud
  5. Trigger export of relevant edge logs to Datadog for deeper analysis

Result: Faster diagnosis, better root cause analysis, fewer escalations.


Real-World Example

Scenario: Store #4523 reports "POS is slow, orders taking 30+ seconds to confirm"

Without Edge Telemetry:

  • Check Datadog: Cloud services look healthy
  • Conclusion: "Must be a local issue"
  • Escalate to field ops (2+ hour response time)
  • Field tech reboots POS, issue persists
  • Eventually discover: Edge server's disk is 98% full, slowing local database

With Edge Telemetry (HyperDX):

  • Access HyperDX for store #4523
  • Immediately see: Disk utilization 98%, database query latency spiking
  • Root cause identified in 5 minutes
  • Trigger log export to Datadog for historical analysis
  • Remote remediation: clear old logs, resize partition
  • Issue resolved without field visit

Time saved: 2+ hours → 15 minutes Cost saved: $150+ field visit avoided


Key Concepts to Remember

  1. Edge computing = Running workloads close to data sources (restaurants)
  2. Byte Edge = Yum's Kubernetes-based edge platform
  3. Resilience = Primary driver (operate during cloud/internet outages)
  4. Local telemetry = ClickStack (ClickHouse + HyperDX + OpenTelemetry) stores logs/metrics at the edge
  5. Conditional export = Send edge telemetry to Datadog only when needed
  6. IM benefit = Better incident diagnosis with store-specific visibility and dual query modes (SQL + Lucene)

Discussion Questions

Before moving to Module 2, think about:

  1. What types of incidents would benefit most from edge telemetry?
  2. When would you want to export edge logs to Datadog vs investigate locally?
  3. How does edge computing change our on-call response playbook?
  4. What new failure modes does edge computing introduce?

Next Steps

✅ Complete this module ⬜ Read Module 2: Kubernetes Overview ⬜ Schedule shadow call with Byte Edge engineer

Estimated time to next module: 1 day (let concepts sink in)

Reading Checkpoint

Current score: 0%

Sections complete

0/0

Checkpoint confirmed

Not yet

Reflection

0 chars

Completion requires 80% section coverage, checkpoint confirmation, and a short reflection. On completion, you will move to the next module automatically.

Add 40 more characters.

Mark at least 80% of sections complete.