Skillweave - Incident Learning Platform

Module Content

Not Started

Key Takeaways

OpenAI: ChatGPT infrastructure monitoring
Anthropic: Claude infrastructure observability
Tesla: 1 billion events/second processing
Shopify: In-house observability platform
ClickHouse = PostgreSQL (database)

Overview

Reading time: ~50 minutes

Overview

ClickStack is ClickHouse's official open-source observability platform that provides a complete edge observability solution.

What is ClickStack?

ClickStack = ClickHouse + HyperDX + OpenTelemetry

Component	Purpose
ClickHouse	Columnar database engine (stores telemetry data)
HyperDX	Intelligent UI and API layer (unified query experience)
OpenTelemetry	Standard ingestion pipeline (OTLP protocol)

Key Insight: ClickStack is not just ClickHouse the database - it's a complete, production-ready observability platform optimized for high-volume telemetry workloads.

Industry Adoption:

OpenAI: ChatGPT infrastructure monitoring
Anthropic: Claude infrastructure observability
Tesla: 1 billion events/second processing
Shopify: In-house observability platform

Analogy:

ClickHouse = PostgreSQL (database)
HyperDX = pgAdmin (UI to query the database)
ClickStack = The full integrated platform

Why ClickStack for Edge Observability?

Traditional Observability Challenges

Problem: Most observability platforms (Datadog, New Relic, Splunk) are designed for cloud environments with:

Always-on connectivity ✅
Centralized infrastructure ✅
Cost scales with data volume (acceptable for cloud) ✅

Edge Reality: Restaurants, stores, remote locations have:

Intermittent connectivity ❌
Distributed infrastructure ❌
High data volume = prohibitive cloud costs ❌

ClickStack's Edge Advantages

Self-hosted: Runs entirely on edge infrastructure (no cloud dependency)
High performance: Sub-second queries on billions of events
Cost efficient: Storage cost = local disk (not per-GB ingestion fees)
Offline capable: Works without internet connectivity
Unified querying: SQL + Lucene syntax for both power users and beginners
JSON columns: Dynamic schema support without pre-defining fields
High cardinality: Handle millions of unique label combinations efficiently

What is ClickHouse?

Elevator Pitch

ClickHouse is an open-source columnar database optimized for analytics and time-series data.

Why Columnar?

Row-based databases (PostgreSQL, MySQL):

| id  | timestamp           | level | message               |
|-----|---------------------|-------|-----------------------|
| 1   | 2026-02-19 14:00:00 | INFO  | Order created         |
| 2   | 2026-02-19 14:00:01 | ERROR | Payment failed        |
| 3   | 2026-02-19 14:00:02 | INFO  | Order confirmed       |

Storage: Row 1 [1, 2026-02-19 14:00:00, INFO, Order created]
         Row 2 [2, 2026-02-19 14:00:01, ERROR, Payment failed]
         Row 3 [3, 2026-02-19 14:00:02, INFO, Order confirmed]

Columnar databases (ClickHouse):

Storage: Column id        [1, 2, 3]
         Column timestamp [2026-02-19 14:00:00, 2026-02-19 14:00:01, ...]
         Column level     [INFO, ERROR, INFO]
         Column message   [Order created, Payment failed, ...]

Why This Matters for Telemetry

Query: "Count ERROR logs in the last hour"

Row-based: Read ALL columns for ALL rows, filter by level

Must scan: id, timestamp, level, message for millions of rows
Slow ❌

Columnar: Read only the level and timestamp columns

Scan only relevant columns
10-100x faster ✅

Key insight: Telemetry queries typically filter/aggregate on a few columns (timestamp, level, service) → columnar is perfect.

ClickHouse Features for Telemetry

1. Blazing Fast Queries

Compression: Similar values in a column compress well (e.g., "INFO" repeated 1M times)
Parallelization: Queries use all CPU cores
Vectorized execution: Process thousands of rows per CPU instruction

Result: Query billions of log entries in seconds.

2. Time-Series Optimized

Partitioning by date:

CREATE TABLE logs (
  timestamp DateTime,
  level String,
  message String,
  service String
) ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(timestamp)  -- One partition per day
ORDER BY (timestamp, service);

Benefit: When querying "last 2 hours," ClickHouse only scans relevant partitions, ignoring the rest.

3. TTL (Time-To-Live) for Auto-Cleanup

ALTER TABLE logs MODIFY TTL timestamp + INTERVAL 90 DAY;

Benefit: Automatically delete logs older than 90 days → no manual cleanup, storage stays manageable.

4. Materialized Views for Pre-Aggregation

-- Pre-compute error counts per service per hour
CREATE MATERIALIZED VIEW error_counts_hourly
ENGINE = SummingMergeTree()
ORDER BY (service, hour)
AS SELECT
  service,
  toStartOfHour(timestamp) AS hour,
  countIf(level = 'ERROR') AS error_count
FROM logs
GROUP BY service, hour;

Benefit: Instead of scanning millions of logs to count errors, query the pre-aggregated view (instant results).

What is HyperDX?

Elevator Pitch

HyperDX is the frontend layer of ClickStack - an open-source observability UI (like Datadog) that provides unified querying across all telemetry signals.

Acquired by ClickHouse Inc. in early 2025, HyperDX is now the official UI for ClickStack.

Key Philosophy: No Signal Silos

Traditional observability: Separate tabs for Logs, Metrics, Traces

❌ Forces you to switch between different tools
❌ Manual correlation between signals
❌ Fragmented investigation workflow

HyperDX approach: Unified search across all signals

✅ Single query syntax for logs, metrics, and traces
✅ Automatic correlation (click log → see related trace)
✅ Symptom-to-root-cause workflow

Key Features

Unified Search Experience

Query logs, metrics, and traces with one syntax
Both SQL (powerful analytics) and Lucene (simple text search) supported
Automatic correlation between signals

Dual Query Syntax

Lucene: error payment (simple, fast)
SQL: SELECT * FROM logs WHERE severity='ERROR' AND body LIKE '%payment%' (powerful, flexible)
Choose based on your expertise and use case

Log Search & Filtering

Full-text search
Structured field filtering
Time range selection
Pattern detection (clustering similar logs)

Metrics Dashboards

Custom dashboards
Visualization (line charts, bar charts, heatmaps)
Alerting (trigger on thresholds)

Distributed Tracing

Trace visualization (waterfall diagrams)
Service dependency maps
Latency analysis

Correlation

Jump from logs → traces → metrics
Unified view of all telemetry
Client-side + backend telemetry in one view

ClickStack Architecture

┌─────────────────────────────────────────────────────────┐
│  Applications (Edge K8s cluster)                         │
│  - pos-backend                                           │
│  - payment-service                                       │
│  - order-service                                         │
└─────┬───────────────────────────────────────────────────┘
      │ (logs, metrics, traces via OpenTelemetry SDK)
      │
┌─────▼───────────────────────────────────────────────────┐
│  OpenTelemetry Collector (ClickStack Component)          │
│  - Receives telemetry via OTLP protocol (standard)       │
│  - Enriches data (adds store_id, environment tags)       │
│  - Processors: batch, filter, transform                  │
│  - Routes to ClickHouse using native exporter            │
└─────┬───────────────────────────────────────────────────┘
      │
┌─────▼───────────────────────────────────────────────────┐
│  ClickHouse Database (ClickStack Component)              │
│  - logs table (log entries with JSON columns)            │
│  - traces table (spans)                                  │
│  - metrics table (time-series data)                      │
│  - Native JSON type for dynamic fields                   │
│  - High cardinality support (billions of labels)         │
└─────┬───────────────────────────────────────────────────┘
      │
┌─────▼───────────────────────────────────────────────────┐
│  HyperDX API + UI (ClickStack Component)                 │
│  - Web UI: http://clickstack.store-4523.local:8080       │
│  - REST API: Query logs/metrics/traces programmatically  │
│  - Query engine: Dual syntax (SQL + Lucene)              │
│  - Unified search: Logs + Metrics + Traces in one view   │
└──────────────────────────────────────────────────────────┘

Note: ClickStack is the integrated platform. The three components work together seamlessly.

ClickStack's Unique Features

1. Native JSON Column Type

Problem: Traditional observability requires pre-defining every field

❌ "Add a new field? Update the schema first"
❌ Dynamic fields stored as strings = slow queries
❌ Nested JSON requires complex parsing

ClickStack Solution: Native JSON columns

-- Each path in JSON automatically becomes its own column
attributes JSON  -- Dynamically expands to: attributes.order_id, attributes.customer_id, etc.

Performance Gains:

10x faster searches (only read relevant fields)
100x less data scanned (skip irrelevant columns)
No manual column management

Real Example:

-- Old approach: String column, slow scan
SELECT * FROM logs WHERE JSONExtractString(attributes, 'order_id') = '12345';

-- ClickStack approach: Native column, fast lookup
SELECT * FROM logs WHERE attributes.order_id = '12345';

2. Dual Query Syntax: SQL + Lucene

HyperDX provides two query modes:

Lucene Syntax (Simple & Fast)

error payment
service:payment-service status:ERROR
store_id:4523 AND (payment OR authorization)

When to use: Quick searches, finding specific events, exploring data

SQL Syntax (Powerful & Analytical)

SELECT service_name, count(*) as error_count
FROM logs
WHERE severity_text = 'ERROR'
  AND timestamp >= now() - INTERVAL 1 HOUR
GROUP BY service_name
ORDER BY error_count DESC;

When to use: Aggregations, complex filtering, analytical queries

Key Insight: You can start with Lucene, then switch to SQL for deeper analysis.

3. High Cardinality Support

Problem: Traditional time-series databases (Prometheus, InfluxDB) struggle with high cardinality

❌ Millions of unique label combinations = performance degradation
❌ Must sample or limit labels

ClickStack Solution: Everything in one big table

-- Even with billions of unique combinations, no problem
SELECT * FROM metrics
WHERE labels.customer_id = '12345'
  AND labels.order_id = '98765'
  AND labels.payment_method = 'credit_card';

Real-World Scale:

Tesla: 1 billion events/second, 1 quintillion rows
OpenAI: ChatGPT infrastructure monitoring
Anthropic: Claude infrastructure (high cardinality labels)

4. Unified Signal Correlation

HyperDX automatically correlates:

Logs with trace_id → Shows related trace
Traces with spans → Shows all logs for that request
Metrics with labels → Shows related logs and traces

Workflow:

See error spike in metrics dashboard
Click spike → Jump to logs filtered to that time
Click log → See full distributed trace
Identify bottleneck in trace → Jump back to logs for that service

This is the power of ClickStack's unified approach.

Telemetry Schema in ClickHouse

Logs Table

CREATE TABLE logs (
  timestamp DateTime64(3),         -- Millisecond precision
  trace_id String,                 -- Correlation ID (links to traces)
  span_id String,                  -- Span ID (links to specific trace span)
  severity_text String,            -- INFO, WARN, ERROR, DEBUG
  severity_number Int8,            -- Numeric severity (for sorting)
  service_name String,             -- pos-backend, payment-service, etc.
  body String,                     -- Log message

  -- Resource attributes (describe the source)
  resource_store_id String,        -- Store #4523
  resource_environment String,     -- production, staging
  resource_k8s_pod_name String,    -- pos-backend-abc123
  resource_k8s_namespace String,   -- pos

  -- Log attributes (structured data from application)
  attributes Map(String, String),  -- Key-value pairs (e.g., order_id, customer_id)

  INDEX idx_trace_id trace_id TYPE bloom_filter GRANULARITY 1
) ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (timestamp, service_name, severity_number);

Traces Table

CREATE TABLE traces (
  timestamp DateTime64(3),
  trace_id String,                 -- Unique trace ID
  span_id String,                  -- Unique span ID
  parent_span_id String,           -- Parent span (for hierarchy)
  span_name String,                -- Operation name (e.g., "POST /order")
  span_kind String,                -- SERVER, CLIENT, INTERNAL
  service_name String,
  duration_ns UInt64,              -- Span duration in nanoseconds
  status_code String,              -- OK, ERROR

  -- Span attributes
  attributes Map(String, String),  -- http.method, http.status_code, etc.

  -- Resource attributes
  resource_store_id String,
  resource_environment String,

  INDEX idx_trace_id trace_id TYPE bloom_filter GRANULARITY 1
) ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (timestamp, trace_id, span_id);

Metrics Table

CREATE TABLE metrics (
  timestamp DateTime64(3),
  metric_name String,              -- cpu_usage_percent, order_count, etc.
  value Float64,                   -- Metric value

  -- Metric attributes (labels)
  attributes Map(String, String),  -- service, host, status, etc.

  -- Resource attributes
  resource_store_id String,
  resource_environment String
) ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (timestamp, metric_name);

Common Query Patterns for IM

Pattern 1: Recent Error Logs

SELECT
  timestamp,
  service_name,
  body,
  attributes['order_id'] AS order_id,
  attributes['customer_id'] AS customer_id
FROM logs
WHERE
  timestamp >= now() - INTERVAL 2 HOUR
  AND severity_text = 'ERROR'
  AND resource_store_id = '4523'
ORDER BY timestamp DESC
LIMIT 100;

Use case: "Show me all errors at Store #4523 in the last 2 hours"

Pattern 2: Error Rate by Service

SELECT
  service_name,
  countIf(severity_text = 'ERROR') AS error_count,
  count() AS total_count,
  (error_count / total_count) * 100 AS error_rate_pct
FROM logs
WHERE
  timestamp >= now() - INTERVAL 1 HOUR
  AND resource_store_id = '4523'
GROUP BY service_name
ORDER BY error_rate_pct DESC;

Use case: "Which service has the highest error rate?"

Pattern 3: Trace Lookup by ID

SELECT
  span_id,
  parent_span_id,
  span_name,
  service_name,
  duration_ns / 1000000 AS duration_ms,
  status_code,
  attributes
FROM traces
WHERE
  trace_id = '7d5d747b-e160-e280-5049-099d984bcfe0'
ORDER BY timestamp ASC;

Use case: "Show me the full trace for this order"

Pattern 4: Slow Traces (P99 Latency)

SELECT
  trace_id,
  span_name,
  service_name,
  max(duration_ns) / 1000000 AS max_duration_ms
FROM traces
WHERE
  timestamp >= now() - INTERVAL 1 HOUR
  AND resource_store_id = '4523'
  AND span_kind = 'SERVER'
GROUP BY trace_id, span_name, service_name
HAVING max_duration_ms > 1000  -- Slower than 1 second
ORDER BY max_duration_ms DESC
LIMIT 20;

Use case: "What are the slowest requests in the last hour?"

Pattern 5: Correlated Logs for a Trace

SELECT
  timestamp,
  service_name,
  severity_text,
  body
FROM logs
WHERE
  trace_id = '7d5d747b-e160-e280-5049-099d984bcfe0'
ORDER BY timestamp ASC;

Use case: "Show me all logs related to this trace"

Pattern 6: Metric Trend (CPU Usage)

SELECT
  toStartOfMinute(timestamp) AS minute,
  avg(value) AS avg_cpu
FROM metrics
WHERE
  metric_name = 'cpu_usage_percent'
  AND resource_store_id = '4523'
  AND attributes['service'] = 'pos-backend'
  AND timestamp >= now() - INTERVAL 1 HOUR
GROUP BY minute
ORDER BY minute ASC;

Use case: "Graph CPU usage for pos-backend over the last hour"

HyperDX UI Walkthrough

1. Log Search

UI Location: HyperDX → Logs

Features:

Full-text search: Search across all log messages
Field filters: Filter by service, level, store_id, etc.
Time picker: Last 15m, 1h, 4h, custom range
Live tail: Stream logs in real-time

Example:

Query: "payment failed"
Filters:
  - service_name = payment-service
  - severity_text = ERROR
  - resource_store_id = 4523
Time range: Last 2 hours

Result: All error logs containing "payment failed" from the payment service at store #4523.

2. Trace View

UI Location: HyperDX → Traces

Features:

Waterfall diagram: Visualize span hierarchy and timing
Service map: See service dependencies
Filter by duration: Find slow traces
Filter by status: Find failed traces

Example:

Filters:
  - duration > 1000ms
  - status = ERROR
  - service_name = pos-backend
Time range: Last 1 hour

Result: All failed traces from pos-backend that took longer than 1 second.

3. Metrics Dashboard

UI Location: HyperDX → Metrics

Features:

Custom dashboards: Create charts for key metrics
Visualization types: Line chart, bar chart, heatmap, gauge
Alerts: Set thresholds and get notified

Example Dashboard:

Panel 1: CPU usage (line chart)
Panel 2: Error rate per service (bar chart)
Panel 3: Request latency P50/P99 (line chart)
Panel 4: Active orders (gauge)

4. Correlation: Logs ↔ Traces ↔ Metrics

Workflow:

See spike in error rate (Metrics Dashboard)
Click spike → Jump to Logs filtered to that time range
Click error log → See associated trace_id
Click trace_id → View full trace waterfall
Identify slow span → See which service caused delay

This is the power of unified observability!

ClickStack vs Datadog: Feature Comparison

Feature	ClickStack (Edge)	Datadog (Cloud)
Log search	✅ Unified (SQL + Lucene)	✅ Full-text + structured
Distributed tracing	✅ Waterfall, service map	✅ Waterfall, service map, flame graphs
Metrics dashboards	✅ Custom dashboards	✅ Custom dashboards + anomaly detection
Alerting	✅ Basic threshold alerts	✅ Advanced ML-based alerts
APM	✅ Basic (via OpenTelemetry)	✅ Full APM (profiling, code hotspots)
Log patterns	✅ Pattern detection + clustering	✅ Pattern detection + clustering
Query syntax	✅ SQL + Lucene (dual mode)	⚠️ Proprietary syntax only
High cardinality	✅ Billions of labels	⚠️ Limited by pricing
JSON support	✅ Native JSON columns	⚠️ Parsed at query time
Deployment	Self-hosted (edge)	SaaS (cloud)
Cost	Fixed (hardware)	Variable (per GB ingested)
Internet required	❌ No (works offline)	✅ Yes
Retention	30-90 days (disk space)	15-30 days (default)
Query speed	⚡ Sub-second (local)	~1-5 seconds (network latency)
Industry adoption	Tesla, OpenAI, Anthropic	Most Fortune 500

Key takeaway: ClickStack excels at high-volume, edge deployment scenarios. Datadog is better for fleet-wide analysis and advanced features like ML-based anomaly detection.

When to Use ClickStack vs Datadog

Use ClickStack (Edge) When:

Store-specific investigation: Debugging issues at a specific location
High-volume verbose logs: Full transaction logs, debug traces (not sent to Datadog)
Offline scenarios: Store internet is down, need local observability
Historical deep dives: Need data beyond Datadog's retention window
SQL power users: Complex analytics queries on telemetry data
High cardinality queries: Millions of unique label combinations
Cost optimization: Avoid per-GB Datadog ingestion fees

Use Datadog (Cloud) When:

Fleet-wide analysis: Query across all stores simultaneously
Cross-store correlation: Is this affecting multiple locations?
Long-term trend analysis: Months of aggregated data
Advanced features: ML anomaly detection, forecasting, APM profiling
Collaboration: Share links with team (cloud-based access)
Alerting: Sophisticated alert routing and escalation

Use Both When:

Initial investigation: Start in ClickStack (fast, granular, local)
Fleet correlation: Export relevant data to Datadog for cross-store analysis
Edge + cloud debugging: Correlate edge telemetry with cloud service telemetry
Post-incident review: Combine edge + cloud data for comprehensive analysis
Compliance: Keep full audit trail at edge, send summaries to cloud

Conditional Export: ClickStack → Datadog

Export Mechanism

Option 1: API-based export (ClickStack native)

# Export logs from ClickStack to Datadog
clickstack export \
  --source-store 4523 \
  --time-range "2026-02-19T14:00:00Z/2026-02-19T16:00:00Z" \
  --service pos-backend \
  --severity ERROR,WARN \
  --destination datadog \
  --format otlp

Option 2: OpenTelemetry Collector (dual export)

# Configure OTLP collector to send to both ClickHouse and Datadog
exporters:
  clickhouse:
    endpoint: http://localhost:9000
    enabled: true

  datadog:
    api:
      key: ${DD_API_KEY}
    enabled: false  # Enable only when export is needed

Option 3: Datadog Agent on Edge

Run Datadog Agent on edge server (disabled by default)
Enable agent only when export is needed
Configure filters to send specific logs/metrics

Option 4: Batch Export Job

Scheduled job runs daily
Export aggregated metrics (error counts, P99 latency, etc.)
Keep verbose logs at the edge, send summaries to cloud

Export Triggers (Byte Edge Implementation)

Trigger 1: Incident Declared

trigger: incident_declared
store_id: 4523
lookback: 2h
export:
  logs: [ERROR, WARN]
  traces: [status=ERROR]
  metrics: [cpu_usage, memory_usage, error_rate]

Trigger 2: Alert Threshold

trigger: error_rate > 5%
store_id: 4523
lookback: 30m
export:
  logs: [ERROR]
  traces: [status=ERROR, duration>1s]

Trigger 3: Manual Request

# IM engineer manually triggers export
hyperdx-export --store 4523 --time-range "last 2h" --all

Installation & Setup (Overview)

Prerequisites

Kubernetes cluster (edge server)
Persistent storage (local disk or NFS)
8GB+ RAM for ClickHouse
4GB+ RAM for HyperDX UI
2GB+ RAM for OpenTelemetry Collector

Deployment (Helm Chart)

# Add ClickStack Helm repo
helm repo add clickstack https://clickhouse.com/clickstack

# Install complete ClickStack (ClickHouse + HyperDX + OTLP Collector)
helm install clickstack clickstack/clickstack \
  --namespace observability \
  --set clickhouse.persistence.size=100Gi \
  --set clickhouse.retention.logs=90d \
  --set clickhouse.jsonColumns.enabled=true \
  --set hyperdx.ingress.enabled=true \
  --set hyperdx.auth.enabled=true \
  --set otel.collector.enabled=true \
  --set otel.collector.endpoint=0.0.0.0:4318

Application Instrumentation

// Example: Instrument Node.js app with OpenTelemetry (standard OTLP)
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');

const provider = new NodeTracerProvider();
provider.addSpanProcessor(
  new BatchSpanProcessor(
    new OTLPTraceExporter({
      url: 'http://clickstack-collector:4318/v1/traces'  // ClickStack OTLP endpoint
    })
  )
);
provider.register();

Key point for IM: You won't be deploying this, but understanding the architecture helps with troubleshooting.

ClickStack vs Individual Components: ClickStack provides a single Helm chart that deploys all three components (ClickHouse, HyperDX, OpenTelemetry) with optimized configurations.

Key Takeaways

ClickStack: Complete observability platform = ClickHouse + HyperDX + OpenTelemetry
ClickHouse: Columnar database, optimized for time-series telemetry data with native JSON support
HyperDX: Unified observability UI with dual query syntax (SQL + Lucene)
OpenTelemetry: Standard ingestion pipeline (OTLP protocol)
Columnar advantage: 10-100x faster queries for analytics workloads
Native JSON columns: Dynamic schema, 10x faster searches, 100x less data scanned
High cardinality: Handle billions of unique label combinations (Tesla: 1B events/sec)
Edge deployment: Runs locally on edge servers, works offline
Dual query syntax: Simple Lucene for exploration, powerful SQL for analytics
Conditional export: Store locally, export to Datadog when needed
Industry adoption: OpenAI (ChatGPT), Anthropic (Claude), Tesla, Shopify
IM workflow: Investigate in ClickStack (fast, local), export to Datadog (collaborate, fleet-wide)

Discussion Questions

Before moving to Module 5, think about:

What would you do if ClickStack itself is down at a store?
How would you handle a store running out of disk space for telemetry?
What telemetry would you export to Datadog after a payment outage?
How would you troubleshoot if ClickHouse queries are slow?
When would you use Lucene syntax vs SQL for querying?
How does ClickStack's JSON column type improve query performance?
What's the benefit of OpenTelemetry's OTLP protocol for edge deployments?

Next Steps

✅ Complete Module 1: Edge Computing ✅ Complete Module 2: Kubernetes Overview ✅ Complete Module 3: Observability & Telemetry ✅ Complete Module 4: ClickStack Deep Dive ⬜ Read Module 5: Local Demo Setup ⬜ Complete Module 6: Hands-On Exercises

Estimated time to next module: 1 day (prepare local environment)

Additional Resources

ClickStack Docs: https://clickhouse.com/docs/en/observability/clickstack
ClickStack GitHub: https://github.com/clickhouse/clickstack
HyperDX GitHub: https://github.com/hyperdxio/hyperdx
OpenTelemetry Docs: https://opentelemetry.io/docs/
ClickStack Open House: https://www.youtube.com/watch?v=clickstack-openhouse (Tesla, OpenAI, Anthropic use cases)

Reading Checkpoint

Current score: 0%

Sections complete

0/0

Checkpoint confirmed

Not yet

Reflection

0 chars

Completion requires 80% section coverage, checkpoint confirmation, and a short reflection. On completion, you will move to the next module automatically.

I can explain one operational takeaway from this module and when to apply it. Reflection (40+ chars)

Add 40 more characters.

Mark at least 80% of sections complete.

ClickStack Deep Dive for IM

Module Navigator

Learning Guidance

[Astro Alert] WARNING: DAG run failed for tango_sites_s3_sf_extr

[Astro Alert] WARNING: DAG run failed for sf2fourth_recon_trigger

PH US Pick Up Order Flow 2 synthetic test failed on Pizza Hut website

Module Content

Overview

Overview

What is ClickStack?

Why ClickStack for Edge Observability?

Traditional Observability Challenges

ClickStack's Edge Advantages

What is ClickHouse?

Elevator Pitch

Why Columnar?

Why This Matters for Telemetry

ClickHouse Features for Telemetry

1. Blazing Fast Queries

2. Time-Series Optimized

3. TTL (Time-To-Live) for Auto-Cleanup

4. Materialized Views for Pre-Aggregation

What is HyperDX?

Elevator Pitch

Key Philosophy: No Signal Silos

Key Features

ClickStack Architecture

ClickStack's Unique Features

1. Native JSON Column Type

2. Dual Query Syntax: SQL + Lucene

Lucene Syntax (Simple & Fast)

SQL Syntax (Powerful & Analytical)

3. High Cardinality Support

4. Unified Signal Correlation

Telemetry Schema in ClickHouse

Logs Table

Traces Table

Metrics Table

Common Query Patterns for IM

Pattern 1: Recent Error Logs

Pattern 2: Error Rate by Service

Pattern 3: Trace Lookup by ID

Pattern 4: Slow Traces (P99 Latency)

Pattern 5: Correlated Logs for a Trace

Pattern 6: Metric Trend (CPU Usage)

HyperDX UI Walkthrough

1. Log Search

2. Trace View

3. Metrics Dashboard

4. Correlation: Logs ↔ Traces ↔ Metrics

ClickStack vs Datadog: Feature Comparison

When to Use ClickStack vs Datadog

Use ClickStack (Edge) When:

Use Datadog (Cloud) When:

Use Both When:

Conditional Export: ClickStack → Datadog

Export Mechanism

Export Triggers (Byte Edge Implementation)

Installation & Setup (Overview)

Prerequisites

Deployment (Helm Chart)

Application Instrumentation

Key Takeaways

Discussion Questions

Next Steps

Additional Resources