Back to course: Commerce

Commerce | Reading Module

Global Investigation Workflow

Status: Not Started | Pass threshold: 100% | Points: 90

L2 35 min triage

Best score

0%

Attempts

0

Pass rate

0%

Passed

0

Completion happens in the checkpoint panel below.

Learning Guidance

Objectives

  • Run a consistent end-to-end workflow from trigger to closure.
  • Identify alert windows and baseline comparison intervals.
  • Use infrastructure-first checks to avoid false root-cause assumptions.

Evidence To Capture

  • Alert window with trigger and recovery timestamps.
  • Impact statement with market/store scope.
  • Escalation packet including owner and next action.

Source Artifacts

Internal source references are available for maintainers but are not exposed in deployed environments.

Field Evidence

Real incidents related to what you're learning.

Module Content

Not Started

Key Takeaways

  • What metric/logs it's measuring
  • The calculation (count, rate, percentage, etc.)
  • Thresholds for warn vs alert
  • Evaluation window (last 5m, 15m, etc.)
  • Jan 24, 2026 3:47 PM ET

Overview

This document outlines the standard investigation workflow for Datadog monitor alerts.


Step 1: Read the Event Details (Your Head Start)

When a monitor triggers, the Event Details provide immediate context:

FieldWhat It Tells You
OrgWhich market is affected (e.g., kfc_us, tb_us, ph_uk)
EnvWhich stack/environment (production, prod-hopper, prod-curie)
ServiceWhich service triggered the alert

This is your starting point - focus your investigation here first, then expand if needed.


Step 2: Check Monitor Status & Query

Before diving into logs, understand what the monitor is measuring:

Current Status

StatusMeaning
OK (Green)Recovered - but still investigate if recently triggered
Warn (Yellow)Crossed warning threshold - needs attention
Alert (Red)Crossed critical threshold - urgent action needed

Understand the Query

Read the monitor query to understand:

  • What metric/logs it's measuring
  • The calculation (count, rate, percentage, etc.)
  • Thresholds for warn vs alert
  • Evaluation window (last 5m, 15m, etc.)

Example: (failures / total) * 100 > 40 means alert when failure rate exceeds 40%.

Check Trend Across Windows

Always check multiple time windows to understand trajectory:

WindowPurpose
5 minCurrent state - is it getting better or worse?
15 minRecent trend - matches monitor eval window
30 minBroader context - was it worse earlier?

Trend matters: A monitor at 41% but trending down from 50% is recovering. A monitor at 39% but trending up from 30% may soon trigger.


Step 3: Find the Alert Window

Critical concept: The event timestamp tells you WHEN the alert triggered. Your investigation focuses on the alert window - from trigger to recovery.

Time Window Search Strategy

When searching for alert events, start tight and expand if needed:

AttemptWindowWhen to Use
1Last 1 dayDefault - alerts are usually recent
2Last 2 daysIf nothing found
3Last 5 daysExpand further
4Last 10 daysOlder incidents
5Last 20-30 daysHistorical analysis

Important: Keep search ranges specific and tight during deep analysis to improve performance and relevance.

Timestamp Format

Always use EST/ET with date for all timestamps in outputs:

  • Jan 24, 2026 3:47 PM ET
  • 20:47:04 UTC
  • 3:47 PM (missing date)

Don't Just Look at "Now"

Investigations may come hours, days, or weeks after the alert. You need to analyze the historical snapshot, not current state.

Time PeriodWhat to Analyze
Alert windowtrigger_time → recovery_time (the incident)
Before alertBaseline - what does "normal" look like?
After recoveryDid we return to normal?
NowCurrent state - is it still healthy?

Finding the Window

  1. Check the event timeline for the monitor
  2. Find when it triggered (Warn/Alert)
  3. Find when it recovered (OK)
  4. That window is your investigation focus

IMPORTANT: Searching for Alert Events

When the monitor is currently OK (green), you MUST search for the actual alert events.

Many monitors have scheduled downtimes (overnight closures). Searching monitor_id:XXXXX often only returns downtime events, NOT the actual alerts.

What works:

# Search for the alert text directly
Triggered [monitor name keywords]

# Examples:
Triggered TB Captures
Triggered OTP Failure
Triggered Payment Decline

This finds events with titles like [Triggered on {...}] and [Recovered on {...}].

Why this matters:

  • Most monitors have overnight downtimes (business closed)
  • monitor_id:X query returns downtime start/end events
  • Actual alert events require searching by alert title text
  • Always look back to find the most recent Triggered/Recovered pair

Example Timeline Analysis

08:00 UTC - Baseline: 150 events/min (normal)
09:00 UTC - Scheduled downtime starts
12:30 UTC - Downtime ends
12:50 UTC - Volume at 10/min (still low)
13:09 UTC - ALERT TRIGGERED (below threshold)
14:00 UTC - Volume at 40/min (recovering)
14:30 UTC - OK (recovered)

Investigation window: 12:30-14:30 UTC Compare to baseline: 08:00-09:00 UTC


Step 4: Read the Runbook (Human Source)

Before investigating independently:

  1. Check if the monitor has an attached runbook (in the notification message)
  2. Use the Atlassian MCP to fetch the Confluence page content
  3. Extract escalation paths, contacts, and actions
  4. Note any steps that are unclear or outdated

The runbook represents accumulated team knowledge for humans - escalation paths, who to contact, what actions to take.


Step 5: Check the Knowledge Base (Your Learned Patterns)

Before diving into logs, check if you have prior learnings for this monitor/service.

FileWhat to Check
knowledge-base/common-monitors.mdHave you investigated this monitor before?
knowledge-base/services/{service}.yamlDo you have investigation patterns for this service?

If you have prior learnings:

  • Use the search queries you documented
  • Apply the metadata field knowledge
  • Follow the investigation patterns that worked before

If this is a new monitor/service:

  • You'll discover patterns during investigation
  • Update the knowledge base after so next time is faster

The knowledge base is YOUR brain - it compounds with every investigation.


Step 6: Investigate with Context

Start with the trigger context (Org, Env, Service) but stay open to:

  • Other organizations affected (multitenant stacks share infrastructure)
  • Upstream/downstream service issues
  • External vendor problems

Key Questions to Answer

  1. Scope: Is this isolated to one org/market or widespread?
  2. Cause: User-caused errors vs system failures?
  3. Timing: When did it start? Correlate with deployments/changes?
  4. Impact: What's the business impact? Order flow affected?

Step 7: Infrastructure Correlation (Before Assuming Root Cause)

Critical concept: When you see errors like "can't reach database" or "connection timeout", don't assume the target is down. The SOURCE (pod/node) might be the problem.

7a. Group Errors by Infrastructure Dimension

Before deep-diving into error content, check WHERE errors are coming from:

-- Group by Kubernetes node
SELECT kube_node, count(*) as errors
FROM logs
GROUP BY kube_node
ORDER BY errors DESC

-- Group by pod
SELECT pod_name, count(*) as errors
FROM logs
GROUP BY pod_name
ORDER BY errors DESC
PatternLikely CauseNext Step
Errors concentrated on 1-2 nodesNode issue (unhealthy, network, resources)Check node health
Errors concentrated on specific podsPod issue (OOM, crash loop, bad deployment)Check pod health
Errors distributed across all nodes/podsApplication or downstream issue (database, vendor)Check downstream health

7b. Infrastructure Tags Reference

These tags are available on application logs - use them for correlation:

TagExampleUse For
kube_nodeip-10-10-26-138.ec2.internalNode-level correlation
pod_nameplatform-router-storefront-5d74b797f4-65mpqPod-level correlation
kube_namespacegraph-core-prod-curie-use1Namespace scoping
kube_cluster_nameprod-curieCluster-level correlation
availability-zoneus-east-1aAZ-level issues
eks_nodegroup-nameyce-curie-prod-eksstack-e45e-green-node-groupNode group issues
instance-typec7i.8xlargeInstance-type specific issues
container_nameplatform-router-storefrontContainer identification

7c. Kubernetes Health Check

Metric for unhealthy pods:

kubernetes_state.pod.ready{condition:false,!pod_phase:succeeded}

Scope by namespace: kube_namespace:storemenu-prod-curie-use1

Dashboard: Kubernetes Pods Overview

  • Filter by namespace to see pod states
  • Look for: pods not ready, restarts, OOM, CrashLoopBackOff

7d. Database/RDS Health Check

If errors indicate database connectivity issues AND errors are distributed (not node-concentrated):

Dashboards:

What to check:

  • Connection count (maxed out?)
  • CPU/memory utilization
  • Read/write latency spikes
  • Recent failover events

7e. Known Failure Signatures

SignatureLikely Cause
Pods in bad state (not ready)Node issue, resource exhaustion, deployment problem
OOM (Out of Memory)Memory limits too low, memory leak
CrashLoopBackOffApplication crash on startup, config issue
Synthetic failuresMonitoring/health check failures
"Can't reach [X]" from specific podsCheck the SOURCE pod/node, not just target

7f. Decision Tree

1. Query errors, group by kube_node
   ↓
2. Errors concentrated on specific node(s)?
   │
   ├─ YES → Check Kubernetes Pods Overview for that node
   │        → Look for pods not ready, OOM, CrashLoopBackOff
   │        → Check if other services on same node affected
   │        → Likely action: Cordon node, roll pods
   │
   └─ NO (distributed) → Check downstream systems
                         → RDS dashboards if database errors
                         → Vendor status if external service errors
                         → Application logs for specific error content

Example: Store-Menu Investigation (Jan 26, 2026)

What happened:

  • Alert: Router Subgraph Errors for store-menu in prod-curie
  • Error message: "Can't reach database server"
  • Initial assumption: RDS database issue

What we should have done:

SELECT kube_node, count(*) as errors
FROM logs
WHERE env:prod-curie AND service:platform-router AND @metadata.subgraph_name:store-menu
GROUP BY kube_node

What this would have shown:

  • Errors concentrated on node ip-10-10-27-170.ec2.internal
  • Other services on that node also failing

Actual root cause: Unhealthy Kubernetes node, not RDS Resolution: Cordon node, roll pods to healthy nodes


Step 8: Volume & Rate Comparison

For volume-based monitors, compare across time periods:

Volume Comparison Table

PeriodTimeVolumeStatus
Pre-alert baseline(before issue)X/minNormal
Alert window(during issue)Y/min⚠️ Low
Post-recovery(after fix)Z/min✅ Recovered
CurrentnowW/minCurrent state

Calculating Impact

Drop percentage = ((baseline - alert_window) / baseline) * 100

Example:
Baseline: 150/min
Alert window: 10/min
Drop: ((150 - 10) / 150) * 100 = 93% drop

SQL Query for Volume by Minute

SELECT DATE_TRUNC('minute', timestamp) as minute, count(*) as volume
FROM logs
GROUP BY DATE_TRUNC('minute', timestamp)
ORDER BY minute

This separates call takers from analysts - understanding the data, not just reading it.


Step 9: Business Analysis Output

When presenting findings, include:

Summary Table (by dimension)

  • By Organization: Which markets are affected and counts
  • By Error Reason: What types of errors (helps identify root cause)
  • By Client Platform: iOS vs Android vs Web (spot bad app releases)

Key Fields to Extract from Logs

FieldLocationWhy It Matters
organizationTop levelWhich market
client.idclient.idPlatform (e.g., kfc_us_ios)
reasonTop level or event.reasonError type
metadata.error_causemetadata.error_causeuser vs system - critical for triage
metadata.flow_typemetadata.flow_typeWhat operation was attempted
dd.version / image_tagTagsService version (spot bad deployments)
user_agent.originaluser_agent.originalApp version for mobile issues

Error Cause Classification

  • user: User-caused errors (wrong password, expired OTP, etc.) - typically not actionable
  • system: System failures - requires investigation and likely action

Step 10: Customer & IP Analysis (Top Offenders)

Understanding WHO is affected helps identify patterns and determine if traffic is organic or suspicious.

Universal Key Attributes

These attributes are available across most services - use them for deep analysis:

AttributePurposeQuery Example
@customer.idRegistered customer@customer.id:uuid-here
@user.idUser identifier@user.id:uuid-here
@client.ipClient IP address@client.ip:4.204.72.*
@client.idPlatform/client@client.id:ph_ca_ios
@event.actionAction attempted@event.action:oauth-token-post
@organizationMarket@organization:ph_ca
@metadata.grant_typeIDP auth flow (IDP only)@metadata.grant_type:refresh_token

Top Offenders Analysis

Find who's generating the volume/errors:

-- Top IPs
SELECT "@client.ip" as ip, count(*) as requests
FROM logs GROUP BY "@client.ip" ORDER BY requests DESC LIMIT 20

-- Top customers
SELECT "@customer.id" as customer, count(*) as requests
FROM logs GROUP BY "@customer.id" ORDER BY requests DESC LIMIT 20

-- By platform
SELECT "@client.id" as platform, count(*) as requests
FROM logs GROUP BY "@client.id" ORDER BY requests DESC

Distribution Assessment

PatternIndicatesAction
Few IPs, high volume eachBot/attack or proxyInvestigate further
Many unique IPs, low volume eachOrganic trafficLikely legitimate
Single IP range (e.g., 4.204.72.x)CDN/proxy infrastructureCheck success rate

Key Identifiers

FieldPurpose
@customer.idRegistered customer identifier
@user.idUser identifier (same as customer.id, varies by log)
@client.ipClient IP address (may be proxied)
@transactionUnique transaction ID for tracing full request flow

IP Analysis

Check for patterns:

  • Single IP with many failures → Possible bot/attack or frustrated user
  • 127.0.0.1 or internal IPs → Traffic proxied through brand infrastructure (e.g., KFC proxies all requests)
  • Distributed IPs with low count each → Normal user behavior

Bot/Attack Assessment

IndicatorNormalSuspicious
Max failures per IP3-1050+
User agentsReal app versionsScripted/missing
IP distributionMany unique IPsFew IPs, high volume
Error patternMixed reasonsSingle reason repeated

Transaction Tracing

Use @transaction ID to trace full request flow:

  1. Find the failure log with transaction ID
  2. Query all logs with that transaction
  3. See the full flow: entry → start → processing → end (success/failure)
  4. Early logs show original client.ip before proxying

Step 11: Cross-Reference to Orders (The "So What" Test)

A spike or anomaly means nothing until you verify business impact. Did users complete orders?

User Journey Query

Search across services using OR to catch both user.id and customer.id:

(@user.id:uuid-here OR @customer.id:uuid-here)

What to Look For

ServiceLog MessageMeaning
cart-workflowserviceGet cart activity succeededUser accessed cart
go-cart-dgsPOST /graphql 200Cart API success
order-dgsorders requestOrder query
order-dgssubmit order successful✅ Order placed
workflowserviceSuccessfully published order created event✅ Order confirmed

Order Success Rate During Alert

SELECT
  CASE WHEN message LIKE '%successful%' THEN 'success'
       WHEN message LIKE '%failed%' THEN 'failed'
       ELSE 'other' END as outcome,
  count(*) as total
FROM logs
WHERE service = 'order-dgs' AND message LIKE '%submit order%'
GROUP BY outcome

Assessment Framework

Auth SuccessOrder SuccessAssessment
High (>85%)High (>90%)Healthy - legitimate traffic
High (>85%)Low (<70%)Investigate order flow
Low (<70%)N/AAuth issue - check IDP

Step 13: Update the Knowledge Base

This is critical. After every investigation, update your knowledge base so next time is faster.

What You LearnedWhere to Update
New service investigation patternsCreate/update knowledge-base/services/{service}.yaml
New monitor investigatedAdd to knowledge-base/common-monitors.md
New search queries, metadata fieldsAdd to service YAML file
New escalation paths discoveredAdd to service YAML file

What to Document

For service YAML files (knowledge-base/services/{service}.yaml):

  • Key log messages and event actions
  • Important metadata fields and their meanings
  • Search queries that work
  • Investigation patterns (what to check, in what order)
  • Common root causes and escalation paths
  • Historical issues with ticket references

For common-monitors.md:

  • Monitor ID, name, type, thresholds
  • Quick investigation steps
  • Volume references (what's normal)
  • Common root causes
  • Escalation paths
  • Example investigations

The Learning Loop

Investigation 1 → Discover patterns → Update knowledge base
Investigation 2 → Check knowledge base + Discover more → Update
Investigation 3 → Rich context, faster investigation → Update
...
Investigation N → Expert-level context, rapid triage

The knowledge base compounds with every investigation.


Step 14: Create Jira Ticket (If Actionable)

If the investigation yields actionable recommendations, create a ticket to track the work.

When to Create a Ticket

Create TicketSkip Ticket
Config changes (thresholds, downtime)User-caused errors
Process improvements (runbook updates)Self-recovered, no action
System bugs needing fixKnown noise patterns
Vendor coordination

Allowed Projects

ProjectKeyUse For
Byte Incident ManagementBIMIncident follow-ups
ReOps SREREOPMonitor tuning, SRE ops

Workflow

  1. Identify actionable recommendation from investigation
  2. Ask Claude: "Create a ticket from this investigation"
  3. Choose project (BIM or REOP)
  4. Review draft → approve → ticket created
  5. Update knowledge base with ticket reference

Example

Investigation: TB Payment Captures (monitor 71163811) Finding: Downtime ends at 7:30 AM ET but volume stays low until ~9 AM ET Ticket: BIM-163 - Extend downtime to 8:45 AM ET

See atlassian/ATLASSIAN_GUARDRAILS.md for full ticket template and workflow.


Environment Reference

Datadog Env TagStack NameRegionType
prod-curieProd-Curieus-east-1Single tenant (PH_US)
productionProd-Turingus-east-1Multitenant (US)
prod-hopperProd-Hoppereu-west-1Multitenant (EU)

See knowledge-base/stacks/ for full stack details. See knowledge-base/markets/ for market-to-stack mappings.


Common Mistakes to Avoid

  1. Only looking at "now" - Find the alert window and analyze that timeframe
  2. Jumping straight to logs - Check monitor status and query first
  3. Only looking at one time window - Check 5m, 15m, 30m trends
  4. Not understanding the query - Know what's being measured and thresholds
  5. Investigating all orgs when one triggered - Start focused, expand if needed
  6. Skipping the runbook - It exists for a reason
  7. Ignoring error_cause field - user errors are usually not actionable
  8. Not checking event history - Context matters
  9. Missing the trend - A recovering monitor needs different action than a worsening one
  10. Not comparing to baseline - What's "normal"? How bad was the drop?
  11. Not providing deep links - Make it easy for others to verify your findings
  12. Using wide time ranges for deep analysis - Keep search ranges tight and specific for better performance
  13. Not cross-referencing to orders - A spike means nothing until you verify business impact
  14. Missing timestamps in output - Always include date + time in ET format
  15. Not checking the knowledge base first - Check if you have prior learnings before investigating
  16. Not updating the knowledge base after - Each investigation should make the next one better
  17. Missing @ prefix on custom attributes - Datadog log queries require @ prefix for custom attributes (e.g., -@device.station_name:POS99 not -device.station_name:POS99)
  18. Not checking infrastructure correlation first - Before assuming database/vendor issue, group errors by kube_node to see if they're concentrated on specific infrastructure. "Can't reach X" might mean the SOURCE can't reach anything, not that the TARGET is down.

Update Log

DateChange
2026-01-26Added Step 7: Infrastructure Correlation - group by kube_node before assuming root cause
2026-01-26Added infrastructure tags reference, K8s/RDS dashboard links, failure signatures
2026-01-26Added mistake 18: not checking infrastructure correlation first
2026-01-26Renumbered Steps 8-14 (previously 7-13)
2026-01-24Added Step 5: Check the Knowledge Base (your learned patterns)
2026-01-24Expanded Step 12: Update the Knowledge Base with detailed guidance
2026-01-24Added mistakes 15-16: knowledge base check and update
2026-01-24Added Step 9: Cross-Reference to Orders ("So What" test)
2026-01-24Added universal key attributes for deep analysis
2026-01-24Added top offenders analysis and distribution assessment
2026-01-24Added time window search strategy (1d → 2d → 5d → 10d → 20d → 30d)
2026-01-24Added timestamp format rule: always EST/ET with date
2026-01-24Added Step 12: Create Jira Ticket (If Actionable) with BIM/REOP workflow
2026-01-24Added alert window analysis, volume comparison, "call takers vs analysts" concept
2026-01-24Added monitor status/query analysis, trend windows, customer/IP analysis, deep links
2026-01-24Initial investigation workflow document

Reading Checkpoint

Current score: 0%

Sections complete

0/0

Checkpoint confirmed

Not yet

Reflection

0 chars

Completion requires 80% section coverage, checkpoint confirmation, and a short reflection. On completion, you will move to the next module automatically.

Add 40 more characters.

Mark at least 80% of sections complete.