Commerce | Reading Module

Global Investigation Workflow

Status: Not Started | Pass threshold: 100% | Points: 90

L2 35 min triage

Best score

Attempts

Pass rate

Passed

Completion happens in the checkpoint panel below.

Module Navigator

Commerce Incident Analytics and MTTx

Reading Module

Review Prior Module ←

Recommended Next

Commerce Service Topology

Reading Module

Resume Recommended Path →

Datadog Review and Monitor Hygiene

Reading Module

Continue To Next Module →

Learning Guidance

Objectives

Run a consistent end-to-end workflow from trigger to closure.
Identify alert windows and baseline comparison intervals.
Use infrastructure-first checks to avoid false root-cause assumptions.

Evidence To Capture

Alert window with trigger and recovery timestamps.
Impact statement with market/store scope.
Escalation packet including owner and next action.

Source Artifacts

Internal source references are available for maintainers but are not exposed in deployed environments.

Field Evidence

Real incidents related to what you're learning.

01KJ0N8QJK4H0RH7NX060HK4XG

2/21/2026 | n/a

medium

Lower than expected payment authorization attempts by tender type for ph_mx

• Org: ph_mx • Env: • Service: payment-dgs • TenderType: **Key Information** • Payment authorization attempts are lower than usual. • This anomaly tracks pay...

commerce Same product lane: commerceKeywords: monitor, triggers, orgRoot cause (data) aligns with L2 focus

Study Incident →

01KHY27W02P4MQJSGXFC323JV5

2/20/2026 | n/a

medium

Lower than usual payment authorization attempts detected by tender type for ph_mx

• Org: ph_mx • Env: • Service: payment-dgs • TenderType: **Key Information** • Payment authorization attempts are lower than usual. • This anomaly tracks pay...

commerce Same product lane: commerceKeywords: monitor, triggers, orgRoot cause (data) aligns with L2 focus

Study Incident →

01KJ3KXQ3YNN1SJFZXREM323ZH

2/22/2026 | n/a

low

Increase in payment declines for HB on Mobile channel

• Org: hb_us • Env:production • Service: payment-dgs • Client App ID: 7viorc6ovt9v7s38sv7i3iqe6k Key Information:This monitors Payment declines from HB on Yu...

commerce Same product lane: commerceKeywords: org, env, productionRoot cause (data) aligns with L2 focus

Study Incident →

Module Content

Not Started

Key Takeaways

What metric/logs it's measuring
The calculation (count, rate, percentage, etc.)
Thresholds for warn vs alert
Evaluation window (last 5m, 15m, etc.)
✅ Jan 24, 2026 3:47 PM ET

Overview

This document outlines the standard investigation workflow for Datadog monitor alerts.

Step 1: Read the Event Details (Your Head Start)

When a monitor triggers, the Event Details provide immediate context:

Field	What It Tells You
Org	Which market is affected (e.g., `kfc_us`, `tb_us`, `ph_uk`)
Env	Which stack/environment (`production`, `prod-hopper`, `prod-curie`)
Service	Which service triggered the alert

This is your starting point - focus your investigation here first, then expand if needed.

Step 2: Check Monitor Status & Query

Before diving into logs, understand what the monitor is measuring:

Current Status

Status	Meaning
OK (Green)	Recovered - but still investigate if recently triggered
Warn (Yellow)	Crossed warning threshold - needs attention
Alert (Red)	Crossed critical threshold - urgent action needed

Understand the Query

Read the monitor query to understand:

What metric/logs it's measuring
The calculation (count, rate, percentage, etc.)
Thresholds for warn vs alert
Evaluation window (last 5m, 15m, etc.)

Example: (failures / total) * 100 > 40 means alert when failure rate exceeds 40%.

Check Trend Across Windows

Always check multiple time windows to understand trajectory:

Window	Purpose
5 min	Current state - is it getting better or worse?
15 min	Recent trend - matches monitor eval window
30 min	Broader context - was it worse earlier?

Trend matters: A monitor at 41% but trending down from 50% is recovering. A monitor at 39% but trending up from 30% may soon trigger.

Step 3: Find the Alert Window

Critical concept: The event timestamp tells you WHEN the alert triggered. Your investigation focuses on the alert window - from trigger to recovery.

Time Window Search Strategy

When searching for alert events, start tight and expand if needed:

Attempt	Window	When to Use
1	Last 1 day	Default - alerts are usually recent
2	Last 2 days	If nothing found
3	Last 5 days	Expand further
4	Last 10 days	Older incidents
5	Last 20-30 days	Historical analysis

Important: Keep search ranges specific and tight during deep analysis to improve performance and relevance.

Timestamp Format

Always use EST/ET with date for all timestamps in outputs:

✅ Jan 24, 2026 3:47 PM ET
❌ 20:47:04 UTC
❌ 3:47 PM (missing date)

Don't Just Look at "Now"

Investigations may come hours, days, or weeks after the alert. You need to analyze the historical snapshot, not current state.

Time Period	What to Analyze
Alert window	trigger_time → recovery_time (the incident)
Before alert	Baseline - what does "normal" look like?
After recovery	Did we return to normal?
Now	Current state - is it still healthy?

Finding the Window

Check the event timeline for the monitor
Find when it triggered (Warn/Alert)
Find when it recovered (OK)
That window is your investigation focus

IMPORTANT: Searching for Alert Events

When the monitor is currently OK (green), you MUST search for the actual alert events.

Many monitors have scheduled downtimes (overnight closures). Searching monitor_id:XXXXX often only returns downtime events, NOT the actual alerts.

What works:

# Search for the alert text directly
Triggered [monitor name keywords]

# Examples:
Triggered TB Captures
Triggered OTP Failure
Triggered Payment Decline

This finds events with titles like [Triggered on {...}] and [Recovered on {...}].

Why this matters:

Most monitors have overnight downtimes (business closed)
monitor_id:X query returns downtime start/end events
Actual alert events require searching by alert title text
Always look back to find the most recent Triggered/Recovered pair

Example Timeline Analysis

08:00 UTC - Baseline: 150 events/min (normal)
09:00 UTC - Scheduled downtime starts
12:30 UTC - Downtime ends
12:50 UTC - Volume at 10/min (still low)
13:09 UTC - ALERT TRIGGERED (below threshold)
14:00 UTC - Volume at 40/min (recovering)
14:30 UTC - OK (recovered)

Investigation window: 12:30-14:30 UTC Compare to baseline: 08:00-09:00 UTC

Step 4: Read the Runbook (Human Source)

Before investigating independently:

Check if the monitor has an attached runbook (in the notification message)
Use the Atlassian MCP to fetch the Confluence page content
Extract escalation paths, contacts, and actions
Note any steps that are unclear or outdated

The runbook represents accumulated team knowledge for humans - escalation paths, who to contact, what actions to take.

Step 5: Check the Knowledge Base (Your Learned Patterns)

Before diving into logs, check if you have prior learnings for this monitor/service.

File	What to Check
`knowledge-base/common-monitors.md`	Have you investigated this monitor before?
`knowledge-base/services/{service}.yaml`	Do you have investigation patterns for this service?

If you have prior learnings:

Use the search queries you documented
Apply the metadata field knowledge
Follow the investigation patterns that worked before

If this is a new monitor/service:

You'll discover patterns during investigation
Update the knowledge base after so next time is faster

The knowledge base is YOUR brain - it compounds with every investigation.

Step 6: Investigate with Context

Start with the trigger context (Org, Env, Service) but stay open to:

Other organizations affected (multitenant stacks share infrastructure)
Upstream/downstream service issues
External vendor problems

Key Questions to Answer

Scope: Is this isolated to one org/market or widespread?
Cause: User-caused errors vs system failures?
Timing: When did it start? Correlate with deployments/changes?
Impact: What's the business impact? Order flow affected?

Step 7: Infrastructure Correlation (Before Assuming Root Cause)

Critical concept: When you see errors like "can't reach database" or "connection timeout", don't assume the target is down. The SOURCE (pod/node) might be the problem.

7a. Group Errors by Infrastructure Dimension

Before deep-diving into error content, check WHERE errors are coming from:

-- Group by Kubernetes node
SELECT kube_node, count(*) as errors
FROM logs
GROUP BY kube_node
ORDER BY errors DESC

-- Group by pod
SELECT pod_name, count(*) as errors
FROM logs
GROUP BY pod_name
ORDER BY errors DESC

Pattern	Likely Cause	Next Step
Errors concentrated on 1-2 nodes	Node issue (unhealthy, network, resources)	Check node health
Errors concentrated on specific pods	Pod issue (OOM, crash loop, bad deployment)	Check pod health
Errors distributed across all nodes/pods	Application or downstream issue (database, vendor)	Check downstream health

7b. Infrastructure Tags Reference

These tags are available on application logs - use them for correlation:

Tag	Example	Use For
`kube_node`	`ip-10-10-26-138.ec2.internal`	Node-level correlation
`pod_name`	`platform-router-storefront-5d74b797f4-65mpq`	Pod-level correlation
`kube_namespace`	`graph-core-prod-curie-use1`	Namespace scoping
`kube_cluster_name`	`prod-curie`	Cluster-level correlation
`availability-zone`	`us-east-1a`	AZ-level issues
`eks_nodegroup-name`	`yce-curie-prod-eksstack-e45e-green-node-group`	Node group issues
`instance-type`	`c7i.8xlarge`	Instance-type specific issues
`container_name`	`platform-router-storefront`	Container identification

7c. Kubernetes Health Check

Metric for unhealthy pods:

kubernetes_state.pod.ready{condition:false,!pod_phase:succeeded}

Scope by namespace: kube_namespace:storemenu-prod-curie-use1

Dashboard: Kubernetes Pods Overview

Filter by namespace to see pod states
Look for: pods not ready, restarts, OOM, CrashLoopBackOff

7d. Database/RDS Health Check

If errors indicate database connectivity issues AND errors are distributed (not node-concentrated):

Dashboards:

AWS RDS Detailed - Filter by dbinstanceidentifier
Curie Prod RDS Overall

What to check:

Connection count (maxed out?)
CPU/memory utilization
Read/write latency spikes
Recent failover events

7e. Known Failure Signatures

Signature	Likely Cause
Pods in bad state (not ready)	Node issue, resource exhaustion, deployment problem
OOM (Out of Memory)	Memory limits too low, memory leak
CrashLoopBackOff	Application crash on startup, config issue
Synthetic failures	Monitoring/health check failures
"Can't reach [X]" from specific pods	Check the SOURCE pod/node, not just target

7f. Decision Tree

1. Query errors, group by kube_node
   ↓
2. Errors concentrated on specific node(s)?
   │
   ├─ YES → Check Kubernetes Pods Overview for that node
   │        → Look for pods not ready, OOM, CrashLoopBackOff
   │        → Check if other services on same node affected
   │        → Likely action: Cordon node, roll pods
   │
   └─ NO (distributed) → Check downstream systems
                         → RDS dashboards if database errors
                         → Vendor status if external service errors
                         → Application logs for specific error content

Example: Store-Menu Investigation (Jan 26, 2026)

What happened:

Alert: Router Subgraph Errors for store-menu in prod-curie
Error message: "Can't reach database server"
Initial assumption: RDS database issue

What we should have done:

SELECT kube_node, count(*) as errors
FROM logs
WHERE env:prod-curie AND service:platform-router AND @metadata.subgraph_name:store-menu
GROUP BY kube_node

What this would have shown:

Errors concentrated on node ip-10-10-27-170.ec2.internal
Other services on that node also failing

Actual root cause: Unhealthy Kubernetes node, not RDS Resolution: Cordon node, roll pods to healthy nodes

Step 8: Volume & Rate Comparison

For volume-based monitors, compare across time periods:

Volume Comparison Table

Period	Time	Volume	Status
Pre-alert baseline	(before issue)	X/min	Normal
Alert window	(during issue)	Y/min	⚠️ Low
Post-recovery	(after fix)	Z/min	✅ Recovered
Current	now	W/min	Current state

Calculating Impact

Drop percentage = ((baseline - alert_window) / baseline) * 100

Example:
Baseline: 150/min
Alert window: 10/min
Drop: ((150 - 10) / 150) * 100 = 93% drop

SQL Query for Volume by Minute

SELECT DATE_TRUNC('minute', timestamp) as minute, count(*) as volume
FROM logs
GROUP BY DATE_TRUNC('minute', timestamp)
ORDER BY minute

This separates call takers from analysts - understanding the data, not just reading it.

Step 9: Business Analysis Output

When presenting findings, include:

Summary Table (by dimension)

By Organization: Which markets are affected and counts
By Error Reason: What types of errors (helps identify root cause)
By Client Platform: iOS vs Android vs Web (spot bad app releases)

Key Fields to Extract from Logs

Field	Location	Why It Matters
`organization`	Top level	Which market
`client.id`	`client.id`	Platform (e.g., `kfc_us_ios`)
`reason`	Top level or `event.reason`	Error type
`metadata.error_cause`	`metadata.error_cause`	`user` vs `system` - critical for triage
`metadata.flow_type`	`metadata.flow_type`	What operation was attempted
`dd.version` / `image_tag`	Tags	Service version (spot bad deployments)
`user_agent.original`	`user_agent.original`	App version for mobile issues

Error Cause Classification

user: User-caused errors (wrong password, expired OTP, etc.) - typically not actionable
system: System failures - requires investigation and likely action

Step 10: Customer & IP Analysis (Top Offenders)

Understanding WHO is affected helps identify patterns and determine if traffic is organic or suspicious.

Universal Key Attributes

These attributes are available across most services - use them for deep analysis:

Attribute	Purpose	Query Example
`@customer.id`	Registered customer	`@customer.id:uuid-here`
`@user.id`	User identifier	`@user.id:uuid-here`
`@client.ip`	Client IP address	`@client.ip:4.204.72.*`
`@client.id`	Platform/client	`@client.id:ph_ca_ios`
`@event.action`	Action attempted	`@event.action:oauth-token-post`
`@organization`	Market	`@organization:ph_ca`
`@metadata.grant_type`	IDP auth flow (IDP only)	`@metadata.grant_type:refresh_token`

Top Offenders Analysis

Find who's generating the volume/errors:

-- Top IPs
SELECT "@client.ip" as ip, count(*) as requests
FROM logs GROUP BY "@client.ip" ORDER BY requests DESC LIMIT 20

-- Top customers
SELECT "@customer.id" as customer, count(*) as requests
FROM logs GROUP BY "@customer.id" ORDER BY requests DESC LIMIT 20

-- By platform
SELECT "@client.id" as platform, count(*) as requests
FROM logs GROUP BY "@client.id" ORDER BY requests DESC

Distribution Assessment

Pattern	Indicates	Action
Few IPs, high volume each	Bot/attack or proxy	Investigate further
Many unique IPs, low volume each	Organic traffic	Likely legitimate
Single IP range (e.g., 4.204.72.x)	CDN/proxy infrastructure	Check success rate

Key Identifiers

Field	Purpose
`@customer.id`	Registered customer identifier
`@user.id`	User identifier (same as customer.id, varies by log)
`@client.ip`	Client IP address (may be proxied)
`@transaction`	Unique transaction ID for tracing full request flow

IP Analysis

Check for patterns:

Single IP with many failures → Possible bot/attack or frustrated user
127.0.0.1 or internal IPs → Traffic proxied through brand infrastructure (e.g., KFC proxies all requests)
Distributed IPs with low count each → Normal user behavior

Bot/Attack Assessment

Indicator	Normal	Suspicious
Max failures per IP	3-10	50+
User agents	Real app versions	Scripted/missing
IP distribution	Many unique IPs	Few IPs, high volume
Error pattern	Mixed reasons	Single reason repeated

Transaction Tracing

Use @transaction ID to trace full request flow:

Find the failure log with transaction ID
Query all logs with that transaction
See the full flow: entry → start → processing → end (success/failure)
Early logs show original client.ip before proxying

Step 11: Cross-Reference to Orders (The "So What" Test)

A spike or anomaly means nothing until you verify business impact. Did users complete orders?

User Journey Query

Search across services using OR to catch both user.id and customer.id:

(@user.id:uuid-here OR @customer.id:uuid-here)

What to Look For

Service	Log Message	Meaning
cart-workflowservice	`Get cart activity succeeded`	User accessed cart
go-cart-dgs	`POST /graphql 200`	Cart API success
order-dgs	`orders request`	Order query
order-dgs	`submit order successful`	✅ Order placed
workflowservice	`Successfully published order created event`	✅ Order confirmed

Order Success Rate During Alert

SELECT
  CASE WHEN message LIKE '%successful%' THEN 'success'
       WHEN message LIKE '%failed%' THEN 'failed'
       ELSE 'other' END as outcome,
  count(*) as total
FROM logs
WHERE service = 'order-dgs' AND message LIKE '%submit order%'
GROUP BY outcome

Assessment Framework

Auth Success	Order Success	Assessment
High (>85%)	High (>90%)	Healthy - legitimate traffic
High (>85%)	Low (<70%)	Investigate order flow
Low (<70%)	N/A	Auth issue - check IDP

Step 12: Deep Links

Always provide clickable links for further investigation:

Building Datadog Log URLs

Base: https://app.datadoghq.com/logs?query=<encoded_query>&from_ts=<start>&to_ts=<end>

Essential links to include:

Transaction trace - Full request flow for a specific transaction
Top offending IPs - Grouped by client.ip to spot patterns
Success vs Failure rate - Pie chart of outcome distribution
Failures by org/reason - Breakdown for root cause analysis

Vendor Status Pages

Always check and link to vendor status when relevant:

Vendor	Status Page
Braze	https://braze.statuspage.io/
Fiserv	https://fiserv3.statuspage.io/
Cybersource	https://status.cybersource.com/

Step 13: Update the Knowledge Base

This is critical. After every investigation, update your knowledge base so next time is faster.

What You Learned	Where to Update
New service investigation patterns	Create/update `knowledge-base/services/{service}.yaml`
New monitor investigated	Add to `knowledge-base/common-monitors.md`
New search queries, metadata fields	Add to service YAML file
New escalation paths discovered	Add to service YAML file

What to Document

For service YAML files (knowledge-base/services/{service}.yaml):

Key log messages and event actions
Important metadata fields and their meanings
Search queries that work
Investigation patterns (what to check, in what order)
Common root causes and escalation paths
Historical issues with ticket references

For common-monitors.md:

Monitor ID, name, type, thresholds
Quick investigation steps
Volume references (what's normal)
Common root causes
Escalation paths
Example investigations

The Learning Loop

Investigation 1 → Discover patterns → Update knowledge base
Investigation 2 → Check knowledge base + Discover more → Update
Investigation 3 → Rich context, faster investigation → Update
...
Investigation N → Expert-level context, rapid triage

The knowledge base compounds with every investigation.

Step 14: Create Jira Ticket (If Actionable)

If the investigation yields actionable recommendations, create a ticket to track the work.

When to Create a Ticket

Create Ticket	Skip Ticket
Config changes (thresholds, downtime)	User-caused errors
Process improvements (runbook updates)	Self-recovered, no action
System bugs needing fix	Known noise patterns
Vendor coordination

Allowed Projects

Project	Key	Use For
Byte Incident Management	`BIM`	Incident follow-ups
ReOps SRE	`REOP`	Monitor tuning, SRE ops

Workflow

Identify actionable recommendation from investigation
Ask Claude: "Create a ticket from this investigation"
Choose project (BIM or REOP)
Review draft → approve → ticket created
Update knowledge base with ticket reference

Example

Investigation: TB Payment Captures (monitor 71163811) Finding: Downtime ends at 7:30 AM ET but volume stays low until ~9 AM ET Ticket: BIM-163 - Extend downtime to 8:45 AM ET

See atlassian/ATLASSIAN_GUARDRAILS.md for full ticket template and workflow.

Environment Reference

Datadog Env Tag	Stack Name	Region	Type
`prod-curie`	Prod-Curie	us-east-1	Single tenant (PH_US)
`production`	Prod-Turing	us-east-1	Multitenant (US)
`prod-hopper`	Prod-Hopper	eu-west-1	Multitenant (EU)

See knowledge-base/stacks/ for full stack details. See knowledge-base/markets/ for market-to-stack mappings.

Common Mistakes to Avoid

Only looking at "now" - Find the alert window and analyze that timeframe
Jumping straight to logs - Check monitor status and query first
Only looking at one time window - Check 5m, 15m, 30m trends
Not understanding the query - Know what's being measured and thresholds
Investigating all orgs when one triggered - Start focused, expand if needed
Skipping the runbook - It exists for a reason
Ignoring error_cause field - user errors are usually not actionable
Not checking event history - Context matters
Missing the trend - A recovering monitor needs different action than a worsening one
Not comparing to baseline - What's "normal"? How bad was the drop?
Not providing deep links - Make it easy for others to verify your findings
Using wide time ranges for deep analysis - Keep search ranges tight and specific for better performance
Not cross-referencing to orders - A spike means nothing until you verify business impact
Missing timestamps in output - Always include date + time in ET format
Not checking the knowledge base first - Check if you have prior learnings before investigating
Not updating the knowledge base after - Each investigation should make the next one better
Missing @ prefix on custom attributes - Datadog log queries require @ prefix for custom attributes (e.g., -@device.station_name:POS99 not -device.station_name:POS99)
Not checking infrastructure correlation first - Before assuming database/vendor issue, group errors by kube_node to see if they're concentrated on specific infrastructure. "Can't reach X" might mean the SOURCE can't reach anything, not that the TARGET is down.

Update Log

Date	Change
2026-01-26	Added Step 7: Infrastructure Correlation - group by kube_node before assuming root cause
2026-01-26	Added infrastructure tags reference, K8s/RDS dashboard links, failure signatures
2026-01-26	Added mistake 18: not checking infrastructure correlation first
2026-01-26	Renumbered Steps 8-14 (previously 7-13)
2026-01-24	Added Step 5: Check the Knowledge Base (your learned patterns)
2026-01-24	Expanded Step 12: Update the Knowledge Base with detailed guidance
2026-01-24	Added mistakes 15-16: knowledge base check and update
2026-01-24	Added Step 9: Cross-Reference to Orders ("So What" test)
2026-01-24	Added universal key attributes for deep analysis
2026-01-24	Added top offenders analysis and distribution assessment
2026-01-24	Added time window search strategy (1d → 2d → 5d → 10d → 20d → 30d)
2026-01-24	Added timestamp format rule: always EST/ET with date
2026-01-24	Added Step 12: Create Jira Ticket (If Actionable) with BIM/REOP workflow
2026-01-24	Added alert window analysis, volume comparison, "call takers vs analysts" concept
2026-01-24	Added monitor status/query analysis, trend windows, customer/IP analysis, deep links
2026-01-24	Initial investigation workflow document

Reading Checkpoint

Current score: 0%

Sections complete

0/0

Checkpoint confirmed

Not yet

Reflection

0 chars

Completion requires 80% section coverage, checkpoint confirmation, and a short reflection. On completion, you will move to the next module automatically.

I can explain one operational takeaway from this module and when to apply it. Reflection (40+ chars)

Add 40 more characters.

Mark at least 80% of sections complete.