Skillweave - Incident Learning Platform

Module Content

Not Started

Key Takeaways

Trigger happy (thresholds too sensitive)
Missing or outdated runbooks
Monitor should be retired
New monitoring needed that was missed
Get monitor ID or URL from user

Overview

This document describes the Datadog Review process - a continuous improvement workflow for monitor hygiene and observability quality.

Purpose

Catch P3s before they become P1s. Reduce alert fatigue. Keep monitoring right and tight.

After incidents or alerts, monitors often need improvement:

Trigger happy (thresholds too sensitive)
Missing or outdated runbooks
Query needs adjustment
Monitor should be retired
New monitoring needed that was missed

The Datadog Review process tracks this work across all teams.

Weekly Meeting

When: Every Tuesday at 2:00 PM ET Purpose: Review weekly updates, additions, and closures

Confluence Tracking

All Datadog Review items are tracked in Confluence:

Parent Page: Datadog Monitor Contacts & Cleanup

Structure:

Datadog Monitor Contacts & Cleanup
├── Monitor Contacts & Cleanup 2026
│   ├── Q1 2026
│   │   ├── Jan 2026 - Alert Cleanup and Action Items
│   │   ├── Feb 2026 - Alert Cleanup and Action Items
│   │   └── Mar 2026 - Alert Cleanup and Action Items
│   ├── Q2 2026
│   └── ...

Monthly Page Columns:

Column	Description
Issue Description	What monitor is problematic and why
Slack Links	Where it was discussed
Proposed Solutions/Improvements	Threshold changes, runbook updates, etc.
DD Links	Datadog monitor and log links
Ownership	Who's responsible for the fix
BIM Ticket	Jira ticket (optional - see below)
Date Discussed	When it was reviewed
Escalation to Dev	If dev team involvement needed
Status	TO-DO, DONE, Monitoring

Workflow

After Investigation or Incident

Alert/Incident fires
    ↓
Investigate & resolve
    ↓
Monitor needs improvement?
    ↓
YES → "Add to Datadog Review"
    ↓
Add to current month's Confluence page
    ↓
Create BIM ticket? (OPTIONAL)
    ├── Yes → Create BIM ticket, link to Confluence
    └── No → Track in incident.io instead
    ↓
Weekly Tuesday review
    ↓
Work completed → Mark DONE

When to Add a Monitor to Review

Reason	Example
Trigger happy	Monitor fires 10x/day but no action needed
Threshold needs adjustment	Traffic increased, old thresholds too sensitive
Runbook missing/outdated	Responder didn't know what to do
Query incorrect	Monitor measuring wrong thing
Should be retired	Service deprecated, monitor no longer relevant
New monitoring needed	Incident revealed gap in coverage
Wrong channel/routing	Alert going to wrong team
Priority incorrect	P2 should be P3, etc.

Runbook Update Process

When a monitor's runbook is missing or outdated (per the table above), follow this workflow to improve it.

Workflow

User identifies monitor with poor runbook
    ↓
Read current monitor configuration
    ↓
Identify specific runbook problems
    ↓
Draft improved runbook (PRINT for review, never update directly)
    ↓
User reviews and approves
    ↓
Create BIM ticket with recommendations
    ↓
User manually updates monitor in Datadog

Step-by-Step Process

Read the Monitor

Get monitor ID or URL from user
Use Datadog MCP to fetch monitor details
Review current runbook message

Identify Problems

Missing required sections?
Unclear action steps?
Notification tags in wrong spots?
No structured troubleshooting steps?
Missing Datadog deep links?

Draft Improved Runbook

Follow standard format (see below)
Print for user review (NEVER update directly)
Iterate based on feedback

Create BIM Ticket

Project: BIM (Byte Incident Management)
Include full runbook text in description
List specific problems being fixed
User will manually update in Datadog

Standard Runbook Format

All monitor runbooks should follow this structure:

## [Monitor Name] - [Brief Description]

### Top level variables:
- Org: {{log.attributes.organization}}
- Env: {{log.tags.env}}
- Service: {{log.tags.service}}
- [Optional: Index, DB, Pod, etc. depending on monitor type]

---

### Key information:

**What This Monitor Watches**:
[Description of what's being monitored]

**Thresholds**:
- **Warning**: [Threshold value and meaning]
- **Critical**: [Threshold value and meaning]
- **Auto-Reset**: [If applicable]

**What Happens at Each Level**:
[Explain the impact at different threshold levels]

**Common Scenarios That Trigger This**:
1. [Scenario 1]
2. [Scenario 2]
3. [Scenario 3]

---

### Troubleshooting steps:

#### STEP 1: [Action Name] ([Estimated Time])
[Why this step first]

1. **[Sub-step]**:
   - [Datadog deep link]
   - [What to look for]

2. **[Sub-step]**:
   - [Action to take]

---

#### STEP 2: [Next Action]

[Continue with numbered steps and deep links]

---

### Quick Reference Links (optional)

| Resource | Link |
|----------|------|
| [Dashboard Name] | [Link] |
| [Config Page] | [Link] |

---

### Escalation

[Notification tags in proper scopes - preserve exactly from original monitor]

Critical Checklist

Before creating BIM ticket, verify:

[ ] Top level variables section present with bullet points (dashes)
[ ] Org, Env, Service are ALWAYS included (required)
[ ] Optional variables (Index, DB, Pod, etc.) added if relevant
[ ] Key information section explains what monitor watches, thresholds, common scenarios
[ ] Troubleshooting steps are numbered, chronological, with Datadog deep links
[ ] Notification tags preserved in exact same spots (conditional scopes, end, etc.)
[ ] @webhook-reops-incident-io replaced with @webhook-byte-incident-io (if byte not already present)
[ ] @webhook-byte-incident-io included for Byte team escalation

Important Rules

NEVER update monitors directly - Always create BIM ticket for user to manually update
Preserve notification tags - Don't accidentally drop or move notification mentions
Ask about tag placement - If unsure where tags should go, ask the user
Replace old webhooks - webhook-reops-incident-io is deprecated, use webhook-byte-incident-io
Print for review first - Always show user the full runbook before creating ticket

Example Session

User: "This monitor's runbook was too vague, help me improve it"
      https://app.datadoghq.com/monitors/155218447

AI: [Reads monitor, identifies problems]
    "Current runbook has these issues:
    1. Action buried in conditional logic
    2. No numbered steps
    3. Formula unclear

    Here's the improved runbook: [prints full text]"

User: "Looks good, create ticket"

AI: [Creates BIM ticket with recommendations]
    "Created BIM-167: https://yumbrands.atlassian.net/browse/BIM-167"

Jira Ticket Rules

BIM Tickets (We Create)

Create a BIM ticket when:

Work is something we can do (threshold adjustment, runbook update, monitor config)
Need to track work that doesn't live in incident.io

BIM ticket is OPTIONAL - if tracking in incident.io, no Jira needed.

Other Team's Projects (They Create)

If the fix requires work in another team's codebase/project:

DO NOT create a ticket in their project
Note the owner in Confluence
Ask them to create the ticket in their project
They link it to the Confluence page

Prompt After Investigation

When a monitor needs review, prompt the user:

This monitor could use a review. Want me to:

1. Add to Datadog Review Confluence page
2. Add to Confluence + create BIM ticket
3. Skip for now

Adding to Confluence

When adding a monitor to review, include:

| Issue Description | Slack Links | Proposed Solutions | DD Links | Ownership | BIM Ticket | Date | Escalation | Status |
|-------------------|-------------|-------------------|----------|-----------|------------|------|------------|--------|
| [Monitor Name] - [Why it needs review] | [Slack thread] | [What should change] | [Monitor URL] | [Owner] | [BIM-XXX or N/A] | [Date] | [If needed] | TO-DO |

Current Quarter Page

Q1 2026: Find or create the current month's page under: Datadog Monitor Contacts & Cleanup 2026 > Q1 2026 > [Month] 2026 - Alert Cleanup and Action Items

Integration Points

System	Current State	Future
Datadog MCP	Connected - can read monitors	-
Atlassian MCP	Connected - can read/write Confluence, create BIM tickets	-
incident.io	Manual	MCP integration coming soon

Update Log

Date	Change
2026-01-24	Initial Datadog Review workflow document

Reading Checkpoint

Current score: 0%

Sections complete

0/0

Checkpoint confirmed

Not yet

Reflection

0 chars

Completion requires 80% section coverage, checkpoint confirmation, and a short reflection. On completion, you will move to the next module automatically.

I can explain one operational takeaway from this module and when to apply it. Reflection (40+ chars)

Add 40 more characters.

Mark at least 80% of sections complete.

Datadog Review and Monitor Hygiene

Module Navigator

Learning Guidance

Stores not appearing for ROCC Audits in Byte Coach Tool

Unable to enter QR code to launch Best Voices Survey in Micro App

[Astro Alert] WARNING: DAG run failed for sf2fourth_recon_trigger

Module Content

Overview

Purpose

Weekly Meeting

Confluence Tracking

Workflow

After Investigation or Incident

When to Add a Monitor to Review

Runbook Update Process

Workflow

Step-by-Step Process

Standard Runbook Format

Critical Checklist

Important Rules

Example Session

Jira Ticket Rules

BIM Tickets (We Create)

Other Team's Projects (They Create)

Prompt After Investigation

Adding to Confluence

Current Quarter Page

Integration Points

Update Log