Back to course: Coach

Coach | Reading Module

Datadog Review and Monitor Hygiene

Status: Not Started | Pass threshold: 100% | Points: 85

L3 30 min post-incident

Best score

0%

Attempts

0

Pass rate

0%

Passed

0

Completion happens in the checkpoint panel below.

Learning Guidance

Objectives

  • Decide when an alert should enter review backlog.
  • Draft runbook improvements with escalation-safe formatting.
  • Route action items to BIM workflow with ownership clarity.

Evidence To Capture

  • Monitor issue description and proposed fix.
  • Confluence tracking row with owner and status.
  • Jira linkage decision (BIM or incident-only).

Source Artifacts

Internal source references are available for maintainers but are not exposed in deployed environments.

Field Evidence

Real incidents related to what you're learning.

Module Content

Not Started

Key Takeaways

  • Trigger happy (thresholds too sensitive)
  • Missing or outdated runbooks
  • Monitor should be retired
  • New monitoring needed that was missed
  • Get monitor ID or URL from user

Overview

This document describes the Datadog Review process - a continuous improvement workflow for monitor hygiene and observability quality.


Purpose

Catch P3s before they become P1s. Reduce alert fatigue. Keep monitoring right and tight.

After incidents or alerts, monitors often need improvement:

  • Trigger happy (thresholds too sensitive)
  • Missing or outdated runbooks
  • Query needs adjustment
  • Monitor should be retired
  • New monitoring needed that was missed

The Datadog Review process tracks this work across all teams.


Weekly Meeting

When: Every Tuesday at 2:00 PM ET Purpose: Review weekly updates, additions, and closures


Confluence Tracking

All Datadog Review items are tracked in Confluence:

Parent Page: Datadog Monitor Contacts & Cleanup

Structure:

Datadog Monitor Contacts & Cleanup
├── Monitor Contacts & Cleanup 2026
│   ├── Q1 2026
│   │   ├── Jan 2026 - Alert Cleanup and Action Items
│   │   ├── Feb 2026 - Alert Cleanup and Action Items
│   │   └── Mar 2026 - Alert Cleanup and Action Items
│   ├── Q2 2026
│   └── ...

Monthly Page Columns:

ColumnDescription
Issue DescriptionWhat monitor is problematic and why
Slack LinksWhere it was discussed
Proposed Solutions/ImprovementsThreshold changes, runbook updates, etc.
DD LinksDatadog monitor and log links
OwnershipWho's responsible for the fix
BIM TicketJira ticket (optional - see below)
Date DiscussedWhen it was reviewed
Escalation to DevIf dev team involvement needed
StatusTO-DO, DONE, Monitoring

Workflow

After Investigation or Incident

Alert/Incident fires
    ↓
Investigate & resolve
    ↓
Monitor needs improvement?
    ↓
YES → "Add to Datadog Review"
    ↓
Add to current month's Confluence page
    ↓
Create BIM ticket? (OPTIONAL)
    ├── Yes → Create BIM ticket, link to Confluence
    └── No → Track in incident.io instead
    ↓
Weekly Tuesday review
    ↓
Work completed → Mark DONE

When to Add a Monitor to Review

ReasonExample
Trigger happyMonitor fires 10x/day but no action needed
Threshold needs adjustmentTraffic increased, old thresholds too sensitive
Runbook missing/outdatedResponder didn't know what to do
Query incorrectMonitor measuring wrong thing
Should be retiredService deprecated, monitor no longer relevant
New monitoring neededIncident revealed gap in coverage
Wrong channel/routingAlert going to wrong team
Priority incorrectP2 should be P3, etc.

Runbook Update Process

When a monitor's runbook is missing or outdated (per the table above), follow this workflow to improve it.

Workflow

User identifies monitor with poor runbook
    ↓
Read current monitor configuration
    ↓
Identify specific runbook problems
    ↓
Draft improved runbook (PRINT for review, never update directly)
    ↓
User reviews and approves
    ↓
Create BIM ticket with recommendations
    ↓
User manually updates monitor in Datadog

Step-by-Step Process

  1. Read the Monitor
  • Get monitor ID or URL from user
  • Use Datadog MCP to fetch monitor details
  • Review current runbook message
  1. Identify Problems
  • Missing required sections?
  • Unclear action steps?
  • Notification tags in wrong spots?
  • No structured troubleshooting steps?
  • Missing Datadog deep links?
  1. Draft Improved Runbook
  • Follow standard format (see below)
  • Print for user review (NEVER update directly)
  • Iterate based on feedback
  1. Create BIM Ticket
  • Project: BIM (Byte Incident Management)
  • Include full runbook text in description
  • List specific problems being fixed
  • User will manually update in Datadog

Standard Runbook Format

All monitor runbooks should follow this structure:

## [Monitor Name] - [Brief Description]

### Top level variables:
- Org: {{log.attributes.organization}}
- Env: {{log.tags.env}}
- Service: {{log.tags.service}}
- [Optional: Index, DB, Pod, etc. depending on monitor type]

---

### Key information:

**What This Monitor Watches**:
[Description of what's being monitored]

**Thresholds**:
- **Warning**: [Threshold value and meaning]
- **Critical**: [Threshold value and meaning]
- **Auto-Reset**: [If applicable]

**What Happens at Each Level**:
[Explain the impact at different threshold levels]

**Common Scenarios That Trigger This**:
1. [Scenario 1]
2. [Scenario 2]
3. [Scenario 3]

---

### Troubleshooting steps:

#### STEP 1: [Action Name] ([Estimated Time])
[Why this step first]

1. **[Sub-step]**:
   - [Datadog deep link]
   - [What to look for]

2. **[Sub-step]**:
   - [Action to take]

---

#### STEP 2: [Next Action]

[Continue with numbered steps and deep links]

---

### Quick Reference Links (optional)

| Resource | Link |
|----------|------|
| [Dashboard Name] | [Link] |
| [Config Page] | [Link] |

---

### Escalation

[Notification tags in proper scopes - preserve exactly from original monitor]

Critical Checklist

Before creating BIM ticket, verify:

  • [ ] Top level variables section present with bullet points (dashes)
  • [ ] Org, Env, Service are ALWAYS included (required)
  • [ ] Optional variables (Index, DB, Pod, etc.) added if relevant
  • [ ] Key information section explains what monitor watches, thresholds, common scenarios
  • [ ] Troubleshooting steps are numbered, chronological, with Datadog deep links
  • [ ] Notification tags preserved in exact same spots (conditional scopes, end, etc.)
  • [ ] @webhook-reops-incident-io replaced with @webhook-byte-incident-io (if byte not already present)
  • [ ] @webhook-byte-incident-io included for Byte team escalation

Important Rules

  1. NEVER update monitors directly - Always create BIM ticket for user to manually update
  2. Preserve notification tags - Don't accidentally drop or move notification mentions
  3. Ask about tag placement - If unsure where tags should go, ask the user
  4. Replace old webhooks - webhook-reops-incident-io is deprecated, use webhook-byte-incident-io
  5. Print for review first - Always show user the full runbook before creating ticket

Example Session

User: "This monitor's runbook was too vague, help me improve it"
      https://app.datadoghq.com/monitors/155218447

AI: [Reads monitor, identifies problems]
    "Current runbook has these issues:
    1. Action buried in conditional logic
    2. No numbered steps
    3. Formula unclear

    Here's the improved runbook: [prints full text]"

User: "Looks good, create ticket"

AI: [Creates BIM ticket with recommendations]
    "Created BIM-167: https://yumbrands.atlassian.net/browse/BIM-167"

Jira Ticket Rules

BIM Tickets (We Create)

Create a BIM ticket when:

  • Work is something we can do (threshold adjustment, runbook update, monitor config)
  • Need to track work that doesn't live in incident.io

BIM ticket is OPTIONAL - if tracking in incident.io, no Jira needed.

Other Team's Projects (They Create)

If the fix requires work in another team's codebase/project:

  1. DO NOT create a ticket in their project
  2. Note the owner in Confluence
  3. Ask them to create the ticket in their project
  4. They link it to the Confluence page

Prompt After Investigation

When a monitor needs review, prompt the user:

This monitor could use a review. Want me to:

1. Add to Datadog Review Confluence page
2. Add to Confluence + create BIM ticket
3. Skip for now

Adding to Confluence

When adding a monitor to review, include:

| Issue Description | Slack Links | Proposed Solutions | DD Links | Ownership | BIM Ticket | Date | Escalation | Status |
|-------------------|-------------|-------------------|----------|-----------|------------|------|------------|--------|
| [Monitor Name] - [Why it needs review] | [Slack thread] | [What should change] | [Monitor URL] | [Owner] | [BIM-XXX or N/A] | [Date] | [If needed] | TO-DO |

Current Quarter Page

Q1 2026: Find or create the current month's page under: Datadog Monitor Contacts & Cleanup 2026 > Q1 2026 > [Month] 2026 - Alert Cleanup and Action Items


Integration Points

SystemCurrent StateFuture
Datadog MCPConnected - can read monitors-
Atlassian MCPConnected - can read/write Confluence, create BIM tickets-
incident.ioManualMCP integration coming soon

Update Log

DateChange
2026-01-24Initial Datadog Review workflow document

Reading Checkpoint

Current score: 0%

Sections complete

0/0

Checkpoint confirmed

Not yet

Reflection

0 chars

Completion requires 80% section coverage, checkpoint confirmation, and a short reflection. On completion, you will move to the next module automatically.

Add 40 more characters.

Mark at least 80% of sections complete.