When a monitor's runbook is missing or outdated (per the table above), follow this workflow to improve it.
Workflow
User identifies monitor with poor runbook
↓
Read current monitor configuration
↓
Identify specific runbook problems
↓
Draft improved runbook (PRINT for review, never update directly)
↓
User reviews and approves
↓
Create BIM ticket with recommendations
↓
User manually updates monitor in DatadogStep-by-Step Process
- Read the Monitor
- Get monitor ID or URL from user
- Use Datadog MCP to fetch monitor details
- Review current runbook message
- Identify Problems
- Missing required sections?
- Unclear action steps?
- Notification tags in wrong spots?
- No structured troubleshooting steps?
- Missing Datadog deep links?
- Draft Improved Runbook
- Follow standard format (see below)
- Print for user review (NEVER update directly)
- Iterate based on feedback
- Create BIM Ticket
- Project: BIM (Byte Incident Management)
- Include full runbook text in description
- List specific problems being fixed
- User will manually update in Datadog
Standard Runbook Format
All monitor runbooks should follow this structure:
## [Monitor Name] - [Brief Description]
### Top level variables:
- Org: {{log.attributes.organization}}
- Env: {{log.tags.env}}
- Service: {{log.tags.service}}
- [Optional: Index, DB, Pod, etc. depending on monitor type]
---
### Key information:
**What This Monitor Watches**:
[Description of what's being monitored]
**Thresholds**:
- **Warning**: [Threshold value and meaning]
- **Critical**: [Threshold value and meaning]
- **Auto-Reset**: [If applicable]
**What Happens at Each Level**:
[Explain the impact at different threshold levels]
**Common Scenarios That Trigger This**:
1. [Scenario 1]
2. [Scenario 2]
3. [Scenario 3]
---
### Troubleshooting steps:
#### STEP 1: [Action Name] ([Estimated Time])
[Why this step first]
1. **[Sub-step]**:
- [Datadog deep link]
- [What to look for]
2. **[Sub-step]**:
- [Action to take]
---
#### STEP 2: [Next Action]
[Continue with numbered steps and deep links]
---
### Quick Reference Links (optional)
| Resource | Link |
|----------|------|
| [Dashboard Name] | [Link] |
| [Config Page] | [Link] |
---
### Escalation
[Notification tags in proper scopes - preserve exactly from original monitor]Critical Checklist
Before creating BIM ticket, verify:
- [ ] Top level variables section present with bullet points (dashes)
- [ ] Org, Env, Service are ALWAYS included (required)
- [ ] Optional variables (Index, DB, Pod, etc.) added if relevant
- [ ] Key information section explains what monitor watches, thresholds, common scenarios
- [ ] Troubleshooting steps are numbered, chronological, with Datadog deep links
- [ ] Notification tags preserved in exact same spots (conditional scopes, end, etc.)
- [ ]
@webhook-reops-incident-io replaced with @webhook-byte-incident-io (if byte not already present) - [ ]
@webhook-byte-incident-io included for Byte team escalation
Important Rules
- NEVER update monitors directly - Always create BIM ticket for user to manually update
- Preserve notification tags - Don't accidentally drop or move notification mentions
- Ask about tag placement - If unsure where tags should go, ask the user
- Replace old webhooks -
webhook-reops-incident-io is deprecated, use webhook-byte-incident-io - Print for review first - Always show user the full runbook before creating ticket
Example Session
User: "This monitor's runbook was too vague, help me improve it"
https://app.datadoghq.com/monitors/155218447
AI: [Reads monitor, identifies problems]
"Current runbook has these issues:
1. Action buried in conditional logic
2. No numbered steps
3. Formula unclear
Here's the improved runbook: [prints full text]"
User: "Looks good, create ticket"
AI: [Creates BIM ticket with recommendations]
"Created BIM-167: https://yumbrands.atlassian.net/browse/BIM-167"