Back to course: Commerce

Commerce | Reading Module

Commerce Monitor Signals and Alert Quality

Status: Not Started | Pass threshold: 100% | Points: 55

L1 25 min

Best score

0%

Attempts

0

Pass rate

0%

Passed

0

Completion happens in the checkpoint panel below.

Learning Guidance

Objectives

  • Link monitor IDs to service ownership and escalation defaults.
  • Differentiate noisy alerts from customer-impacting patterns.
  • Build first-pass investigation checkpoints per monitor family.

Evidence To Capture

  • Primary monitor and related monitors identified.
  • Initial owner-routing hypothesis documented.

Source Artifacts

Internal source references are available for maintainers but are not exposed in deployed environments.

Field Evidence

Real incidents related to what you're learning.

Module Content

Not Started

Key Takeaways

  • Payments incidents: payments team first, then SRE if infra symptoms present.
  • Order flow incidents: order-delivery first, then payments/cart based on failed step.
  • Authentication incidents: customer-auth first, security review if bot patterns detected.
  • RDS or cluster health incidents: SRE first, then impacted product teams.

Overview

Priority set for onboarding comes from observability-audit/monitors/common-monitors.md and detailed monitor YAML files.

P0/P1 Learning Monitors

Monitor IDNameServiceTeamNotes
251308017Payment Decline Ratepayment-dgspaymentsdirect checkout risk
163999810Gift Card Fiserv Declinepayment-dgspaymentsstore activation vs vendor issue
71163811TB Payment Capturespayment-workflowservicepaymentslow-volume heartbeat pattern
164192487Order Confirmation Anomalyorder-dgs/workflowserviceorder-deliveryconfirmation path degradation
137981342IDP OTP Failure Rateidpservicecustomer-authlogin flow degradation
140529304IDP Refresh Token Anomalyidpservicecustomer-authsession continuity risk
181521332High Order RDS CPUrdssreshared dependency saturation
213501682Container Restartsk8ssreinfra correlation and service instability

Routing Defaults

  • Payments incidents: payments team first, then SRE if infra symptoms present.
  • Order flow incidents: order-delivery first, then payments/cart based on failed step.
  • Authentication incidents: customer-auth first, security review if bot patterns detected.
  • RDS or cluster health incidents: SRE first, then impacted product teams.

Alert-to-Runbook Mapping

Monitor IDPrimary Runbook
251308017runbooks/payment-decline-rate.md
163999810runbooks/payment-decline-rate.md
164192487runbooks/order-confirmation-drop.md
137981342runbooks/otp-failure-spike.md
71163811../pos/runbooks/tb-payment-captures-drop.md

Monitor Profile Highlights

MonitorServiceTeamWhy It Matters
251308017 Payment Decline Ratepayment-dgspaymentsThreshold: >50% decline rate Typical Pattern: Can spike due to fraud prevention, card network issues Action: Check Fiser
163999810 Gift Card Fiserv Declinepayment-dgspaymentsType: Log alert (count threshold) Threshold: >= 50 in 30 min Grouped By: org, decline_reason, store Service File: knowle
71163811 TB Payment Capturespayment-workflowservicepaymentsType: Volume monitor (low count alert / heartbeat) Query: logs("processor capture request complete" @organization:tb_us)
137981342 IDP OTP Failure Rateidpservicecustomer-authQuery: (failures / total) * 100 over 15 minutes Thresholds: Warn >40%, Alert >80% Grouped By: cluster_name (environment)
140529304 IDP Refresh Token Grant Anomalyidpservicecustomer-authType: Anomaly detection (metric-based) Metric: commerce.idpservice.refresh_token_all Query: Anomaly detection with weekl
181521332 High Order RDS CPUorder-rdssreThreshold: >50% CPU Typical Pattern: Spikes during peak dinner hours (17:00-21:00 EST) Auto-Recovery: Usually within 30-
213501682 High Total Container Restarts in ProductionkubernetessreType: Query alert (metric threshold) Thresholds: Warn >75, Alert >100 total restarts Grouped By: kube_cluster_name, kube

Reading Checkpoint

Current score: 0%

Sections complete

0/0

Checkpoint confirmed

Not yet

Reflection

0 chars

Completion requires 80% section coverage, checkpoint confirmation, and a short reflection. On completion, you will move to the next module automatically.

Add 40 more characters.

Mark at least 80% of sections complete.