Critical concept: When you see errors like "can't reach database" or "connection timeout", don't assume the target is down. The SOURCE (pod/node) might be the problem.
7a. Group Errors by Infrastructure Dimension
Before deep-diving into error content, check WHERE errors are coming from:
-- Group by Kubernetes node
SELECT kube_node, count(*) as errors
FROM logs
GROUP BY kube_node
ORDER BY errors DESC
-- Group by pod
SELECT pod_name, count(*) as errors
FROM logs
GROUP BY pod_name
ORDER BY errors DESC
| Pattern | Likely Cause | Next Step |
|---|
| Errors concentrated on 1-2 nodes | Node issue (unhealthy, network, resources) | Check node health |
| Errors concentrated on specific pods | Pod issue (OOM, crash loop, bad deployment) | Check pod health |
| Errors distributed across all nodes/pods | Application or downstream issue (database, vendor) | Check downstream health |
7b. Infrastructure Tags Reference
These tags are available on application logs - use them for correlation:
| Tag | Example | Use For |
|---|
kube_node | ip-10-10-26-138.ec2.internal | Node-level correlation |
pod_name | platform-router-storefront-5d74b797f4-65mpq | Pod-level correlation |
kube_namespace | graph-core-prod-curie-use1 | Namespace scoping |
kube_cluster_name | prod-curie | Cluster-level correlation |
availability-zone | us-east-1a | AZ-level issues |
eks_nodegroup-name | yce-curie-prod-eksstack-e45e-green-node-group | Node group issues |
instance-type | c7i.8xlarge | Instance-type specific issues |
container_name | platform-router-storefront | Container identification |
7c. Kubernetes Health Check
Metric for unhealthy pods:
kubernetes_state.pod.ready{condition:false,!pod_phase:succeeded}Scope by namespace: kube_namespace:storemenu-prod-curie-use1
Dashboard: Kubernetes Pods Overview
- Filter by namespace to see pod states
- Look for: pods not ready, restarts, OOM, CrashLoopBackOff
7d. Database/RDS Health Check
If errors indicate database connectivity issues AND errors are distributed (not node-concentrated):
Dashboards:
What to check:
- Connection count (maxed out?)
- CPU/memory utilization
- Read/write latency spikes
- Recent failover events
7e. Known Failure Signatures
| Signature | Likely Cause |
|---|
| Pods in bad state (not ready) | Node issue, resource exhaustion, deployment problem |
| OOM (Out of Memory) | Memory limits too low, memory leak |
| CrashLoopBackOff | Application crash on startup, config issue |
| Synthetic failures | Monitoring/health check failures |
| "Can't reach [X]" from specific pods | Check the SOURCE pod/node, not just target |
7f. Decision Tree
1. Query errors, group by kube_node
↓
2. Errors concentrated on specific node(s)?
│
├─ YES → Check Kubernetes Pods Overview for that node
│ → Look for pods not ready, OOM, CrashLoopBackOff
│ → Check if other services on same node affected
│ → Likely action: Cordon node, roll pods
│
└─ NO (distributed) → Check downstream systems
→ RDS dashboards if database errors
→ Vendor status if external service errors
→ Application logs for specific error contentExample: Store-Menu Investigation (Jan 26, 2026)
What happened:
- Alert: Router Subgraph Errors for store-menu in prod-curie
- Error message: "Can't reach database server"
- Initial assumption: RDS database issue
What we should have done:
SELECT kube_node, count(*) as errors
FROM logs
WHERE env:prod-curie AND service:platform-router AND @metadata.subgraph_name:store-menu
GROUP BY kube_node
What this would have shown:
- Errors concentrated on node
ip-10-10-27-170.ec2.internal - Other services on that node also failing
Actual root cause: Unhealthy Kubernetes node, not RDS Resolution: Cordon node, roll pods to healthy nodes