Incident Response Pattern¶
Purpose¶
Step-by-step procedures for diagnosing and resolving production issues in NGE service modules. This is a pattern (how-to), not a rule (must-do).
See error-handling rules for exception types and observability rules for metrics and alarms.
Triage Decision Tree¶
Alarm fires
│
├── DLQ depth > 0?
│ → Go to "DLQ Investigation"
│
├── Lambda error rate spike?
│ → Go to "Lambda Error Diagnosis"
│
├── SQS age of oldest message rising?
│ → Go to "Backpressure Response"
│
├── Lambda throttled?
│ → Go to "Concurrency Exhaustion"
│
└── Aurora connection errors?
→ Go to "Database Connection Issues"
DLQ Investigation¶
When: DLQ depth alarm fires (any messages in DLQ).
- Read DLQ messages — check
error_typeanderror_messagein message attributes - Classify failure type:
| Error Pattern | Likely Cause | Action |
|---|---|---|
PermanentFailureException |
Bad input data, logic error | Fix code or data, then redrive |
RecoverableException after max retries |
Transient issue resolved itself | Redrive — should succeed |
OperationalError (MySQL 1213/1205) |
Deadlock under load | Wait for load to subside, then redrive |
ConnectionError / timeout |
Downstream service outage | Verify service is healthy, then redrive |
- Redrive procedure:
- Use
sqs_ops.redrive_dlq()— sends messages back to source queue - DLQ messages get maximum 2 redrives before final
SilentSuccessException - Monitor source queue after redrive — messages should process successfully
-
If redrive fails again with same error → escalate to code fix
-
After redrive:
- Verify DLQ depth returns to 0
- Verify processing completes (check CloudWatch metrics)
- If batch-related, verify
NgeCaseTrackerJobpicks up completion events
Lambda Error Diagnosis¶
When: Lambda error rate exceeds 5% threshold.
- Check CloudWatch Logs — filter by
level = "ERROR"in the module's log group - Identify the error:
| Log Pattern | Diagnosis |
|---|---|
| Import errors at cold start | Dependency issue — check Lambda layer |
RecoverableException spikes |
Downstream service degradation |
OperationalError spikes |
Database contention — check Aurora metrics |
TimeoutError |
Lambda timeout too short, or downstream service slow |
| Memory exceeded | Lambda memory needs increase — check MaxMemoryUsed |
- Resolution:
- Transient downstream issue → wait and monitor (SQS retries handle this)
- Code bug → deploy fix, then redrive DLQ messages accumulated during outage
- Resource constraint → adjust Lambda memory/timeout/concurrency via CDK
Backpressure Response¶
When: ApproximateAgeOfOldestMessage alarm fires (messages aging in queue).
- Check queue depth — is it growing or stable?
- Growing: messages arriving faster than processing → need more concurrency or slower ingest
- Stable but old: processing is stuck → check Lambda errors
- Check
AgeOfFirstAttemptmetric — measures lag between SQS send and handler pickup - Check Lambda concurrent executions — is it at
MaximumConcurrency? - If at max → temporarily increase (verify Aurora can handle more connections)
- If below max → Lambda is erroring, not slow — go to "Lambda Error Diagnosis"
- Per-case backpressure: If one case's large import is degrading others, the bottleneck is Aurora writer contention, not Lambda. Response:
- Monitor
DatabaseConnections— if near limit, do NOT increase Lambda concurrency - Wait for the large import to complete — SQS buffers absorb the delay
Concurrency Exhaustion¶
When: Lambda throttle alarm fires.
- Identify which module — check
Throttlesmetric per Lambda function - Check if account-level limit — AWS has per-region concurrent execution limits
- Resolution:
- Single module spike → increase
MaximumConcurrencyon event source mapping - Account-level → request limit increase via AWS Support
- Before increasing concurrency, verify Aurora connection headroom (see performance-scaling rules)
Database Connection Issues¶
When: Lambda logs show connection failures or OperationalError outside of deadlocks.
- Check RDS Proxy metrics —
DatabaseConnections,QueryDatabaseResponseLatency - Check Aurora metrics —
CPUUtilization,FreeableMemory,DatabaseConnections - Common causes:
| Symptom | Cause | Fix |
|---|---|---|
max_connections exceeded |
Too many concurrent Lambdas | Reduce MaximumConcurrency |
| RDS Proxy pinning exhaustion | Per-case schema switching pins connections | Reduce concurrent case processing |
| Aurora writer CPU > 90% | Large imports or bulk operations | Wait for completion or throttle ingest |
| Deadlock storms | Missing jitter on retries | Deploy jitter fix (see BACKLOG) |
Checkpoint Pipeline Stuck¶
When: A document batch stops progressing (no new checkpoint transitions).
- Query checkpoint state — find documents stuck at a specific checkpoint step
- Check if handler is processing — look for recent log entries for the stuck documents
- Common causes:
- Message consumed but handler timed out before checkpoint update → message returns to queue after visibility timeout
- Checkpoint update failed (
CheckpointUpdateError) → document stuck at previous step - Upstream dependency unavailable →
RecoverableExceptioncycling - Resolution:
- If message is back in queue: it will retry automatically
- If message is in DLQ: redrive after fixing root cause
- If no message exists (lost): manually re-publish the event to restart processing from last checkpoint
Post-Incident Checklist¶
After resolving any production incident:
- DLQ depth is 0 for all affected modules
- CloudWatch alarms are all in OK state
- Queue depth is draining (not growing)
- Lambda error rate is below threshold
- Affected batches completed successfully (check Athena events)
- Root cause identified and documented
- If code fix needed: PR opened with test covering the failure scenario
- If operational: BACKLOG.md updated with prevention item
Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.