Incident Response Pattern¶

Purpose¶

Step-by-step procedures for diagnosing and resolving production issues in NGE service modules. This is a pattern (how-to), not a rule (must-do).

See error-handling rules for exception types and observability rules for metrics and alarms.

Triage Decision Tree¶

Alarm fires
    │
    ├── DLQ depth > 0?
    │       → Go to "DLQ Investigation"
    │
    ├── Lambda error rate spike?
    │       → Go to "Lambda Error Diagnosis"
    │
    ├── SQS age of oldest message rising?
    │       → Go to "Backpressure Response"
    │
    ├── Lambda throttled?
    │       → Go to "Concurrency Exhaustion"
    │
    └── Aurora connection errors?
            → Go to "Database Connection Issues"

DLQ Investigation¶

When: DLQ depth alarm fires (any messages in DLQ).

Read DLQ messages — check error_type and error_message in message attributes
Classify failure type:

Error Pattern	Likely Cause	Action
`PermanentFailureException`	Bad input data, logic error	Fix code or data, then redrive
`RecoverableException` after max retries	Transient issue resolved itself	Redrive — should succeed
`OperationalError` (MySQL 1213/1205)	Deadlock under load	Wait for load to subside, then redrive
`ConnectionError` / timeout	Downstream service outage	Verify service is healthy, then redrive

Redrive procedure:
Use sqs_ops.redrive_dlq() — sends messages back to source queue
DLQ messages get maximum 2 redrives before final SilentSuccessException
Monitor source queue after redrive — messages should process successfully
If redrive fails again with same error → escalate to code fix
After redrive:
Verify DLQ depth returns to 0
Verify processing completes (check CloudWatch metrics)
If batch-related, verify NgeCaseTrackerJob picks up completion events

Lambda Error Diagnosis¶

When: Lambda error rate exceeds 5% threshold.

Check CloudWatch Logs — filter by level = "ERROR" in the module's log group
Identify the error:

Log Pattern	Diagnosis
Import errors at cold start	Dependency issue — check Lambda layer
`RecoverableException` spikes	Downstream service degradation
`OperationalError` spikes	Database contention — check Aurora metrics
`TimeoutError`	Lambda timeout too short, or downstream service slow
Memory exceeded	Lambda memory needs increase — check `MaxMemoryUsed`

Resolution:
Transient downstream issue → wait and monitor (SQS retries handle this)
Code bug → deploy fix, then redrive DLQ messages accumulated during outage
Resource constraint → adjust Lambda memory/timeout/concurrency via CDK

Backpressure Response¶

When: ApproximateAgeOfOldestMessage alarm fires (messages aging in queue).

Check queue depth — is it growing or stable?
Growing: messages arriving faster than processing → need more concurrency or slower ingest
Stable but old: processing is stuck → check Lambda errors
Check AgeOfFirstAttempt metric — measures lag between SQS send and handler pickup
Check Lambda concurrent executions — is it at MaximumConcurrency?
If at max → temporarily increase (verify Aurora can handle more connections)
If below max → Lambda is erroring, not slow — go to "Lambda Error Diagnosis"
Per-case backpressure: If one case's large import is degrading others, the bottleneck is Aurora writer contention, not Lambda. Response:
Monitor DatabaseConnections — if near limit, do NOT increase Lambda concurrency
Wait for the large import to complete — SQS buffers absorb the delay

Concurrency Exhaustion¶

When: Lambda throttle alarm fires.

Identify which module — check Throttles metric per Lambda function
Check if account-level limit — AWS has per-region concurrent execution limits
Resolution:
Single module spike → increase MaximumConcurrency on event source mapping
Account-level → request limit increase via AWS Support
Before increasing concurrency, verify Aurora connection headroom (see performance-scaling rules)

Database Connection Issues¶

When: Lambda logs show connection failures or OperationalError outside of deadlocks.

Check RDS Proxy metrics — DatabaseConnections, QueryDatabaseResponseLatency
Check Aurora metrics — CPUUtilization, FreeableMemory, DatabaseConnections
Common causes:

Symptom	Cause	Fix
`max_connections` exceeded	Too many concurrent Lambdas	Reduce `MaximumConcurrency`
RDS Proxy pinning exhaustion	Per-case schema switching pins connections	Reduce concurrent case processing
Aurora writer CPU > 90%	Large imports or bulk operations	Wait for completion or throttle ingest
Deadlock storms	Missing jitter on retries	Deploy jitter fix (see BACKLOG)

Checkpoint Pipeline Stuck¶

When: A document batch stops progressing (no new checkpoint transitions).

Query checkpoint state — find documents stuck at a specific checkpoint step
Check if handler is processing — look for recent log entries for the stuck documents
Common causes:
Message consumed but handler timed out before checkpoint update → message returns to queue after visibility timeout
Checkpoint update failed (CheckpointUpdateError) → document stuck at previous step
Upstream dependency unavailable → RecoverableException cycling
Resolution:
If message is back in queue: it will retry automatically
If message is in DLQ: redrive after fixing root cause
If no message exists (lost): manually re-publish the event to restart processing from last checkpoint

Post-Incident Checklist¶

After resolving any production incident:

DLQ depth is 0 for all affected modules
CloudWatch alarms are all in OK state
Queue depth is draining (not growing)
Lambda error rate is below threshold
Affected batches completed successfully (check Athena events)
Root cause identified and documented
If code fix needed: PR opened with test covering the failure scenario
If operational: BACKLOG.md updated with prevention item

Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.