Skip to content

Incident Response Pattern

Purpose

Step-by-step procedures for diagnosing and resolving production issues in NGE service modules. This is a pattern (how-to), not a rule (must-do).

See error-handling rules for exception types and observability rules for metrics and alarms.

Triage Decision Tree

Alarm fires
    ├── DLQ depth > 0?
    │       → Go to "DLQ Investigation"
    ├── Lambda error rate spike?
    │       → Go to "Lambda Error Diagnosis"
    ├── SQS age of oldest message rising?
    │       → Go to "Backpressure Response"
    ├── Lambda throttled?
    │       → Go to "Concurrency Exhaustion"
    └── Aurora connection errors?
            → Go to "Database Connection Issues"

DLQ Investigation

When: DLQ depth alarm fires (any messages in DLQ).

  1. Read DLQ messages — check error_type and error_message in message attributes
  2. Classify failure type:
Error Pattern Likely Cause Action
PermanentFailureException Bad input data, logic error Fix code or data, then redrive
RecoverableException after max retries Transient issue resolved itself Redrive — should succeed
OperationalError (MySQL 1213/1205) Deadlock under load Wait for load to subside, then redrive
ConnectionError / timeout Downstream service outage Verify service is healthy, then redrive
  1. Redrive procedure:
  2. Use sqs_ops.redrive_dlq() — sends messages back to source queue
  3. DLQ messages get maximum 2 redrives before final SilentSuccessException
  4. Monitor source queue after redrive — messages should process successfully
  5. If redrive fails again with same error → escalate to code fix

  6. After redrive:

  7. Verify DLQ depth returns to 0
  8. Verify processing completes (check CloudWatch metrics)
  9. If batch-related, verify NgeCaseTrackerJob picks up completion events

Lambda Error Diagnosis

When: Lambda error rate exceeds 5% threshold.

  1. Check CloudWatch Logs — filter by level = "ERROR" in the module's log group
  2. Identify the error:
Log Pattern Diagnosis
Import errors at cold start Dependency issue — check Lambda layer
RecoverableException spikes Downstream service degradation
OperationalError spikes Database contention — check Aurora metrics
TimeoutError Lambda timeout too short, or downstream service slow
Memory exceeded Lambda memory needs increase — check MaxMemoryUsed
  1. Resolution:
  2. Transient downstream issue → wait and monitor (SQS retries handle this)
  3. Code bug → deploy fix, then redrive DLQ messages accumulated during outage
  4. Resource constraint → adjust Lambda memory/timeout/concurrency via CDK

Backpressure Response

When: ApproximateAgeOfOldestMessage alarm fires (messages aging in queue).

  1. Check queue depth — is it growing or stable?
  2. Growing: messages arriving faster than processing → need more concurrency or slower ingest
  3. Stable but old: processing is stuck → check Lambda errors
  4. Check AgeOfFirstAttempt metric — measures lag between SQS send and handler pickup
  5. Check Lambda concurrent executions — is it at MaximumConcurrency?
  6. If at max → temporarily increase (verify Aurora can handle more connections)
  7. If below max → Lambda is erroring, not slow — go to "Lambda Error Diagnosis"
  8. Per-case backpressure: If one case's large import is degrading others, the bottleneck is Aurora writer contention, not Lambda. Response:
  9. Monitor DatabaseConnections — if near limit, do NOT increase Lambda concurrency
  10. Wait for the large import to complete — SQS buffers absorb the delay

Concurrency Exhaustion

When: Lambda throttle alarm fires.

  1. Identify which module — check Throttles metric per Lambda function
  2. Check if account-level limit — AWS has per-region concurrent execution limits
  3. Resolution:
  4. Single module spike → increase MaximumConcurrency on event source mapping
  5. Account-level → request limit increase via AWS Support
  6. Before increasing concurrency, verify Aurora connection headroom (see performance-scaling rules)

Database Connection Issues

When: Lambda logs show connection failures or OperationalError outside of deadlocks.

  1. Check RDS Proxy metricsDatabaseConnections, QueryDatabaseResponseLatency
  2. Check Aurora metricsCPUUtilization, FreeableMemory, DatabaseConnections
  3. Common causes:
Symptom Cause Fix
max_connections exceeded Too many concurrent Lambdas Reduce MaximumConcurrency
RDS Proxy pinning exhaustion Per-case schema switching pins connections Reduce concurrent case processing
Aurora writer CPU > 90% Large imports or bulk operations Wait for completion or throttle ingest
Deadlock storms Missing jitter on retries Deploy jitter fix (see BACKLOG)

Checkpoint Pipeline Stuck

When: A document batch stops progressing (no new checkpoint transitions).

  1. Query checkpoint state — find documents stuck at a specific checkpoint step
  2. Check if handler is processing — look for recent log entries for the stuck documents
  3. Common causes:
  4. Message consumed but handler timed out before checkpoint update → message returns to queue after visibility timeout
  5. Checkpoint update failed (CheckpointUpdateError) → document stuck at previous step
  6. Upstream dependency unavailable → RecoverableException cycling
  7. Resolution:
  8. If message is back in queue: it will retry automatically
  9. If message is in DLQ: redrive after fixing root cause
  10. If no message exists (lost): manually re-publish the event to restart processing from last checkpoint

Post-Incident Checklist

After resolving any production incident:

  • DLQ depth is 0 for all affected modules
  • CloudWatch alarms are all in OK state
  • Queue depth is draining (not growing)
  • Lambda error rate is below threshold
  • Affected batches completed successfully (check Athena events)
  • Root cause identified and documented
  • If code fix needed: PR opened with test covering the failure scenario
  • If operational: BACKLOG.md updated with prevention item
Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.