Article Review: Group 1 — SQS+Lambda Integration¶
Articles Reviewed¶
- Understanding Amazon SQS and AWS Lambda Event Source Mapping — Serverless Guru / AWS APN Blog (Feb 2023)
- Lessons learned from combining SQS and Lambda in a data project — Miia Niemelä / Solita (Jan 2020)
Key Concepts¶
How Event Source Mapping Works Internally¶
AWS manages up to 1,000 parallel polling threads per region. Each thread: 1. Long-polls the SQS queue 2. Receives a batch of messages 3. Synchronously invokes the Lambda function 4. On success → deletes the messages from the queue 5. On error (non-throttle) → does nothing; messages reappear after visibility timeout 6. On throttle → retries a few times, then gives up; messages reappear with incremented ReceiveCount
Lambda Scaling Ramp-Up¶
When messages start arriving, Lambda reads up to 5 batches initially, then adds up to 60 more instances per minute until reaching the concurrency limit. This means a sudden burst of 10,000 messages won't immediately scale to 1,000 Lambdas — there's a ramp-up period where messages queue up.
The Throttling → DLQ Trap¶
The most critical insight from both articles. When reserved concurrency is reached: 1. Polling threads continue to aggressively poll and attempt to invoke Lambda 2. Invocations are throttled (not queued) 3. Throttled messages become visible again with ReceiveCount incremented 4. After ReceiveCount > maxReceiveCount → message goes to DLQ 5. Valid, unprocessed messages end up in DLQ
This is NOT a failure of the message — it's a failure of capacity planning.
Three Approaches to Concurrency Control¶
| Approach | Mechanism | Throttling Risk | DLQ Risk |
|---|---|---|---|
| Reserved Concurrency | put_function_concurrency() |
HIGH — polling threads still try to invoke aggressively | HIGH — throttled invocations increment ReceiveCount |
| MaximumConcurrency on ESM | ScalingConfig.MaximumConcurrency |
LOW — event source mapping itself limits invocations | LOW — messages stay in queue naturally |
| FIFO Queue + Message Groups | Number of unique MessageGroupIDs = max parallelism | NONE — only one thread per message group | NONE — controlled at polling level |
MaximumConcurrency (introduced late 2022) is the recommended approach for standard queues. It caps invocations at the event source mapping level, so polling threads never try to invoke more Lambdas than allowed. Messages stay safely in the queue until capacity frees up.
Visibility Timeout Formula¶
Article 1 formula: VisibilityTimeout >= Lambda_timeout + BatchWindow + 30s buffer
Article 2 (AWS docs): VisibilityTimeout >= 6 × Lambda_timeout
The second is more conservative. The first is more practical — the key is that visibility timeout must exceed the total time from poll-to-delete.
DLQ Placement¶
DLQ must be configured on the source SQS queue, not on the Lambda function. Lambda's DLQ only handles async invocations. SQS→Lambda event source mapping is synchronous, so Lambda's DLQ is never triggered.
Mapping to NGE Architecture¶
What We Do Right¶
-
MaximumConcurrency on event source mapping — Our CDK config uses
ScalingConfig.MaximumConcurrency, which is the recommended approach. This prevents aggressive throttling. -
ReportBatchItemFailures — Our batch Lambda (DocumentLoader) uses partial failure reporting, so one bad record doesn't cause the entire batch to retry.
-
Visibility timeout = Lambda timeout (900s) — Both are 15 minutes, preventing duplicate delivery during normal processing.
-
Structured exception hierarchy — PermanentFailures go directly to DLQ; RecoverableExceptions requeue with backoff. This is more sophisticated than the articles describe.
-
DLQ on source queue — Our CDK deploys the DLQ on the SQS queue, not on the Lambda.
Potential Issues Found¶
1. VALIDATED: Dual Concurrency Control (Correct Pattern)¶
Our dynamic Lambda creation sets BOTH:
- ScalingConfig.MaximumConcurrency on the event source mapping
- put_function_concurrency(ReservedConcurrentExecutions=concurrency) on the Lambda
From the lambda-sqs-integration pattern (line 254-260):
attach_event_source(lambda_client, sqs_client, new_lambda, queue_url, concurrency)
lambda_client.put_function_concurrency(
FunctionName=new_lambda,
ReservedConcurrentExecutions=concurrency,
)
This is the safest configuration. The two settings serve complementary purposes: - MaximumConcurrency (ESM) = ceiling — prevents the event source mapping from over-invoking beyond capacity - ReservedConcurrency (Lambda) = floor/guarantee — reserves account-level capacity so other Lambda functions can't starve this one
Without ReservedConcurrency, if other Lambdas in the account consumed all available concurrency, the ESM would try to invoke our Lambda and get throttled — causing the exact DLQ trap the articles describe.
Both are set to the same concurrency value, so the ESM never tries to invoke beyond what's reserved.
Key invariant to maintain: MaximumConcurrency <= ReservedConcurrency. If MaximumConcurrency ever exceeds ReservedConcurrency, the ESM will try to invoke more Lambdas than reserved slots available, causing throttling.
The articles warn about using reserved concurrency ALONE (without MaximumConcurrency). That's the old approach where polling threads aggressively invoke past the reserved limit. Our dual approach avoids this entirely.
2. MEDIUM: Visibility Timeout Doesn't Account for Batch Window¶
For DocumentLoader (batch processing): - Lambda timeout: 900s - Batch window: 60s - Current visibility timeout: 900s - Should be: 900 + 60 + 30 = 990s minimum
The polling thread holds messages invisible while batching (60s) + while Lambda processes (up to 900s). If both hit their maximums, the total could exceed 900s.
Recommendation: Set visibility timeout to 990s or round up to 960s (16 minutes) for batch-processing queues.
3. LOW: maxReceiveCount Audit¶
Both articles recommend maxReceiveCount ≥ 5 to absorb transient throttling. Need to verify our current values across all SQS queues.
If maxReceiveCount is set too low (e.g., 3), even a brief throttling spike during burst loads would send valid messages to the DLQ.
Recommendation: Audit and ensure all NGE SQS queues have maxReceiveCount ≥ 5.
4. INFO: FIFO Queue Opportunity for Per-Case Backpressure¶
The FIFO queue + message group pattern from Article 1 maps interestingly to our per-case processing model. If we used a FIFO queue with case_id as the MessageGroupId, AWS would naturally limit parallelism to one Lambda per case — providing built-in per-case backpressure without any application logic.
Trade-off: FIFO queues have lower throughput (300 TPS without high-throughput, 3,000 with) and higher cost. Our current model of per-batch dynamic queues may be more flexible.
Recommendation: Keep as reference for ADR-005 (bulk operations) where per-case ordering might matter.
New Backlog Items¶
| Item | Priority | Related Backlog |
|---|---|---|
| Increase batch-processing queue visibility timeout to 990s | MEDIUM | New |
| Audit maxReceiveCount across all NGE SQS queues (ensure >= 5) | MEDIUM | New |
| Evaluate FIFO queues for per-case backpressure in ADR-005 | LOW | #7 (per-case backpressure) |
Summary¶
Both articles converge on the same core lesson: reserved concurrency ALONE + SQS event source mapping creates a dangerous interaction where valid messages get throttled into the DLQ. Our approach — using BOTH MaximumConcurrency (ceiling) and ReservedConcurrency (floor guarantee) set to the same value — is the correct pattern that avoids this trap entirely. The remaining items are the batch window gap in visibility timeout (could cause rare duplicate processing under peak load) and a maxReceiveCount audit to ensure resilience during burst scenarios.
Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.