Skip to content

Article Review: Group 1 — SQS+Lambda Integration

Articles Reviewed

  1. Understanding Amazon SQS and AWS Lambda Event Source Mapping — Serverless Guru / AWS APN Blog (Feb 2023)
  2. Lessons learned from combining SQS and Lambda in a data project — Miia Niemelä / Solita (Jan 2020)

Key Concepts

How Event Source Mapping Works Internally

AWS manages up to 1,000 parallel polling threads per region. Each thread: 1. Long-polls the SQS queue 2. Receives a batch of messages 3. Synchronously invokes the Lambda function 4. On success → deletes the messages from the queue 5. On error (non-throttle) → does nothing; messages reappear after visibility timeout 6. On throttle → retries a few times, then gives up; messages reappear with incremented ReceiveCount

Lambda Scaling Ramp-Up

When messages start arriving, Lambda reads up to 5 batches initially, then adds up to 60 more instances per minute until reaching the concurrency limit. This means a sudden burst of 10,000 messages won't immediately scale to 1,000 Lambdas — there's a ramp-up period where messages queue up.

The Throttling → DLQ Trap

The most critical insight from both articles. When reserved concurrency is reached: 1. Polling threads continue to aggressively poll and attempt to invoke Lambda 2. Invocations are throttled (not queued) 3. Throttled messages become visible again with ReceiveCount incremented 4. After ReceiveCount > maxReceiveCount → message goes to DLQ 5. Valid, unprocessed messages end up in DLQ

This is NOT a failure of the message — it's a failure of capacity planning.

Three Approaches to Concurrency Control

Approach Mechanism Throttling Risk DLQ Risk
Reserved Concurrency put_function_concurrency() HIGH — polling threads still try to invoke aggressively HIGH — throttled invocations increment ReceiveCount
MaximumConcurrency on ESM ScalingConfig.MaximumConcurrency LOW — event source mapping itself limits invocations LOW — messages stay in queue naturally
FIFO Queue + Message Groups Number of unique MessageGroupIDs = max parallelism NONE — only one thread per message group NONE — controlled at polling level

MaximumConcurrency (introduced late 2022) is the recommended approach for standard queues. It caps invocations at the event source mapping level, so polling threads never try to invoke more Lambdas than allowed. Messages stay safely in the queue until capacity frees up.

Visibility Timeout Formula

Article 1 formula: VisibilityTimeout >= Lambda_timeout + BatchWindow + 30s buffer

Article 2 (AWS docs): VisibilityTimeout >= 6 × Lambda_timeout

The second is more conservative. The first is more practical — the key is that visibility timeout must exceed the total time from poll-to-delete.

DLQ Placement

DLQ must be configured on the source SQS queue, not on the Lambda function. Lambda's DLQ only handles async invocations. SQS→Lambda event source mapping is synchronous, so Lambda's DLQ is never triggered.

Mapping to NGE Architecture

What We Do Right

  1. MaximumConcurrency on event source mapping — Our CDK config uses ScalingConfig.MaximumConcurrency, which is the recommended approach. This prevents aggressive throttling.

  2. ReportBatchItemFailures — Our batch Lambda (DocumentLoader) uses partial failure reporting, so one bad record doesn't cause the entire batch to retry.

  3. Visibility timeout = Lambda timeout (900s) — Both are 15 minutes, preventing duplicate delivery during normal processing.

  4. Structured exception hierarchy — PermanentFailures go directly to DLQ; RecoverableExceptions requeue with backoff. This is more sophisticated than the articles describe.

  5. DLQ on source queue — Our CDK deploys the DLQ on the SQS queue, not on the Lambda.

Potential Issues Found

1. VALIDATED: Dual Concurrency Control (Correct Pattern)

Our dynamic Lambda creation sets BOTH: - ScalingConfig.MaximumConcurrency on the event source mapping - put_function_concurrency(ReservedConcurrentExecutions=concurrency) on the Lambda

From the lambda-sqs-integration pattern (line 254-260):

attach_event_source(lambda_client, sqs_client, new_lambda, queue_url, concurrency)
lambda_client.put_function_concurrency(
    FunctionName=new_lambda,
    ReservedConcurrentExecutions=concurrency,
)

This is the safest configuration. The two settings serve complementary purposes: - MaximumConcurrency (ESM) = ceiling — prevents the event source mapping from over-invoking beyond capacity - ReservedConcurrency (Lambda) = floor/guarantee — reserves account-level capacity so other Lambda functions can't starve this one

Without ReservedConcurrency, if other Lambdas in the account consumed all available concurrency, the ESM would try to invoke our Lambda and get throttled — causing the exact DLQ trap the articles describe.

Both are set to the same concurrency value, so the ESM never tries to invoke beyond what's reserved.

Key invariant to maintain: MaximumConcurrency <= ReservedConcurrency. If MaximumConcurrency ever exceeds ReservedConcurrency, the ESM will try to invoke more Lambdas than reserved slots available, causing throttling.

The articles warn about using reserved concurrency ALONE (without MaximumConcurrency). That's the old approach where polling threads aggressively invoke past the reserved limit. Our dual approach avoids this entirely.

2. MEDIUM: Visibility Timeout Doesn't Account for Batch Window

For DocumentLoader (batch processing): - Lambda timeout: 900s - Batch window: 60s - Current visibility timeout: 900s - Should be: 900 + 60 + 30 = 990s minimum

The polling thread holds messages invisible while batching (60s) + while Lambda processes (up to 900s). If both hit their maximums, the total could exceed 900s.

Recommendation: Set visibility timeout to 990s or round up to 960s (16 minutes) for batch-processing queues.

3. LOW: maxReceiveCount Audit

Both articles recommend maxReceiveCount ≥ 5 to absorb transient throttling. Need to verify our current values across all SQS queues.

If maxReceiveCount is set too low (e.g., 3), even a brief throttling spike during burst loads would send valid messages to the DLQ.

Recommendation: Audit and ensure all NGE SQS queues have maxReceiveCount ≥ 5.

4. INFO: FIFO Queue Opportunity for Per-Case Backpressure

The FIFO queue + message group pattern from Article 1 maps interestingly to our per-case processing model. If we used a FIFO queue with case_id as the MessageGroupId, AWS would naturally limit parallelism to one Lambda per case — providing built-in per-case backpressure without any application logic.

Trade-off: FIFO queues have lower throughput (300 TPS without high-throughput, 3,000 with) and higher cost. Our current model of per-batch dynamic queues may be more flexible.

Recommendation: Keep as reference for ADR-005 (bulk operations) where per-case ordering might matter.

New Backlog Items

Item Priority Related Backlog
Increase batch-processing queue visibility timeout to 990s MEDIUM New
Audit maxReceiveCount across all NGE SQS queues (ensure >= 5) MEDIUM New
Evaluate FIFO queues for per-case backpressure in ADR-005 LOW #7 (per-case backpressure)

Summary

Both articles converge on the same core lesson: reserved concurrency ALONE + SQS event source mapping creates a dangerous interaction where valid messages get throttled into the DLQ. Our approach — using BOTH MaximumConcurrency (ceiling) and ReservedConcurrency (floor guarantee) set to the same value — is the correct pattern that avoids this trap entirely. The remaining items are the batch window gap in visibility timeout (could cause rare duplicate processing under peak load) and a maxReceiveCount audit to ensure resilience during burst scenarios.

Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.