Article Review: Groups 9-13 — AWS Deep Dive Audit¶

Approach¶

Rather than reviewing 12 AWS reference docs (312M) cover-to-cover, this audit extracted specific sections relevant to our patterns, backlog items, and architectural decisions. Findings are mapped directly to our existing patterns and BACKLOG.md items.

Sources Consulted¶

Doc	Size	Sections Extracted
AWS SQS Deep Dive (Joud W. Awad)	10M	Visibility timeout, heartbeat, DLQ/redrive, FIFO, MaximumConcurrency
AWS SQS+Lambda Deep Dive (Hiziroglou)	25M	Event source mapping, batch window, concurrency
AWS SNS+EventBridge Deep Dive (Hiziroglou)	24M	Filter policies, message attributes, fan-out
AWS Aurora Deep Dive (Joud W. Awad)	33M	Locking, multi-tenant, connection pooling, RDS Proxy, Serverless v2
AWS ECS Deep Dive (Joud W. Awad)	18M	Health checks, task scaling, sidecar patterns
Lambda Developer Guide (AWS)	26M	TOC/structure reference
Aurora User Guide (AWS)	26M	TOC/structure reference
SQS Developer Guide (AWS)	5M	TOC/structure reference

Audit Results by Pattern / Backlog Item¶

1. Visibility Timeout (BACKLOG: Increase to 990s)¶

Our pattern: VisibilityTimeout = 900s = Lambda timeout (15 min)

AWS recommendation (SQS Deep Dive):

"Setting the visibility timeout depends on how long it takes your application to process and delete a message."

Two strategies: - Known processing time: Set visibility timeout to max processing time - Unknown processing time: Use heartbeat — set initial timeout short (e.g., 2 min), then keep extending by 2 min every minute while still processing

Audit finding: Our 900s matches Lambda timeout, which is correct for the "known time" approach. However, for batch processing Lambdas with a 60s batch window, the AWS formula from the APN blog (Group 1) of Lambda_timeout + BatchWindow + 30s = 990s is more precise.

Verdict: BACKLOG item confirmed. Increase to 990s for batch-processing queues.

2. Heartbeat Extension (BACKLOG: documentloader checkpoint pipeline)¶

Our pattern: No heartbeat. Fixed 900s visibility timeout.

AWS recommendation (SQS Deep Dive):

"Create a heartbeat for your consumer process: Specify the initial visibility timeout (for example, 2 minutes) and then — as long as your consumer still works on the message — keep extending the visibility timeout by 2 minutes every minute."

This is directly applicable to our checkpoint pipeline where long-running document processing could exceed the visibility timeout.

Verdict: BACKLOG item confirmed with specific implementation guidance. Use ChangeMessageVisibility API to extend timeout during checkpoint processing. Pattern: initial 2 min timeout, extend by 2 min every minute while processing.

3. DLQ and ApproximateAgeOfOldestMessage (BACKLOG: AgeOfFirstAttempt metric)¶

AWS warning (SQS Deep Dive):

"Including a poison pill message in a queue can distort the ApproximateAgeOfOldestMessage CloudWatch metric by giving an incorrect age of the poison pill message. Configuring a dead-letter queue helps avoid false alarms when using this metric."

Audit finding: Our backlog item to add AgeOfFirstAttempt metric is validated. The AWS metric ApproximateAgeOfOldestMessage can be distorted by stuck messages. A custom metric tracking age at first processing attempt would be more reliable. Additionally, the DLQ must be properly configured to prevent poison pills from corrupting this metric.

Verdict: BACKLOG item confirmed. Also ensures DLQ is properly draining poison pills.

4. MaximumConcurrency (Pattern: lambda-sqs-integration.md)¶

AWS warning (SQS Deep Dive):

"Maximum Concurrency: this is a very critical option that you have to specify carefully especially in a standard queue as it affects how your messages are processed, specifying this to a high number may cause your consumers to compete over messages and also to consume your lambda concurrent limit per region in your account."

Audit finding: Validates our dual concurrency control pattern (ESM MaximumConcurrency + Lambda ReservedConcurrency). The doc confirms that MaximumConcurrency is the primary knob for SQS→Lambda scaling.

Verdict: Pattern VALIDATED. No changes needed.

5. SELECT...FOR UPDATE on Aurora (BACKLOG: Bates fencing token)¶

AWS Aurora Deep Dive finding:

"Write Forwarding does not support: SELECT … FOR UPDATE queries that require row locking"

Audit finding: CRITICAL for ADR-006. Our bates stamping plan uses SELECT...FOR UPDATE on BatesPattern for atomic number increment. This MUST execute on the primary writer instance, NOT through reader endpoints or write forwarding. If our Lambda connects to a reader endpoint (for general reads), the bates increment must explicitly use the writer endpoint.

Verdict: New BACKLOG item. Ensure bates stamp Lambda uses writer endpoint for SELECT...FOR UPDATE. Add this as a requirement in ADR-006.

6. Aurora Connection Pooling / RDS Proxy (Pattern: database-session.md)¶

Our pattern: In-Lambda engine cache with MAX_CACHE_SIZE = 50 and LRU eviction.

AWS Aurora Deep Dive recommends:

"You can set up the connection between your Lambda function and your DB cluster through RDS Proxy to improve your database performance... connections that benefit from the connection pooling that RDS Proxy offers."

"Connection pooling simplifies your application logic. You don't need to write application code to minimize the number of simultaneous open connections."

"This technique also reduces the chance of 'too many connections' errors."

Benefits of RDS Proxy over in-Lambda caching: - Connection reuse across Lambda invocations (not just within same execution environment) - Automatic connection health checks - Multiplexing — shares connections across multiple Lambda invocations - Pinning only when needed (e.g., session-level variables)

Audit finding: Our in-Lambda engine cache works but is suboptimal. Each Lambda execution environment maintains its own connection cache. With dynamic Lambda cloning (per-batch), we could have N batches × concurrency × 1 connection = many connections to Aurora. RDS Proxy would pool these.

Verdict: New potential BACKLOG item. Evaluate RDS Proxy for NGE Lambda→Aurora connections. Would reduce connection pressure and simplify our engine cache. However, adds latency (~1ms) and cost. Best evaluated when connection pressure becomes a problem.

7. Aurora Serverless v2 for Multi-Tenant (Pattern: ADR-003)¶

AWS Aurora Deep Dive:

"Multi-Tenant Applications: In scenarios where multiple tenants share database resources, Aurora Serverless v2 manages individual database capacity automatically, allowing each tenant's cluster to scale based on their specific activity levels."

Audit finding: Aurora Serverless v2 would auto-scale our Aurora cluster based on workload from different cases. Currently we use provisioned Aurora. For burst workloads (large imports), Serverless v2's ability to scale in 0.5 ACU increments could reduce cost while handling peaks.

Verdict: Informational. Worth evaluating when Aurora costs or scaling become an issue. Not urgent — provisioned Aurora works fine for current scale.

8. FIFO Queues for Per-Case Ordering (BACKLOG: per-case backpressure)¶

AWS SQS Deep Dive:

"FIFO Queue is known as First-in-first-out delivery... Exactly-Once Processing — A message is delivered once and remains..." "Amazon SQS uses the value of each message's message group ID as input to an internal hash function." "Amazon SQS is optimized for uniform distribution of items across a FIFO queue's partitions... message group IDs that can have a large number of distinct values."

Audit finding: Using case_id as MessageGroupId would provide per-case ordering AND natural backpressure (one processing thread per case). With many distinct case_ids, SQS optimally distributes across partitions. This validates the Group 1 recommendation.

Verdict: BACKLOG item validated. FIFO with case_id as MessageGroupId is a viable per-case backpressure mechanism. Trade-off: 300 TPS standard, 3,000 TPS high-throughput.

9. ECS Health Checks (Pattern: ecs-long-running-workloads.md)¶

AWS ECS Deep Dive:

"When a health check is defined in a task definition, the container runs the health check process inside the container... The health check consists of: Command, Interval, Timeout, Retries..."

Audit finding: Our ECS modules (documentexporter, documentextractor, documentuploader) should define container health checks in task definitions. This ensures ECS replaces unhealthy containers automatically.

Verdict: Informational. Verify health checks are defined in CDK task definitions for all ECS modules.

Summary: Pattern Audit Results¶

Pattern/Item	AWS Doc Finding	Verdict
Visibility timeout 900s	Correct for known time; add 90s for batch window	Confirmed: increase to 990s
Heartbeat extension	AWS explicitly recommends for unknown processing time	Confirmed: implement in checkpoint pipeline
AgeOfFirstAttempt metric	ApproximateAgeOfOldestMessage distorted by poison pills	Confirmed: custom metric needed
Dual concurrency control	MaximumConcurrency is "very critical" — our dual pattern is correct	Validated
SELECT...FOR UPDATE	NOT supported through Aurora write forwarding	New finding: must use writer endpoint
In-Lambda connection cache	RDS Proxy recommended for Lambda→Aurora	Evaluate RDS Proxy
Aurora Serverless v2	Auto-scales for multi-tenant	Informational
FIFO for per-case backpressure	MessageGroupId = case_id provides natural ordering	Validated
ECS health checks	Should be defined in task definitions	Verify in CDK

New/Updated BACKLOG Items¶

Item	Priority	Source
Ensure bates stamp Lambda uses writer endpoint for SELECT...FOR UPDATE	HIGH	Aurora doc: write forwarding limitation
Evaluate RDS Proxy for Lambda→Aurora connection pooling	LOW	Aurora doc: connection pooling recommendation
Verify ECS health checks in CDK task definitions	LOW	ECS doc: container health check pattern

Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.