Article Review: Groups 9-13 — AWS Deep Dive Audit¶
Approach¶
Rather than reviewing 12 AWS reference docs (312M) cover-to-cover, this audit extracted specific sections relevant to our patterns, backlog items, and architectural decisions. Findings are mapped directly to our existing patterns and BACKLOG.md items.
Sources Consulted¶
| Doc | Size | Sections Extracted |
|---|---|---|
| AWS SQS Deep Dive (Joud W. Awad) | 10M | Visibility timeout, heartbeat, DLQ/redrive, FIFO, MaximumConcurrency |
| AWS SQS+Lambda Deep Dive (Hiziroglou) | 25M | Event source mapping, batch window, concurrency |
| AWS SNS+EventBridge Deep Dive (Hiziroglou) | 24M | Filter policies, message attributes, fan-out |
| AWS Aurora Deep Dive (Joud W. Awad) | 33M | Locking, multi-tenant, connection pooling, RDS Proxy, Serverless v2 |
| AWS ECS Deep Dive (Joud W. Awad) | 18M | Health checks, task scaling, sidecar patterns |
| Lambda Developer Guide (AWS) | 26M | TOC/structure reference |
| Aurora User Guide (AWS) | 26M | TOC/structure reference |
| SQS Developer Guide (AWS) | 5M | TOC/structure reference |
Audit Results by Pattern / Backlog Item¶
1. Visibility Timeout (BACKLOG: Increase to 990s)¶
Our pattern: VisibilityTimeout = 900s = Lambda timeout (15 min)
AWS recommendation (SQS Deep Dive):
"Setting the visibility timeout depends on how long it takes your application to process and delete a message."
Two strategies: - Known processing time: Set visibility timeout to max processing time - Unknown processing time: Use heartbeat — set initial timeout short (e.g., 2 min), then keep extending by 2 min every minute while still processing
Audit finding: Our 900s matches Lambda timeout, which is correct for the "known time" approach. However, for batch processing Lambdas with a 60s batch window, the AWS formula from the APN blog (Group 1) of Lambda_timeout + BatchWindow + 30s = 990s is more precise.
Verdict: BACKLOG item confirmed. Increase to 990s for batch-processing queues.
2. Heartbeat Extension (BACKLOG: documentloader checkpoint pipeline)¶
Our pattern: No heartbeat. Fixed 900s visibility timeout.
AWS recommendation (SQS Deep Dive):
"Create a heartbeat for your consumer process: Specify the initial visibility timeout (for example, 2 minutes) and then — as long as your consumer still works on the message — keep extending the visibility timeout by 2 minutes every minute."
This is directly applicable to our checkpoint pipeline where long-running document processing could exceed the visibility timeout.
Verdict: BACKLOG item confirmed with specific implementation guidance. Use ChangeMessageVisibility API to extend timeout during checkpoint processing. Pattern: initial 2 min timeout, extend by 2 min every minute while processing.
3. DLQ and ApproximateAgeOfOldestMessage (BACKLOG: AgeOfFirstAttempt metric)¶
AWS warning (SQS Deep Dive):
"Including a poison pill message in a queue can distort the ApproximateAgeOfOldestMessage CloudWatch metric by giving an incorrect age of the poison pill message. Configuring a dead-letter queue helps avoid false alarms when using this metric."
Audit finding: Our backlog item to add AgeOfFirstAttempt metric is validated. The AWS metric ApproximateAgeOfOldestMessage can be distorted by stuck messages. A custom metric tracking age at first processing attempt would be more reliable. Additionally, the DLQ must be properly configured to prevent poison pills from corrupting this metric.
Verdict: BACKLOG item confirmed. Also ensures DLQ is properly draining poison pills.
4. MaximumConcurrency (Pattern: lambda-sqs-integration.md)¶
AWS warning (SQS Deep Dive):
"Maximum Concurrency: this is a very critical option that you have to specify carefully especially in a standard queue as it affects how your messages are processed, specifying this to a high number may cause your consumers to compete over messages and also to consume your lambda concurrent limit per region in your account."
Audit finding: Validates our dual concurrency control pattern (ESM MaximumConcurrency + Lambda ReservedConcurrency). The doc confirms that MaximumConcurrency is the primary knob for SQS→Lambda scaling.
Verdict: Pattern VALIDATED. No changes needed.
5. SELECT...FOR UPDATE on Aurora (BACKLOG: Bates fencing token)¶
AWS Aurora Deep Dive finding:
"Write Forwarding does not support: SELECT … FOR UPDATE queries that require row locking"
Audit finding: CRITICAL for ADR-006. Our bates stamping plan uses SELECT...FOR UPDATE on BatesPattern for atomic number increment. This MUST execute on the primary writer instance, NOT through reader endpoints or write forwarding. If our Lambda connects to a reader endpoint (for general reads), the bates increment must explicitly use the writer endpoint.
Verdict: New BACKLOG item. Ensure bates stamp Lambda uses writer endpoint for SELECT...FOR UPDATE. Add this as a requirement in ADR-006.
6. Aurora Connection Pooling / RDS Proxy (Pattern: database-session.md)¶
Our pattern: In-Lambda engine cache with MAX_CACHE_SIZE = 50 and LRU eviction.
AWS Aurora Deep Dive recommends:
"You can set up the connection between your Lambda function and your DB cluster through RDS Proxy to improve your database performance... connections that benefit from the connection pooling that RDS Proxy offers."
"Connection pooling simplifies your application logic. You don't need to write application code to minimize the number of simultaneous open connections."
"This technique also reduces the chance of 'too many connections' errors."
Benefits of RDS Proxy over in-Lambda caching: - Connection reuse across Lambda invocations (not just within same execution environment) - Automatic connection health checks - Multiplexing — shares connections across multiple Lambda invocations - Pinning only when needed (e.g., session-level variables)
Audit finding: Our in-Lambda engine cache works but is suboptimal. Each Lambda execution environment maintains its own connection cache. With dynamic Lambda cloning (per-batch), we could have N batches × concurrency × 1 connection = many connections to Aurora. RDS Proxy would pool these.
Verdict: New potential BACKLOG item. Evaluate RDS Proxy for NGE Lambda→Aurora connections. Would reduce connection pressure and simplify our engine cache. However, adds latency (~1ms) and cost. Best evaluated when connection pressure becomes a problem.
7. Aurora Serverless v2 for Multi-Tenant (Pattern: ADR-003)¶
AWS Aurora Deep Dive:
"Multi-Tenant Applications: In scenarios where multiple tenants share database resources, Aurora Serverless v2 manages individual database capacity automatically, allowing each tenant's cluster to scale based on their specific activity levels."
Audit finding: Aurora Serverless v2 would auto-scale our Aurora cluster based on workload from different cases. Currently we use provisioned Aurora. For burst workloads (large imports), Serverless v2's ability to scale in 0.5 ACU increments could reduce cost while handling peaks.
Verdict: Informational. Worth evaluating when Aurora costs or scaling become an issue. Not urgent — provisioned Aurora works fine for current scale.
8. FIFO Queues for Per-Case Ordering (BACKLOG: per-case backpressure)¶
AWS SQS Deep Dive:
"FIFO Queue is known as First-in-first-out delivery... Exactly-Once Processing — A message is delivered once and remains..." "Amazon SQS uses the value of each message's message group ID as input to an internal hash function." "Amazon SQS is optimized for uniform distribution of items across a FIFO queue's partitions... message group IDs that can have a large number of distinct values."
Audit finding: Using case_id as MessageGroupId would provide per-case ordering AND natural backpressure (one processing thread per case). With many distinct case_ids, SQS optimally distributes across partitions. This validates the Group 1 recommendation.
Verdict: BACKLOG item validated. FIFO with case_id as MessageGroupId is a viable per-case backpressure mechanism. Trade-off: 300 TPS standard, 3,000 TPS high-throughput.
9. ECS Health Checks (Pattern: ecs-long-running-workloads.md)¶
AWS ECS Deep Dive:
"When a health check is defined in a task definition, the container runs the health check process inside the container... The health check consists of: Command, Interval, Timeout, Retries..."
Audit finding: Our ECS modules (documentexporter, documentextractor, documentuploader) should define container health checks in task definitions. This ensures ECS replaces unhealthy containers automatically.
Verdict: Informational. Verify health checks are defined in CDK task definitions for all ECS modules.
Summary: Pattern Audit Results¶
| Pattern/Item | AWS Doc Finding | Verdict |
|---|---|---|
| Visibility timeout 900s | Correct for known time; add 90s for batch window | Confirmed: increase to 990s |
| Heartbeat extension | AWS explicitly recommends for unknown processing time | Confirmed: implement in checkpoint pipeline |
| AgeOfFirstAttempt metric | ApproximateAgeOfOldestMessage distorted by poison pills | Confirmed: custom metric needed |
| Dual concurrency control | MaximumConcurrency is "very critical" — our dual pattern is correct | Validated |
| SELECT...FOR UPDATE | NOT supported through Aurora write forwarding | New finding: must use writer endpoint |
| In-Lambda connection cache | RDS Proxy recommended for Lambda→Aurora | Evaluate RDS Proxy |
| Aurora Serverless v2 | Auto-scales for multi-tenant | Informational |
| FIFO for per-case backpressure | MessageGroupId = case_id provides natural ordering | Validated |
| ECS health checks | Should be defined in task definitions | Verify in CDK |
New/Updated BACKLOG Items¶
| Item | Priority | Source |
|---|---|---|
| Ensure bates stamp Lambda uses writer endpoint for SELECT...FOR UPDATE | HIGH | Aurora doc: write forwarding limitation |
| Evaluate RDS Proxy for Lambda→Aurora connection pooling | LOW | Aurora doc: connection pooling recommendation |
| Verify ECS health checks in CDK task definitions | LOW | ECS doc: container health check pattern |
Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.