Reference Implementation: documentexchanger¶
Document batch transfer service enabling copying of documents (with metadata, annotations, custodians, bates stamps, confidentiality codes, issues, and OCR text) between eDiscovery cases. Dynamically provisions processor Lambdas and SQS queues per exchange job, orchestrates bulk database ETL via AWS Glue, and tears down resources on completion.
Architecture Overview¶
documentexchanger transfers documents from a source case to a destination case. An "exchange" involves: copying files on S3, copying database records via Glue ETL, optionally processing PDF annotations via Nutrient, re-OCR of attachment pages, and uploading to the destination case's document engine.
The module uses a dynamic infrastructure orchestration pattern similar to documentuploader — creating and destroying processor Lambdas and SQS queues per exchange job at runtime. However, it differs by using Lambda processors (not ECS) and AWS Glue for bulk database operations (not per-record Lambda).
Why Dynamic Infrastructure?¶
| Concern | Static (pre-created) | Dynamic (per-exchange) |
|---|---|---|
| Queue isolation | Shared queue, filter by attributes | Dedicated queue per exchange job |
| Lambda scaling | Shared concurrency pool | Per-exchange concurrency isolation |
| Cleanup | Manual resource tracking | Delete queue/Lambda = full cleanup |
| Conditional processing | All queues always exist | Only create queues needed for this exchange |
Architecture Tree¶
documentexchanger/
├── lib/
│ ├── global/
│ │ ├── lambda/
│ │ │ ├── orchestrator/ # Main orchestration Lambda
│ │ │ │ ├── index.ts # Dual entry: API Gateway + SQS
│ │ │ │ ├── types.ts # ExchangeRequest, ExchangeContext interfaces
│ │ │ │ ├── event-publisher.ts # SNS event construction + publishing
│ │ │ │ ├── event-processor.ts # Lifecycle event handler (de-orchestration)
│ │ │ │ ├── queue-manager.ts # Dynamic SQS queue creation/deletion
│ │ │ │ ├── lambda-manager.ts # Dynamic Lambda creation/deletion
│ │ │ │ ├── step-function-trigger.ts # Glue job orchestration via Step Functions
│ │ │ │ ├── uploader-coordinator.ts # Nutrient uploader messaging (circuit breaker)
│ │ │ │ ├── resource-reconstructor.ts # Rebuild queue/Lambda ARNs from naming
│ │ │ │ ├── sns-helper.ts # SNS publishing utilities
│ │ │ │ └── sqs-helper.ts # SQS queue utilities
│ │ │ ├── processors/
│ │ │ │ ├── s3_copy_processor/ # S3 server-side document copy
│ │ │ │ ├── pre_uploader_processor/ # PDF annotation processing via Nutrient
│ │ │ │ ├── re_ocr_processor/ # Re-OCR attachment pages via Nutrient
│ │ │ │ ├── search_text_processor/ # Search indexing (stub)
│ │ │ │ └── shared/
│ │ │ │ ├── logger.ts # Winston structured JSON logging
│ │ │ │ ├── error-handler.ts # SNS error publishing with processor mapping
│ │ │ │ └── shared-module.ts # Max receive count retry logic
│ │ │ └── exchange-status-update/ # DB status update Lambda
│ │ ├── global-dlq-processor/ # Global DLQ handler
│ │ ├── orchestrator-stack.ts # CDK: orchestrator Lambda + layer
│ │ ├── sqs-stack.ts # CDK: global SQS queues + SNS subscriptions
│ │ ├── stepfunction-stack.ts # CDK: Glue job state machine
│ │ ├── glue-job-stack.ts # CDK: Glue ETL job
│ │ ├── global-lambda-stack.ts # CDK: status update + DLQ Lambdas
│ │ ├── apigateway-stack.ts # CDK: REST API for exchange requests
│ │ ├── roles-stack.ts # CDK: IAM roles
│ │ └── proxy-stack.ts # CDK: RDS proxy
│ ├── glue/
│ │ └── script.py # PySpark ETL: bulk DB copy between cases
│ ├── helpers/
│ │ ├── config-helper.ts # Environment × region config
│ │ └── name-helper.ts # Resource naming conventions
│ └── constructs/
│ └── global-queue.ts # Reusable SQS + DLQ construct
├── config.json # Per-environment settings
├── pipeline/ # CDK pipeline for Glue script
└── test/ # Jest CDK + processor tests
Language Stack¶
| Component | Language | Runtime | Purpose |
|---|---|---|---|
| Orchestrator | TypeScript | Lambda (Node.js) | Exchange lifecycle management |
| S3 Copy Processor | TypeScript | Lambda (Node.js) | Server-side S3 document copy |
| Pre-Uploader Processor | TypeScript | Lambda (Node.js) | PDF annotation processing via Nutrient |
| Re-OCR Processor | TypeScript | Lambda (Node.js) | Text extraction from Nutrient pages |
| Search Text Processor | TypeScript | Lambda (Node.js) | Search indexing (stub) |
| Status Update | TypeScript | Lambda (Node.js) | DB status field updates |
| DLQ Processor | TypeScript | Lambda (Node.js) | Error event publishing |
| Glue ETL Script | Python (PySpark) | AWS Glue | Bulk database record copy |
| Infrastructure | TypeScript | CDK | AWS resource provisioning |
All Lambda processors share a common Lambda Layer compiled from the
processors/shared/ TypeScript code.
Pattern Mapping¶
Patterns Implemented¶
| Architecture Pattern | documentexchanger Implementation | Notes |
|---|---|---|
| SNS Event Publishing | event-publisher.ts — typed event builders |
TypeScript, same event structure (source, jobId, caseId, etc.) |
| SQS Handler | Orchestrator processes from global SQS queue | Standard parse → route → handle flow |
| Retry & Resilience | SharedModule.handleErrors() — max receive count check |
3 retries before DLQ, similar to standard pattern |
| Job Processor Orchestration | Orchestrator creates queues + Lambdas per exchange | Dynamic infrastructure like documentloader/documentuploader |
| Idempotent Handlers | Resource reconstruction from naming conventions | No persistent state needed — ARNs rebuilt from deterministic names |
| Config Management | config.json environment × region matrix |
Same pattern as other modules |
| CDK Infrastructure | Multiple stacks (orchestrator, SQS, Step Functions, Glue, API GW) | Standard multi-stack composition |
| SNS Filter Policies | Per-queue filters (eventType, caseId, batchId) | Standard attribute-based filtering |
| Structured Logging | Winston JSON logger with context fields | caseId, batchId, documentId, exhibitIds |
| Multi-Tenancy | Per-case database naming: {DB_NAME_PREFIX}_{case_id} |
Standard multi-tenant pattern |
Patterns NOT Used¶
| Pattern | Why Not |
|---|---|
| Hexagonal core/shell | Orchestration-focused module — no domain logic to separate |
| Checkpoint Pipeline | No multi-step per-document processing; each processor is single-action |
| Exception Hierarchy (Python) | TypeScript module — uses max-receive-count + DLQ instead |
| SQLAlchemy ORM | Database operations via PySpark Glue job, not Lambda |
| Elasticsearch Bulk Indexing | Search text processor is a stub |
| @retry_on_db_conflict | No MySQL writes from Lambda; bulk writes via Glue |
Event Flow¶
Exchange Lifecycle¶
API Gateway POST /exchange
│
▼
Orchestrator Lambda (sync)
├── Create SQS queues (conditional: 2-5 based on annotation needs)
├── Create processor Lambdas (with event source mappings)
├── Apply resource-based policies
├── Send message to uploader orchestrator
├── Trigger Step Function (Glue job)
└── Publish EXCHANGER_INITIATED
Step Function
├── RunGlueJob (PySpark: copy DB records source→dest)
│ retry: 3 attempts, 30s interval, 2.0 backoff
├── DB_COPY_FINISHED → InvokePayloadLambda
│ (generates per-document messages for processor queues)
└── PAYLOAD_CONSTRUCTION_FINISHED → SNS
Processor Lambdas (parallel, per-document)
├── S3 Copy: CopyObject source→dest S3
├── Pre-Uploader: Fetch/flatten annotations via Nutrient
├── Re-OCR: Extract text from Nutrient pages
└── Search Text: Index for search (stub)
De-Orchestration (triggered by PAYLOAD_CONSTRUCTION_FINISHED)
├── Check queue depths (re-enqueue with 90s delay if not empty)
├── Delete processor Lambdas
├── Delete per-exchange SQS queues
└── Publish EXCHANGER_FINISHED
Inbound Events (Orchestrator Consumes)¶
| Event | Source | Action |
|---|---|---|
| API Gateway POST | Nextpoint backend | Create infrastructure, start exchange |
PAYLOAD_CONSTRUCTION_FINISHED |
Step Function | Begin de-orchestration |
EXCHANGER_DELETE_INITIATED |
Self (re-enqueued) | Continue de-orchestration (check queues) |
GLUE_JOB_FAILED |
Step Function | Cleanup and publish EXCHANGER_FAILED |
Outbound Events (Published to SNS)¶
| Event | Source | Downstream |
|---|---|---|
EXCHANGER_INITIATED |
Orchestrator | Status update Lambda |
DOCUMENT_COPIED |
S3 Copy Processor | Monitoring |
NATIVE_DOCUMENT_COPIED |
S3 Copy Processor | Monitoring |
DOCUMENT_NOT_FOUND |
S3 Copy / Pre-Uploader | Error tracking |
DOCUMENT_PRE_PROCESSED |
Pre-Uploader Processor | Monitoring |
DB_COPY_FINISHED |
Step Function | Payload construction |
EXCHANGER_DELETE_INITIATED |
Event Processor | Self (de-orchestration loop) |
EXCHANGER_FINISHED |
Event Processor | Status update Lambda |
EXCHANGER_FAILED |
Error Handler | Status update Lambda |
Key Design Decisions¶
1. Dual Entry Point (API Gateway + SQS)¶
Decision: Single Lambda handler processes both synchronous API requests and asynchronous SQS events.
API Gateway POST → Orchestrator (creates infrastructure, returns jobId)
SQS events → Orchestrator (lifecycle management: cleanup, retry)
Why: - API Gateway provides synchronous request/response for initial exchange trigger - SQS provides asynchronous retry for lifecycle events (de-orchestration) - Single handler avoids duplicating orchestration logic
2. Conditional Queue/Lambda Creation¶
Decision: Only create the queues and Lambdas needed for a specific exchange
based on requiresAnnotations and includeNatives flags.
| Configuration | Queues Created | Lambdas Created |
|---|---|---|
| No annotations | s3_copy, doc_uploader | S3 Copy |
| Annotations only | pre_uploader, re_ocr, search_text, doc_uploader | Pre-Uploader, Re-OCR, Search Text |
| Annotations + natives | s3_copy, pre_uploader, re_ocr, search_text, doc_uploader | S3 Copy, Pre-Uploader, Re-OCR, Search Text |
Why: Minimizes resource creation overhead and avoids idle queues/Lambdas. An exchange without annotations doesn't need Nutrient processing infrastructure.
3. Resource Reconstruction (Stateless Cleanup)¶
Decision: Rebuild queue URLs and Lambda ARNs from deterministic naming conventions rather than persisting them in a database.
// ResourceReconstructor rebuilds from naming pattern:
// Queue: doc_exch_{type}_queue_{caseId}_{batchId}
// Lambda: doc_exch_{type}_lambda_{caseId}_{batchId}
const queueUrl = `https://sqs.${region}.amazonaws.com/${accountId}/${queueName}`;
const lambdaArn = `arn:aws:lambda:${region}:${accountId}:function:${functionName}`;
Why: Eliminates need for persistent state storage (DynamoDB/SSM). If the orchestrator Lambda restarts mid-lifecycle, it can reconstruct all resource references from the exchange request parameters alone.
4. Circuit Breaker for Uploader Communication¶
Decision: UploaderCoordinator implements a circuit breaker pattern when messaging the Nutrient uploader orchestrator.
CLOSED state (normal):
→ Send messages to uploader queue
→ On failure: increment failure counter
→ If failures >= 5: transition to OPEN
OPEN state (circuit tripped):
→ Reject messages immediately (don't attempt send)
→ After 30s timeout: transition to HALF_OPEN
HALF_OPEN state (testing):
→ Allow one test message
→ If success: transition to CLOSED
→ If failure: transition back to OPEN
Why: The Nutrient uploader is a separate module with its own queue and scaling constraints. If it's overwhelmed or down, the circuit breaker prevents the exchanger from flooding it with messages and consuming its own Lambda concurrency on retries.
5. AWS Glue for Bulk Database Copy¶
Decision: Use PySpark Glue job for bulk database ETL rather than per-record Lambda processing.
Glue Job (script.py):
Source DB: {DB_NAME_PREFIX}_{source_case_id}
Target DB: {DB_NAME_PREFIX}_{destination_case_id}
Tables copied: documents, document_variants, exhibits,
label_category_labels, custom_field_values, custodians,
custodian_exhibits, confidentiality_codes, confidentiality_logs,
relevancy_logs, folder_tags, folder_taggings, label_categories, etc.
Connection: JDBC via RDS proxy (writer + reader endpoints)
Optimizations:
- ThreadPoolExecutor (BFS_MAX_WORKERS=8) for parallel child table processing
- Thread-safe column caching (_columns_cache with lock)
- JDBC fetch size tuning (spark.sql.jdbc.fetchSize=10000)
- _execute_with_reconnect() for connection resilience
- delete_rows_in_batches() for safe bulk deletions
Post-copy operations:
- Clear document_properties (NULL) on exchanged exhibits
- Assign source exhibit ID as number field on target exhibits
- Remap confidentiality_status bitwise to target case codes
- Deduplicate label categories and folder tags by name
Why: Copying thousands of database records per-document via Lambda would be slow and expensive. Glue's Spark engine handles bulk JDBC reads/writes efficiently, and the Glue job runs independently of the Lambda timeout limit.
6. Step Functions for Glue Orchestration¶
Decision: Use Step Functions to orchestrate the Glue job → payload construction → notification sequence.
RunGlueJob (retry: 3 attempts, 30s interval, 2.0 backoff)
→ Success: DB_COPY_FINISHED → InvokePayloadLambda
→ Failure: GLUE_JOB_FAILED → SNS notification
Why: Glue jobs can run for minutes to hours. Step Functions provides native retry/backoff, failure handling, and SNS notifications without requiring a polling Lambda.
Processor Details¶
S3 Copy Processor¶
Copies document files from source case S3 to destination case S3:
Receive SQS message (documentId, sourcePath, destPath)
→ S3.CopyObject (server-side copy, no data transfer through Lambda)
→ If NoSuchKey: publish DOCUMENT_NOT_FOUND, skip
→ On success: publish DOCUMENT_COPIED
→ Send message to uploader queue (for Nutrient ingestion)
Server-side copy avoids downloading/uploading files through Lambda memory.
Pre-Uploader Processor (Annotation Processing)¶
Handles PDF annotation transfer via Nutrient API. Supports 7 annotation types, each configurable as "brand_in" (flatten into PDF) or "layer" (overlay JSON):
| Annotation Type | Nutrient Type | Routing Config | Notes |
|---|---|---|---|
| Redactions | pspdfkit/markup/redact |
redactionsAs |
Applied first (ordering matters for overlap correctness) |
| Highlights | pspdfkit/markup/highlight |
highlightsAs |
Standard markup type |
| Notes | pspdfkit/note |
notesAs |
Standard markup type |
| Stickers | pspdfkit/image |
stickersAs |
Collects binary image attachments via collectImageAttachments() |
| Bates stamps | pspdfkit/text:bates_stamp |
batesAs |
Virtual type with :bates_stamp suffix for disambiguation |
| Confidentiality codes | pspdfkit/text:confidentiality_stamp |
confidentialityCodesAs |
Virtual type with :confidentiality_stamp suffix |
| Comments | highlight with isCommentThreadRoot |
Always layer | Fetches thread details, writes to S3 separately |
Receive SQS message (documentId, exhibitId, annotationTypes)
→ Check S3 source PDF exists (skip with {skipped: true} if not)
→ Construct Nutrient document ID: document_{caseId}_{batchId}_{documentId}_{exhibitId}
→ Fetch annotations from Nutrient API
→ For "brand_in": Apply redactions first, then other types (ordering for overlap correctness)
→ For "layer": Upload JSON annotation data (stickers include image binary data)
→ For "comments": Fetch comment thread details, write to S3
→ On success: publish DOCUMENT_PRE_PROCESSED
→ Send message to uploader queue
Re-OCR Processor¶
Extracts text from Nutrient document pages:
Receive SQS message (documentId, exhibitId, attachmentPosition)
→ Call Nutrient API: GET /documents/{id}/pages/{position}/text
→ Upload text to S3: case_{caseId}/processed-files/{batchId}/txt/{documentId}/attachments/{position}/content.txt
→ Send message to search_text queue
Uses token caching for Nutrient API efficiency across messages in a batch.
De-Orchestration (Cleanup)¶
When PAYLOAD_CONSTRUCTION_FINISHED arrives, the orchestrator begins teardown:
1. Publish EXCHANGER_DELETE_INITIATED (starts cleanup loop)
2. Check all per-exchange queue depths
→ If any queue has messages: re-enqueue with 90s delay
→ If all queues empty: proceed to deletion
3. Delete event source mappings (Lambda ← SQS)
4. Delete processor Lambdas
5. Delete per-exchange SQS queues
6. Publish EXCHANGER_FINISHED
90-second delay allows in-flight messages to complete processing before queue deletion. The self-referencing event loop (similar to documentuploader) provides natural polling without Step Functions.
Error Handling¶
Max receive count pattern (TypeScript equivalent of exception hierarchy):
| Receive Count | Action |
|---|---|
| < MAX_RECEIVE_COUNT (3) | Throw error → SQS requeues with visibility timeout |
| ≥ MAX_RECEIVE_COUNT (3) | Publish error to SNS, don't throw (message consumed) |
Error-to-event mapping:
| Processor | Error Event Type |
|---|---|
| s3_copy_processor | S3_DOC_COPIED (status: ERROR) |
| pre_uploader_processor | PRE_UPLOADER_FINISHED (status: ERROR) |
| orchestrator | EXCHANGER_FAILED |
Global DLQ: All per-exchange queues use a shared global DLQ. The DLQ processor publishes error events to SNS for monitoring.
SNS Filter Policies¶
| Queue | Filter |
|---|---|
Orchestrator (exch-orch) |
eventType: [PAYLOAD_CONSTRUCTION_FINISHED, EXCHANGER_DELETE_INITIATED, GLUE_JOB_FAILED] |
Status Update (exch-status) |
eventType: [EXCHANGER_FINISHED, EXCHANGER_INITIATED] |
| Re-OCR (per-exchange) | eventType: [DOCUMENT_UPLOADED], caseId: [{destCaseId}], batchId: [{batchId}] |
Divergences from Standard Architecture¶
| Standard Pattern | documentexchanger Divergence | Reason |
|---|---|---|
| Python 3.10+ | TypeScript (all Lambdas) + Python (Glue only) | TypeScript preferred for orchestration logic |
| Hexagonal core/shell | No separation — orchestration + infrastructure mixed | No domain logic; pure orchestration and S3/API operations |
| Exception hierarchy | Max-receive-count + DLQ in TypeScript | No RecoverableException/PermanentFailureException in TS |
| SQLAlchemy ORM | PySpark DataFrames via Glue | Bulk ETL, not per-record CRUD |
| Multi-tenant MySQL (Lambda) | Glue job reads/writes per-case databases | Lambda doesn't touch database directly |
| Checkpoint pipeline | N/A — single-action per processor message | No multi-step per-document processing |
| ECS long-running tasks | Lambda processors + Glue job | Operations fit within Lambda timeout |
Multi-Region Deployment¶
Same c2/c4/c5 naming convention as other modules:
| Region | Prefix | SNS Topic | Staging Bucket |
|---|---|---|---|
| us-east-1 | c2 |
Per-region topic ARN | Per-region S3 bucket |
| us-west-1 | c4 |
Per-region topic ARN | Per-region S3 bucket |
| ca-central-1 | c5 |
Per-region topic ARN | Per-region S3 bucket |
Pre-Deployment Architectural Review¶
Status: Build phase — not yet deployed. The following issues were identified during Principal Architect review and must be addressed before production deployment.
P0 — Pre-Deployment Blockers¶
1. Idempotency Violations (Architecture Rule: All handlers MUST be idempotent)¶
Orchestrator handler is NOT idempotent. If the API Gateway handler fails after creating queues/Lambdas but before returning success, a retry creates duplicate resources. No deduplication check or request ID tracking exists.
// Current: generates new jobId every call — retries create new exchanges
const jobId = randomUUID();
// Required: check if exchange already initiated for this request
// Option A: Accept client-provided idempotency key
// Option B: DynamoDB conditional put on (sourceCaseId, destCaseId, labelId, batchId)
SQS message processing lacks partial batch failure handling. Line 217
throws errors immediately, causing the entire batch to fail and retry from
the beginning. Architecture requires {"batchItemFailures": [...]} for
individual record failures.
2. Race Condition in De-Orchestration Cleanup¶
Critical: The cleanup checks areAllQueuesEmpty() after a 90-second
delay, then deletes queues. But messages can arrive AFTER the check but
BEFORE deletion:
Time 0s: EXCHANGER_DELETE_INITIATED received
Time 90s: Check queue depth → 0 messages → "safe to delete"
Time 91s: New message arrives in queue (from slow processor)
Time 92s: Queue deleted → message LOST, processor Lambda errors
Fix: Cleanup order must be: 1. Remove SNS subscriptions (stop new messages) 2. Disable event source mappings (stop Lambda polling) 3. Wait for in-flight Lambda invocations to complete 4. Verify queues empty 5. Delete Lambdas 6. Delete queues
3. Missing Exception Type Hierarchy¶
The codebase uses generic throw error everywhere instead of the architecture's
exception hierarchy. All processors use SharedModule.handleErrors() which
only checks receive count — it doesn't distinguish between:
- Recoverable (Nutrient API timeout → retry)
- Permanent (document genuinely missing → DLQ immediately)
- Silent (already processed → delete message, log success)
Current behavior: All errors get 3 retries regardless of whether retrying will help.
4. No Transactional State Tracking¶
The orchestrator performs 6 steps (create queues → create Lambdas → apply policies → send uploader message → trigger Step Function → publish event). If any step fails after step 2, partial resources exist with no compensation or rollback.
Required: DynamoDB state table tracking orchestration progress:
PK: exchangeJobId
status: INITIATING | QUEUES_CREATED | LAMBDAS_CREATED | POLICIES_APPLIED |
STEP_FUNCTION_TRIGGERED | INITIATED | DE_ORCHESTRATING | FINISHED | FAILED
queueUrls: [...], lambdaArns: [...]
createdAt, updatedAt
This also eliminates the fragile resource reconstruction pattern (rebuilding ARNs from naming conventions that could silently generate wrong names if conventions change).
5. Hardcoded Secrets¶
Architecture rule: No secrets in code — use AWS Secrets Manager. Even in non-prod, hardcoded secrets in source code are a security violation. Use Secrets Manager for all environments.
P1 — High Priority (Pre-GA)¶
6. Overly Permissive IAM Policies¶
// orchestrator-stack.ts — WILDCARD PERMISSIONS
'sqs:*', // Should be: sqs:CreateQueue, sqs:DeleteQueue, sqs:SendMessage, etc.
'lambda:*', // Should be: lambda:CreateFunction, lambda:DeleteFunction, etc.
'ec2:*', // Should be: ec2:CreateNetworkInterface, ec2:DescribeNetworkInterfaces, etc.
Violates least-privilege principle. Scope to specific actions and resource ARN
patterns (e.g., arn:aws:sqs:*:*:doc_exch_*).
7. Inconsistent Logging¶
Mixed console.log() / console.warn() with Winston logger.INFO() /
logger.ERROR() across the orchestrator. Architecture requires structured
JSON logging with consistent context fields (caseId, batchId, jobId,
documentId).
Fix: Replace all console.* calls with the Winston logger from
processors/shared/logger.ts.
8. Missing Tests¶
| Component | Test Status |
|---|---|
| Orchestrator (index.ts) | ❌ No tests |
| Event Processor | ❌ No tests |
| Queue Manager | ❌ No tests |
| Lambda Manager | ❌ No tests |
| S3 Copy Processor | ⚠️ 3 basic tests only |
| Pre-Uploader Processor | ❌ No tests |
| Re-OCR Processor | ❌ No tests |
| Search Text Processor | ❌ No tests |
| Status Update Lambda | ❌ No tests |
| Glue Script | ❌ No tests |
| CDK Stacks | ⚠️ Placeholder only |
Architecture rule: Every handler test must verify idempotent behavior. None of the existing tests verify idempotency.
9. Fragile Lambda ZIP Construction¶
LambdaManager hand-rolls ZIP file format with manual CRC32 calculation and
binary buffer construction instead of using a proper ZIP library. This is
extremely fragile — binary format bugs will cause silent Lambda creation
failures.
Fix: Use archiver or jszip npm package.
10. Glue Script Lacks Safety Guarantees¶
# script.py — No rollback if job fails mid-copy
# No idempotency checks for already-copied data
# No validation of secret format
# No transaction safety on bulk writes
If the Glue job fails halfway through copying tables, partial data exists in the destination database with no way to identify what was copied vs what wasn't. Needs either: - Idempotent writes (INSERT IGNORE / ON DUPLICATE KEY UPDATE) - Transaction wrapper with rollback - Checkpoint tracking per table
P2 — Medium Priority¶
11. Hardcoded Configuration Values¶
| Value | Location | Should Be |
|---|---|---|
| 15 min Lambda timeout | orchestrator-stack.ts | Environment variable |
| 20000 bulk update size | search_text_processor | Environment variable |
| 1000 SQS batch size | lambda-manager.ts | Config per environment |
| 300s batch window | lambda-manager.ts | Config per environment |
| 90s cleanup delay | event-processor.ts | Config per environment |
12. Token Initialization Race Condition¶
PreUploaderProcessor stores Nutrient API token as instance variable
initialized async in init(). If two SQS messages arrive rapidly on a warm
Lambda, both call init() concurrently — race condition on token state.
Fix: Use module-level cached token (like re_ocr_processor does correctly
with let cachedToken).
13. Connection Pool Staleness¶
SearchTextProcessor caches a MySQL connection pool at module level but
never validates or refreshes it. Long-running warm Lambdas may use stale
connections that have been terminated by RDS proxy.
Fix: Add pool health check or use pool.getConnection() with error
handling that recreates the pool on connection failure.
14. CDK Compilation During Synthesis¶
// orchestrator-stack.ts — BLOCKS CDK SYNTHESIS
execSync('npm run compile-processors', { stdio: 'inherit' });
Compiling TypeScript during CDK synthesis makes deployment dependent on
build environment state and adds latency to every cdk synth. Should be a
separate build step in CI/CD pipeline.
Architecture Compliance Summary¶
| Requirement | Status | Details |
|---|---|---|
| Event-driven via SNS | ⚠️ Partial | Events published, but routing is fragile (order-dependent if/else chain) |
| Idempotent handlers | ❌ FAIL | No deduplication checks anywhere |
| Exception hierarchy | ❌ FAIL | Generic errors thrown; no Recoverable/Permanent/Silent distinction |
| Checkpoint/state tracking | ❌ FAIL | No DynamoDB state table; relies on fragile resource reconstruction |
| Multi-tenant DB | ✅ OK | Glue script handles per-case databases correctly |
| Structured logging | ⚠️ Partial | Processors use Winston; orchestrator mixes console.log |
| No secrets in code | ❌ FAIL | Hardcoded 'secret' in lambda-manager.ts |
| IAM least privilege | ❌ FAIL | Wildcard permissions on SQS, Lambda, EC2 |
| Testing (idempotency) | ❌ FAIL | No idempotency tests; most components untested |
| Partial batch failures | ❌ FAIL | Missing batchItemFailures response |
Lessons Learned¶
-
Resource reconstruction eliminates state persistence — deterministic naming conventions let you rebuild ARNs/URLs without DynamoDB or SSM, simplifying cleanup after Lambda restarts.
-
Conditional infrastructure minimizes waste — only creating queues/Lambdas needed for a specific exchange type reduces provisioning time and avoids idle resources.
-
Circuit breaker prevents cascading failures — when the Nutrient uploader is overwhelmed, the circuit breaker stops the exchanger from wasting Lambda concurrency on failed sends.
-
Glue excels for bulk cross-database ETL — copying thousands of records between per-case databases is a natural fit for Spark, avoiding Lambda timeout and connection pool constraints.
-
Dual entry points (API + SQS) enable sync + async in one handler — the initial exchange request needs synchronous response (jobId), while lifecycle events are naturally asynchronous.
-
De-orchestration delay (90s) prevents message loss — checking queue depth before deletion ensures in-flight messages complete processing.
Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.