Skip to content

Reference Implementation: documentexchanger

Document batch transfer service enabling copying of documents (with metadata, annotations, custodians, bates stamps, confidentiality codes, issues, and OCR text) between eDiscovery cases. Dynamically provisions processor Lambdas and SQS queues per exchange job, orchestrates bulk database ETL via AWS Glue, and tears down resources on completion.

Architecture Overview

documentexchanger transfers documents from a source case to a destination case. An "exchange" involves: copying files on S3, copying database records via Glue ETL, optionally processing PDF annotations via Nutrient, re-OCR of attachment pages, and uploading to the destination case's document engine.

The module uses a dynamic infrastructure orchestration pattern similar to documentuploader — creating and destroying processor Lambdas and SQS queues per exchange job at runtime. However, it differs by using Lambda processors (not ECS) and AWS Glue for bulk database operations (not per-record Lambda).

Why Dynamic Infrastructure?

Concern Static (pre-created) Dynamic (per-exchange)
Queue isolation Shared queue, filter by attributes Dedicated queue per exchange job
Lambda scaling Shared concurrency pool Per-exchange concurrency isolation
Cleanup Manual resource tracking Delete queue/Lambda = full cleanup
Conditional processing All queues always exist Only create queues needed for this exchange

Architecture Tree

documentexchanger/
├── lib/
│   ├── global/
│   │   ├── lambda/
│   │   │   ├── orchestrator/                      # Main orchestration Lambda
│   │   │   │   ├── index.ts                       # Dual entry: API Gateway + SQS
│   │   │   │   ├── types.ts                       # ExchangeRequest, ExchangeContext interfaces
│   │   │   │   ├── event-publisher.ts             # SNS event construction + publishing
│   │   │   │   ├── event-processor.ts             # Lifecycle event handler (de-orchestration)
│   │   │   │   ├── queue-manager.ts               # Dynamic SQS queue creation/deletion
│   │   │   │   ├── lambda-manager.ts              # Dynamic Lambda creation/deletion
│   │   │   │   ├── step-function-trigger.ts       # Glue job orchestration via Step Functions
│   │   │   │   ├── uploader-coordinator.ts        # Nutrient uploader messaging (circuit breaker)
│   │   │   │   ├── resource-reconstructor.ts      # Rebuild queue/Lambda ARNs from naming
│   │   │   │   ├── sns-helper.ts                  # SNS publishing utilities
│   │   │   │   └── sqs-helper.ts                  # SQS queue utilities
│   │   │   ├── processors/
│   │   │   │   ├── s3_copy_processor/             # S3 server-side document copy
│   │   │   │   ├── pre_uploader_processor/        # PDF annotation processing via Nutrient
│   │   │   │   ├── re_ocr_processor/              # Re-OCR attachment pages via Nutrient
│   │   │   │   ├── search_text_processor/         # Search indexing (stub)
│   │   │   │   └── shared/
│   │   │   │       ├── logger.ts                  # Winston structured JSON logging
│   │   │   │       ├── error-handler.ts           # SNS error publishing with processor mapping
│   │   │   │       └── shared-module.ts           # Max receive count retry logic
│   │   │   └── exchange-status-update/            # DB status update Lambda
│   │   ├── global-dlq-processor/                  # Global DLQ handler
│   │   ├── orchestrator-stack.ts                  # CDK: orchestrator Lambda + layer
│   │   ├── sqs-stack.ts                           # CDK: global SQS queues + SNS subscriptions
│   │   ├── stepfunction-stack.ts                  # CDK: Glue job state machine
│   │   ├── glue-job-stack.ts                      # CDK: Glue ETL job
│   │   ├── global-lambda-stack.ts                 # CDK: status update + DLQ Lambdas
│   │   ├── apigateway-stack.ts                    # CDK: REST API for exchange requests
│   │   ├── roles-stack.ts                         # CDK: IAM roles
│   │   └── proxy-stack.ts                         # CDK: RDS proxy
│   ├── glue/
│   │   └── script.py                              # PySpark ETL: bulk DB copy between cases
│   ├── helpers/
│   │   ├── config-helper.ts                       # Environment × region config
│   │   └── name-helper.ts                         # Resource naming conventions
│   └── constructs/
│       └── global-queue.ts                        # Reusable SQS + DLQ construct
├── config.json                                    # Per-environment settings
├── pipeline/                                      # CDK pipeline for Glue script
└── test/                                          # Jest CDK + processor tests

Language Stack

Component Language Runtime Purpose
Orchestrator TypeScript Lambda (Node.js) Exchange lifecycle management
S3 Copy Processor TypeScript Lambda (Node.js) Server-side S3 document copy
Pre-Uploader Processor TypeScript Lambda (Node.js) PDF annotation processing via Nutrient
Re-OCR Processor TypeScript Lambda (Node.js) Text extraction from Nutrient pages
Search Text Processor TypeScript Lambda (Node.js) Search indexing (stub)
Status Update TypeScript Lambda (Node.js) DB status field updates
DLQ Processor TypeScript Lambda (Node.js) Error event publishing
Glue ETL Script Python (PySpark) AWS Glue Bulk database record copy
Infrastructure TypeScript CDK AWS resource provisioning

All Lambda processors share a common Lambda Layer compiled from the processors/shared/ TypeScript code.

Pattern Mapping

Patterns Implemented

Architecture Pattern documentexchanger Implementation Notes
SNS Event Publishing event-publisher.ts — typed event builders TypeScript, same event structure (source, jobId, caseId, etc.)
SQS Handler Orchestrator processes from global SQS queue Standard parse → route → handle flow
Retry & Resilience SharedModule.handleErrors() — max receive count check 3 retries before DLQ, similar to standard pattern
Job Processor Orchestration Orchestrator creates queues + Lambdas per exchange Dynamic infrastructure like documentloader/documentuploader
Idempotent Handlers Resource reconstruction from naming conventions No persistent state needed — ARNs rebuilt from deterministic names
Config Management config.json environment × region matrix Same pattern as other modules
CDK Infrastructure Multiple stacks (orchestrator, SQS, Step Functions, Glue, API GW) Standard multi-stack composition
SNS Filter Policies Per-queue filters (eventType, caseId, batchId) Standard attribute-based filtering
Structured Logging Winston JSON logger with context fields caseId, batchId, documentId, exhibitIds
Multi-Tenancy Per-case database naming: {DB_NAME_PREFIX}_{case_id} Standard multi-tenant pattern

Patterns NOT Used

Pattern Why Not
Hexagonal core/shell Orchestration-focused module — no domain logic to separate
Checkpoint Pipeline No multi-step per-document processing; each processor is single-action
Exception Hierarchy (Python) TypeScript module — uses max-receive-count + DLQ instead
SQLAlchemy ORM Database operations via PySpark Glue job, not Lambda
Elasticsearch Bulk Indexing Search text processor is a stub
@retry_on_db_conflict No MySQL writes from Lambda; bulk writes via Glue

Event Flow

Exchange Lifecycle

API Gateway POST /exchange
Orchestrator Lambda (sync)
  ├── Create SQS queues (conditional: 2-5 based on annotation needs)
  ├── Create processor Lambdas (with event source mappings)
  ├── Apply resource-based policies
  ├── Send message to uploader orchestrator
  ├── Trigger Step Function (Glue job)
  └── Publish EXCHANGER_INITIATED

Step Function
  ├── RunGlueJob (PySpark: copy DB records source→dest)
  │     retry: 3 attempts, 30s interval, 2.0 backoff
  ├── DB_COPY_FINISHED → InvokePayloadLambda
  │     (generates per-document messages for processor queues)
  └── PAYLOAD_CONSTRUCTION_FINISHED → SNS

Processor Lambdas (parallel, per-document)
  ├── S3 Copy: CopyObject source→dest S3
  ├── Pre-Uploader: Fetch/flatten annotations via Nutrient
  ├── Re-OCR: Extract text from Nutrient pages
  └── Search Text: Index for search (stub)

De-Orchestration (triggered by PAYLOAD_CONSTRUCTION_FINISHED)
  ├── Check queue depths (re-enqueue with 90s delay if not empty)
  ├── Delete processor Lambdas
  ├── Delete per-exchange SQS queues
  └── Publish EXCHANGER_FINISHED

Inbound Events (Orchestrator Consumes)

Event Source Action
API Gateway POST Nextpoint backend Create infrastructure, start exchange
PAYLOAD_CONSTRUCTION_FINISHED Step Function Begin de-orchestration
EXCHANGER_DELETE_INITIATED Self (re-enqueued) Continue de-orchestration (check queues)
GLUE_JOB_FAILED Step Function Cleanup and publish EXCHANGER_FAILED

Outbound Events (Published to SNS)

Event Source Downstream
EXCHANGER_INITIATED Orchestrator Status update Lambda
DOCUMENT_COPIED S3 Copy Processor Monitoring
NATIVE_DOCUMENT_COPIED S3 Copy Processor Monitoring
DOCUMENT_NOT_FOUND S3 Copy / Pre-Uploader Error tracking
DOCUMENT_PRE_PROCESSED Pre-Uploader Processor Monitoring
DB_COPY_FINISHED Step Function Payload construction
EXCHANGER_DELETE_INITIATED Event Processor Self (de-orchestration loop)
EXCHANGER_FINISHED Event Processor Status update Lambda
EXCHANGER_FAILED Error Handler Status update Lambda

Key Design Decisions

1. Dual Entry Point (API Gateway + SQS)

Decision: Single Lambda handler processes both synchronous API requests and asynchronous SQS events.

API Gateway POST → Orchestrator (creates infrastructure, returns jobId)
SQS events    → Orchestrator (lifecycle management: cleanup, retry)

Why: - API Gateway provides synchronous request/response for initial exchange trigger - SQS provides asynchronous retry for lifecycle events (de-orchestration) - Single handler avoids duplicating orchestration logic

2. Conditional Queue/Lambda Creation

Decision: Only create the queues and Lambdas needed for a specific exchange based on requiresAnnotations and includeNatives flags.

Configuration Queues Created Lambdas Created
No annotations s3_copy, doc_uploader S3 Copy
Annotations only pre_uploader, re_ocr, search_text, doc_uploader Pre-Uploader, Re-OCR, Search Text
Annotations + natives s3_copy, pre_uploader, re_ocr, search_text, doc_uploader S3 Copy, Pre-Uploader, Re-OCR, Search Text

Why: Minimizes resource creation overhead and avoids idle queues/Lambdas. An exchange without annotations doesn't need Nutrient processing infrastructure.

3. Resource Reconstruction (Stateless Cleanup)

Decision: Rebuild queue URLs and Lambda ARNs from deterministic naming conventions rather than persisting them in a database.

// ResourceReconstructor rebuilds from naming pattern:
// Queue: doc_exch_{type}_queue_{caseId}_{batchId}
// Lambda: doc_exch_{type}_lambda_{caseId}_{batchId}

const queueUrl = `https://sqs.${region}.amazonaws.com/${accountId}/${queueName}`;
const lambdaArn = `arn:aws:lambda:${region}:${accountId}:function:${functionName}`;

Why: Eliminates need for persistent state storage (DynamoDB/SSM). If the orchestrator Lambda restarts mid-lifecycle, it can reconstruct all resource references from the exchange request parameters alone.

4. Circuit Breaker for Uploader Communication

Decision: UploaderCoordinator implements a circuit breaker pattern when messaging the Nutrient uploader orchestrator.

CLOSED state (normal):
  → Send messages to uploader queue
  → On failure: increment failure counter
  → If failures >= 5: transition to OPEN

OPEN state (circuit tripped):
  → Reject messages immediately (don't attempt send)
  → After 30s timeout: transition to HALF_OPEN

HALF_OPEN state (testing):
  → Allow one test message
  → If success: transition to CLOSED
  → If failure: transition back to OPEN

Why: The Nutrient uploader is a separate module with its own queue and scaling constraints. If it's overwhelmed or down, the circuit breaker prevents the exchanger from flooding it with messages and consuming its own Lambda concurrency on retries.

5. AWS Glue for Bulk Database Copy

Decision: Use PySpark Glue job for bulk database ETL rather than per-record Lambda processing.

Glue Job (script.py):
  Source DB: {DB_NAME_PREFIX}_{source_case_id}
  Target DB: {DB_NAME_PREFIX}_{destination_case_id}

  Tables copied: documents, document_variants, exhibits,
    label_category_labels, custom_field_values, custodians,
    custodian_exhibits, confidentiality_codes, confidentiality_logs,
    relevancy_logs, folder_tags, folder_taggings, label_categories, etc.

  Connection: JDBC via RDS proxy (writer + reader endpoints)

  Optimizations:
    - ThreadPoolExecutor (BFS_MAX_WORKERS=8) for parallel child table processing
    - Thread-safe column caching (_columns_cache with lock)
    - JDBC fetch size tuning (spark.sql.jdbc.fetchSize=10000)
    - _execute_with_reconnect() for connection resilience
    - delete_rows_in_batches() for safe bulk deletions

  Post-copy operations:
    - Clear document_properties (NULL) on exchanged exhibits
    - Assign source exhibit ID as number field on target exhibits
    - Remap confidentiality_status bitwise to target case codes
    - Deduplicate label categories and folder tags by name

Why: Copying thousands of database records per-document via Lambda would be slow and expensive. Glue's Spark engine handles bulk JDBC reads/writes efficiently, and the Glue job runs independently of the Lambda timeout limit.

6. Step Functions for Glue Orchestration

Decision: Use Step Functions to orchestrate the Glue job → payload construction → notification sequence.

RunGlueJob (retry: 3 attempts, 30s interval, 2.0 backoff)
  → Success: DB_COPY_FINISHED → InvokePayloadLambda
  → Failure: GLUE_JOB_FAILED → SNS notification

Why: Glue jobs can run for minutes to hours. Step Functions provides native retry/backoff, failure handling, and SNS notifications without requiring a polling Lambda.

Processor Details

S3 Copy Processor

Copies document files from source case S3 to destination case S3:

Receive SQS message (documentId, sourcePath, destPath)
  → S3.CopyObject (server-side copy, no data transfer through Lambda)
  → If NoSuchKey: publish DOCUMENT_NOT_FOUND, skip
  → On success: publish DOCUMENT_COPIED
  → Send message to uploader queue (for Nutrient ingestion)

Server-side copy avoids downloading/uploading files through Lambda memory.

Pre-Uploader Processor (Annotation Processing)

Handles PDF annotation transfer via Nutrient API. Supports 7 annotation types, each configurable as "brand_in" (flatten into PDF) or "layer" (overlay JSON):

Annotation Type Nutrient Type Routing Config Notes
Redactions pspdfkit/markup/redact redactionsAs Applied first (ordering matters for overlap correctness)
Highlights pspdfkit/markup/highlight highlightsAs Standard markup type
Notes pspdfkit/note notesAs Standard markup type
Stickers pspdfkit/image stickersAs Collects binary image attachments via collectImageAttachments()
Bates stamps pspdfkit/text:bates_stamp batesAs Virtual type with :bates_stamp suffix for disambiguation
Confidentiality codes pspdfkit/text:confidentiality_stamp confidentialityCodesAs Virtual type with :confidentiality_stamp suffix
Comments highlight with isCommentThreadRoot Always layer Fetches thread details, writes to S3 separately
Receive SQS message (documentId, exhibitId, annotationTypes)
  → Check S3 source PDF exists (skip with {skipped: true} if not)
  → Construct Nutrient document ID: document_{caseId}_{batchId}_{documentId}_{exhibitId}
  → Fetch annotations from Nutrient API
  → For "brand_in": Apply redactions first, then other types (ordering for overlap correctness)
  → For "layer": Upload JSON annotation data (stickers include image binary data)
  → For "comments": Fetch comment thread details, write to S3
  → On success: publish DOCUMENT_PRE_PROCESSED
  → Send message to uploader queue

Re-OCR Processor

Extracts text from Nutrient document pages:

Receive SQS message (documentId, exhibitId, attachmentPosition)
  → Call Nutrient API: GET /documents/{id}/pages/{position}/text
  → Upload text to S3: case_{caseId}/processed-files/{batchId}/txt/{documentId}/attachments/{position}/content.txt
  → Send message to search_text queue

Uses token caching for Nutrient API efficiency across messages in a batch.

De-Orchestration (Cleanup)

When PAYLOAD_CONSTRUCTION_FINISHED arrives, the orchestrator begins teardown:

1. Publish EXCHANGER_DELETE_INITIATED (starts cleanup loop)
2. Check all per-exchange queue depths
   → If any queue has messages: re-enqueue with 90s delay
   → If all queues empty: proceed to deletion
3. Delete event source mappings (Lambda ← SQS)
4. Delete processor Lambdas
5. Delete per-exchange SQS queues
6. Publish EXCHANGER_FINISHED

90-second delay allows in-flight messages to complete processing before queue deletion. The self-referencing event loop (similar to documentuploader) provides natural polling without Step Functions.

Error Handling

Max receive count pattern (TypeScript equivalent of exception hierarchy):

Receive Count Action
< MAX_RECEIVE_COUNT (3) Throw error → SQS requeues with visibility timeout
≥ MAX_RECEIVE_COUNT (3) Publish error to SNS, don't throw (message consumed)

Error-to-event mapping:

Processor Error Event Type
s3_copy_processor S3_DOC_COPIED (status: ERROR)
pre_uploader_processor PRE_UPLOADER_FINISHED (status: ERROR)
orchestrator EXCHANGER_FAILED

Global DLQ: All per-exchange queues use a shared global DLQ. The DLQ processor publishes error events to SNS for monitoring.

SNS Filter Policies

Queue Filter
Orchestrator (exch-orch) eventType: [PAYLOAD_CONSTRUCTION_FINISHED, EXCHANGER_DELETE_INITIATED, GLUE_JOB_FAILED]
Status Update (exch-status) eventType: [EXCHANGER_FINISHED, EXCHANGER_INITIATED]
Re-OCR (per-exchange) eventType: [DOCUMENT_UPLOADED], caseId: [{destCaseId}], batchId: [{batchId}]

Divergences from Standard Architecture

Standard Pattern documentexchanger Divergence Reason
Python 3.10+ TypeScript (all Lambdas) + Python (Glue only) TypeScript preferred for orchestration logic
Hexagonal core/shell No separation — orchestration + infrastructure mixed No domain logic; pure orchestration and S3/API operations
Exception hierarchy Max-receive-count + DLQ in TypeScript No RecoverableException/PermanentFailureException in TS
SQLAlchemy ORM PySpark DataFrames via Glue Bulk ETL, not per-record CRUD
Multi-tenant MySQL (Lambda) Glue job reads/writes per-case databases Lambda doesn't touch database directly
Checkpoint pipeline N/A — single-action per processor message No multi-step per-document processing
ECS long-running tasks Lambda processors + Glue job Operations fit within Lambda timeout

Multi-Region Deployment

Same c2/c4/c5 naming convention as other modules:

Region Prefix SNS Topic Staging Bucket
us-east-1 c2 Per-region topic ARN Per-region S3 bucket
us-west-1 c4 Per-region topic ARN Per-region S3 bucket
ca-central-1 c5 Per-region topic ARN Per-region S3 bucket

Pre-Deployment Architectural Review

Status: Build phase — not yet deployed. The following issues were identified during Principal Architect review and must be addressed before production deployment.

P0 — Pre-Deployment Blockers

1. Idempotency Violations (Architecture Rule: All handlers MUST be idempotent)

Orchestrator handler is NOT idempotent. If the API Gateway handler fails after creating queues/Lambdas but before returning success, a retry creates duplicate resources. No deduplication check or request ID tracking exists.

// Current: generates new jobId every call — retries create new exchanges
const jobId = randomUUID();

// Required: check if exchange already initiated for this request
// Option A: Accept client-provided idempotency key
// Option B: DynamoDB conditional put on (sourceCaseId, destCaseId, labelId, batchId)

SQS message processing lacks partial batch failure handling. Line 217 throws errors immediately, causing the entire batch to fail and retry from the beginning. Architecture requires {"batchItemFailures": [...]} for individual record failures.

2. Race Condition in De-Orchestration Cleanup

Critical: The cleanup checks areAllQueuesEmpty() after a 90-second delay, then deletes queues. But messages can arrive AFTER the check but BEFORE deletion:

Time 0s:   EXCHANGER_DELETE_INITIATED received
Time 90s:  Check queue depth → 0 messages → "safe to delete"
Time 91s:  New message arrives in queue (from slow processor)
Time 92s:  Queue deleted → message LOST, processor Lambda errors

Fix: Cleanup order must be: 1. Remove SNS subscriptions (stop new messages) 2. Disable event source mappings (stop Lambda polling) 3. Wait for in-flight Lambda invocations to complete 4. Verify queues empty 5. Delete Lambdas 6. Delete queues

3. Missing Exception Type Hierarchy

The codebase uses generic throw error everywhere instead of the architecture's exception hierarchy. All processors use SharedModule.handleErrors() which only checks receive count — it doesn't distinguish between:

  • Recoverable (Nutrient API timeout → retry)
  • Permanent (document genuinely missing → DLQ immediately)
  • Silent (already processed → delete message, log success)

Current behavior: All errors get 3 retries regardless of whether retrying will help.

4. No Transactional State Tracking

The orchestrator performs 6 steps (create queues → create Lambdas → apply policies → send uploader message → trigger Step Function → publish event). If any step fails after step 2, partial resources exist with no compensation or rollback.

Required: DynamoDB state table tracking orchestration progress:

PK: exchangeJobId
status: INITIATING | QUEUES_CREATED | LAMBDAS_CREATED | POLICIES_APPLIED |
        STEP_FUNCTION_TRIGGERED | INITIATED | DE_ORCHESTRATING | FINISHED | FAILED
queueUrls: [...], lambdaArns: [...]
createdAt, updatedAt

This also eliminates the fragile resource reconstruction pattern (rebuilding ARNs from naming conventions that could silently generate wrong names if conventions change).

5. Hardcoded Secrets

// lambda-manager.ts line 308 — NON-PROD SECRET HARDCODED
NUTRIENT_API_TOKEN: 'secret',

Architecture rule: No secrets in code — use AWS Secrets Manager. Even in non-prod, hardcoded secrets in source code are a security violation. Use Secrets Manager for all environments.

P1 — High Priority (Pre-GA)

6. Overly Permissive IAM Policies

// orchestrator-stack.ts — WILDCARD PERMISSIONS
'sqs:*',      // Should be: sqs:CreateQueue, sqs:DeleteQueue, sqs:SendMessage, etc.
'lambda:*',   // Should be: lambda:CreateFunction, lambda:DeleteFunction, etc.
'ec2:*',      // Should be: ec2:CreateNetworkInterface, ec2:DescribeNetworkInterfaces, etc.

Violates least-privilege principle. Scope to specific actions and resource ARN patterns (e.g., arn:aws:sqs:*:*:doc_exch_*).

7. Inconsistent Logging

Mixed console.log() / console.warn() with Winston logger.INFO() / logger.ERROR() across the orchestrator. Architecture requires structured JSON logging with consistent context fields (caseId, batchId, jobId, documentId).

Fix: Replace all console.* calls with the Winston logger from processors/shared/logger.ts.

8. Missing Tests

Component Test Status
Orchestrator (index.ts) ❌ No tests
Event Processor ❌ No tests
Queue Manager ❌ No tests
Lambda Manager ❌ No tests
S3 Copy Processor ⚠️ 3 basic tests only
Pre-Uploader Processor ❌ No tests
Re-OCR Processor ❌ No tests
Search Text Processor ❌ No tests
Status Update Lambda ❌ No tests
Glue Script ❌ No tests
CDK Stacks ⚠️ Placeholder only

Architecture rule: Every handler test must verify idempotent behavior. None of the existing tests verify idempotency.

9. Fragile Lambda ZIP Construction

LambdaManager hand-rolls ZIP file format with manual CRC32 calculation and binary buffer construction instead of using a proper ZIP library. This is extremely fragile — binary format bugs will cause silent Lambda creation failures.

Fix: Use archiver or jszip npm package.

10. Glue Script Lacks Safety Guarantees

# script.py — No rollback if job fails mid-copy
# No idempotency checks for already-copied data
# No validation of secret format
# No transaction safety on bulk writes

If the Glue job fails halfway through copying tables, partial data exists in the destination database with no way to identify what was copied vs what wasn't. Needs either: - Idempotent writes (INSERT IGNORE / ON DUPLICATE KEY UPDATE) - Transaction wrapper with rollback - Checkpoint tracking per table

P2 — Medium Priority

11. Hardcoded Configuration Values

Value Location Should Be
15 min Lambda timeout orchestrator-stack.ts Environment variable
20000 bulk update size search_text_processor Environment variable
1000 SQS batch size lambda-manager.ts Config per environment
300s batch window lambda-manager.ts Config per environment
90s cleanup delay event-processor.ts Config per environment

12. Token Initialization Race Condition

PreUploaderProcessor stores Nutrient API token as instance variable initialized async in init(). If two SQS messages arrive rapidly on a warm Lambda, both call init() concurrently — race condition on token state.

Fix: Use module-level cached token (like re_ocr_processor does correctly with let cachedToken).

13. Connection Pool Staleness

SearchTextProcessor caches a MySQL connection pool at module level but never validates or refreshes it. Long-running warm Lambdas may use stale connections that have been terminated by RDS proxy.

Fix: Add pool health check or use pool.getConnection() with error handling that recreates the pool on connection failure.

14. CDK Compilation During Synthesis

// orchestrator-stack.ts — BLOCKS CDK SYNTHESIS
execSync('npm run compile-processors', { stdio: 'inherit' });

Compiling TypeScript during CDK synthesis makes deployment dependent on build environment state and adds latency to every cdk synth. Should be a separate build step in CI/CD pipeline.

Architecture Compliance Summary

Requirement Status Details
Event-driven via SNS ⚠️ Partial Events published, but routing is fragile (order-dependent if/else chain)
Idempotent handlers ❌ FAIL No deduplication checks anywhere
Exception hierarchy ❌ FAIL Generic errors thrown; no Recoverable/Permanent/Silent distinction
Checkpoint/state tracking ❌ FAIL No DynamoDB state table; relies on fragile resource reconstruction
Multi-tenant DB ✅ OK Glue script handles per-case databases correctly
Structured logging ⚠️ Partial Processors use Winston; orchestrator mixes console.log
No secrets in code ❌ FAIL Hardcoded 'secret' in lambda-manager.ts
IAM least privilege ❌ FAIL Wildcard permissions on SQS, Lambda, EC2
Testing (idempotency) ❌ FAIL No idempotency tests; most components untested
Partial batch failures ❌ FAIL Missing batchItemFailures response

Lessons Learned

  1. Resource reconstruction eliminates state persistence — deterministic naming conventions let you rebuild ARNs/URLs without DynamoDB or SSM, simplifying cleanup after Lambda restarts.

  2. Conditional infrastructure minimizes waste — only creating queues/Lambdas needed for a specific exchange type reduces provisioning time and avoids idle resources.

  3. Circuit breaker prevents cascading failures — when the Nutrient uploader is overwhelmed, the circuit breaker stops the exchanger from wasting Lambda concurrency on failed sends.

  4. Glue excels for bulk cross-database ETL — copying thousands of records between per-case databases is a natural fit for Spark, avoiding Lambda timeout and connection pool constraints.

  5. Dual entry points (API + SQS) enable sync + async in one handler — the initial exchange request needs synchronous response (jobId), while lifecycle events are naturally asynchronous.

  6. De-orchestration delay (90s) prevents message loss — checking queue depth before deletion ensures in-flight messages complete processing.

Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.