Reference Implementation: documentexchanger¶

Document batch transfer service enabling copying of documents (with metadata, annotations, custodians, bates stamps, confidentiality codes, issues, and OCR text) between eDiscovery cases. Dynamically provisions processor Lambdas and SQS queues per exchange job, orchestrates bulk database ETL via AWS Glue, and tears down resources on completion.

Architecture Overview¶

documentexchanger transfers documents from a source case to a destination case. An "exchange" involves: copying files on S3, copying database records via Glue ETL, optionally processing PDF annotations via Nutrient, re-OCR of attachment pages, and uploading to the destination case's document engine.

The module uses a dynamic infrastructure orchestration pattern similar to documentuploader — creating and destroying processor Lambdas and SQS queues per exchange job at runtime. However, it differs by using Lambda processors (not ECS) and AWS Glue for bulk database operations (not per-record Lambda).

Why Dynamic Infrastructure?¶

Concern	Static (pre-created)	Dynamic (per-exchange)
Queue isolation	Shared queue, filter by attributes	Dedicated queue per exchange job
Lambda scaling	Shared concurrency pool	Per-exchange concurrency isolation
Cleanup	Manual resource tracking	Delete queue/Lambda = full cleanup
Conditional processing	All queues always exist	Only create queues needed for this exchange

Architecture Tree¶

documentexchanger/
├── lib/
│   ├── global/
│   │   ├── lambda/
│   │   │   ├── orchestrator/                      # Main orchestration Lambda
│   │   │   │   ├── index.ts                       # Dual entry: API Gateway + SQS
│   │   │   │   ├── types.ts                       # ExchangeRequest, ExchangeContext interfaces
│   │   │   │   ├── event-publisher.ts             # SNS event construction + publishing
│   │   │   │   ├── event-processor.ts             # Lifecycle event handler (de-orchestration)
│   │   │   │   ├── queue-manager.ts               # Dynamic SQS queue creation/deletion
│   │   │   │   ├── lambda-manager.ts              # Dynamic Lambda creation/deletion
│   │   │   │   ├── step-function-trigger.ts       # Glue job orchestration via Step Functions
│   │   │   │   ├── uploader-coordinator.ts        # Nutrient uploader messaging (circuit breaker)
│   │   │   │   ├── resource-reconstructor.ts      # Rebuild queue/Lambda ARNs from naming
│   │   │   │   ├── sns-helper.ts                  # SNS publishing utilities
│   │   │   │   └── sqs-helper.ts                  # SQS queue utilities
│   │   │   ├── processors/
│   │   │   │   ├── s3_copy_processor/             # S3 server-side document copy
│   │   │   │   ├── pre_uploader_processor/        # PDF annotation processing via Nutrient
│   │   │   │   ├── re_ocr_processor/              # Re-OCR attachment pages via Nutrient
│   │   │   │   ├── search_text_processor/         # Search indexing (stub)
│   │   │   │   └── shared/
│   │   │   │       ├── logger.ts                  # Winston structured JSON logging
│   │   │   │       ├── error-handler.ts           # SNS error publishing with processor mapping
│   │   │   │       └── shared-module.ts           # Max receive count retry logic
│   │   │   └── exchange-status-update/            # DB status update Lambda
│   │   ├── global-dlq-processor/                  # Global DLQ handler
│   │   ├── orchestrator-stack.ts                  # CDK: orchestrator Lambda + layer
│   │   ├── sqs-stack.ts                           # CDK: global SQS queues + SNS subscriptions
│   │   ├── stepfunction-stack.ts                  # CDK: Glue job state machine
│   │   ├── glue-job-stack.ts                      # CDK: Glue ETL job
│   │   ├── global-lambda-stack.ts                 # CDK: status update + DLQ Lambdas
│   │   ├── apigateway-stack.ts                    # CDK: REST API for exchange requests
│   │   ├── roles-stack.ts                         # CDK: IAM roles
│   │   └── proxy-stack.ts                         # CDK: RDS proxy
│   ├── glue/
│   │   └── script.py                              # PySpark ETL: bulk DB copy between cases
│   ├── helpers/
│   │   ├── config-helper.ts                       # Environment × region config
│   │   └── name-helper.ts                         # Resource naming conventions
│   └── constructs/
│       └── global-queue.ts                        # Reusable SQS + DLQ construct
├── config.json                                    # Per-environment settings
├── pipeline/                                      # CDK pipeline for Glue script
└── test/                                          # Jest CDK + processor tests

Language Stack¶

Component	Language	Runtime	Purpose
Orchestrator	TypeScript	Lambda (Node.js)	Exchange lifecycle management
S3 Copy Processor	TypeScript	Lambda (Node.js)	Server-side S3 document copy
Pre-Uploader Processor	TypeScript	Lambda (Node.js)	PDF annotation processing via Nutrient
Re-OCR Processor	TypeScript	Lambda (Node.js)	Text extraction from Nutrient pages
Search Text Processor	TypeScript	Lambda (Node.js)	Search indexing (stub)
Status Update	TypeScript	Lambda (Node.js)	DB status field updates
DLQ Processor	TypeScript	Lambda (Node.js)	Error event publishing
Glue ETL Script	Python (PySpark)	AWS Glue	Bulk database record copy
Infrastructure	TypeScript	CDK	AWS resource provisioning

All Lambda processors share a common Lambda Layer compiled from the processors/shared/ TypeScript code.

Pattern Mapping¶

Patterns Implemented¶

Architecture Pattern	documentexchanger Implementation	Notes
SNS Event Publishing	`event-publisher.ts` — typed event builders	TypeScript, same event structure (source, jobId, caseId, etc.)
SQS Handler	Orchestrator processes from global SQS queue	Standard parse → route → handle flow
Retry & Resilience	`SharedModule.handleErrors()` — max receive count check	3 retries before DLQ, similar to standard pattern
Job Processor Orchestration	Orchestrator creates queues + Lambdas per exchange	Dynamic infrastructure like documentloader/documentuploader
Idempotent Handlers	Resource reconstruction from naming conventions	No persistent state needed — ARNs rebuilt from deterministic names
Config Management	`config.json` environment × region matrix	Same pattern as other modules
CDK Infrastructure	Multiple stacks (orchestrator, SQS, Step Functions, Glue, API GW)	Standard multi-stack composition
SNS Filter Policies	Per-queue filters (eventType, caseId, batchId)	Standard attribute-based filtering
Structured Logging	Winston JSON logger with context fields	caseId, batchId, documentId, exhibitIds
Multi-Tenancy	Per-case database naming: `{DB_NAME_PREFIX}_{case_id}`	Standard multi-tenant pattern

Patterns NOT Used¶

Pattern	Why Not
Hexagonal core/shell	Orchestration-focused module — no domain logic to separate
Checkpoint Pipeline	No multi-step per-document processing; each processor is single-action
Exception Hierarchy (Python)	TypeScript module — uses max-receive-count + DLQ instead
SQLAlchemy ORM	Database operations via PySpark Glue job, not Lambda
Elasticsearch Bulk Indexing	Search text processor is a stub
@retry_on_db_conflict	No MySQL writes from Lambda; bulk writes via Glue

Event Flow¶

Exchange Lifecycle¶

API Gateway POST /exchange
  │
  ▼
Orchestrator Lambda (sync)
  ├── Create SQS queues (conditional: 2-5 based on annotation needs)
  ├── Create processor Lambdas (with event source mappings)
  ├── Apply resource-based policies
  ├── Send message to uploader orchestrator
  ├── Trigger Step Function (Glue job)
  └── Publish EXCHANGER_INITIATED

Step Function
  ├── RunGlueJob (PySpark: copy DB records source→dest)
  │     retry: 3 attempts, 30s interval, 2.0 backoff
  ├── DB_COPY_FINISHED → InvokePayloadLambda
  │     (generates per-document messages for processor queues)
  └── PAYLOAD_CONSTRUCTION_FINISHED → SNS

Processor Lambdas (parallel, per-document)
  ├── S3 Copy: CopyObject source→dest S3
  ├── Pre-Uploader: Fetch/flatten annotations via Nutrient
  ├── Re-OCR: Extract text from Nutrient pages
  └── Search Text: Index for search (stub)

De-Orchestration (triggered by PAYLOAD_CONSTRUCTION_FINISHED)
  ├── Check queue depths (re-enqueue with 90s delay if not empty)
  ├── Delete processor Lambdas
  ├── Delete per-exchange SQS queues
  └── Publish EXCHANGER_FINISHED

Inbound Events (Orchestrator Consumes)¶

Event	Source	Action
API Gateway POST	Nextpoint backend	Create infrastructure, start exchange
`PAYLOAD_CONSTRUCTION_FINISHED`	Step Function	Begin de-orchestration
`EXCHANGER_DELETE_INITIATED`	Self (re-enqueued)	Continue de-orchestration (check queues)
`GLUE_JOB_FAILED`	Step Function	Cleanup and publish EXCHANGER_FAILED

Outbound Events (Published to SNS)¶

Event	Source	Downstream
`EXCHANGER_INITIATED`	Orchestrator	Status update Lambda
`DOCUMENT_COPIED`	S3 Copy Processor	Monitoring
`NATIVE_DOCUMENT_COPIED`	S3 Copy Processor	Monitoring
`DOCUMENT_NOT_FOUND`	S3 Copy / Pre-Uploader	Error tracking
`DOCUMENT_PRE_PROCESSED`	Pre-Uploader Processor	Monitoring
`DB_COPY_FINISHED`	Step Function	Payload construction
`EXCHANGER_DELETE_INITIATED`	Event Processor	Self (de-orchestration loop)
`EXCHANGER_FINISHED`	Event Processor	Status update Lambda
`EXCHANGER_FAILED`	Error Handler	Status update Lambda

Key Design Decisions¶

1. Dual Entry Point (API Gateway + SQS)¶

Decision: Single Lambda handler processes both synchronous API requests and asynchronous SQS events.

API Gateway POST → Orchestrator (creates infrastructure, returns jobId)
SQS events    → Orchestrator (lifecycle management: cleanup, retry)

Why: - API Gateway provides synchronous request/response for initial exchange trigger - SQS provides asynchronous retry for lifecycle events (de-orchestration) - Single handler avoids duplicating orchestration logic

2. Conditional Queue/Lambda Creation¶

Decision: Only create the queues and Lambdas needed for a specific exchange based on requiresAnnotations and includeNatives flags.

Configuration	Queues Created	Lambdas Created
No annotations	s3_copy, doc_uploader	S3 Copy
Annotations only	pre_uploader, re_ocr, search_text, doc_uploader	Pre-Uploader, Re-OCR, Search Text
Annotations + natives	s3_copy, pre_uploader, re_ocr, search_text, doc_uploader	S3 Copy, Pre-Uploader, Re-OCR, Search Text

Why: Minimizes resource creation overhead and avoids idle queues/Lambdas. An exchange without annotations doesn't need Nutrient processing infrastructure.

3. Resource Reconstruction (Stateless Cleanup)¶

Decision: Rebuild queue URLs and Lambda ARNs from deterministic naming conventions rather than persisting them in a database.

// ResourceReconstructor rebuilds from naming pattern:
// Queue: doc_exch_{type}_queue_{caseId}_{batchId}
// Lambda: doc_exch_{type}_lambda_{caseId}_{batchId}

const queueUrl = `https://sqs.${region}.amazonaws.com/${accountId}/${queueName}`;
const lambdaArn = `arn:aws:lambda:${region}:${accountId}:function:${functionName}`;

Why: Eliminates need for persistent state storage (DynamoDB/SSM). If the orchestrator Lambda restarts mid-lifecycle, it can reconstruct all resource references from the exchange request parameters alone.

4. Circuit Breaker for Uploader Communication¶

Decision: UploaderCoordinator implements a circuit breaker pattern when messaging the Nutrient uploader orchestrator.

CLOSED state (normal):
  → Send messages to uploader queue
  → On failure: increment failure counter
  → If failures >= 5: transition to OPEN

OPEN state (circuit tripped):
  → Reject messages immediately (don't attempt send)
  → After 30s timeout: transition to HALF_OPEN

HALF_OPEN state (testing):
  → Allow one test message
  → If success: transition to CLOSED
  → If failure: transition back to OPEN

Why: The Nutrient uploader is a separate module with its own queue and scaling constraints. If it's overwhelmed or down, the circuit breaker prevents the exchanger from flooding it with messages and consuming its own Lambda concurrency on retries.

5. AWS Glue for Bulk Database Copy¶

Decision: Use PySpark Glue job for bulk database ETL rather than per-record Lambda processing.

Glue Job (script.py):
  Source DB: {DB_NAME_PREFIX}_{source_case_id}
  Target DB: {DB_NAME_PREFIX}_{destination_case_id}

  Tables copied: documents, document_variants, exhibits,
    label_category_labels, custom_field_values, custodians,
    custodian_exhibits, confidentiality_codes, confidentiality_logs,
    relevancy_logs, folder_tags, folder_taggings, label_categories, etc.

  Connection: JDBC via RDS proxy (writer + reader endpoints)

  Optimizations:
    - ThreadPoolExecutor (BFS_MAX_WORKERS=8) for parallel child table processing
    - Thread-safe column caching (_columns_cache with lock)
    - JDBC fetch size tuning (spark.sql.jdbc.fetchSize=10000)
    - _execute_with_reconnect() for connection resilience
    - delete_rows_in_batches() for safe bulk deletions

  Post-copy operations:
    - Clear document_properties (NULL) on exchanged exhibits
    - Assign source exhibit ID as number field on target exhibits
    - Remap confidentiality_status bitwise to target case codes
    - Deduplicate label categories and folder tags by name

Why: Copying thousands of database records per-document via Lambda would be slow and expensive. Glue's Spark engine handles bulk JDBC reads/writes efficiently, and the Glue job runs independently of the Lambda timeout limit.

6. Step Functions for Glue Orchestration¶

Decision: Use Step Functions to orchestrate the Glue job → payload construction → notification sequence.

RunGlueJob (retry: 3 attempts, 30s interval, 2.0 backoff)
  → Success: DB_COPY_FINISHED → InvokePayloadLambda
  → Failure: GLUE_JOB_FAILED → SNS notification

Why: Glue jobs can run for minutes to hours. Step Functions provides native retry/backoff, failure handling, and SNS notifications without requiring a polling Lambda.

Processor Details¶

S3 Copy Processor¶

Copies document files from source case S3 to destination case S3:

Receive SQS message (documentId, sourcePath, destPath)
  → S3.CopyObject (server-side copy, no data transfer through Lambda)
  → If NoSuchKey: publish DOCUMENT_NOT_FOUND, skip
  → On success: publish DOCUMENT_COPIED
  → Send message to uploader queue (for Nutrient ingestion)

Server-side copy avoids downloading/uploading files through Lambda memory.

Pre-Uploader Processor (Annotation Processing)¶

Handles PDF annotation transfer via Nutrient API. Supports 7 annotation types, each configurable as "brand_in" (flatten into PDF) or "layer" (overlay JSON):

Annotation Type	Nutrient Type	Routing Config	Notes
Redactions	`pspdfkit/markup/redact`	`redactionsAs`	Applied first (ordering matters for overlap correctness)
Highlights	`pspdfkit/markup/highlight`	`highlightsAs`	Standard markup type
Notes	`pspdfkit/note`	`notesAs`	Standard markup type
Stickers	`pspdfkit/image`	`stickersAs`	Collects binary image attachments via `collectImageAttachments()`
Bates stamps	`pspdfkit/text:bates_stamp`	`batesAs`	Virtual type with `:bates_stamp` suffix for disambiguation
Confidentiality codes	`pspdfkit/text:confidentiality_stamp`	`confidentialityCodesAs`	Virtual type with `:confidentiality_stamp` suffix
Comments	highlight with `isCommentThreadRoot`	Always layer	Fetches thread details, writes to S3 separately

Receive SQS message (documentId, exhibitId, annotationTypes)
  → Check S3 source PDF exists (skip with {skipped: true} if not)
  → Construct Nutrient document ID: document_{caseId}_{batchId}_{documentId}_{exhibitId}
  → Fetch annotations from Nutrient API
  → For "brand_in": Apply redactions first, then other types (ordering for overlap correctness)
  → For "layer": Upload JSON annotation data (stickers include image binary data)
  → For "comments": Fetch comment thread details, write to S3
  → On success: publish DOCUMENT_PRE_PROCESSED
  → Send message to uploader queue

Re-OCR Processor¶

Extracts text from Nutrient document pages:

Receive SQS message (documentId, exhibitId, attachmentPosition)
  → Call Nutrient API: GET /documents/{id}/pages/{position}/text
  → Upload text to S3: case_{caseId}/processed-files/{batchId}/txt/{documentId}/attachments/{position}/content.txt
  → Send message to search_text queue

Uses token caching for Nutrient API efficiency across messages in a batch.

De-Orchestration (Cleanup)¶

When PAYLOAD_CONSTRUCTION_FINISHED arrives, the orchestrator begins teardown:

1. Publish EXCHANGER_DELETE_INITIATED (starts cleanup loop)
2. Check all per-exchange queue depths
   → If any queue has messages: re-enqueue with 90s delay
   → If all queues empty: proceed to deletion
3. Delete event source mappings (Lambda ← SQS)
4. Delete processor Lambdas
5. Delete per-exchange SQS queues
6. Publish EXCHANGER_FINISHED

90-second delay allows in-flight messages to complete processing before queue deletion. The self-referencing event loop (similar to documentuploader) provides natural polling without Step Functions.

Error Handling¶

Max receive count pattern (TypeScript equivalent of exception hierarchy):

Receive Count	Action
< MAX_RECEIVE_COUNT (3)	Throw error → SQS requeues with visibility timeout
≥ MAX_RECEIVE_COUNT (3)	Publish error to SNS, don't throw (message consumed)

Error-to-event mapping:

Processor	Error Event Type
s3_copy_processor	`S3_DOC_COPIED` (status: ERROR)
pre_uploader_processor	`PRE_UPLOADER_FINISHED` (status: ERROR)
orchestrator	`EXCHANGER_FAILED`

Global DLQ: All per-exchange queues use a shared global DLQ. The DLQ processor publishes error events to SNS for monitoring.

Queue	Filter
Orchestrator (`exch-orch`)	`eventType: [PAYLOAD_CONSTRUCTION_FINISHED, EXCHANGER_DELETE_INITIATED, GLUE_JOB_FAILED]`
Status Update (`exch-status`)	`eventType: [EXCHANGER_FINISHED, EXCHANGER_INITIATED]`
Re-OCR (per-exchange)	`eventType: [DOCUMENT_UPLOADED], caseId: [{destCaseId}], batchId: [{batchId}]`

Divergences from Standard Architecture¶

Standard Pattern	documentexchanger Divergence	Reason
Python 3.10+	TypeScript (all Lambdas) + Python (Glue only)	TypeScript preferred for orchestration logic
Hexagonal core/shell	No separation — orchestration + infrastructure mixed	No domain logic; pure orchestration and S3/API operations
Exception hierarchy	Max-receive-count + DLQ in TypeScript	No RecoverableException/PermanentFailureException in TS
SQLAlchemy ORM	PySpark DataFrames via Glue	Bulk ETL, not per-record CRUD
Multi-tenant MySQL (Lambda)	Glue job reads/writes per-case databases	Lambda doesn't touch database directly
Checkpoint pipeline	N/A — single-action per processor message	No multi-step per-document processing
ECS long-running tasks	Lambda processors + Glue job	Operations fit within Lambda timeout

Multi-Region Deployment¶

Same c2/c4/c5 naming convention as other modules:

Region	Prefix	SNS Topic	Staging Bucket
us-east-1	`c2`	Per-region topic ARN	Per-region S3 bucket
us-west-1	`c4`	Per-region topic ARN	Per-region S3 bucket
ca-central-1	`c5`	Per-region topic ARN	Per-region S3 bucket

Pre-Deployment Architectural Review¶

Status: Build phase — not yet deployed. The following issues were identified during Principal Architect review and must be addressed before production deployment.

P0 — Pre-Deployment Blockers¶

1. Idempotency Violations (Architecture Rule: All handlers MUST be idempotent)¶

Orchestrator handler is NOT idempotent. If the API Gateway handler fails after creating queues/Lambdas but before returning success, a retry creates duplicate resources. No deduplication check or request ID tracking exists.

// Current: generates new jobId every call — retries create new exchanges
const jobId = randomUUID();

// Required: check if exchange already initiated for this request
// Option A: Accept client-provided idempotency key
// Option B: DynamoDB conditional put on (sourceCaseId, destCaseId, labelId, batchId)

SQS message processing lacks partial batch failure handling. Line 217 throws errors immediately, causing the entire batch to fail and retry from the beginning. Architecture requires {"batchItemFailures": [...]} for individual record failures.

2. Race Condition in De-Orchestration Cleanup¶

Critical: The cleanup checks areAllQueuesEmpty() after a 90-second delay, then deletes queues. But messages can arrive AFTER the check but BEFORE deletion:

Time 0s:   EXCHANGER_DELETE_INITIATED received
Time 90s:  Check queue depth → 0 messages → "safe to delete"
Time 91s:  New message arrives in queue (from slow processor)
Time 92s:  Queue deleted → message LOST, processor Lambda errors

Fix: Cleanup order must be: 1. Remove SNS subscriptions (stop new messages) 2. Disable event source mappings (stop Lambda polling) 3. Wait for in-flight Lambda invocations to complete 4. Verify queues empty 5. Delete Lambdas 6. Delete queues

3. Missing Exception Type Hierarchy¶

The codebase uses generic throw error everywhere instead of the architecture's exception hierarchy. All processors use SharedModule.handleErrors() which only checks receive count — it doesn't distinguish between:

Recoverable (Nutrient API timeout → retry)
Permanent (document genuinely missing → DLQ immediately)
Silent (already processed → delete message, log success)

Current behavior: All errors get 3 retries regardless of whether retrying will help.

4. No Transactional State Tracking¶

The orchestrator performs 6 steps (create queues → create Lambdas → apply policies → send uploader message → trigger Step Function → publish event). If any step fails after step 2, partial resources exist with no compensation or rollback.

Required: DynamoDB state table tracking orchestration progress:

PK: exchangeJobId
status: INITIATING | QUEUES_CREATED | LAMBDAS_CREATED | POLICIES_APPLIED |
        STEP_FUNCTION_TRIGGERED | INITIATED | DE_ORCHESTRATING | FINISHED | FAILED
queueUrls: [...], lambdaArns: [...]
createdAt, updatedAt

This also eliminates the fragile resource reconstruction pattern (rebuilding ARNs from naming conventions that could silently generate wrong names if conventions change).

5. Hardcoded Secrets¶

// lambda-manager.ts line 308 — NON-PROD SECRET HARDCODED
NUTRIENT_API_TOKEN: 'secret',

Architecture rule: No secrets in code — use AWS Secrets Manager. Even in non-prod, hardcoded secrets in source code are a security violation. Use Secrets Manager for all environments.

P1 — High Priority (Pre-GA)¶

6. Overly Permissive IAM Policies¶

// orchestrator-stack.ts — WILDCARD PERMISSIONS
'sqs:*',      // Should be: sqs:CreateQueue, sqs:DeleteQueue, sqs:SendMessage, etc.
'lambda:*',   // Should be: lambda:CreateFunction, lambda:DeleteFunction, etc.
'ec2:*',      // Should be: ec2:CreateNetworkInterface, ec2:DescribeNetworkInterfaces, etc.

Violates least-privilege principle. Scope to specific actions and resource ARN patterns (e.g., arn:aws:sqs:*:*:doc_exch_*).

7. Inconsistent Logging¶

Mixed console.log() / console.warn() with Winston logger.INFO() / logger.ERROR() across the orchestrator. Architecture requires structured JSON logging with consistent context fields (caseId, batchId, jobId, documentId).

Fix: Replace all console.* calls with the Winston logger from processors/shared/logger.ts.

8. Missing Tests¶

Component	Test Status
Orchestrator (index.ts)	❌ No tests
Event Processor	❌ No tests
Queue Manager	❌ No tests
Lambda Manager	❌ No tests
S3 Copy Processor	⚠️ 3 basic tests only
Pre-Uploader Processor	❌ No tests
Re-OCR Processor	❌ No tests
Search Text Processor	❌ No tests
Status Update Lambda	❌ No tests
Glue Script	❌ No tests
CDK Stacks	⚠️ Placeholder only

Architecture rule: Every handler test must verify idempotent behavior. None of the existing tests verify idempotency.

9. Fragile Lambda ZIP Construction¶

LambdaManager hand-rolls ZIP file format with manual CRC32 calculation and binary buffer construction instead of using a proper ZIP library. This is extremely fragile — binary format bugs will cause silent Lambda creation failures.

Fix: Use archiver or jszip npm package.

10. Glue Script Lacks Safety Guarantees¶

# script.py — No rollback if job fails mid-copy
# No idempotency checks for already-copied data
# No validation of secret format
# No transaction safety on bulk writes

If the Glue job fails halfway through copying tables, partial data exists in the destination database with no way to identify what was copied vs what wasn't. Needs either: - Idempotent writes (INSERT IGNORE / ON DUPLICATE KEY UPDATE) - Transaction wrapper with rollback - Checkpoint tracking per table

P2 — Medium Priority¶

11. Hardcoded Configuration Values¶

Value	Location	Should Be
15 min Lambda timeout	orchestrator-stack.ts	Environment variable
20000 bulk update size	search_text_processor	Environment variable
1000 SQS batch size	lambda-manager.ts	Config per environment
300s batch window	lambda-manager.ts	Config per environment
90s cleanup delay	event-processor.ts	Config per environment

12. Token Initialization Race Condition¶

PreUploaderProcessor stores Nutrient API token as instance variable initialized async in init(). If two SQS messages arrive rapidly on a warm Lambda, both call init() concurrently — race condition on token state.

Fix: Use module-level cached token (like re_ocr_processor does correctly with let cachedToken).

13. Connection Pool Staleness¶

SearchTextProcessor caches a MySQL connection pool at module level but never validates or refreshes it. Long-running warm Lambdas may use stale connections that have been terminated by RDS proxy.

Fix: Add pool health check or use pool.getConnection() with error handling that recreates the pool on connection failure.

14. CDK Compilation During Synthesis¶

// orchestrator-stack.ts — BLOCKS CDK SYNTHESIS
execSync('npm run compile-processors', { stdio: 'inherit' });

Compiling TypeScript during CDK synthesis makes deployment dependent on build environment state and adds latency to every cdk synth. Should be a separate build step in CI/CD pipeline.

Architecture Compliance Summary¶

Requirement	Status	Details
Event-driven via SNS	⚠️ Partial	Events published, but routing is fragile (order-dependent if/else chain)
Idempotent handlers	❌ FAIL	No deduplication checks anywhere
Exception hierarchy	❌ FAIL	Generic errors thrown; no Recoverable/Permanent/Silent distinction
Checkpoint/state tracking	❌ FAIL	No DynamoDB state table; relies on fragile resource reconstruction
Multi-tenant DB	✅ OK	Glue script handles per-case databases correctly
Structured logging	⚠️ Partial	Processors use Winston; orchestrator mixes console.log
No secrets in code	❌ FAIL	Hardcoded `'secret'` in lambda-manager.ts
IAM least privilege	❌ FAIL	Wildcard permissions on SQS, Lambda, EC2
Testing (idempotency)	❌ FAIL	No idempotency tests; most components untested
Partial batch failures	❌ FAIL	Missing batchItemFailures response

Lessons Learned¶

Resource reconstruction eliminates state persistence — deterministic naming conventions let you rebuild ARNs/URLs without DynamoDB or SSM, simplifying cleanup after Lambda restarts.
Conditional infrastructure minimizes waste — only creating queues/Lambdas needed for a specific exchange type reduces provisioning time and avoids idle resources.
Circuit breaker prevents cascading failures — when the Nutrient uploader is overwhelmed, the circuit breaker stops the exchanger from wasting Lambda concurrency on failed sends.
Glue excels for bulk cross-database ETL — copying thousands of records between per-case databases is a natural fit for Spark, avoiding Lambda timeout and connection pool constraints.
Dual entry points (API + SQS) enable sync + async in one handler — the initial exchange request needs synchronous response (jobId), while lifecycle events are naturally asynchronous.
De-orchestration delay (90s) prevents message loss — checking queue depth before deletion ensures in-flight messages complete processing.

Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.