Reference Implementation: documentuploader¶
Document upload and PDF processing service using Nutrient (PSPDFKit) engine. Dynamically provisions per-case/batch ECS infrastructure via CloudFormation, processes documents through a multi-container sidecar architecture, and tears down resources when complete.
Architecture Overview¶
documentuploader is not a standard Lambda + SQS service module. It is a dynamic infrastructure orchestrator that creates and destroys full CloudFormation stacks per case/batch at runtime. This is driven by the compute requirements of the Nutrient (PSPDFKit) document processing engine — PDF flattening, rendering, and annotation processing require persistent ECS Fargate tasks with sidecar containers, not ephemeral Lambda invocations.
Why Dynamic Stacks?¶
| Concern | Standard Module (Lambda) | documentuploader (Dynamic ECS) |
|---|---|---|
| Compute duration | < 15 min | Minutes to hours per batch |
| Resource isolation | Lambda concurrency | Full network + queue isolation per case |
| Scaling unit | Lambda invocations | ECS tasks with queue-depth step scaling |
| Database connections | Connection per invocation | Connection pool per task (throttled globally) |
| Cleanup | N/A | Full stack destruction on completion |
Architecture Tree¶
documentuploader/
├── orchestrator/ # Global orchestration Lambda
│ ├── index.ts # Main handler: stack lifecycle management
│ ├── event-factory.ts # SNS event construction helpers
│ └── dlq/
│ └── index.ts # Global DLQ handler → UPLOADER_FAILED events
├── global/
│ └── infrastructure.ts # Global CDK app entry (all shared stacks)
├── per-import/
│ └── infrastructure.ts # Per-import CDK app (synthesized → S3 template)
├── lib/
│ ├── nutrient/ # Global shared infrastructure stacks
│ │ ├── orchestrator-stack.ts # Orchestrator Lambda + global SQS + SNS subs
│ │ ├── ecs-stack.ts # Shared ECS cluster + ALB
│ │ ├── ecs-roles-stack.ts # IAM roles for ECS tasks
│ │ ├── networking-stack.ts # VPC, subnets, security groups
│ │ ├── database-stack.ts # Nutrient PostgreSQL RDS
│ │ ├── poller-image-stack.ts # ECR: Poller Docker image
│ │ ├── redirection-service-image-stack.ts # ECR: Redirection Service image
│ │ ├── jwt-key-stack.ts # JWT keys for Nutrient engine
│ │ ├── orchestrator-roles-stack.ts # IAM for orchestrator Lambda
│ │ ├── import-specific-infra-roles-stack.ts # IAM for per-import resources
│ │ ├── dlq/
│ │ │ └── index.ts # Per-import DLQ handler Lambda
│ │ └── message-transfer-lambda/
│ │ └── index.ts # PDF retry → main queue transfer
│ ├── uploader/ # Per-import nested stacks
│ │ ├── helper-stack.ts # SQS queues, DLQ, security groups
│ │ ├── task-definition-stack.ts # 3-container Fargate task definition
│ │ ├── ecs-service-stack.ts # QueueProcessingFargateService + scaling
│ │ └── uploader-ecs-stack.ts # Parent stack (composes nested stacks)
│ ├── doc_uploader_poller/ # Ruby — SQS consumer + Nutrient API caller
│ │ ├── doc_uploader_poller.rb # Shoryuken worker: main processing loop
│ │ ├── nutrient_api.rb # HTTP client for Nutrient engine
│ │ ├── Gemfile # Ruby dependencies (shoryuken, aws-sdk, etc.)
│ │ └── Dockerfile
│ ├── redirection-service/ # Node.js — S3 presigned URL generator
│ │ ├── app.js # Express server: presigned URLs + doc lookup
│ │ ├── package.json
│ │ └── Dockerfile
│ └── helpers/
│ └── config-helper.ts # Environment × region config loading
├── bin/configs/
│ ├── config.json # Per-environment infrastructure settings
│ └── nutrient-config.json # Nutrient engine-specific settings
├── pipeline/ # CI/CD pipeline stacks
└── test/ # Jest CDK tests
Hybrid Language Stack¶
| Component | Language | Runtime | Purpose |
|---|---|---|---|
| Orchestrator | TypeScript | Lambda (Node.js) | Stack lifecycle, throttling, monitoring |
| DLQ Handler | TypeScript | Lambda (Node.js) | Error notification events |
| Message Transfer | TypeScript | Lambda (Node.js) | PDF retry → main queue forwarding |
| Poller | Ruby | ECS (Shoryuken) | SQS consumer, Nutrient API caller |
| DocEngine | Nutrient (binary) | ECS container | PDF processing, flattening, annotations |
| Redirection Service | Node.js | ECS (Express) | S3 presigned URL generation |
| Infrastructure | TypeScript | CDK | AWS resource provisioning |
Why Ruby for the Poller? Shoryuken is a mature SQS processing framework with built-in visibility timeout management, receive count tracking, and concurrent worker pools — purpose-built for long-running SQS consumers on ECS.
Nutrient (PSPDFKit) Document Engine Integration¶
Nutrient (formerly PSPDFKit) is a commercial document processing engine that runs as a Docker container. documentuploader wraps it in a sidecar architecture where the Poller (Ruby) drives the API and the Redirection Service (Node.js) handles S3 access.
Nutrient API Operations¶
The NutrientApi Ruby class (nutrient_api.rb) communicates with DocEngine via
localhost:5000 (same Fargate task, shared network namespace):
| Operation | Endpoint | Purpose |
|---|---|---|
| Load Document | POST /api/documents |
Upload a document into Nutrient's storage (multipart form-data) |
| Flatten PDF | POST /api/documents/{id}/apply_instructions |
Merge all layers (annotations, form fields, signatures) into static content |
Load Document flow:
Poller receives SQS message
→ Constructs document ID: document_{caseId}_{batchId}_{documentId}_{exhibitIds}
→ Calls Redirection Service (localhost:80) to get PDF content via presigned URL
→ POST /api/documents to Nutrient with file or URL reference
→ Returns Nutrient's internal document_id
Flatten PDF flow:
After load → POST /api/documents/{id}/apply_instructions
→ Instructions: { parts: [{ document: { id: "#self" } }], actions: [{ type: "flatten" }] }
→ Timeout: 240 seconds (CPU-intensive operation)
→ Merges annotations, form fields, signatures into non-editable layer
Document ID Encoding Scheme¶
Documents are identified by a composite ID that encodes routing metadata:
The Redirection Service parses this to resolve the S3 path:
- ext=pdf → case_{caseId}/processed-files/{batchId}/native/{documentId}
- ext=* (generated) → case_{caseId}/processed-files/{batchId}/pdf/{documentId}
Redirection Service (Presigned URL Proxy)¶
Express.js sidecar (localhost:80) that provides S3 access without exposing credentials to the Nutrient engine:
| Endpoint | Purpose |
|---|---|
GET /documents/{documentId}?ext=pdf\|generated |
Generate presigned URL (7-day expiry), stream PDF content back |
GET /list/{documentId} |
Check if PDF exists on S3 (ListObjectsV2) |
GET /health |
Container health check |
Why a separate service? The Nutrient DocEngine fetches documents via HTTP
URL. Rather than giving DocEngine direct S3 credentials, the Redirection Service
generates presigned URLs and streams content. This isolates S3 access to a
controlled proxy, enables the HTTP_PROXY_REMOTE_FILE_DOWNLOAD=http://localhost:80
configuration on DocEngine.
Nutrient Configuration (nutrient-config.json)¶
Environment × region matrix controlling DocEngine behavior:
| Setting | Purpose | Example Values |
|---|---|---|
DATABASE_CONNECTIONS |
PostgreSQL pool per container | 5 (default), varies by env |
PSPDFKIT_WORKER_POOL_SIZE |
Internal worker threads | 30 |
ASSET_STORAGE_BACKEND |
Storage backend | s3 |
ASSET_STORAGE_CACHE_SIZE |
Local cache for assets | 4 GB |
USE_REDIS_CACHE |
Enable Redis for caching | true (staging/prod) |
JWT_ALGORITHM |
Auth algorithm | RS256 |
SERVER_REQUEST_TIMEOUT |
Request timeout (ms) | 240000 (matches flatten timeout) |
ACTIVATION_KEY |
Nutrient license key | Per-environment |
Error Handling in Nutrient Operations¶
| Scenario | Poller Behavior |
|---|---|
| Nutrient API returns non-200 | Re-raise → Shoryuken requeues (up to MAX_RECEIVE_COUNT=4) |
| Flatten timeout (>240s) | Log error, publish to SNS, return nil |
| PDF not found on S3 (first attempt) | Send to PDF retry queue (3-min delay) |
| PDF not found on S3 (from retry queue) | Log and skip (SilentSuccess equivalent) |
| MAX_RECEIVE_COUNT exceeded (4) | Message moves to DLQ |
| Unexpected exception | Log, call handle_errors, publish SNS event |
Pattern Mapping¶
Patterns Implemented¶
| Architecture Pattern | documentuploader Implementation | Notes |
|---|---|---|
| SNS Event Publishing | event-factory.ts — typed event builders |
TypeScript instead of Python EventType enum |
| SQS Handler | Orchestrator Lambda processes from global SQS | Standard parse → route → handle flow |
| Exception Hierarchy | Stack status → event type mapping | CF status codes replace Python exception types |
| Retry & Resilience | Re-enqueue with delay for lifecycle monitoring | 900s delay for throttled re-checks |
| Job Processor Orchestration | Orchestrator manages per-import stack lifecycle | Full CF stacks instead of SQS/DLQ/Lambda sets |
| Idempotent Handlers | SSM parameter store tracks stack state | Prevents duplicate stack creation |
| Config Management | config.json + nutrient-config.json matrix |
Environment × region, same as standard |
| CDK Infrastructure | Global stacks + synthesized per-import template | Two-tier: global (deployed) + per-import (dynamic) |
| ECS Long-Running Workloads | Multi-container Fargate tasks | Sidecar pattern with localhost communication |
| Structured Logging | CloudWatch log groups per component | Per-import log streams for isolation |
Patterns NOT Used¶
| Pattern | Why Not |
|---|---|
| Checkpoint Pipeline | No multi-step document processing — single upload action per message |
| Database Session (MySQL) | Uses Nutrient's PostgreSQL, not multi-tenant MySQL |
| ORM Models | No SQLAlchemy — Nutrient engine manages its own database |
| Elasticsearch Bulk Indexing | No search indexing in upload phase |
| Concurrent Write Safety | No shared MySQL writes — isolated per-import queues |
| Chunk Dispatch | Documents dispatched individually to per-import queues |
Event Flow¶
Inbound Events (Orchestrator Consumes)¶
| Event | Source | Action |
|---|---|---|
JOB_STARTED |
Job Processor | Check connection pool → create CF stack or throttle |
LOADER_FINISHED (COMPLETED/SUCCESS) |
documentloader | Check queues empty → delete stack |
JOB_FINISHED (COMPLETED) |
Job Processor | Enable PDF retry message transfer Lambda |
IMPORT_CANCELLED |
User action | Cancel/delete uploader stack |
UPLOADER_THROTTLED |
Self (re-enqueued) | Retry stack creation after delay |
UPLOADER_CREATE_INITIATED |
Self (re-enqueued) | Poll CF stack creation status |
UPLOADER_DELETE_INITIATED |
Self (re-enqueued) | Poll CF stack deletion status |
UPLOADER_CANCEL_INITIATED |
Self (re-enqueued) | Poll CF stack cancellation status |
Outbound Events (Orchestrator Publishes)¶
| Event | Trigger | Downstream Consumer |
|---|---|---|
UPLOADER_THROTTLED |
Active connections > MAX_ALLOWED | Self (delayed re-enqueue) |
UPLOADER_CREATE_INITIATED |
CF CreateStack called | Self (status polling) |
UPLOADER_STARTED |
CF stack CREATE_COMPLETE | Job Processor |
UPLOADER_DELETE_INITIATED |
CF DeleteStack called | Self (status polling) |
UPLOADER_FINISHED |
CF stack DELETE_COMPLETE | Job Processor |
UPLOADER_CANCELLED |
CF stack deleted after cancel | Job Processor |
UPLOADER_CANCEL_INITIATED |
Cancel in progress | Self (status polling) |
UPLOADER_FAILED |
Stack creation failed after retries | Job Processor |
UPLOADER_WARNING |
Non-fatal error during orchestration | Monitoring |
DOCUMENT_UPLOADED |
Poller: document processed | documentextractor |
Self-Referencing Event Loop¶
The orchestrator uses SNS → SQS → self as a polling mechanism for CloudFormation stack status. Instead of sleeping or Step Functions wait states:
Orchestrator publishes UPLOADER_CREATE_INITIATED
→ SNS routes back to orchestrator's SQS queue
→ Orchestrator re-processes: checks CF stack status
→ If CREATE_IN_PROGRESS: re-publish UPLOADER_CREATE_INITIATED (poll again)
→ If CREATE_COMPLETE: publish UPLOADER_STARTED (done)
→ If ROLLBACK_COMPLETE: retry or publish UPLOADER_FAILED
This pattern avoids Step Functions costs while maintaining event-driven asynchronous polling. The SQS visibility timeout provides natural delay between status checks.
Key Design Decisions¶
1. Dynamic CloudFormation Stack Provisioning¶
Decision: Create and destroy full CloudFormation stacks per case/batch at runtime instead of using shared infrastructure.
Why: - Resource isolation — each case gets its own SQS queues, ECS tasks, and network security groups; no cross-case interference - Connection pool management — Nutrient's PostgreSQL has a hard connection limit; per-import stacks enable precise connection accounting - Clean teardown — deleting the CF stack guarantees all resources are removed; no orphaned queues or tasks - Independent scaling — each case scales based on its own queue depth
Trade-off: Stack creation takes 3-5 minutes (CF provisioning latency). Mitigated by the orchestrator's self-polling event loop.
2. Three-Container Sidecar Architecture¶
Decision: Single Fargate task runs three containers communicating via localhost:
┌─────────────────────────────────────────────┐
│ Fargate Task (shared network namespace) │
│ │
│ ┌──────────┐ localhost:5000 ┌───────────┐│
│ │ Poller │──────────────→ │ DocEngine ││
│ │ (Ruby) │ │ (Nutrient) ││
│ └─────┬─────┘ └───────────┘│
│ │ localhost:80 │
│ ▼ │
│ ┌─────────────────┐ │
│ │ Redirection Svc │──→ S3 Presigned URLs │
│ │ (Node.js) │ │
│ └─────────────────┘ │
└─────────────────────────────────────────────┘
Why: - DocEngine requires co-located access (low latency for API calls) - Redirection Service generates presigned URLs without embedding S3 credentials in the Poller or DocEngine - All three scale together as a unit (queue depth drives task count)
3. SSM Parameter Store for Throttle State¶
Decision: Track per-stack throttle state in SSM Parameters
(/nge/docUploader/{stackName}) rather than DynamoDB or in-memory.
Why: - Survives Lambda cold starts (no in-memory state loss) - Simple key-value read/write (no table schema needed) - Built-in TTL not needed — explicit cleanup on stack deletion - Low-volume writes (only on throttle/unthrottle events)
4. PDF Retry Queue Pattern¶
Decision: Three-queue architecture per import: main queue, DLQ, and PDF retry queue.
Main Queue ←──────────── Message Transfer Lambda (enabled after JOB_FINISHED)
│ ↑
▼ │
Poller PDF Retry Queue
│ ↑
├── Success → SNS │
├── Failure → DLQ │
└── PDF not ready ────────────→┘ (3-min visibility timeout)
Why: PDFs may not be generated on S3 yet when the upload message arrives.
Rather than failing and consuming DLQ retries, the Poller sends the message to
a dedicated retry queue with a delay. The Message Transfer Lambda is initially
disabled and only enabled after JOB_FINISHED (when all PDFs are guaranteed
to exist), forwarding retry messages back to the main queue for reprocessing.
5. Connection Pool Throttling¶
Decision: Global throttling based on total ECS connection count vs
MAX_ALLOWED_CONNECTIONS.
active_connections = running_tasks × PER_TASK_CONNECTION
if active_connections > MAX_ALLOWED_CONNECTIONS:
publish UPLOADER_THROTTLED (re-enqueue with 900s delay)
else:
create per-import stack
Why: Nutrient's PostgreSQL RDS has a hard connection limit. Each ECS task
opens PER_TASK_CONNECTION connections (e.g., 10). The orchestrator checks the
ECS cluster's running task count before creating new stacks, preventing database
connection exhaustion across all active imports.
6. Queue-Depth Step Scaling¶
Decision: ECS auto-scaling based on SQS visible message count with step thresholds, not CPU/memory.
| Queue Depth | Scaling Action |
|---|---|
| > 5000 messages | Maximum capacity |
| > 2000 messages | High capacity |
| > 500 messages | Medium capacity |
| > 200 messages | Low capacity |
| < 200 messages | Minimum (1 task) |
Why: PDF processing is I/O-bound (S3 reads, API calls), not CPU-bound. Queue depth directly correlates with work remaining, making it a better scaling signal than resource utilization.
Divergences from Standard Architecture¶
| Standard Pattern | documentuploader Divergence | Reason |
|---|---|---|
| Python 3.10+ | Ruby (Poller) + TypeScript (Orchestrator) + Node.js (Redirection) | Shoryuken for SQS processing; Nutrient SDK integration |
| Lambda handlers | ECS Fargate tasks (Poller + DocEngine + Redirection) | Long-running compute, sidecar requirements |
| Hexagonal core/shell | No core/shell split — orchestrator is infrastructure-only | No domain logic; pure infrastructure lifecycle management |
| Multi-tenant MySQL | Nutrient PostgreSQL (single shared instance) | Nutrient engine manages its own database |
| SQLAlchemy ORM | No ORM — Nutrient engine handles persistence | Document storage is Nutrient's responsibility |
| SQS → Lambda integration | SQS → Shoryuken (Ruby) on ECS | Long-running consumer with connection pooling |
| Static infrastructure | Dynamic CF stack creation/deletion per case/batch | Resource isolation and connection pool management |
| Exception hierarchy (Python) | CF stack status codes drive event routing | TypeScript orchestrator maps CF statuses to events |
| Checkpoint pipeline | N/A — single-action processing per message | Each message is one document upload, no multi-step |
| SNS filter policies | SNS filter on orchestrator queue + per-import queue isolation | Two-tier: global routing + per-import physical separation |
Configuration Model¶
Two config files (environment × region matrix):
| File | Purpose | Example Keys |
|---|---|---|
config.json |
Infrastructure settings | VPC IDs, subnets, accounts, RDS endpoints |
nutrient-config.json |
Nutrient engine settings | Connection limits, worker pools, API tokens |
Runtime environment variables by component:
| Component | Key Variables |
|---|---|
| Orchestrator | MAX_ALLOWED_CONNECTIONS, PER_TASK_CONNECTION, TEMPLATE_BUCKET, RE_ENQUEUE_DELAY_IN_SECONDS |
| Poller | PER_CASE_DOC_UPLOADER_QUEUE_NAME, PDF_RETRY_QUEUE_PREFIX, REDIRECTION_SERVICE_ENDPOINT |
| DocEngine | PGHOST, DATABASE_CONNECTIONS, ASSET_STORAGE_S3_BUCKET, ACTIVATION_KEY |
| Redirection Svc | BUCKET_NAME, PORT, AWS_REGION |
CI/CD Pipeline Architecture¶
Three separate CodePipeline/CodeBuild pipelines:
| Pipeline | Purpose | Trigger |
|---|---|---|
| Per-Import Pipeline | Synthesizes per-import CDK → CF template, uploads to S3 | Code push to per-import stacks |
| SOCI Pipeline | Builds Seekable OCI (SOCI) index for DocEngine image | DocEngine image push |
| DocEngine Image Push | Pushes pre-built Nutrient DocEngine image to ECR | Manual / release |
SOCI (Seekable OCI): Enables lazy-loading of container image layers so ECS tasks start faster — critical when dynamic stacks spin up new tasks and image pull time directly impacts batch processing latency.
Multi-Region Deployment¶
Region prefix naming convention:
| Region | Prefix | Example Stack |
|---|---|---|
| us-east-1 | c2 |
c2-staging-UploaderStack-123-456 |
| us-west-1 | c4 |
c4-production-UploaderStack-123-456 |
| ca-central-1 | c5 |
c5-production-UploaderStack-123-456 |
Each region has independent: VPC, subnets, RDS instance, Redis cluster, S3 bucket for assets, ACM certificate, and max connection thresholds.
Dynamic Worker Scaling¶
The orchestrator supports event-driven worker limit overrides:
JOB_STARTED event
→ eventDetail.workerLimit = 50 (from job processor)
→ Orchestrator: newCapacity = Math.ceil(50 / PER_TASK_CONNECTION)
→ Updates Application Auto Scaling maxCapacity
→ Adjusts step scaling policy if workerLimit > default maxWorkers
This allows the job processor to request more (or fewer) ECS tasks per import based on batch size or priority, overriding the default scaling configuration.
Stack Cleanup on Deletion¶
When a per-import stack is deleted, the orchestrator performs explicit cleanup:
- Unsubscribe per-import queue from SNS topic (ARN stored in queue tags)
- Disable Message Transfer Lambda event source mapping
- Delete per-import SQS queues (main, DLQ, PDF retry)
- Delete SSM parameter store entry
- Delete CloudFormation stack (removes ECS tasks, security groups, log config)
This ensures no orphaned resources remain after batch processing completes.
Container Build Patterns¶
Both custom containers use Alpine-based minimal images:
Ruby Poller (multi-stage build):
Stage 1 (builder): ruby:3.2.2-alpine + build-base → compile native gems
Stage 2 (runner): ruby:3.2.2-alpine (minimal) → copy compiled gems
Entry: bundle exec shoryuken -q {QUEUE} -r ./doc_uploader_poller.rb -c {CONCURRENCY}
Redirection Service (single stage):
Poller concurrency is configurable via pollerConcurrency config (default: 1),
controlling Shoryuken worker threads per ECS task.
Observability¶
Logging Strategy¶
Toggle between centralized and per-import log groups:
| Mode | Log Group Pattern | Use Case |
|---|---|---|
| Centralized | /nge/{region}-{env}/docUploader/ecs/docEngine/{caseId}/{batchId} |
Production: per-case log isolation |
| Local | /nge/{region}-{env}/docUploader/ecs |
Dev/test: simplified debugging |
Controlled by useCentralizedLogging config flag.
Structured Logging¶
All components emit structured JSON logs with consistent context:
{
"level": "INFO",
"message": "Document uploaded successfully",
"caseId": "123",
"batchId": "456",
"jobId": "789",
"documentId": "doc_001",
"exhibitIds": "[1,2,3]"
}
The Ruby Poller, Node.js Redirection Service, and TypeScript Orchestrator all follow the same field naming convention for cross-component log correlation.
Lessons Learned¶
-
Self-referencing SNS→SQS polling is cost-effective — avoids Step Functions for simple status polling; SQS visibility timeout provides natural backoff.
-
Dynamic stacks add latency but guarantee isolation — 3-5 minute CF creation time is acceptable for batch imports that run hours; isolation prevents noisy-neighbor problems across cases.
-
Connection pool accounting is critical — without global throttling, a burst of concurrent imports can exhaust the database connection pool, causing cascading failures across all active imports.
-
Disabled-by-default Lambdas are a useful coordination primitive — the Message Transfer Lambda starts disabled and is enabled via API call when upstream processing completes, avoiding premature retry of PDFs.
-
Three-queue pattern separates concerns — main queue for active work, DLQ for permanent failures, retry queue for timing-dependent retries. Each queue has different visibility timeouts and consumer patterns.
-
Ruby Shoryuken excels for long-running SQS consumers — built-in receive count tracking, visibility timeout extension, and concurrent worker pools make it ideal for ECS-based queue processing.
-
SOCI indexes reduce ECS task startup time — lazy-loading container layers matters when dynamic stacks create tasks on demand and every minute of CF + image pull latency delays batch processing.
-
Explicit cleanup prevents resource leaks — deleting a CF stack removes ECS resources but not SQS queues or SSM parameters; the orchestrator must clean these up separately.
Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.