Reference Implementation: documentuploader¶

Document upload and PDF processing service using Nutrient (PSPDFKit) engine. Dynamically provisions per-case/batch ECS infrastructure via CloudFormation, processes documents through a multi-container sidecar architecture, and tears down resources when complete.

Architecture Overview¶

documentuploader is not a standard Lambda + SQS service module. It is a dynamic infrastructure orchestrator that creates and destroys full CloudFormation stacks per case/batch at runtime. This is driven by the compute requirements of the Nutrient (PSPDFKit) document processing engine — PDF flattening, rendering, and annotation processing require persistent ECS Fargate tasks with sidecar containers, not ephemeral Lambda invocations.

Why Dynamic Stacks?¶

Concern	Standard Module (Lambda)	documentuploader (Dynamic ECS)
Compute duration	< 15 min	Minutes to hours per batch
Resource isolation	Lambda concurrency	Full network + queue isolation per case
Scaling unit	Lambda invocations	ECS tasks with queue-depth step scaling
Database connections	Connection per invocation	Connection pool per task (throttled globally)
Cleanup	N/A	Full stack destruction on completion

Architecture Tree¶

documentuploader/
├── orchestrator/                          # Global orchestration Lambda
│   ├── index.ts                           # Main handler: stack lifecycle management
│   ├── event-factory.ts                   # SNS event construction helpers
│   └── dlq/
│       └── index.ts                       # Global DLQ handler → UPLOADER_FAILED events
├── global/
│   └── infrastructure.ts                  # Global CDK app entry (all shared stacks)
├── per-import/
│   └── infrastructure.ts                  # Per-import CDK app (synthesized → S3 template)
├── lib/
│   ├── nutrient/                          # Global shared infrastructure stacks
│   │   ├── orchestrator-stack.ts          # Orchestrator Lambda + global SQS + SNS subs
│   │   ├── ecs-stack.ts                   # Shared ECS cluster + ALB
│   │   ├── ecs-roles-stack.ts             # IAM roles for ECS tasks
│   │   ├── networking-stack.ts            # VPC, subnets, security groups
│   │   ├── database-stack.ts              # Nutrient PostgreSQL RDS
│   │   ├── poller-image-stack.ts          # ECR: Poller Docker image
│   │   ├── redirection-service-image-stack.ts  # ECR: Redirection Service image
│   │   ├── jwt-key-stack.ts               # JWT keys for Nutrient engine
│   │   ├── orchestrator-roles-stack.ts    # IAM for orchestrator Lambda
│   │   ├── import-specific-infra-roles-stack.ts # IAM for per-import resources
│   │   ├── dlq/
│   │   │   └── index.ts                   # Per-import DLQ handler Lambda
│   │   └── message-transfer-lambda/
│   │       └── index.ts                   # PDF retry → main queue transfer
│   ├── uploader/                          # Per-import nested stacks
│   │   ├── helper-stack.ts                # SQS queues, DLQ, security groups
│   │   ├── task-definition-stack.ts       # 3-container Fargate task definition
│   │   ├── ecs-service-stack.ts           # QueueProcessingFargateService + scaling
│   │   └── uploader-ecs-stack.ts          # Parent stack (composes nested stacks)
│   ├── doc_uploader_poller/               # Ruby — SQS consumer + Nutrient API caller
│   │   ├── doc_uploader_poller.rb         # Shoryuken worker: main processing loop
│   │   ├── nutrient_api.rb               # HTTP client for Nutrient engine
│   │   ├── Gemfile                        # Ruby dependencies (shoryuken, aws-sdk, etc.)
│   │   └── Dockerfile
│   ├── redirection-service/               # Node.js — S3 presigned URL generator
│   │   ├── app.js                         # Express server: presigned URLs + doc lookup
│   │   ├── package.json
│   │   └── Dockerfile
│   └── helpers/
│       └── config-helper.ts               # Environment × region config loading
├── bin/configs/
│   ├── config.json                        # Per-environment infrastructure settings
│   └── nutrient-config.json               # Nutrient engine-specific settings
├── pipeline/                              # CI/CD pipeline stacks
└── test/                                  # Jest CDK tests

Hybrid Language Stack¶

Component	Language	Runtime	Purpose
Orchestrator	TypeScript	Lambda (Node.js)	Stack lifecycle, throttling, monitoring
DLQ Handler	TypeScript	Lambda (Node.js)	Error notification events
Message Transfer	TypeScript	Lambda (Node.js)	PDF retry → main queue forwarding
Poller	Ruby	ECS (Shoryuken)	SQS consumer, Nutrient API caller
DocEngine	Nutrient (binary)	ECS container	PDF processing, flattening, annotations
Redirection Service	Node.js	ECS (Express)	S3 presigned URL generation
Infrastructure	TypeScript	CDK	AWS resource provisioning

Why Ruby for the Poller? Shoryuken is a mature SQS processing framework with built-in visibility timeout management, receive count tracking, and concurrent worker pools — purpose-built for long-running SQS consumers on ECS.

Nutrient (PSPDFKit) Document Engine Integration¶

Nutrient (formerly PSPDFKit) is a commercial document processing engine that runs as a Docker container. documentuploader wraps it in a sidecar architecture where the Poller (Ruby) drives the API and the Redirection Service (Node.js) handles S3 access.

Nutrient API Operations¶

The NutrientApi Ruby class (nutrient_api.rb) communicates with DocEngine via localhost:5000 (same Fargate task, shared network namespace):

Operation	Endpoint	Purpose
Load Document	`POST /api/documents`	Upload a document into Nutrient's storage (multipart form-data)
Flatten PDF	`POST /api/documents/{id}/apply_instructions`	Merge all layers (annotations, form fields, signatures) into static content

Load Document flow:

Poller receives SQS message
  → Constructs document ID: document_{caseId}_{batchId}_{documentId}_{exhibitIds}
  → Calls Redirection Service (localhost:80) to get PDF content via presigned URL
  → POST /api/documents to Nutrient with file or URL reference
  → Returns Nutrient's internal document_id

Flatten PDF flow:

After load → POST /api/documents/{id}/apply_instructions
  → Instructions: { parts: [{ document: { id: "#self" } }], actions: [{ type: "flatten" }] }
  → Timeout: 240 seconds (CPU-intensive operation)
  → Merges annotations, form fields, signatures into non-editable layer

Document ID Encoding Scheme¶

Documents are identified by a composite ID that encodes routing metadata:

document_{caseId}_{batchId}_{documentId}_{exhibitIds}

The Redirection Service parses this to resolve the S3 path: - ext=pdf → case_{caseId}/processed-files/{batchId}/native/{documentId} - ext=* (generated) → case_{caseId}/processed-files/{batchId}/pdf/{documentId}

Redirection Service (Presigned URL Proxy)¶

Express.js sidecar (localhost:80) that provides S3 access without exposing credentials to the Nutrient engine:

Endpoint	Purpose
`GET /documents/{documentId}?ext=pdf\\|generated`	Generate presigned URL (7-day expiry), stream PDF content back
`GET /list/{documentId}`	Check if PDF exists on S3 (ListObjectsV2)
`GET /health`	Container health check

Why a separate service? The Nutrient DocEngine fetches documents via HTTP URL. Rather than giving DocEngine direct S3 credentials, the Redirection Service generates presigned URLs and streams content. This isolates S3 access to a controlled proxy, enables the HTTP_PROXY_REMOTE_FILE_DOWNLOAD=http://localhost:80 configuration on DocEngine.

Nutrient Configuration (`nutrient-config.json`)¶

Environment × region matrix controlling DocEngine behavior:

Setting	Purpose	Example Values
`DATABASE_CONNECTIONS`	PostgreSQL pool per container	5 (default), varies by env
`PSPDFKIT_WORKER_POOL_SIZE`	Internal worker threads	30
`ASSET_STORAGE_BACKEND`	Storage backend	`s3`
`ASSET_STORAGE_CACHE_SIZE`	Local cache for assets	4 GB
`USE_REDIS_CACHE`	Enable Redis for caching	true (staging/prod)
`JWT_ALGORITHM`	Auth algorithm	RS256
`SERVER_REQUEST_TIMEOUT`	Request timeout (ms)	240000 (matches flatten timeout)
`ACTIVATION_KEY`	Nutrient license key	Per-environment

Error Handling in Nutrient Operations¶

Scenario	Poller Behavior
Nutrient API returns non-200	Re-raise → Shoryuken requeues (up to MAX_RECEIVE_COUNT=4)
Flatten timeout (>240s)	Log error, publish to SNS, return nil
PDF not found on S3 (first attempt)	Send to PDF retry queue (3-min delay)
PDF not found on S3 (from retry queue)	Log and skip (SilentSuccess equivalent)
MAX_RECEIVE_COUNT exceeded (4)	Message moves to DLQ
Unexpected exception	Log, call `handle_errors`, publish SNS event

Pattern Mapping¶

Patterns Implemented¶

Architecture Pattern	documentuploader Implementation	Notes
SNS Event Publishing	`event-factory.ts` — typed event builders	TypeScript instead of Python EventType enum
SQS Handler	Orchestrator Lambda processes from global SQS	Standard parse → route → handle flow
Exception Hierarchy	Stack status → event type mapping	CF status codes replace Python exception types
Retry & Resilience	Re-enqueue with delay for lifecycle monitoring	900s delay for throttled re-checks
Job Processor Orchestration	Orchestrator manages per-import stack lifecycle	Full CF stacks instead of SQS/DLQ/Lambda sets
Idempotent Handlers	SSM parameter store tracks stack state	Prevents duplicate stack creation
Config Management	`config.json` + `nutrient-config.json` matrix	Environment × region, same as standard
CDK Infrastructure	Global stacks + synthesized per-import template	Two-tier: global (deployed) + per-import (dynamic)
ECS Long-Running Workloads	Multi-container Fargate tasks	Sidecar pattern with localhost communication
Structured Logging	CloudWatch log groups per component	Per-import log streams for isolation

Patterns NOT Used¶

Pattern	Why Not
Checkpoint Pipeline	No multi-step document processing — single upload action per message
Database Session (MySQL)	Uses Nutrient's PostgreSQL, not multi-tenant MySQL
ORM Models	No SQLAlchemy — Nutrient engine manages its own database
Elasticsearch Bulk Indexing	No search indexing in upload phase
Concurrent Write Safety	No shared MySQL writes — isolated per-import queues
Chunk Dispatch	Documents dispatched individually to per-import queues

Event Flow¶

Inbound Events (Orchestrator Consumes)¶

Event	Source	Action
`JOB_STARTED`	Job Processor	Check connection pool → create CF stack or throttle
`LOADER_FINISHED` (COMPLETED/SUCCESS)	documentloader	Check queues empty → delete stack
`JOB_FINISHED` (COMPLETED)	Job Processor	Enable PDF retry message transfer Lambda
`IMPORT_CANCELLED`	User action	Cancel/delete uploader stack
`UPLOADER_THROTTLED`	Self (re-enqueued)	Retry stack creation after delay
`UPLOADER_CREATE_INITIATED`	Self (re-enqueued)	Poll CF stack creation status
`UPLOADER_DELETE_INITIATED`	Self (re-enqueued)	Poll CF stack deletion status
`UPLOADER_CANCEL_INITIATED`	Self (re-enqueued)	Poll CF stack cancellation status

Outbound Events (Orchestrator Publishes)¶

Event	Trigger	Downstream Consumer
`UPLOADER_THROTTLED`	Active connections > MAX_ALLOWED	Self (delayed re-enqueue)
`UPLOADER_CREATE_INITIATED`	CF CreateStack called	Self (status polling)
`UPLOADER_STARTED`	CF stack CREATE_COMPLETE	Job Processor
`UPLOADER_DELETE_INITIATED`	CF DeleteStack called	Self (status polling)
`UPLOADER_FINISHED`	CF stack DELETE_COMPLETE	Job Processor
`UPLOADER_CANCELLED`	CF stack deleted after cancel	Job Processor
`UPLOADER_CANCEL_INITIATED`	Cancel in progress	Self (status polling)
`UPLOADER_FAILED`	Stack creation failed after retries	Job Processor
`UPLOADER_WARNING`	Non-fatal error during orchestration	Monitoring
`DOCUMENT_UPLOADED`	Poller: document processed	documentextractor

Self-Referencing Event Loop¶

The orchestrator uses SNS → SQS → self as a polling mechanism for CloudFormation stack status. Instead of sleeping or Step Functions wait states:

Orchestrator publishes UPLOADER_CREATE_INITIATED
  → SNS routes back to orchestrator's SQS queue
    → Orchestrator re-processes: checks CF stack status
      → If CREATE_IN_PROGRESS: re-publish UPLOADER_CREATE_INITIATED (poll again)
      → If CREATE_COMPLETE: publish UPLOADER_STARTED (done)
      → If ROLLBACK_COMPLETE: retry or publish UPLOADER_FAILED

This pattern avoids Step Functions costs while maintaining event-driven asynchronous polling. The SQS visibility timeout provides natural delay between status checks.

Key Design Decisions¶

1. Dynamic CloudFormation Stack Provisioning¶

Decision: Create and destroy full CloudFormation stacks per case/batch at runtime instead of using shared infrastructure.

Why: - Resource isolation — each case gets its own SQS queues, ECS tasks, and network security groups; no cross-case interference - Connection pool management — Nutrient's PostgreSQL has a hard connection limit; per-import stacks enable precise connection accounting - Clean teardown — deleting the CF stack guarantees all resources are removed; no orphaned queues or tasks - Independent scaling — each case scales based on its own queue depth

Trade-off: Stack creation takes 3-5 minutes (CF provisioning latency). Mitigated by the orchestrator's self-polling event loop.

2. Three-Container Sidecar Architecture¶

Decision: Single Fargate task runs three containers communicating via localhost:

┌─────────────────────────────────────────────┐
│  Fargate Task (shared network namespace)     │
│                                              │
│  ┌──────────┐  localhost:5000  ┌───────────┐│
│  │  Poller   │──────────────→ │ DocEngine  ││
│  │  (Ruby)   │                │ (Nutrient) ││
│  └─────┬─────┘                └───────────┘│
│        │ localhost:80                        │
│        ▼                                     │
│  ┌─────────────────┐                        │
│  │ Redirection Svc  │──→ S3 Presigned URLs  │
│  │ (Node.js)        │                        │
│  └─────────────────┘                        │
└─────────────────────────────────────────────┘

Why: - DocEngine requires co-located access (low latency for API calls) - Redirection Service generates presigned URLs without embedding S3 credentials in the Poller or DocEngine - All three scale together as a unit (queue depth drives task count)

3. SSM Parameter Store for Throttle State¶

Decision: Track per-stack throttle state in SSM Parameters (/nge/docUploader/{stackName}) rather than DynamoDB or in-memory.

Why: - Survives Lambda cold starts (no in-memory state loss) - Simple key-value read/write (no table schema needed) - Built-in TTL not needed — explicit cleanup on stack deletion - Low-volume writes (only on throttle/unthrottle events)

4. PDF Retry Queue Pattern¶

Decision: Three-queue architecture per import: main queue, DLQ, and PDF retry queue.

Main Queue ←──────────── Message Transfer Lambda (enabled after JOB_FINISHED)
    │                              ↑
    ▼                              │
  Poller                    PDF Retry Queue
    │                              ↑
    ├── Success → SNS              │
    ├── Failure → DLQ              │
    └── PDF not ready ────────────→┘  (3-min visibility timeout)

Why: PDFs may not be generated on S3 yet when the upload message arrives. Rather than failing and consuming DLQ retries, the Poller sends the message to a dedicated retry queue with a delay. The Message Transfer Lambda is initially disabled and only enabled after JOB_FINISHED (when all PDFs are guaranteed to exist), forwarding retry messages back to the main queue for reprocessing.

5. Connection Pool Throttling¶

Decision: Global throttling based on total ECS connection count vs MAX_ALLOWED_CONNECTIONS.

active_connections = running_tasks × PER_TASK_CONNECTION
if active_connections > MAX_ALLOWED_CONNECTIONS:
    publish UPLOADER_THROTTLED (re-enqueue with 900s delay)
else:
    create per-import stack

Why: Nutrient's PostgreSQL RDS has a hard connection limit. Each ECS task opens PER_TASK_CONNECTION connections (e.g., 10). The orchestrator checks the ECS cluster's running task count before creating new stacks, preventing database connection exhaustion across all active imports.

6. Queue-Depth Step Scaling¶

Decision: ECS auto-scaling based on SQS visible message count with step thresholds, not CPU/memory.

Queue Depth	Scaling Action
> 5000 messages	Maximum capacity
> 2000 messages	High capacity
> 500 messages	Medium capacity
> 200 messages	Low capacity
< 200 messages	Minimum (1 task)

Why: PDF processing is I/O-bound (S3 reads, API calls), not CPU-bound. Queue depth directly correlates with work remaining, making it a better scaling signal than resource utilization.

Divergences from Standard Architecture¶

Standard Pattern	documentuploader Divergence	Reason
Python 3.10+	Ruby (Poller) + TypeScript (Orchestrator) + Node.js (Redirection)	Shoryuken for SQS processing; Nutrient SDK integration
Lambda handlers	ECS Fargate tasks (Poller + DocEngine + Redirection)	Long-running compute, sidecar requirements
Hexagonal core/shell	No core/shell split — orchestrator is infrastructure-only	No domain logic; pure infrastructure lifecycle management
Multi-tenant MySQL	Nutrient PostgreSQL (single shared instance)	Nutrient engine manages its own database
SQLAlchemy ORM	No ORM — Nutrient engine handles persistence	Document storage is Nutrient's responsibility
SQS → Lambda integration	SQS → Shoryuken (Ruby) on ECS	Long-running consumer with connection pooling
Static infrastructure	Dynamic CF stack creation/deletion per case/batch	Resource isolation and connection pool management
Exception hierarchy (Python)	CF stack status codes drive event routing	TypeScript orchestrator maps CF statuses to events
Checkpoint pipeline	N/A — single-action processing per message	Each message is one document upload, no multi-step
SNS filter policies	SNS filter on orchestrator queue + per-import queue isolation	Two-tier: global routing + per-import physical separation

Configuration Model¶

Two config files (environment × region matrix):

File	Purpose	Example Keys
`config.json`	Infrastructure settings	VPC IDs, subnets, accounts, RDS endpoints
`nutrient-config.json`	Nutrient engine settings	Connection limits, worker pools, API tokens

Runtime environment variables by component:

Component	Key Variables
Orchestrator	`MAX_ALLOWED_CONNECTIONS`, `PER_TASK_CONNECTION`, `TEMPLATE_BUCKET`, `RE_ENQUEUE_DELAY_IN_SECONDS`
Poller	`PER_CASE_DOC_UPLOADER_QUEUE_NAME`, `PDF_RETRY_QUEUE_PREFIX`, `REDIRECTION_SERVICE_ENDPOINT`
DocEngine	`PGHOST`, `DATABASE_CONNECTIONS`, `ASSET_STORAGE_S3_BUCKET`, `ACTIVATION_KEY`
Redirection Svc	`BUCKET_NAME`, `PORT`, `AWS_REGION`

CI/CD Pipeline Architecture¶

Three separate CodePipeline/CodeBuild pipelines:

Pipeline	Purpose	Trigger
Per-Import Pipeline	Synthesizes per-import CDK → CF template, uploads to S3	Code push to per-import stacks
SOCI Pipeline	Builds Seekable OCI (SOCI) index for DocEngine image	DocEngine image push
DocEngine Image Push	Pushes pre-built Nutrient DocEngine image to ECR	Manual / release

SOCI (Seekable OCI): Enables lazy-loading of container image layers so ECS tasks start faster — critical when dynamic stacks spin up new tasks and image pull time directly impacts batch processing latency.

Multi-Region Deployment¶

Region prefix naming convention:

Region	Prefix	Example Stack
us-east-1	`c2`	`c2-staging-UploaderStack-123-456`
us-west-1	`c4`	`c4-production-UploaderStack-123-456`
ca-central-1	`c5`	`c5-production-UploaderStack-123-456`

Each region has independent: VPC, subnets, RDS instance, Redis cluster, S3 bucket for assets, ACM certificate, and max connection thresholds.

Dynamic Worker Scaling¶

The orchestrator supports event-driven worker limit overrides:

JOB_STARTED event
  → eventDetail.workerLimit = 50 (from job processor)
  → Orchestrator: newCapacity = Math.ceil(50 / PER_TASK_CONNECTION)
  → Updates Application Auto Scaling maxCapacity
  → Adjusts step scaling policy if workerLimit > default maxWorkers

This allows the job processor to request more (or fewer) ECS tasks per import based on batch size or priority, overriding the default scaling configuration.

Stack Cleanup on Deletion¶

When a per-import stack is deleted, the orchestrator performs explicit cleanup:

Unsubscribe per-import queue from SNS topic (ARN stored in queue tags)
Disable Message Transfer Lambda event source mapping
Delete per-import SQS queues (main, DLQ, PDF retry)
Delete SSM parameter store entry
Delete CloudFormation stack (removes ECS tasks, security groups, log config)

This ensures no orphaned resources remain after batch processing completes.

Container Build Patterns¶

Both custom containers use Alpine-based minimal images:

Ruby Poller (multi-stage build):

Stage 1 (builder): ruby:3.2.2-alpine + build-base → compile native gems
Stage 2 (runner):  ruby:3.2.2-alpine (minimal) → copy compiled gems
Entry: bundle exec shoryuken -q {QUEUE} -r ./doc_uploader_poller.rb -c {CONCURRENCY}

Redirection Service (single stage):

node:20-alpine → npm ci --omit=dev → node app.js

Poller concurrency is configurable via pollerConcurrency config (default: 1), controlling Shoryuken worker threads per ECS task.

Observability¶

Logging Strategy¶

Toggle between centralized and per-import log groups:

Mode	Log Group Pattern	Use Case
Centralized	`/nge/{region}-{env}/docUploader/ecs/docEngine/{caseId}/{batchId}`	Production: per-case log isolation
Local	`/nge/{region}-{env}/docUploader/ecs`	Dev/test: simplified debugging

Controlled by useCentralizedLogging config flag.

Structured Logging¶

All components emit structured JSON logs with consistent context:

{
  "level": "INFO",
  "message": "Document uploaded successfully",
  "caseId": "123",
  "batchId": "456",
  "jobId": "789",
  "documentId": "doc_001",
  "exhibitIds": "[1,2,3]"
}

The Ruby Poller, Node.js Redirection Service, and TypeScript Orchestrator all follow the same field naming convention for cross-component log correlation.

Lessons Learned¶

Self-referencing SNS→SQS polling is cost-effective — avoids Step Functions for simple status polling; SQS visibility timeout provides natural backoff.
Dynamic stacks add latency but guarantee isolation — 3-5 minute CF creation time is acceptable for batch imports that run hours; isolation prevents noisy-neighbor problems across cases.
Connection pool accounting is critical — without global throttling, a burst of concurrent imports can exhaust the database connection pool, causing cascading failures across all active imports.
Disabled-by-default Lambdas are a useful coordination primitive — the Message Transfer Lambda starts disabled and is enabled via API call when upstream processing completes, avoiding premature retry of PDFs.
Three-queue pattern separates concerns — main queue for active work, DLQ for permanent failures, retry queue for timing-dependent retries. Each queue has different visibility timeouts and consumer patterns.
Ruby Shoryuken excels for long-running SQS consumers — built-in receive count tracking, visibility timeout extension, and concurrent worker pools make it ideal for ECS-based queue processing.
SOCI indexes reduce ECS task startup time — lazy-loading container layers matters when dynamic stacks create tasks on demand and every minute of CF + image pull latency delays batch processing.
Explicit cleanup prevents resource leaks — deleting a CF stack removes ECS resources but not SQS queues or SSM parameters; the orchestrator must clean these up separately.

Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.