Skip to content

Reference Implementation: documentuploader

Document upload and PDF processing service using Nutrient (PSPDFKit) engine. Dynamically provisions per-case/batch ECS infrastructure via CloudFormation, processes documents through a multi-container sidecar architecture, and tears down resources when complete.

Architecture Overview

documentuploader is not a standard Lambda + SQS service module. It is a dynamic infrastructure orchestrator that creates and destroys full CloudFormation stacks per case/batch at runtime. This is driven by the compute requirements of the Nutrient (PSPDFKit) document processing engine — PDF flattening, rendering, and annotation processing require persistent ECS Fargate tasks with sidecar containers, not ephemeral Lambda invocations.

Why Dynamic Stacks?

Concern Standard Module (Lambda) documentuploader (Dynamic ECS)
Compute duration < 15 min Minutes to hours per batch
Resource isolation Lambda concurrency Full network + queue isolation per case
Scaling unit Lambda invocations ECS tasks with queue-depth step scaling
Database connections Connection per invocation Connection pool per task (throttled globally)
Cleanup N/A Full stack destruction on completion

Architecture Tree

documentuploader/
├── orchestrator/                          # Global orchestration Lambda
│   ├── index.ts                           # Main handler: stack lifecycle management
│   ├── event-factory.ts                   # SNS event construction helpers
│   └── dlq/
│       └── index.ts                       # Global DLQ handler → UPLOADER_FAILED events
├── global/
│   └── infrastructure.ts                  # Global CDK app entry (all shared stacks)
├── per-import/
│   └── infrastructure.ts                  # Per-import CDK app (synthesized → S3 template)
├── lib/
│   ├── nutrient/                          # Global shared infrastructure stacks
│   │   ├── orchestrator-stack.ts          # Orchestrator Lambda + global SQS + SNS subs
│   │   ├── ecs-stack.ts                   # Shared ECS cluster + ALB
│   │   ├── ecs-roles-stack.ts             # IAM roles for ECS tasks
│   │   ├── networking-stack.ts            # VPC, subnets, security groups
│   │   ├── database-stack.ts              # Nutrient PostgreSQL RDS
│   │   ├── poller-image-stack.ts          # ECR: Poller Docker image
│   │   ├── redirection-service-image-stack.ts  # ECR: Redirection Service image
│   │   ├── jwt-key-stack.ts               # JWT keys for Nutrient engine
│   │   ├── orchestrator-roles-stack.ts    # IAM for orchestrator Lambda
│   │   ├── import-specific-infra-roles-stack.ts # IAM for per-import resources
│   │   ├── dlq/
│   │   │   └── index.ts                   # Per-import DLQ handler Lambda
│   │   └── message-transfer-lambda/
│   │       └── index.ts                   # PDF retry → main queue transfer
│   ├── uploader/                          # Per-import nested stacks
│   │   ├── helper-stack.ts                # SQS queues, DLQ, security groups
│   │   ├── task-definition-stack.ts       # 3-container Fargate task definition
│   │   ├── ecs-service-stack.ts           # QueueProcessingFargateService + scaling
│   │   └── uploader-ecs-stack.ts          # Parent stack (composes nested stacks)
│   ├── doc_uploader_poller/               # Ruby — SQS consumer + Nutrient API caller
│   │   ├── doc_uploader_poller.rb         # Shoryuken worker: main processing loop
│   │   ├── nutrient_api.rb               # HTTP client for Nutrient engine
│   │   ├── Gemfile                        # Ruby dependencies (shoryuken, aws-sdk, etc.)
│   │   └── Dockerfile
│   ├── redirection-service/               # Node.js — S3 presigned URL generator
│   │   ├── app.js                         # Express server: presigned URLs + doc lookup
│   │   ├── package.json
│   │   └── Dockerfile
│   └── helpers/
│       └── config-helper.ts               # Environment × region config loading
├── bin/configs/
│   ├── config.json                        # Per-environment infrastructure settings
│   └── nutrient-config.json               # Nutrient engine-specific settings
├── pipeline/                              # CI/CD pipeline stacks
└── test/                                  # Jest CDK tests

Hybrid Language Stack

Component Language Runtime Purpose
Orchestrator TypeScript Lambda (Node.js) Stack lifecycle, throttling, monitoring
DLQ Handler TypeScript Lambda (Node.js) Error notification events
Message Transfer TypeScript Lambda (Node.js) PDF retry → main queue forwarding
Poller Ruby ECS (Shoryuken) SQS consumer, Nutrient API caller
DocEngine Nutrient (binary) ECS container PDF processing, flattening, annotations
Redirection Service Node.js ECS (Express) S3 presigned URL generation
Infrastructure TypeScript CDK AWS resource provisioning

Why Ruby for the Poller? Shoryuken is a mature SQS processing framework with built-in visibility timeout management, receive count tracking, and concurrent worker pools — purpose-built for long-running SQS consumers on ECS.

Nutrient (PSPDFKit) Document Engine Integration

Nutrient (formerly PSPDFKit) is a commercial document processing engine that runs as a Docker container. documentuploader wraps it in a sidecar architecture where the Poller (Ruby) drives the API and the Redirection Service (Node.js) handles S3 access.

Nutrient API Operations

The NutrientApi Ruby class (nutrient_api.rb) communicates with DocEngine via localhost:5000 (same Fargate task, shared network namespace):

Operation Endpoint Purpose
Load Document POST /api/documents Upload a document into Nutrient's storage (multipart form-data)
Flatten PDF POST /api/documents/{id}/apply_instructions Merge all layers (annotations, form fields, signatures) into static content

Load Document flow:

Poller receives SQS message
  → Constructs document ID: document_{caseId}_{batchId}_{documentId}_{exhibitIds}
  → Calls Redirection Service (localhost:80) to get PDF content via presigned URL
  → POST /api/documents to Nutrient with file or URL reference
  → Returns Nutrient's internal document_id

Flatten PDF flow:

After load → POST /api/documents/{id}/apply_instructions
  → Instructions: { parts: [{ document: { id: "#self" } }], actions: [{ type: "flatten" }] }
  → Timeout: 240 seconds (CPU-intensive operation)
  → Merges annotations, form fields, signatures into non-editable layer

Document ID Encoding Scheme

Documents are identified by a composite ID that encodes routing metadata:

document_{caseId}_{batchId}_{documentId}_{exhibitIds}

The Redirection Service parses this to resolve the S3 path: - ext=pdfcase_{caseId}/processed-files/{batchId}/native/{documentId} - ext=* (generated) → case_{caseId}/processed-files/{batchId}/pdf/{documentId}

Redirection Service (Presigned URL Proxy)

Express.js sidecar (localhost:80) that provides S3 access without exposing credentials to the Nutrient engine:

Endpoint Purpose
GET /documents/{documentId}?ext=pdf\|generated Generate presigned URL (7-day expiry), stream PDF content back
GET /list/{documentId} Check if PDF exists on S3 (ListObjectsV2)
GET /health Container health check

Why a separate service? The Nutrient DocEngine fetches documents via HTTP URL. Rather than giving DocEngine direct S3 credentials, the Redirection Service generates presigned URLs and streams content. This isolates S3 access to a controlled proxy, enables the HTTP_PROXY_REMOTE_FILE_DOWNLOAD=http://localhost:80 configuration on DocEngine.

Nutrient Configuration (nutrient-config.json)

Environment × region matrix controlling DocEngine behavior:

Setting Purpose Example Values
DATABASE_CONNECTIONS PostgreSQL pool per container 5 (default), varies by env
PSPDFKIT_WORKER_POOL_SIZE Internal worker threads 30
ASSET_STORAGE_BACKEND Storage backend s3
ASSET_STORAGE_CACHE_SIZE Local cache for assets 4 GB
USE_REDIS_CACHE Enable Redis for caching true (staging/prod)
JWT_ALGORITHM Auth algorithm RS256
SERVER_REQUEST_TIMEOUT Request timeout (ms) 240000 (matches flatten timeout)
ACTIVATION_KEY Nutrient license key Per-environment

Error Handling in Nutrient Operations

Scenario Poller Behavior
Nutrient API returns non-200 Re-raise → Shoryuken requeues (up to MAX_RECEIVE_COUNT=4)
Flatten timeout (>240s) Log error, publish to SNS, return nil
PDF not found on S3 (first attempt) Send to PDF retry queue (3-min delay)
PDF not found on S3 (from retry queue) Log and skip (SilentSuccess equivalent)
MAX_RECEIVE_COUNT exceeded (4) Message moves to DLQ
Unexpected exception Log, call handle_errors, publish SNS event

Pattern Mapping

Patterns Implemented

Architecture Pattern documentuploader Implementation Notes
SNS Event Publishing event-factory.ts — typed event builders TypeScript instead of Python EventType enum
SQS Handler Orchestrator Lambda processes from global SQS Standard parse → route → handle flow
Exception Hierarchy Stack status → event type mapping CF status codes replace Python exception types
Retry & Resilience Re-enqueue with delay for lifecycle monitoring 900s delay for throttled re-checks
Job Processor Orchestration Orchestrator manages per-import stack lifecycle Full CF stacks instead of SQS/DLQ/Lambda sets
Idempotent Handlers SSM parameter store tracks stack state Prevents duplicate stack creation
Config Management config.json + nutrient-config.json matrix Environment × region, same as standard
CDK Infrastructure Global stacks + synthesized per-import template Two-tier: global (deployed) + per-import (dynamic)
ECS Long-Running Workloads Multi-container Fargate tasks Sidecar pattern with localhost communication
Structured Logging CloudWatch log groups per component Per-import log streams for isolation

Patterns NOT Used

Pattern Why Not
Checkpoint Pipeline No multi-step document processing — single upload action per message
Database Session (MySQL) Uses Nutrient's PostgreSQL, not multi-tenant MySQL
ORM Models No SQLAlchemy — Nutrient engine manages its own database
Elasticsearch Bulk Indexing No search indexing in upload phase
Concurrent Write Safety No shared MySQL writes — isolated per-import queues
Chunk Dispatch Documents dispatched individually to per-import queues

Event Flow

Inbound Events (Orchestrator Consumes)

Event Source Action
JOB_STARTED Job Processor Check connection pool → create CF stack or throttle
LOADER_FINISHED (COMPLETED/SUCCESS) documentloader Check queues empty → delete stack
JOB_FINISHED (COMPLETED) Job Processor Enable PDF retry message transfer Lambda
IMPORT_CANCELLED User action Cancel/delete uploader stack
UPLOADER_THROTTLED Self (re-enqueued) Retry stack creation after delay
UPLOADER_CREATE_INITIATED Self (re-enqueued) Poll CF stack creation status
UPLOADER_DELETE_INITIATED Self (re-enqueued) Poll CF stack deletion status
UPLOADER_CANCEL_INITIATED Self (re-enqueued) Poll CF stack cancellation status

Outbound Events (Orchestrator Publishes)

Event Trigger Downstream Consumer
UPLOADER_THROTTLED Active connections > MAX_ALLOWED Self (delayed re-enqueue)
UPLOADER_CREATE_INITIATED CF CreateStack called Self (status polling)
UPLOADER_STARTED CF stack CREATE_COMPLETE Job Processor
UPLOADER_DELETE_INITIATED CF DeleteStack called Self (status polling)
UPLOADER_FINISHED CF stack DELETE_COMPLETE Job Processor
UPLOADER_CANCELLED CF stack deleted after cancel Job Processor
UPLOADER_CANCEL_INITIATED Cancel in progress Self (status polling)
UPLOADER_FAILED Stack creation failed after retries Job Processor
UPLOADER_WARNING Non-fatal error during orchestration Monitoring
DOCUMENT_UPLOADED Poller: document processed documentextractor

Self-Referencing Event Loop

The orchestrator uses SNS → SQS → self as a polling mechanism for CloudFormation stack status. Instead of sleeping or Step Functions wait states:

Orchestrator publishes UPLOADER_CREATE_INITIATED
  → SNS routes back to orchestrator's SQS queue
    → Orchestrator re-processes: checks CF stack status
      → If CREATE_IN_PROGRESS: re-publish UPLOADER_CREATE_INITIATED (poll again)
      → If CREATE_COMPLETE: publish UPLOADER_STARTED (done)
      → If ROLLBACK_COMPLETE: retry or publish UPLOADER_FAILED

This pattern avoids Step Functions costs while maintaining event-driven asynchronous polling. The SQS visibility timeout provides natural delay between status checks.

Key Design Decisions

1. Dynamic CloudFormation Stack Provisioning

Decision: Create and destroy full CloudFormation stacks per case/batch at runtime instead of using shared infrastructure.

Why: - Resource isolation — each case gets its own SQS queues, ECS tasks, and network security groups; no cross-case interference - Connection pool management — Nutrient's PostgreSQL has a hard connection limit; per-import stacks enable precise connection accounting - Clean teardown — deleting the CF stack guarantees all resources are removed; no orphaned queues or tasks - Independent scaling — each case scales based on its own queue depth

Trade-off: Stack creation takes 3-5 minutes (CF provisioning latency). Mitigated by the orchestrator's self-polling event loop.

2. Three-Container Sidecar Architecture

Decision: Single Fargate task runs three containers communicating via localhost:

┌─────────────────────────────────────────────┐
│  Fargate Task (shared network namespace)     │
│                                              │
│  ┌──────────┐  localhost:5000  ┌───────────┐│
│  │  Poller   │──────────────→ │ DocEngine  ││
│  │  (Ruby)   │                │ (Nutrient) ││
│  └─────┬─────┘                └───────────┘│
│        │ localhost:80                        │
│        ▼                                     │
│  ┌─────────────────┐                        │
│  │ Redirection Svc  │──→ S3 Presigned URLs  │
│  │ (Node.js)        │                        │
│  └─────────────────┘                        │
└─────────────────────────────────────────────┘

Why: - DocEngine requires co-located access (low latency for API calls) - Redirection Service generates presigned URLs without embedding S3 credentials in the Poller or DocEngine - All three scale together as a unit (queue depth drives task count)

3. SSM Parameter Store for Throttle State

Decision: Track per-stack throttle state in SSM Parameters (/nge/docUploader/{stackName}) rather than DynamoDB or in-memory.

Why: - Survives Lambda cold starts (no in-memory state loss) - Simple key-value read/write (no table schema needed) - Built-in TTL not needed — explicit cleanup on stack deletion - Low-volume writes (only on throttle/unthrottle events)

4. PDF Retry Queue Pattern

Decision: Three-queue architecture per import: main queue, DLQ, and PDF retry queue.

Main Queue ←──────────── Message Transfer Lambda (enabled after JOB_FINISHED)
    │                              ↑
    ▼                              │
  Poller                    PDF Retry Queue
    │                              ↑
    ├── Success → SNS              │
    ├── Failure → DLQ              │
    └── PDF not ready ────────────→┘  (3-min visibility timeout)

Why: PDFs may not be generated on S3 yet when the upload message arrives. Rather than failing and consuming DLQ retries, the Poller sends the message to a dedicated retry queue with a delay. The Message Transfer Lambda is initially disabled and only enabled after JOB_FINISHED (when all PDFs are guaranteed to exist), forwarding retry messages back to the main queue for reprocessing.

5. Connection Pool Throttling

Decision: Global throttling based on total ECS connection count vs MAX_ALLOWED_CONNECTIONS.

active_connections = running_tasks × PER_TASK_CONNECTION
if active_connections > MAX_ALLOWED_CONNECTIONS:
    publish UPLOADER_THROTTLED (re-enqueue with 900s delay)
else:
    create per-import stack

Why: Nutrient's PostgreSQL RDS has a hard connection limit. Each ECS task opens PER_TASK_CONNECTION connections (e.g., 10). The orchestrator checks the ECS cluster's running task count before creating new stacks, preventing database connection exhaustion across all active imports.

6. Queue-Depth Step Scaling

Decision: ECS auto-scaling based on SQS visible message count with step thresholds, not CPU/memory.

Queue Depth Scaling Action
> 5000 messages Maximum capacity
> 2000 messages High capacity
> 500 messages Medium capacity
> 200 messages Low capacity
< 200 messages Minimum (1 task)

Why: PDF processing is I/O-bound (S3 reads, API calls), not CPU-bound. Queue depth directly correlates with work remaining, making it a better scaling signal than resource utilization.

Divergences from Standard Architecture

Standard Pattern documentuploader Divergence Reason
Python 3.10+ Ruby (Poller) + TypeScript (Orchestrator) + Node.js (Redirection) Shoryuken for SQS processing; Nutrient SDK integration
Lambda handlers ECS Fargate tasks (Poller + DocEngine + Redirection) Long-running compute, sidecar requirements
Hexagonal core/shell No core/shell split — orchestrator is infrastructure-only No domain logic; pure infrastructure lifecycle management
Multi-tenant MySQL Nutrient PostgreSQL (single shared instance) Nutrient engine manages its own database
SQLAlchemy ORM No ORM — Nutrient engine handles persistence Document storage is Nutrient's responsibility
SQS → Lambda integration SQS → Shoryuken (Ruby) on ECS Long-running consumer with connection pooling
Static infrastructure Dynamic CF stack creation/deletion per case/batch Resource isolation and connection pool management
Exception hierarchy (Python) CF stack status codes drive event routing TypeScript orchestrator maps CF statuses to events
Checkpoint pipeline N/A — single-action processing per message Each message is one document upload, no multi-step
SNS filter policies SNS filter on orchestrator queue + per-import queue isolation Two-tier: global routing + per-import physical separation

Configuration Model

Two config files (environment × region matrix):

File Purpose Example Keys
config.json Infrastructure settings VPC IDs, subnets, accounts, RDS endpoints
nutrient-config.json Nutrient engine settings Connection limits, worker pools, API tokens

Runtime environment variables by component:

Component Key Variables
Orchestrator MAX_ALLOWED_CONNECTIONS, PER_TASK_CONNECTION, TEMPLATE_BUCKET, RE_ENQUEUE_DELAY_IN_SECONDS
Poller PER_CASE_DOC_UPLOADER_QUEUE_NAME, PDF_RETRY_QUEUE_PREFIX, REDIRECTION_SERVICE_ENDPOINT
DocEngine PGHOST, DATABASE_CONNECTIONS, ASSET_STORAGE_S3_BUCKET, ACTIVATION_KEY
Redirection Svc BUCKET_NAME, PORT, AWS_REGION

CI/CD Pipeline Architecture

Three separate CodePipeline/CodeBuild pipelines:

Pipeline Purpose Trigger
Per-Import Pipeline Synthesizes per-import CDK → CF template, uploads to S3 Code push to per-import stacks
SOCI Pipeline Builds Seekable OCI (SOCI) index for DocEngine image DocEngine image push
DocEngine Image Push Pushes pre-built Nutrient DocEngine image to ECR Manual / release

SOCI (Seekable OCI): Enables lazy-loading of container image layers so ECS tasks start faster — critical when dynamic stacks spin up new tasks and image pull time directly impacts batch processing latency.

Multi-Region Deployment

Region prefix naming convention:

Region Prefix Example Stack
us-east-1 c2 c2-staging-UploaderStack-123-456
us-west-1 c4 c4-production-UploaderStack-123-456
ca-central-1 c5 c5-production-UploaderStack-123-456

Each region has independent: VPC, subnets, RDS instance, Redis cluster, S3 bucket for assets, ACM certificate, and max connection thresholds.

Dynamic Worker Scaling

The orchestrator supports event-driven worker limit overrides:

JOB_STARTED event
  → eventDetail.workerLimit = 50 (from job processor)
  → Orchestrator: newCapacity = Math.ceil(50 / PER_TASK_CONNECTION)
  → Updates Application Auto Scaling maxCapacity
  → Adjusts step scaling policy if workerLimit > default maxWorkers

This allows the job processor to request more (or fewer) ECS tasks per import based on batch size or priority, overriding the default scaling configuration.

Stack Cleanup on Deletion

When a per-import stack is deleted, the orchestrator performs explicit cleanup:

  1. Unsubscribe per-import queue from SNS topic (ARN stored in queue tags)
  2. Disable Message Transfer Lambda event source mapping
  3. Delete per-import SQS queues (main, DLQ, PDF retry)
  4. Delete SSM parameter store entry
  5. Delete CloudFormation stack (removes ECS tasks, security groups, log config)

This ensures no orphaned resources remain after batch processing completes.

Container Build Patterns

Both custom containers use Alpine-based minimal images:

Ruby Poller (multi-stage build):

Stage 1 (builder): ruby:3.2.2-alpine + build-base → compile native gems
Stage 2 (runner):  ruby:3.2.2-alpine (minimal) → copy compiled gems
Entry: bundle exec shoryuken -q {QUEUE} -r ./doc_uploader_poller.rb -c {CONCURRENCY}

Redirection Service (single stage):

node:20-alpine → npm ci --omit=dev → node app.js

Poller concurrency is configurable via pollerConcurrency config (default: 1), controlling Shoryuken worker threads per ECS task.

Observability

Logging Strategy

Toggle between centralized and per-import log groups:

Mode Log Group Pattern Use Case
Centralized /nge/{region}-{env}/docUploader/ecs/docEngine/{caseId}/{batchId} Production: per-case log isolation
Local /nge/{region}-{env}/docUploader/ecs Dev/test: simplified debugging

Controlled by useCentralizedLogging config flag.

Structured Logging

All components emit structured JSON logs with consistent context:

{
  "level": "INFO",
  "message": "Document uploaded successfully",
  "caseId": "123",
  "batchId": "456",
  "jobId": "789",
  "documentId": "doc_001",
  "exhibitIds": "[1,2,3]"
}

The Ruby Poller, Node.js Redirection Service, and TypeScript Orchestrator all follow the same field naming convention for cross-component log correlation.

Lessons Learned

  1. Self-referencing SNS→SQS polling is cost-effective — avoids Step Functions for simple status polling; SQS visibility timeout provides natural backoff.

  2. Dynamic stacks add latency but guarantee isolation — 3-5 minute CF creation time is acceptable for batch imports that run hours; isolation prevents noisy-neighbor problems across cases.

  3. Connection pool accounting is critical — without global throttling, a burst of concurrent imports can exhaust the database connection pool, causing cascading failures across all active imports.

  4. Disabled-by-default Lambdas are a useful coordination primitive — the Message Transfer Lambda starts disabled and is enabled via API call when upstream processing completes, avoiding premature retry of PDFs.

  5. Three-queue pattern separates concerns — main queue for active work, DLQ for permanent failures, retry queue for timing-dependent retries. Each queue has different visibility timeouts and consumer patterns.

  6. Ruby Shoryuken excels for long-running SQS consumers — built-in receive count tracking, visibility timeout extension, and concurrent worker pools make it ideal for ECS-based queue processing.

  7. SOCI indexes reduce ECS task startup time — lazy-loading container layers matters when dynamic stacks create tasks on demand and every minute of CF + image pull latency delays batch processing.

  8. Explicit cleanup prevents resource leaks — deleting a CF stack removes ECS resources but not SQS queues or SSM parameters; the orchestrator must clean these up separately.

Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.