Reference Implementation: documentexporter¶
Overview¶
The documentexporter module exports documents from the Nextpoint eDiscovery platform into standard litigation production formats (TIFF, JPEG, PDF, native files) with loadfile generation (LOG, OPT, LFP, DII) and 7zip packaging.
Unlike documentloader (which was built with the architecture patterns in this repo), documentexporter predates those patterns and uses a different architectural approach: Step Functions + ECS containers rather than SNS/SQS + Lambda handlers.
This reference implementation documents how documentexporter works and where it aligns with or diverges from the standard architecture.
Pattern Mapping¶
| Pattern | Status | documentexporter Implementation |
|---|---|---|
| Hexagonal boundaries | Diverges | Flat containers/shared/ module — no core/shell separation. Business logic and infrastructure (S3, API calls) are mixed within the same files |
| Exception hierarchy | Diverges | Uses container exit codes (20–50) instead of RecoverableException/PermanentFailureException/SilentSuccessException. Step Functions catches exit codes and routes to error handler |
| SNS events | Diverges | No SNS events. Export Lambda is invoked directly by the Nextpoint backend (Sidekiq). No event-driven communication |
| SQS handler | Not used | No SQS queues. Step Functions orchestrates ECS tasks directly |
| Checkpoint pipeline | Partial | Step Functions provides implicit checkpointing via state transitions (ExportVolumes → LoadfileGenerator → Zipper → Completer). Not a database-backed checkpoint state machine |
| Database sessions | Diverges | Uses SQLite manifest (downloaded from S3) instead of RDS session management. No writer/reader session pattern |
| Retry/resilience | Partial | Step Functions handles retries at the state level. No @retry_on_db_conflict (no MySQL writes). Nutrient API errors are caught and logged, allowing export to continue for remaining documents |
| Idempotency | Partial | Identity cache prevents duplicate filenames within an export. No checkpoint-based distributed lock |
| Multi-tenancy | Aligns | Per-case isolation via case_{npcase_id} S3 prefix and EFS directory structure |
| CDK infrastructure | Partial | Single CDK stack (not two-stack composition). Uses ECS Fargate + EFS instead of Lambda + SQS. Step Functions for orchestration |
| Config management | Diverges | Environment variables injected by Step Functions per-task (not CONFIG_MAP). Nutrient config via JSON file |
| Structured logging | Diverges | Custom timestamp-prefixed logger, not JSON structured logging |
| ORM models | Not used | No SQLAlchemy models. Reads from SQLite manifest created by the Nextpoint backend |
Architecture¶
documentexporter/
├── lambda/
│ └── handler.ts # Export orchestrator — builds Step Functions state machine
├── updater_lambda/
│ └── handler.ts # Completion/error handler — updates Nextpoint API
├── cleanup_lambda/
│ └── handler.ts # Scheduled: deletes state machines older than 14 days
├── containers/
│ ├── shared/ # Shared Python library (18 modules, flat structure)
│ │ ├── constants.py # Export enums (NativeInclusion, ImageFormats)
│ │ ├── db.py # SQLite manifest handling
│ │ ├── export.py # Export metadata model
│ │ ├── exhibit.py # Document processing with multiprocessing pools
│ │ ├── identifier.py # Document naming with identity dedup cache
│ │ ├── template.py # Export template configuration
│ │ ├── nutrient.py # Document engine client (batch page download)
│ │ ├── loadfile.py # Loadfile generators (LOG, OPT, LFP, DII)
│ │ ├── s3.py # S3 operations with pagination
│ │ ├── nextpoint_api.py # Nextpoint API client (XML, HMAC SHA1 auth)
│ │ ├── document_markup.py # Annotation handling
│ │ ├── custom_pages.py # Cover/summary page generation
│ │ ├── native_placeholder.py # Placeholder guard rails for unsupported native files
│ │ ├── expansive_pdf.py # Multi-volume PDF (1000-page split threshold)
│ │ └── ...
│ ├── exporter/main.py # Per-volume document processor
│ ├── loadfile-generator/main.py # Loadfile assembly orchestrator
│ ├── zipper/main.py # 7zip compression (20GB volume splits)
│ ├── tests/ # Python container tests (pytest)
│ │ ├── conftest.py # Test fixtures
│ │ └── shared/ # Unit tests for shared modules
│ └── base/ # Pre-built Docker base images
├── lib/
│ ├── nge-export-service-stack.ts # CDK infrastructure (ECS, EFS, Step Functions)
│ └── devConfig.ts # Local bridge development configuration
└── test/ # Jest tests (TypeScript Lambda tests only)
Orchestration: Step Functions + ECS¶
The export Lambda dynamically builds a Step Functions state machine definition at runtime based on the export type, then starts an execution:
Export Lambda (invoked by Nextpoint backend)
│
├─ Upload volumes_input.json to S3
├─ Create Step Functions state machine
└─ Start execution
│
├─ Map State (max 20 concurrent volumes)
│ └─ ExportOneVolume (ECS Task)
│ ├─ Exporter container (processes documents)
│ ├─ Nutrient sidecar (renders PDFs/images, health-checked)
│ └─ Redirection sidecar (caches asset URLs)
│
├─ LoadfileGenerator (ECS Task)
│ └─ Generates LOG/OPT/LFP/DII files
│
├─ Zipper (ECS Task)
│ └─ 7zip compress with 20GB splits
│
├─ Completer (Updater Lambda)
│ └─ Updates Nextpoint API, aggregates page counts
│
└─ ErrorHandler (Updater Lambda with error status)
└─ Maps exit codes to user-friendly messages
Two export modes: - Normal: Full pipeline — volumes → loadfiles → zip → complete - Metadata-only: Skips volume processing, goes directly to loadfile generation
Key Design Decisions¶
Step Functions + ECS Instead of Lambda + SQS¶
Document export is compute-intensive (image conversion, OCR, PDF rendering) and requires large working storage (EFS mounts). ECS Fargate tasks get: - 4 vCPU, 8–24 GB RAM per task - 200GB ephemeral storage for exporter - EFS mounts for cross-container file sharing - Sidecar containers (Nutrient document engine + redirection service)
Lambda's 15-minute timeout and limited storage would be insufficient.
Exit Code Error Propagation¶
Container exit codes map to user-facing error messages:
| Exit Code | Meaning |
|---|---|
| 20 | Document not found in Nutrient API |
| 21 | Nutrient API error (4xx/5xx) |
| 22 | Network error calling Nutrient API |
| 23 | Outdated/corrupted data from Nutrient API |
| 30–32 | Zipper errors (compression, S3 upload) |
| 40–47 | Loadfile generation errors |
| 50 | Nextpoint API error |
| 1 | Generic export failure |
The Updater Lambda reads exit codes from Step Functions task output and translates them to user-friendly messages.
SQLite Manifest Instead of RDS¶
Export input data comes as a SQLite database (zip file created by the Nextpoint
backend). Contains exhibits_for_export and exhibits tables. This avoids
per-case database connections during export — the manifest is self-contained
and read-only.
Multiprocessing for CPU-Intensive Operations¶
The exporter container uses Python multiprocessing.Pool for image conversion
(PNG → TIFF/JPEG) and OCR (tesserocr). Combined with document prerendering
(fetching next document while processing current), this maximizes throughput.
Volume-Based Parallelism¶
Documents are pre-split into volumes by the backend. Step Functions Map state processes up to 20 volumes concurrently, each as a separate ECS task. This provides natural parallelism boundaries and fault isolation.
Storage Layout¶
S3 (coordination data, final outputs)¶
s3://{bucket}/case_{npcase_id}/exports/export_{export_id}/
├── volumes_input.json # Step Functions input
├── volumes.json # Updated by zipper with S3 paths
├── loadfile_data/
│ ├── metadata.json # Metadata-only exports
│ └── {volume_position}.json # Per-volume loadfile data
├── error_logs/{volume}.txt # Per-volume error logs
├── page_counts/page_count_{volume}.txt
└── zips/{export_name}.zip[.001,.002,...]
EFS (working storage during export)¶
/mnt/efs/{case_id}/export_{export_id}/
├── {cleansed_name}/
│ ├── NATIVES/ # Native file downloads
│ ├── IMAGES/{volume}/ # Converted images per volume
│ └── TEXT/{volume}/ # OCR text per volume
└── tmp/zip/ # Loadfile assembly scratch space
ECS Task Resources¶
| Task | CPU | Memory | Storage | Sidecars |
|---|---|---|---|---|
| Exporter | 4096 | 8192 MB | 200GB ephemeral + EFS | Nutrient (16GB), Redirection (8GB) |
| Zipper | 4096 | 24576 MB | EFS only | None |
| Loadfile Generator | 4096 | 24576 MB | EFS only | None |
Divergences from Standard Architecture¶
These are documented for visibility, not as criticism — documentexporter was built before the standard patterns were established and its compute requirements (large files, image processing, sidecars) differ from the Lambda-based ingestion pipeline.
1. No Hexagonal Boundaries¶
Business logic and infrastructure are mixed in containers/shared/. For example,
exhibit.py contains both document processing logic and direct S3/Nutrient API calls.
The standard pattern would separate these into core/ (pure logic) and shell/
(infrastructure adapters).
2. No SNS Event Communication¶
The module is invoked directly by the Nextpoint backend and communicates completion via direct API calls (HMAC-authenticated XML). It doesn't participate in the event-driven architecture. Adding SNS events for EXPORT_STARTED, EXPORT_COMPLETED would integrate it into the platform event stream.
3. No Exception-Based Flow Control¶
Instead of RecoverableException/PermanentFailureException controlling SQS message disposition, exit codes control Step Functions error routing. This works but couples error semantics to container process boundaries.
4. TypeScript Orchestration + Python Processing¶
The orchestration layer (Lambda handlers) is in TypeScript while processing is in Python. The standard pattern uses Python throughout. This creates a mixed-language codebase.
5. No Structured JSON Logging¶
Uses a custom timestamp-prefixed logger instead of the standard JSON formatter with context fields (caseId, batchId, jobId). This makes log aggregation and querying harder in CloudWatch.
6. Python Testing Foundation (Early Stage)¶
Python container tests now exist under containers/tests/ with pytest infrastructure
(pytest.ini, requirements-test.txt, conftest.py). Initial tests cover S3
operations and attachment handling. Coverage is still limited — most Python container
code relies on end-to-end testing via the Nextpoint frontend. Jest tests cover
TypeScript Lambda handlers.
Nutrient (PSPDFKit) Sidecar Architecture¶
The exporter task runs three containers in a single Fargate task with strict dependency ordering via health checks:
Container Start Order (dependency chain):
1. Redirection Service starts first (port 80, health check every 5s)
2. Nutrient DocEngine starts after Redirection is HEALTHY (port 5000, health check every 5s, 10 retries)
3. Exporter starts after Nutrient is HEALTHY
| Container | CPU | Memory (prod) | Memory (staging) | Port | Purpose |
|---|---|---|---|---|---|
| Exporter | 4096 | 8192 MB | 8192 MB | — | Document processing, image conversion, OCR |
| Nutrient DocEngine | 4096 | 16384 MB | 12288 MB | 5000 | PDF rendering, page image generation |
| Redirection Service | 1024 | 8192 MB | 8192 MB | 80 | S3 asset download proxy (Ruby) |
Nutrient resources are environment-specific (nge-export-service-stack.ts:55-59):
production gets higher allocation to handle large document sets.
Nutrient configuration (lib/nutrient-config.json) is a per-environment ×
per-region matrix controlling DocEngine behavior: PostgreSQL RDS, Redis cache,
S3 asset storage, worker pool size, and activation keys.
Image Processing & OCR¶
Document Processing Logic (exhibit.py):
- Nutrient page count is used as the authoritative page count when the DB
verified page count exceeds it (prevents out-of-bounds page requests)
- Redaction annotations are applied conditionally — the apply_redactions
Nutrient endpoint is only called when redactions are present
- Native placeholder documents have guard rails (native_placeholder.py)
that short-circuit processing to a single-page placeholder image download
Image Conversion (multiprocessing):
- PNG → TIFF/JPEG via Python multiprocessing.Pool
- Color detection using NumPy array comparison (determines TIFF compression)
- Configurable pool size: POOL_WORKERS env var or CPU count
- Thread oversubscription prevention: OMP_NUM_THREADS=1
OCR (Tesseract via tesserocr C++ API): - OSD (Orientation/Script Detection) for auto-rotation - Preprocessing pipeline: deskew + sharpen before recognition - Batch processing with multiprocessing pools - Lazy-loaded: OCR library only present in exporter base image
Expansive PDF Generation (pikepdf): - Multi-volume support with 1000-page split threshold - Outline/bookmark generation for document navigation - Family relationship grouping (parent-child documents) - Summary cover page injection
CI/CD Pipeline¶
Bitbucket → CodePipeline → CodeBuild:
- Branch push triggers CodePipeline via CodeStar Connections
- CodeBuild prebuild: ECR login, base image version check
- CodeBuild build:
cdk deploy --all - Conditional developer alias support for dev isolation
Base Image Versioning:
- Hash-based tags: 1.0-{file_hash}
- Three base images: exporter-base (Python 3.12 + Tesseract + Leptonica),
loadfile-base (Python 3.12 + 7zip), zipper-base (Python 3.12)
- check-and-build-bases.sh compares hash → only rebuilds on change
- Stored per-environment in ECR: {env}-nge/exporter-base
Multi-Region Deployment¶
| Region | Prefix | Nutrient DB | Redis | S3 Bucket |
|---|---|---|---|---|
| us-east-1 | c2 |
c2_{env}_nutrientdb |
region-specific ElastiCache | trialmanager-{env} |
| us-west-1 | c4 |
c4_{env}_nutrientdb |
region-specific ElastiCache | c4-trialmanager-{env} |
| ca-central-1 | c5 |
c5_{env}_nutrientdb |
region-specific ElastiCache | c5-trialmanager-{env} |
Infrastructure parameters looked up via SSM: /nge/shared/{region_prefix}/vpc/vpcId, etc.
Cleanup Lambda¶
Scheduled Lambda that prevents state machine accumulation: - Lists all Step Functions state machines by prefix - Deletes machines older than configurable age (default 24 hours) - Retry logic with exponential backoff for API throttling - Non-fatal: logs errors but doesn't fail deployment
Security¶
Authentication: - Nextpoint API: HMAC-SHA1 signed requests (XML format) - Nutrient DocEngine: JWT (RS256) + API auth token
Secrets Management: - Nutrient RDS credentials: AWS Secrets Manager - JWT public key: Secrets Manager - Nextpoint API key: Secrets Manager (accessed by Updater Lambda) - Production-only: ACTIVATION_KEY, SECRET_KEY_BASE from Secrets Manager
Network: - ECS tasks in private subnets with NAT - EFS transit encryption enabled - EFS lifecycle: auto-delete files after 7 days - Security groups: task SG → EFS SG (NFS port 2049)
Loadfile Formats¶
The module generates four standard litigation loadfile formats:
| Format | Separator | Purpose |
|---|---|---|
| LOG | Tab | Alias, Volume, Path, DocStart, FolderBreak, BoxBreak, Pages |
| OPT | Tab | Same schema as LOG (alternate consumer format) |
| LFP | Semicolon | Image mapping: identity, boundary, position, volume path, rotation |
| DII | Custom | Document index: Bates/control number ranges + page image references |
Lessons Learned¶
-
ECS is appropriate for compute-heavy export — Image conversion, OCR, and PDF rendering exceed Lambda's resource limits. ECS Fargate with EFS provides the necessary CPU, memory, and storage.
-
Step Functions provides implicit orchestration — The state machine handles retry, parallelism, and error routing without custom queue management. But it lacks the fine-grained message-level control of SQS.
-
SQLite manifests decouple from RDS — Exporting doesn't need live database connections. A self-contained manifest avoids connection pool pressure during large exports.
-
Volume-based parallelism has natural limits — Max 20 concurrent volumes balances throughput with ECS task limits and EFS IOPS.
-
Exit code error mapping is fragile — Adding new error types requires coordinating between Python containers and TypeScript Lambda. Exception hierarchies with typed errors are more maintainable.
-
Mixed TypeScript/Python increases maintenance burden — Developers need proficiency in both languages. A single-language approach reduces context switching.
Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.