Reference Implementation: documentexporter¶

Overview¶

The documentexporter module exports documents from the Nextpoint eDiscovery platform into standard litigation production formats (TIFF, JPEG, PDF, native files) with loadfile generation (LOG, OPT, LFP, DII) and 7zip packaging.

Unlike documentloader (which was built with the architecture patterns in this repo), documentexporter predates those patterns and uses a different architectural approach: Step Functions + ECS containers rather than SNS/SQS + Lambda handlers.

This reference implementation documents how documentexporter works and where it aligns with or diverges from the standard architecture.

Pattern Mapping¶

Pattern	Status	documentexporter Implementation
Hexagonal boundaries	Diverges	Flat `containers/shared/` module — no core/shell separation. Business logic and infrastructure (S3, API calls) are mixed within the same files
Exception hierarchy	Diverges	Uses container exit codes (20–50) instead of RecoverableException/PermanentFailureException/SilentSuccessException. Step Functions catches exit codes and routes to error handler
SNS events	Diverges	No SNS events. Export Lambda is invoked directly by the Nextpoint backend (Sidekiq). No event-driven communication
SQS handler	Not used	No SQS queues. Step Functions orchestrates ECS tasks directly
Checkpoint pipeline	Partial	Step Functions provides implicit checkpointing via state transitions (ExportVolumes → LoadfileGenerator → Zipper → Completer). Not a database-backed checkpoint state machine
Database sessions	Diverges	Uses SQLite manifest (downloaded from S3) instead of RDS session management. No writer/reader session pattern
Retry/resilience	Partial	Step Functions handles retries at the state level. No @retry_on_db_conflict (no MySQL writes). Nutrient API errors are caught and logged, allowing export to continue for remaining documents
Idempotency	Partial	Identity cache prevents duplicate filenames within an export. No checkpoint-based distributed lock
Multi-tenancy	Aligns	Per-case isolation via `case_{npcase_id}` S3 prefix and EFS directory structure
CDK infrastructure	Partial	Single CDK stack (not two-stack composition). Uses ECS Fargate + EFS instead of Lambda + SQS. Step Functions for orchestration
Config management	Diverges	Environment variables injected by Step Functions per-task (not CONFIG_MAP). Nutrient config via JSON file
Structured logging	Diverges	Custom timestamp-prefixed logger, not JSON structured logging
ORM models	Not used	No SQLAlchemy models. Reads from SQLite manifest created by the Nextpoint backend

Architecture¶

documentexporter/
├── lambda/
│   └── handler.ts                  # Export orchestrator — builds Step Functions state machine
├── updater_lambda/
│   └── handler.ts                  # Completion/error handler — updates Nextpoint API
├── cleanup_lambda/
│   └── handler.ts                  # Scheduled: deletes state machines older than 14 days
├── containers/
│   ├── shared/                     # Shared Python library (18 modules, flat structure)
│   │   ├── constants.py            # Export enums (NativeInclusion, ImageFormats)
│   │   ├── db.py                   # SQLite manifest handling
│   │   ├── export.py               # Export metadata model
│   │   ├── exhibit.py              # Document processing with multiprocessing pools
│   │   ├── identifier.py           # Document naming with identity dedup cache
│   │   ├── template.py             # Export template configuration
│   │   ├── nutrient.py             # Document engine client (batch page download)
│   │   ├── loadfile.py             # Loadfile generators (LOG, OPT, LFP, DII)
│   │   ├── s3.py                   # S3 operations with pagination
│   │   ├── nextpoint_api.py        # Nextpoint API client (XML, HMAC SHA1 auth)
│   │   ├── document_markup.py      # Annotation handling
│   │   ├── custom_pages.py         # Cover/summary page generation
│   │   ├── native_placeholder.py   # Placeholder guard rails for unsupported native files
│   │   ├── expansive_pdf.py        # Multi-volume PDF (1000-page split threshold)
│   │   └── ...
│   ├── exporter/main.py            # Per-volume document processor
│   ├── loadfile-generator/main.py  # Loadfile assembly orchestrator
│   ├── zipper/main.py              # 7zip compression (20GB volume splits)
│   ├── tests/                      # Python container tests (pytest)
│   │   ├── conftest.py             # Test fixtures
│   │   └── shared/                 # Unit tests for shared modules
│   └── base/                       # Pre-built Docker base images
├── lib/
│   ├── nge-export-service-stack.ts # CDK infrastructure (ECS, EFS, Step Functions)
│   └── devConfig.ts                # Local bridge development configuration
└── test/                           # Jest tests (TypeScript Lambda tests only)

Orchestration: Step Functions + ECS¶

The export Lambda dynamically builds a Step Functions state machine definition at runtime based on the export type, then starts an execution:

Export Lambda (invoked by Nextpoint backend)
  │
  ├─ Upload volumes_input.json to S3
  ├─ Create Step Functions state machine
  └─ Start execution
      │
      ├─ Map State (max 20 concurrent volumes)
      │   └─ ExportOneVolume (ECS Task)
      │       ├─ Exporter container (processes documents)
      │       ├─ Nutrient sidecar (renders PDFs/images, health-checked)
      │       └─ Redirection sidecar (caches asset URLs)
      │
      ├─ LoadfileGenerator (ECS Task)
      │   └─ Generates LOG/OPT/LFP/DII files
      │
      ├─ Zipper (ECS Task)
      │   └─ 7zip compress with 20GB splits
      │
      ├─ Completer (Updater Lambda)
      │   └─ Updates Nextpoint API, aggregates page counts
      │
      └─ ErrorHandler (Updater Lambda with error status)
          └─ Maps exit codes to user-friendly messages

Two export modes: - Normal: Full pipeline — volumes → loadfiles → zip → complete - Metadata-only: Skips volume processing, goes directly to loadfile generation

Key Design Decisions¶

Step Functions + ECS Instead of Lambda + SQS¶

Document export is compute-intensive (image conversion, OCR, PDF rendering) and requires large working storage (EFS mounts). ECS Fargate tasks get: - 4 vCPU, 8–24 GB RAM per task - 200GB ephemeral storage for exporter - EFS mounts for cross-container file sharing - Sidecar containers (Nutrient document engine + redirection service)

Lambda's 15-minute timeout and limited storage would be insufficient.

Exit Code Error Propagation¶

Container exit codes map to user-facing error messages:

Exit Code	Meaning
20	Document not found in Nutrient API
21	Nutrient API error (4xx/5xx)
22	Network error calling Nutrient API
23	Outdated/corrupted data from Nutrient API
30–32	Zipper errors (compression, S3 upload)
40–47	Loadfile generation errors
50	Nextpoint API error
1	Generic export failure

The Updater Lambda reads exit codes from Step Functions task output and translates them to user-friendly messages.

SQLite Manifest Instead of RDS¶

Export input data comes as a SQLite database (zip file created by the Nextpoint backend). Contains exhibits_for_export and exhibits tables. This avoids per-case database connections during export — the manifest is self-contained and read-only.

Multiprocessing for CPU-Intensive Operations¶

The exporter container uses Python multiprocessing.Pool for image conversion (PNG → TIFF/JPEG) and OCR (tesserocr). Combined with document prerendering (fetching next document while processing current), this maximizes throughput.

Volume-Based Parallelism¶

Documents are pre-split into volumes by the backend. Step Functions Map state processes up to 20 volumes concurrently, each as a separate ECS task. This provides natural parallelism boundaries and fault isolation.

Storage Layout¶

S3 (coordination data, final outputs)¶

s3://{bucket}/case_{npcase_id}/exports/export_{export_id}/
├── volumes_input.json          # Step Functions input
├── volumes.json                # Updated by zipper with S3 paths
├── loadfile_data/
│   ├── metadata.json           # Metadata-only exports
│   └── {volume_position}.json  # Per-volume loadfile data
├── error_logs/{volume}.txt     # Per-volume error logs
├── page_counts/page_count_{volume}.txt
└── zips/{export_name}.zip[.001,.002,...]

EFS (working storage during export)¶

/mnt/efs/{case_id}/export_{export_id}/
├── {cleansed_name}/
│   ├── NATIVES/                # Native file downloads
│   ├── IMAGES/{volume}/        # Converted images per volume
│   └── TEXT/{volume}/          # OCR text per volume
└── tmp/zip/                    # Loadfile assembly scratch space

ECS Task Resources¶

Task	CPU	Memory	Storage	Sidecars
Exporter	4096	8192 MB	200GB ephemeral + EFS	Nutrient (16GB), Redirection (8GB)
Zipper	4096	24576 MB	EFS only	None
Loadfile Generator	4096	24576 MB	EFS only	None

Divergences from Standard Architecture¶

These are documented for visibility, not as criticism — documentexporter was built before the standard patterns were established and its compute requirements (large files, image processing, sidecars) differ from the Lambda-based ingestion pipeline.

1. No Hexagonal Boundaries¶

Business logic and infrastructure are mixed in containers/shared/. For example, exhibit.py contains both document processing logic and direct S3/Nutrient API calls. The standard pattern would separate these into core/ (pure logic) and shell/ (infrastructure adapters).

The module is invoked directly by the Nextpoint backend and communicates completion via direct API calls (HMAC-authenticated XML). It doesn't participate in the event-driven architecture. Adding SNS events for EXPORT_STARTED, EXPORT_COMPLETED would integrate it into the platform event stream.

3. No Exception-Based Flow Control¶

Instead of RecoverableException/PermanentFailureException controlling SQS message disposition, exit codes control Step Functions error routing. This works but couples error semantics to container process boundaries.

4. TypeScript Orchestration + Python Processing¶

The orchestration layer (Lambda handlers) is in TypeScript while processing is in Python. The standard pattern uses Python throughout. This creates a mixed-language codebase.

5. No Structured JSON Logging¶

Uses a custom timestamp-prefixed logger instead of the standard JSON formatter with context fields (caseId, batchId, jobId). This makes log aggregation and querying harder in CloudWatch.

6. Python Testing Foundation (Early Stage)¶

Python container tests now exist under containers/tests/ with pytest infrastructure (pytest.ini, requirements-test.txt, conftest.py). Initial tests cover S3 operations and attachment handling. Coverage is still limited — most Python container code relies on end-to-end testing via the Nextpoint frontend. Jest tests cover TypeScript Lambda handlers.

Nutrient (PSPDFKit) Sidecar Architecture¶

The exporter task runs three containers in a single Fargate task with strict dependency ordering via health checks:

Container Start Order (dependency chain):
  1. Redirection Service starts first (port 80, health check every 5s)
  2. Nutrient DocEngine starts after Redirection is HEALTHY (port 5000, health check every 5s, 10 retries)
  3. Exporter starts after Nutrient is HEALTHY

Container	CPU	Memory (prod)	Memory (staging)	Port	Purpose
Exporter	4096	8192 MB	8192 MB	—	Document processing, image conversion, OCR
Nutrient DocEngine	4096	16384 MB	12288 MB	5000	PDF rendering, page image generation
Redirection Service	1024	8192 MB	8192 MB	80	S3 asset download proxy (Ruby)

Nutrient resources are environment-specific (nge-export-service-stack.ts:55-59): production gets higher allocation to handle large document sets.

Nutrient configuration (lib/nutrient-config.json) is a per-environment × per-region matrix controlling DocEngine behavior: PostgreSQL RDS, Redis cache, S3 asset storage, worker pool size, and activation keys.

Image Processing & OCR¶

Document Processing Logic (exhibit.py): - Nutrient page count is used as the authoritative page count when the DB verified page count exceeds it (prevents out-of-bounds page requests) - Redaction annotations are applied conditionally — the apply_redactions Nutrient endpoint is only called when redactions are present - Native placeholder documents have guard rails (native_placeholder.py) that short-circuit processing to a single-page placeholder image download

Image Conversion (multiprocessing): - PNG → TIFF/JPEG via Python multiprocessing.Pool - Color detection using NumPy array comparison (determines TIFF compression) - Configurable pool size: POOL_WORKERS env var or CPU count - Thread oversubscription prevention: OMP_NUM_THREADS=1

OCR (Tesseract via tesserocr C++ API): - OSD (Orientation/Script Detection) for auto-rotation - Preprocessing pipeline: deskew + sharpen before recognition - Batch processing with multiprocessing pools - Lazy-loaded: OCR library only present in exporter base image

Expansive PDF Generation (pikepdf): - Multi-volume support with 1000-page split threshold - Outline/bookmark generation for document navigation - Family relationship grouping (parent-child documents) - Summary cover page injection

CI/CD Pipeline¶

Bitbucket → CodePipeline → CodeBuild:

Branch push triggers CodePipeline via CodeStar Connections
CodeBuild prebuild: ECR login, base image version check
CodeBuild build: cdk deploy --all
Conditional developer alias support for dev isolation

Base Image Versioning: - Hash-based tags: 1.0-{file_hash} - Three base images: exporter-base (Python 3.12 + Tesseract + Leptonica), loadfile-base (Python 3.12 + 7zip), zipper-base (Python 3.12) - check-and-build-bases.sh compares hash → only rebuilds on change - Stored per-environment in ECR: {env}-nge/exporter-base

Multi-Region Deployment¶

Region	Prefix	Nutrient DB	Redis	S3 Bucket
us-east-1	`c2`	`c2_{env}_nutrientdb`	region-specific ElastiCache	`trialmanager-{env}`
us-west-1	`c4`	`c4_{env}_nutrientdb`	region-specific ElastiCache	`c4-trialmanager-{env}`
ca-central-1	`c5`	`c5_{env}_nutrientdb`	region-specific ElastiCache	`c5-trialmanager-{env}`

Infrastructure parameters looked up via SSM: /nge/shared/{region_prefix}/vpc/vpcId, etc.

Cleanup Lambda¶

Scheduled Lambda that prevents state machine accumulation: - Lists all Step Functions state machines by prefix - Deletes machines older than configurable age (default 24 hours) - Retry logic with exponential backoff for API throttling - Non-fatal: logs errors but doesn't fail deployment

Security¶

Authentication: - Nextpoint API: HMAC-SHA1 signed requests (XML format) - Nutrient DocEngine: JWT (RS256) + API auth token

Secrets Management: - Nutrient RDS credentials: AWS Secrets Manager - JWT public key: Secrets Manager - Nextpoint API key: Secrets Manager (accessed by Updater Lambda) - Production-only: ACTIVATION_KEY, SECRET_KEY_BASE from Secrets Manager

Network: - ECS tasks in private subnets with NAT - EFS transit encryption enabled - EFS lifecycle: auto-delete files after 7 days - Security groups: task SG → EFS SG (NFS port 2049)

Loadfile Formats¶

The module generates four standard litigation loadfile formats:

Format	Separator	Purpose
LOG	Tab	Alias, Volume, Path, DocStart, FolderBreak, BoxBreak, Pages
OPT	Tab	Same schema as LOG (alternate consumer format)
LFP	Semicolon	Image mapping: identity, boundary, position, volume path, rotation
DII	Custom	Document index: Bates/control number ranges + page image references

Lessons Learned¶

ECS is appropriate for compute-heavy export — Image conversion, OCR, and PDF rendering exceed Lambda's resource limits. ECS Fargate with EFS provides the necessary CPU, memory, and storage.
Step Functions provides implicit orchestration — The state machine handles retry, parallelism, and error routing without custom queue management. But it lacks the fine-grained message-level control of SQS.
SQLite manifests decouple from RDS — Exporting doesn't need live database connections. A self-contained manifest avoids connection pool pressure during large exports.
Volume-based parallelism has natural limits — Max 20 concurrent volumes balances throughput with ECS task limits and EFS IOPS.
Exit code error mapping is fragile — Adding new error types requires coordinating between Python containers and TypeScript Lambda. Exception hierarchies with typed errors are more maintainable.
Mixed TypeScript/Python increases maintenance burden — Developers need proficiency in both languages. A single-language approach reduces context switching.

Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.