Skip to content

Reference Implementation: documentexporter

Overview

The documentexporter module exports documents from the Nextpoint eDiscovery platform into standard litigation production formats (TIFF, JPEG, PDF, native files) with loadfile generation (LOG, OPT, LFP, DII) and 7zip packaging.

Unlike documentloader (which was built with the architecture patterns in this repo), documentexporter predates those patterns and uses a different architectural approach: Step Functions + ECS containers rather than SNS/SQS + Lambda handlers.

This reference implementation documents how documentexporter works and where it aligns with or diverges from the standard architecture.

Pattern Mapping

Pattern Status documentexporter Implementation
Hexagonal boundaries Diverges Flat containers/shared/ module — no core/shell separation. Business logic and infrastructure (S3, API calls) are mixed within the same files
Exception hierarchy Diverges Uses container exit codes (20–50) instead of RecoverableException/PermanentFailureException/SilentSuccessException. Step Functions catches exit codes and routes to error handler
SNS events Diverges No SNS events. Export Lambda is invoked directly by the Nextpoint backend (Sidekiq). No event-driven communication
SQS handler Not used No SQS queues. Step Functions orchestrates ECS tasks directly
Checkpoint pipeline Partial Step Functions provides implicit checkpointing via state transitions (ExportVolumes → LoadfileGenerator → Zipper → Completer). Not a database-backed checkpoint state machine
Database sessions Diverges Uses SQLite manifest (downloaded from S3) instead of RDS session management. No writer/reader session pattern
Retry/resilience Partial Step Functions handles retries at the state level. No @retry_on_db_conflict (no MySQL writes). Nutrient API errors are caught and logged, allowing export to continue for remaining documents
Idempotency Partial Identity cache prevents duplicate filenames within an export. No checkpoint-based distributed lock
Multi-tenancy Aligns Per-case isolation via case_{npcase_id} S3 prefix and EFS directory structure
CDK infrastructure Partial Single CDK stack (not two-stack composition). Uses ECS Fargate + EFS instead of Lambda + SQS. Step Functions for orchestration
Config management Diverges Environment variables injected by Step Functions per-task (not CONFIG_MAP). Nutrient config via JSON file
Structured logging Diverges Custom timestamp-prefixed logger, not JSON structured logging
ORM models Not used No SQLAlchemy models. Reads from SQLite manifest created by the Nextpoint backend

Architecture

documentexporter/
├── lambda/
│   └── handler.ts                  # Export orchestrator — builds Step Functions state machine
├── updater_lambda/
│   └── handler.ts                  # Completion/error handler — updates Nextpoint API
├── cleanup_lambda/
│   └── handler.ts                  # Scheduled: deletes state machines older than 14 days
├── containers/
│   ├── shared/                     # Shared Python library (18 modules, flat structure)
│   │   ├── constants.py            # Export enums (NativeInclusion, ImageFormats)
│   │   ├── db.py                   # SQLite manifest handling
│   │   ├── export.py               # Export metadata model
│   │   ├── exhibit.py              # Document processing with multiprocessing pools
│   │   ├── identifier.py           # Document naming with identity dedup cache
│   │   ├── template.py             # Export template configuration
│   │   ├── nutrient.py             # Document engine client (batch page download)
│   │   ├── loadfile.py             # Loadfile generators (LOG, OPT, LFP, DII)
│   │   ├── s3.py                   # S3 operations with pagination
│   │   ├── nextpoint_api.py        # Nextpoint API client (XML, HMAC SHA1 auth)
│   │   ├── document_markup.py      # Annotation handling
│   │   ├── custom_pages.py         # Cover/summary page generation
│   │   ├── native_placeholder.py   # Placeholder guard rails for unsupported native files
│   │   ├── expansive_pdf.py        # Multi-volume PDF (1000-page split threshold)
│   │   └── ...
│   ├── exporter/main.py            # Per-volume document processor
│   ├── loadfile-generator/main.py  # Loadfile assembly orchestrator
│   ├── zipper/main.py              # 7zip compression (20GB volume splits)
│   ├── tests/                      # Python container tests (pytest)
│   │   ├── conftest.py             # Test fixtures
│   │   └── shared/                 # Unit tests for shared modules
│   └── base/                       # Pre-built Docker base images
├── lib/
│   ├── nge-export-service-stack.ts # CDK infrastructure (ECS, EFS, Step Functions)
│   └── devConfig.ts                # Local bridge development configuration
└── test/                           # Jest tests (TypeScript Lambda tests only)

Orchestration: Step Functions + ECS

The export Lambda dynamically builds a Step Functions state machine definition at runtime based on the export type, then starts an execution:

Export Lambda (invoked by Nextpoint backend)
  ├─ Upload volumes_input.json to S3
  ├─ Create Step Functions state machine
  └─ Start execution
      ├─ Map State (max 20 concurrent volumes)
      │   └─ ExportOneVolume (ECS Task)
      │       ├─ Exporter container (processes documents)
      │       ├─ Nutrient sidecar (renders PDFs/images, health-checked)
      │       └─ Redirection sidecar (caches asset URLs)
      ├─ LoadfileGenerator (ECS Task)
      │   └─ Generates LOG/OPT/LFP/DII files
      ├─ Zipper (ECS Task)
      │   └─ 7zip compress with 20GB splits
      ├─ Completer (Updater Lambda)
      │   └─ Updates Nextpoint API, aggregates page counts
      └─ ErrorHandler (Updater Lambda with error status)
          └─ Maps exit codes to user-friendly messages

Two export modes: - Normal: Full pipeline — volumes → loadfiles → zip → complete - Metadata-only: Skips volume processing, goes directly to loadfile generation

Key Design Decisions

Step Functions + ECS Instead of Lambda + SQS

Document export is compute-intensive (image conversion, OCR, PDF rendering) and requires large working storage (EFS mounts). ECS Fargate tasks get: - 4 vCPU, 8–24 GB RAM per task - 200GB ephemeral storage for exporter - EFS mounts for cross-container file sharing - Sidecar containers (Nutrient document engine + redirection service)

Lambda's 15-minute timeout and limited storage would be insufficient.

Exit Code Error Propagation

Container exit codes map to user-facing error messages:

Exit Code Meaning
20 Document not found in Nutrient API
21 Nutrient API error (4xx/5xx)
22 Network error calling Nutrient API
23 Outdated/corrupted data from Nutrient API
30–32 Zipper errors (compression, S3 upload)
40–47 Loadfile generation errors
50 Nextpoint API error
1 Generic export failure

The Updater Lambda reads exit codes from Step Functions task output and translates them to user-friendly messages.

SQLite Manifest Instead of RDS

Export input data comes as a SQLite database (zip file created by the Nextpoint backend). Contains exhibits_for_export and exhibits tables. This avoids per-case database connections during export — the manifest is self-contained and read-only.

Multiprocessing for CPU-Intensive Operations

The exporter container uses Python multiprocessing.Pool for image conversion (PNG → TIFF/JPEG) and OCR (tesserocr). Combined with document prerendering (fetching next document while processing current), this maximizes throughput.

Volume-Based Parallelism

Documents are pre-split into volumes by the backend. Step Functions Map state processes up to 20 volumes concurrently, each as a separate ECS task. This provides natural parallelism boundaries and fault isolation.

Storage Layout

S3 (coordination data, final outputs)

s3://{bucket}/case_{npcase_id}/exports/export_{export_id}/
├── volumes_input.json          # Step Functions input
├── volumes.json                # Updated by zipper with S3 paths
├── loadfile_data/
│   ├── metadata.json           # Metadata-only exports
│   └── {volume_position}.json  # Per-volume loadfile data
├── error_logs/{volume}.txt     # Per-volume error logs
├── page_counts/page_count_{volume}.txt
└── zips/{export_name}.zip[.001,.002,...]

EFS (working storage during export)

/mnt/efs/{case_id}/export_{export_id}/
├── {cleansed_name}/
│   ├── NATIVES/                # Native file downloads
│   ├── IMAGES/{volume}/        # Converted images per volume
│   └── TEXT/{volume}/          # OCR text per volume
└── tmp/zip/                    # Loadfile assembly scratch space

ECS Task Resources

Task CPU Memory Storage Sidecars
Exporter 4096 8192 MB 200GB ephemeral + EFS Nutrient (16GB), Redirection (8GB)
Zipper 4096 24576 MB EFS only None
Loadfile Generator 4096 24576 MB EFS only None

Divergences from Standard Architecture

These are documented for visibility, not as criticism — documentexporter was built before the standard patterns were established and its compute requirements (large files, image processing, sidecars) differ from the Lambda-based ingestion pipeline.

1. No Hexagonal Boundaries

Business logic and infrastructure are mixed in containers/shared/. For example, exhibit.py contains both document processing logic and direct S3/Nutrient API calls. The standard pattern would separate these into core/ (pure logic) and shell/ (infrastructure adapters).

2. No SNS Event Communication

The module is invoked directly by the Nextpoint backend and communicates completion via direct API calls (HMAC-authenticated XML). It doesn't participate in the event-driven architecture. Adding SNS events for EXPORT_STARTED, EXPORT_COMPLETED would integrate it into the platform event stream.

3. No Exception-Based Flow Control

Instead of RecoverableException/PermanentFailureException controlling SQS message disposition, exit codes control Step Functions error routing. This works but couples error semantics to container process boundaries.

4. TypeScript Orchestration + Python Processing

The orchestration layer (Lambda handlers) is in TypeScript while processing is in Python. The standard pattern uses Python throughout. This creates a mixed-language codebase.

5. No Structured JSON Logging

Uses a custom timestamp-prefixed logger instead of the standard JSON formatter with context fields (caseId, batchId, jobId). This makes log aggregation and querying harder in CloudWatch.

6. Python Testing Foundation (Early Stage)

Python container tests now exist under containers/tests/ with pytest infrastructure (pytest.ini, requirements-test.txt, conftest.py). Initial tests cover S3 operations and attachment handling. Coverage is still limited — most Python container code relies on end-to-end testing via the Nextpoint frontend. Jest tests cover TypeScript Lambda handlers.

Nutrient (PSPDFKit) Sidecar Architecture

The exporter task runs three containers in a single Fargate task with strict dependency ordering via health checks:

Container Start Order (dependency chain):
  1. Redirection Service starts first (port 80, health check every 5s)
  2. Nutrient DocEngine starts after Redirection is HEALTHY (port 5000, health check every 5s, 10 retries)
  3. Exporter starts after Nutrient is HEALTHY
Container CPU Memory (prod) Memory (staging) Port Purpose
Exporter 4096 8192 MB 8192 MB Document processing, image conversion, OCR
Nutrient DocEngine 4096 16384 MB 12288 MB 5000 PDF rendering, page image generation
Redirection Service 1024 8192 MB 8192 MB 80 S3 asset download proxy (Ruby)

Nutrient resources are environment-specific (nge-export-service-stack.ts:55-59): production gets higher allocation to handle large document sets.

Nutrient configuration (lib/nutrient-config.json) is a per-environment × per-region matrix controlling DocEngine behavior: PostgreSQL RDS, Redis cache, S3 asset storage, worker pool size, and activation keys.

Image Processing & OCR

Document Processing Logic (exhibit.py): - Nutrient page count is used as the authoritative page count when the DB verified page count exceeds it (prevents out-of-bounds page requests) - Redaction annotations are applied conditionally — the apply_redactions Nutrient endpoint is only called when redactions are present - Native placeholder documents have guard rails (native_placeholder.py) that short-circuit processing to a single-page placeholder image download

Image Conversion (multiprocessing): - PNG → TIFF/JPEG via Python multiprocessing.Pool - Color detection using NumPy array comparison (determines TIFF compression) - Configurable pool size: POOL_WORKERS env var or CPU count - Thread oversubscription prevention: OMP_NUM_THREADS=1

OCR (Tesseract via tesserocr C++ API): - OSD (Orientation/Script Detection) for auto-rotation - Preprocessing pipeline: deskew + sharpen before recognition - Batch processing with multiprocessing pools - Lazy-loaded: OCR library only present in exporter base image

Expansive PDF Generation (pikepdf): - Multi-volume support with 1000-page split threshold - Outline/bookmark generation for document navigation - Family relationship grouping (parent-child documents) - Summary cover page injection

CI/CD Pipeline

Bitbucket → CodePipeline → CodeBuild:

  1. Branch push triggers CodePipeline via CodeStar Connections
  2. CodeBuild prebuild: ECR login, base image version check
  3. CodeBuild build: cdk deploy --all
  4. Conditional developer alias support for dev isolation

Base Image Versioning: - Hash-based tags: 1.0-{file_hash} - Three base images: exporter-base (Python 3.12 + Tesseract + Leptonica), loadfile-base (Python 3.12 + 7zip), zipper-base (Python 3.12) - check-and-build-bases.sh compares hash → only rebuilds on change - Stored per-environment in ECR: {env}-nge/exporter-base

Multi-Region Deployment

Region Prefix Nutrient DB Redis S3 Bucket
us-east-1 c2 c2_{env}_nutrientdb region-specific ElastiCache trialmanager-{env}
us-west-1 c4 c4_{env}_nutrientdb region-specific ElastiCache c4-trialmanager-{env}
ca-central-1 c5 c5_{env}_nutrientdb region-specific ElastiCache c5-trialmanager-{env}

Infrastructure parameters looked up via SSM: /nge/shared/{region_prefix}/vpc/vpcId, etc.

Cleanup Lambda

Scheduled Lambda that prevents state machine accumulation: - Lists all Step Functions state machines by prefix - Deletes machines older than configurable age (default 24 hours) - Retry logic with exponential backoff for API throttling - Non-fatal: logs errors but doesn't fail deployment

Security

Authentication: - Nextpoint API: HMAC-SHA1 signed requests (XML format) - Nutrient DocEngine: JWT (RS256) + API auth token

Secrets Management: - Nutrient RDS credentials: AWS Secrets Manager - JWT public key: Secrets Manager - Nextpoint API key: Secrets Manager (accessed by Updater Lambda) - Production-only: ACTIVATION_KEY, SECRET_KEY_BASE from Secrets Manager

Network: - ECS tasks in private subnets with NAT - EFS transit encryption enabled - EFS lifecycle: auto-delete files after 7 days - Security groups: task SG → EFS SG (NFS port 2049)

Loadfile Formats

The module generates four standard litigation loadfile formats:

Format Separator Purpose
LOG Tab Alias, Volume, Path, DocStart, FolderBreak, BoxBreak, Pages
OPT Tab Same schema as LOG (alternate consumer format)
LFP Semicolon Image mapping: identity, boundary, position, volume path, rotation
DII Custom Document index: Bates/control number ranges + page image references

Lessons Learned

  1. ECS is appropriate for compute-heavy export — Image conversion, OCR, and PDF rendering exceed Lambda's resource limits. ECS Fargate with EFS provides the necessary CPU, memory, and storage.

  2. Step Functions provides implicit orchestration — The state machine handles retry, parallelism, and error routing without custom queue management. But it lacks the fine-grained message-level control of SQS.

  3. SQLite manifests decouple from RDS — Exporting doesn't need live database connections. A self-contained manifest avoids connection pool pressure during large exports.

  4. Volume-based parallelism has natural limits — Max 20 concurrent volumes balances throughput with ECS task limits and EFS IOPS.

  5. Exit code error mapping is fragile — Adding new error types requires coordinating between Python containers and TypeScript Lambda. Exception hierarchies with typed errors are more maintainable.

  6. Mixed TypeScript/Python increases maintenance burden — Developers need proficiency in both languages. A single-language approach reduces context switching.

Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.