Skip to content

Reference Implementation: documentpageservice

PDF page manipulation service for post-processing document exhibits in the Nextpoint eDiscovery platform. Handles page reordering, rotation, addition, removal, and document splitting via a Java ECS Fargate task triggered by API Gateway.

Architecture Overview

documentpageservice is a synchronous, request-driven service — not an event-driven SNS/SQS module. It processes one PDF manipulation job per ECS task invocation, triggered by API Gateway → Lambda → ECS RunTask. This is fundamentally different from the batch-processing, event-driven architecture used by documentloader, documentextractor, and documentuploader.

Why ECS Instead of Lambda?

Concern Lambda ECS (chosen)
PDF processing libraries Limited (no native libs) Full Apache PDFBox + Hyland extractors
File system 512MB /tmp Configurable ephemeral storage
Startup time Cold start with large JARs Container pre-warmed with extractors
Native libraries Difficult (Lambda layers) Native Hyland libs via LD_LIBRARY_PATH
Execution time 15 min max No timeout limit

Architecture Tree

documentpageservice/
├── DocumentPageService/                      # Java ECS task
│   ├── src/main/java/com/nextpoint/
│   │   ├── JobRouter.java                   # Entry point — routes to job handler
│   │   ├── AddPageJob.java                  # Add pages (convert + merge + OCR)
│   │   ├── RemovePageJob.java               # Remove all pages from exhibit
│   │   ├── pdfAlterationService.java        # PDF manipulation (PDFBox)
│   │   ├── s3Service.java                   # S3 download/upload operations
│   │   ├── NextpointAPI.java                # HTTP client (XML, HMAC-SHA1)
│   │   ├── Environment.java                 # Environment config enum
│   │   └── pluginhost/
│   │       ├── TaskDocument.java            # Hyland extraction context
│   │       └── TaskExtractionContext.java   # Hyland task wrapper
│   ├── Dockerfile                           # Multi-stage: gradle build → JRE runtime
│   ├── build.gradle.kts
│   └── extractor-zips/                      # Hyland native binaries (ARM64 Linux)
├── infrastructure/                          # AWS CDK (TypeScript)
│   ├── lib/
│   │   ├── api-stack.ts                    # API Gateway + Lambda handler
│   │   ├── ecs-stack.ts                    # ECS cluster + Fargate task definition
│   │   └── shared-resources-stack.ts       # VPC endpoints + security groups
│   ├── lambda/handlers/handler.ts          # Lambda: parse request → ECS RunTask
│   └── config/
│       └── index.ts                        # Environment × region config
└── test/

Language Stack

Component Language Runtime Purpose
Page Service Java ECS Fargate (JRE) PDF manipulation via Apache PDFBox
API Handler TypeScript Lambda (Node.js 22) Parse request, launch ECS task
Infrastructure TypeScript CDK AWS resource provisioning

Request Flow

Nextpoint Backend
API Gateway (POST /pageService/jobRouter)
Lambda Handler (handler.ts)
  ├── Validate job type against whitelist
  ├── Build ECS RunTask overrides (env vars from request body)
  └── ecs.runTask() → returns taskArn
ECS Fargate Task (Java)
  ├── JobRouter.main() reads environment variables
  ├── Routes to job handler (addPage/removePages/editPdf/rotatePages/splitDocument)
  ├── Downloads PDF from S3
  ├── Manipulates PDF (PDFBox)
  ├── Uploads modified PDF to S3
  ├── Updates Nextpoint API (attachments, exhibits, bates numbers)
  └── Sets processing_in_nge=false via API

Five Job Types

Job Operation PDF Library Hyland Extractors
addPage Merge attachment into exhibit PDF PDFBox merge RENDER_PDF (convert non-PDF), EXTRACT_TEXT (OCR)
removePages Delete all pages from exhibit PDFBox page removal
editPdf Reorder pages per new order array PDFBox page extraction + rebuild
rotatePages Rotate pages by degree values PDFBox rotation Optional OCR on rotated pages
splitDocument Split exhibit at page boundaries PDFBox splitter Creates new exhibits + bates ranges

addPage Detail

The most complex job — converts non-PDF files, merges, and OCRs:

1. Download source attachment from S3
2. Detect MIME type (Apache Tika)
3. If non-PDF: convert via Hyland RENDER_PDF extractor
4. Split into individual pages
5. For each page:
   a. Extract text (Hyland EXTRACT_TEXT / OCR)
   b. Create attachment record via Nextpoint API
   c. Upload page PDF to S3
6. Merge all pages into exhibit PDF
7. Upload merged PDF to S3
8. Update exhibit metadata via API

Search text limit: 8 MB per field (MAX_ALLOWED_SEARCH_TEXT).

Pattern Mapping

Architecture Pattern documentpageservice Implementation Notes
SNS Event Publishing Not used Synchronous API-driven, no events
SQS Handler Not used No queue processing
Exception Hierarchy Not used Generic Java exceptions; terminal failures
Hexagonal core/shell Not followed Mixed business logic + infrastructure in service classes
Multi-tenancy Via API case_id passed per request to Nextpoint API
CDK Infrastructure Three stacks SharedResources → ECS → API (dependency chain)
Idempotent Handlers Not implemented No dedup checks; ECS tasks are fire-and-forget
Config Management YAML + env vars nextpoint_api.yaml for API config; env vars for S3/job params
Structured Logging Not followed Custom NpLoggerImplementation, not JSON structured

Hyland Document Extractor Integration

The module reuses Hyland document extraction plugins from documentextractor:

ECS Container (/app/extractors/)
  ├── hyland/                    # Native libraries (ARM64 Linux)
  │   ├── linux-aarch64-gcc-64/ # .so shared objects
  │   └── ...
  └── plugins/                   # Extraction task handlers
      ├── RENDER_PDF             # Convert any file format → PDF
      └── EXTRACT_TEXT           # OCR / text extraction from PDF pages

Plugin loading: TaskHandlerLoader discovers plugins at startup via EXTRACTOR_PLUGINS_PATH environment variable. Native libraries loaded via LD_LIBRARY_PATH=/app/extractors/hyland and JAVA_OPTS=-Djava.library.path=....

Extraction context: TaskDocument and TaskExtractionContext implement the Hyland extraction framework interfaces, providing a StoredDocument and ExtractionContext for the plugin dispatch system.

Nextpoint API Client

NextpointAPI.java communicates with the Nextpoint backend via HTTP:

Method Endpoint Purpose
getExhibit() GET /documents/get/{id} Fetch exhibit metadata
getAttachment() GET /attachments/get/{id} Fetch attachment metadata
findBatch() GET /case/{caseId}/documents/... Find batch for case
createAttachment() POST /case/{caseId}/documents/... Create page attachment
updateExhibit() PUT /case/{caseId}/documents/... Update exhibit after modification
sendProcessingErrorEmail() POST /... Error notification (prod only)

Authentication: HMAC-SHA1 signed requests with date header and secret key.

Response format: XML (parsed to JsonObject via XMLtoJSON).

S3 File Layout

s3://{bucket}/data-files/{md5_hash}/{case_id}/{batch_id}/{extraction_doc_id}/
  ├── pdf/
  │   ├── {file_name}                    # Original exhibit PDF
  │   └── {file_name}_page_{n}.pdf       # Individual page PDFs
  └── native/
      └── {original_file}                # Native file (pre-conversion)

CDK Infrastructure

Three-Stack Deployment

Stack 1: SharedResourcesStack (pageService-shared)
  └── Security group, VPC endpoints (ECR, CloudWatch, S3)

Stack 2: EcsStack (pageService-ecs)
  ├── ECS Fargate cluster (container insights)
  ├── Task definition: 256 CPU / 512 MB memory
  ├── Docker image built from DocumentPageService/Dockerfile
  └── IAM: task role (S3FullAccess), execution role

Stack 3: ApiStack (pageService-api)
  ├── Lambda function (Node.js 22, parses request → ECS RunTask)
  ├── API Gateway REST API (POST /pageService/jobRouter)
  └── IAM: Lambda role (ECS RunTask + CloudWatch Logs)

Dependency chain: Shared VPC → ECS stack → API stack (API needs ECS task ARN and cluster ARN from SSM parameters).

ECS Task Resource Allocation

Resource Value Notes
CPU 256 (0.25 vCPU) Minimal — PDF operations are I/O-bound
Memory 512 MB Tight for large PDFs; may need tuning
Ephemeral storage Default (20 GB) Adequate for single-document processing

Divergences from Standard Architecture

Standard Pattern documentpageservice Divergence Reason
Python 3.10+ Java (ECS) + TypeScript (Lambda) PDFBox is Java; Hyland extractors require JVM
Event-driven SNS/SQS Synchronous API Gateway → Lambda → ECS Single-document operations, not batch processing
Hexagonal core/shell Monolithic Java service classes Small scope; separation not warranted
Exception hierarchy Terminal failures only (no retry) ECS tasks are one-shot; no message requeue
Multi-tenant MySQL No direct DB access — via Nextpoint API Delegates to backend for all data operations
Structured JSON logging Custom NpLoggerImplementation Java logging ecosystem differs from Python
Idempotent handlers Fire-and-forget ECS tasks No duplicate protection; relies on caller
Secrets Manager YAML config file with API keys Should migrate to Secrets Manager

Pre-Deployment Architectural Review

P0 — Blockers

1. No API Authentication

API Gateway endpoint has no authentication — no API key, IAM auth, or Cognito authorizer. Any caller with the URL can trigger ECS tasks.

Fix: Add IAM authorization or API key requirement at minimum.

2. Hardcoded Credentials in Docker Build

CDK reads AWS credentials from local profile at synthesis time and passes them as Docker build args. Credentials may be embedded in the Docker image layer cache.

Fix: Use IAM task roles exclusively (already configured). Remove credential injection from Dockerfile and CDK.

3. S3FullAccess IAM Policy

ECS task role has AmazonS3FullAccess — allows read/write/delete on ALL S3 buckets in the account.

Fix: Scope to specific bucket ARN: arn:aws:s3:::{bucket}/* with only s3:GetObject, s3:PutObject, s3:DeleteObject.

P1 — High Priority

4. No Retry or Error Recovery

ECS task failures are terminal — no retry mechanism, no DLQ, no error event. The only error handling is sending an email in production.

Fix: Lambda handler should poll task status and publish SNS error event on failure. Consider Step Functions for retry orchestration.

5. No Idempotency

Same request can trigger multiple ECS tasks processing the same document concurrently, potentially corrupting the PDF.

Fix: Add a processing lock (DynamoDB conditional write or Nextpoint API processing_in_nge flag check before launching task).

6. Minimal Test Coverage

Only one placeholder test (testApp() returns true). No PDF manipulation tests, no S3 mock tests, no API client tests.

7. Low Resource Allocation

256 CPU / 512 MB may be insufficient for large PDFs with many pages. PDFBox loads entire documents into memory. A 500-page PDF could OOM.

Fix: Make CPU/memory configurable per environment. Consider 1024 CPU / 2048 MB for production.

P2 — Medium Priority

8. Single-Use S3 Client

s3Service.java creates a new S3Client per operation instead of reusing a long-lived client. Adds connection setup overhead per S3 call.

9. No CloudWatch Metrics or Alarms

No custom metrics for job duration, failure rates, or queue depth monitoring.

10. Synchronous API Could Timeout

API Gateway has a 29-second timeout. Lambda launches ECS task and returns the task ARN, but doesn't wait for completion. The caller has no built-in way to check if the job succeeded.

Consider: Adding a status endpoint or SNS notification on completion.

Architecture Compliance Summary

Requirement Status Details
Event-driven via SNS ❌ N/A Synchronous API — by design, not a pattern violation
Idempotent handlers ❌ FAIL No duplicate protection
Exception hierarchy ❌ N/A Java ECS task, not Python Lambda
Multi-tenant DB ✅ OK Via Nextpoint API (case_id per request)
No secrets in code ⚠️ WARN YAML config file, not Secrets Manager
IAM least privilege ❌ FAIL S3FullAccess on task role
API authentication ❌ FAIL No auth on API Gateway
Testing ❌ FAIL Placeholder test only

Lessons Learned

  1. API Gateway → Lambda → ECS is a viable pattern for one-shot jobs — Lambda validates and launches, ECS does the heavy lifting. Decouples request handling from processing.

  2. Hyland extractor reuse across modules reduces duplication — same native libraries used in documentextractor and documentpageservice for file conversion and OCR.

  3. PDFBox is sufficient for page-level operations — doesn't need Nutrient/PSPDFKit for basic reorder/rotate/split/merge operations. Nutrient is only needed for annotation-aware processing.

  4. Fire-and-forget ECS tasks need status tracking — without a callback or polling mechanism, the caller has no way to know if the job succeeded or failed.

  5. Low resource allocation works for simple operations — 256 CPU / 512 MB handles small PDFs, but production workloads with large documents will need higher limits.

Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.