Reference Implementation: documentpageservice¶
PDF page manipulation service for post-processing document exhibits in the Nextpoint eDiscovery platform. Handles page reordering, rotation, addition, removal, and document splitting via a Java ECS Fargate task triggered by API Gateway.
Architecture Overview¶
documentpageservice is a synchronous, request-driven service — not an event-driven SNS/SQS module. It processes one PDF manipulation job per ECS task invocation, triggered by API Gateway → Lambda → ECS RunTask. This is fundamentally different from the batch-processing, event-driven architecture used by documentloader, documentextractor, and documentuploader.
Why ECS Instead of Lambda?¶
| Concern | Lambda | ECS (chosen) |
|---|---|---|
| PDF processing libraries | Limited (no native libs) | Full Apache PDFBox + Hyland extractors |
| File system | 512MB /tmp | Configurable ephemeral storage |
| Startup time | Cold start with large JARs | Container pre-warmed with extractors |
| Native libraries | Difficult (Lambda layers) | Native Hyland libs via LD_LIBRARY_PATH |
| Execution time | 15 min max | No timeout limit |
Architecture Tree¶
documentpageservice/
├── DocumentPageService/ # Java ECS task
│ ├── src/main/java/com/nextpoint/
│ │ ├── JobRouter.java # Entry point — routes to job handler
│ │ ├── AddPageJob.java # Add pages (convert + merge + OCR)
│ │ ├── RemovePageJob.java # Remove all pages from exhibit
│ │ ├── pdfAlterationService.java # PDF manipulation (PDFBox)
│ │ ├── s3Service.java # S3 download/upload operations
│ │ ├── NextpointAPI.java # HTTP client (XML, HMAC-SHA1)
│ │ ├── Environment.java # Environment config enum
│ │ └── pluginhost/
│ │ ├── TaskDocument.java # Hyland extraction context
│ │ └── TaskExtractionContext.java # Hyland task wrapper
│ ├── Dockerfile # Multi-stage: gradle build → JRE runtime
│ ├── build.gradle.kts
│ └── extractor-zips/ # Hyland native binaries (ARM64 Linux)
├── infrastructure/ # AWS CDK (TypeScript)
│ ├── lib/
│ │ ├── api-stack.ts # API Gateway + Lambda handler
│ │ ├── ecs-stack.ts # ECS cluster + Fargate task definition
│ │ └── shared-resources-stack.ts # VPC endpoints + security groups
│ ├── lambda/handlers/handler.ts # Lambda: parse request → ECS RunTask
│ └── config/
│ └── index.ts # Environment × region config
└── test/
Language Stack¶
| Component | Language | Runtime | Purpose |
|---|---|---|---|
| Page Service | Java | ECS Fargate (JRE) | PDF manipulation via Apache PDFBox |
| API Handler | TypeScript | Lambda (Node.js 22) | Parse request, launch ECS task |
| Infrastructure | TypeScript | CDK | AWS resource provisioning |
Request Flow¶
Nextpoint Backend
│
▼
API Gateway (POST /pageService/jobRouter)
│
▼
Lambda Handler (handler.ts)
├── Validate job type against whitelist
├── Build ECS RunTask overrides (env vars from request body)
└── ecs.runTask() → returns taskArn
│
▼
ECS Fargate Task (Java)
├── JobRouter.main() reads environment variables
├── Routes to job handler (addPage/removePages/editPdf/rotatePages/splitDocument)
├── Downloads PDF from S3
├── Manipulates PDF (PDFBox)
├── Uploads modified PDF to S3
├── Updates Nextpoint API (attachments, exhibits, bates numbers)
└── Sets processing_in_nge=false via API
Five Job Types¶
| Job | Operation | PDF Library | Hyland Extractors |
|---|---|---|---|
addPage |
Merge attachment into exhibit PDF | PDFBox merge | RENDER_PDF (convert non-PDF), EXTRACT_TEXT (OCR) |
removePages |
Delete all pages from exhibit | PDFBox page removal | — |
editPdf |
Reorder pages per new order array | PDFBox page extraction + rebuild | — |
rotatePages |
Rotate pages by degree values | PDFBox rotation | Optional OCR on rotated pages |
splitDocument |
Split exhibit at page boundaries | PDFBox splitter | Creates new exhibits + bates ranges |
addPage Detail¶
The most complex job — converts non-PDF files, merges, and OCRs:
1. Download source attachment from S3
2. Detect MIME type (Apache Tika)
3. If non-PDF: convert via Hyland RENDER_PDF extractor
4. Split into individual pages
5. For each page:
a. Extract text (Hyland EXTRACT_TEXT / OCR)
b. Create attachment record via Nextpoint API
c. Upload page PDF to S3
6. Merge all pages into exhibit PDF
7. Upload merged PDF to S3
8. Update exhibit metadata via API
Search text limit: 8 MB per field (MAX_ALLOWED_SEARCH_TEXT).
Pattern Mapping¶
| Architecture Pattern | documentpageservice Implementation | Notes |
|---|---|---|
| SNS Event Publishing | Not used | Synchronous API-driven, no events |
| SQS Handler | Not used | No queue processing |
| Exception Hierarchy | Not used | Generic Java exceptions; terminal failures |
| Hexagonal core/shell | Not followed | Mixed business logic + infrastructure in service classes |
| Multi-tenancy | Via API | case_id passed per request to Nextpoint API |
| CDK Infrastructure | Three stacks | SharedResources → ECS → API (dependency chain) |
| Idempotent Handlers | Not implemented | No dedup checks; ECS tasks are fire-and-forget |
| Config Management | YAML + env vars | nextpoint_api.yaml for API config; env vars for S3/job params |
| Structured Logging | Not followed | Custom NpLoggerImplementation, not JSON structured |
Hyland Document Extractor Integration¶
The module reuses Hyland document extraction plugins from documentextractor:
ECS Container (/app/extractors/)
├── hyland/ # Native libraries (ARM64 Linux)
│ ├── linux-aarch64-gcc-64/ # .so shared objects
│ └── ...
└── plugins/ # Extraction task handlers
├── RENDER_PDF # Convert any file format → PDF
└── EXTRACT_TEXT # OCR / text extraction from PDF pages
Plugin loading: TaskHandlerLoader discovers plugins at startup via
EXTRACTOR_PLUGINS_PATH environment variable. Native libraries loaded via
LD_LIBRARY_PATH=/app/extractors/hyland and JAVA_OPTS=-Djava.library.path=....
Extraction context: TaskDocument and TaskExtractionContext implement
the Hyland extraction framework interfaces, providing a StoredDocument and
ExtractionContext for the plugin dispatch system.
Nextpoint API Client¶
NextpointAPI.java communicates with the Nextpoint backend via HTTP:
| Method | Endpoint | Purpose |
|---|---|---|
getExhibit() |
GET /documents/get/{id} |
Fetch exhibit metadata |
getAttachment() |
GET /attachments/get/{id} |
Fetch attachment metadata |
findBatch() |
GET /case/{caseId}/documents/... |
Find batch for case |
createAttachment() |
POST /case/{caseId}/documents/... |
Create page attachment |
updateExhibit() |
PUT /case/{caseId}/documents/... |
Update exhibit after modification |
sendProcessingErrorEmail() |
POST /... |
Error notification (prod only) |
Authentication: HMAC-SHA1 signed requests with date header and secret key.
Response format: XML (parsed to JsonObject via XMLtoJSON).
S3 File Layout¶
s3://{bucket}/data-files/{md5_hash}/{case_id}/{batch_id}/{extraction_doc_id}/
├── pdf/
│ ├── {file_name} # Original exhibit PDF
│ └── {file_name}_page_{n}.pdf # Individual page PDFs
└── native/
└── {original_file} # Native file (pre-conversion)
CDK Infrastructure¶
Three-Stack Deployment¶
Stack 1: SharedResourcesStack (pageService-shared)
└── Security group, VPC endpoints (ECR, CloudWatch, S3)
Stack 2: EcsStack (pageService-ecs)
├── ECS Fargate cluster (container insights)
├── Task definition: 256 CPU / 512 MB memory
├── Docker image built from DocumentPageService/Dockerfile
└── IAM: task role (S3FullAccess), execution role
Stack 3: ApiStack (pageService-api)
├── Lambda function (Node.js 22, parses request → ECS RunTask)
├── API Gateway REST API (POST /pageService/jobRouter)
└── IAM: Lambda role (ECS RunTask + CloudWatch Logs)
Dependency chain: Shared VPC → ECS stack → API stack (API needs ECS task ARN and cluster ARN from SSM parameters).
ECS Task Resource Allocation¶
| Resource | Value | Notes |
|---|---|---|
| CPU | 256 (0.25 vCPU) | Minimal — PDF operations are I/O-bound |
| Memory | 512 MB | Tight for large PDFs; may need tuning |
| Ephemeral storage | Default (20 GB) | Adequate for single-document processing |
Divergences from Standard Architecture¶
| Standard Pattern | documentpageservice Divergence | Reason |
|---|---|---|
| Python 3.10+ | Java (ECS) + TypeScript (Lambda) | PDFBox is Java; Hyland extractors require JVM |
| Event-driven SNS/SQS | Synchronous API Gateway → Lambda → ECS | Single-document operations, not batch processing |
| Hexagonal core/shell | Monolithic Java service classes | Small scope; separation not warranted |
| Exception hierarchy | Terminal failures only (no retry) | ECS tasks are one-shot; no message requeue |
| Multi-tenant MySQL | No direct DB access — via Nextpoint API | Delegates to backend for all data operations |
| Structured JSON logging | Custom NpLoggerImplementation | Java logging ecosystem differs from Python |
| Idempotent handlers | Fire-and-forget ECS tasks | No duplicate protection; relies on caller |
| Secrets Manager | YAML config file with API keys | Should migrate to Secrets Manager |
Pre-Deployment Architectural Review¶
P0 — Blockers¶
1. No API Authentication¶
API Gateway endpoint has no authentication — no API key, IAM auth, or Cognito authorizer. Any caller with the URL can trigger ECS tasks.
Fix: Add IAM authorization or API key requirement at minimum.
2. Hardcoded Credentials in Docker Build¶
CDK reads AWS credentials from local profile at synthesis time and passes them as Docker build args. Credentials may be embedded in the Docker image layer cache.
Fix: Use IAM task roles exclusively (already configured). Remove credential injection from Dockerfile and CDK.
3. S3FullAccess IAM Policy¶
ECS task role has AmazonS3FullAccess — allows read/write/delete on ALL
S3 buckets in the account.
Fix: Scope to specific bucket ARN: arn:aws:s3:::{bucket}/* with only
s3:GetObject, s3:PutObject, s3:DeleteObject.
P1 — High Priority¶
4. No Retry or Error Recovery¶
ECS task failures are terminal — no retry mechanism, no DLQ, no error event. The only error handling is sending an email in production.
Fix: Lambda handler should poll task status and publish SNS error event on failure. Consider Step Functions for retry orchestration.
5. No Idempotency¶
Same request can trigger multiple ECS tasks processing the same document concurrently, potentially corrupting the PDF.
Fix: Add a processing lock (DynamoDB conditional write or Nextpoint API
processing_in_nge flag check before launching task).
6. Minimal Test Coverage¶
Only one placeholder test (testApp() returns true). No PDF manipulation
tests, no S3 mock tests, no API client tests.
7. Low Resource Allocation¶
256 CPU / 512 MB may be insufficient for large PDFs with many pages. PDFBox loads entire documents into memory. A 500-page PDF could OOM.
Fix: Make CPU/memory configurable per environment. Consider 1024 CPU / 2048 MB for production.
P2 — Medium Priority¶
8. Single-Use S3 Client¶
s3Service.java creates a new S3Client per operation instead of reusing
a long-lived client. Adds connection setup overhead per S3 call.
9. No CloudWatch Metrics or Alarms¶
No custom metrics for job duration, failure rates, or queue depth monitoring.
10. Synchronous API Could Timeout¶
API Gateway has a 29-second timeout. Lambda launches ECS task and returns the task ARN, but doesn't wait for completion. The caller has no built-in way to check if the job succeeded.
Consider: Adding a status endpoint or SNS notification on completion.
Architecture Compliance Summary¶
| Requirement | Status | Details |
|---|---|---|
| Event-driven via SNS | ❌ N/A | Synchronous API — by design, not a pattern violation |
| Idempotent handlers | ❌ FAIL | No duplicate protection |
| Exception hierarchy | ❌ N/A | Java ECS task, not Python Lambda |
| Multi-tenant DB | ✅ OK | Via Nextpoint API (case_id per request) |
| No secrets in code | ⚠️ WARN | YAML config file, not Secrets Manager |
| IAM least privilege | ❌ FAIL | S3FullAccess on task role |
| API authentication | ❌ FAIL | No auth on API Gateway |
| Testing | ❌ FAIL | Placeholder test only |
Lessons Learned¶
-
API Gateway → Lambda → ECS is a viable pattern for one-shot jobs — Lambda validates and launches, ECS does the heavy lifting. Decouples request handling from processing.
-
Hyland extractor reuse across modules reduces duplication — same native libraries used in documentextractor and documentpageservice for file conversion and OCR.
-
PDFBox is sufficient for page-level operations — doesn't need Nutrient/PSPDFKit for basic reorder/rotate/split/merge operations. Nutrient is only needed for annotation-aware processing.
-
Fire-and-forget ECS tasks need status tracking — without a callback or polling mechanism, the caller has no way to know if the job succeeded or failed.
-
Low resource allocation works for simple operations — 256 CPU / 512 MB handles small PDFs, but production workloads with large documents will need higher limits.
Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.