Reference Implementation: documentpageservice¶

PDF page manipulation service for post-processing document exhibits in the Nextpoint eDiscovery platform. Handles page reordering, rotation, addition, removal, and document splitting via a Java ECS Fargate task triggered by API Gateway.

Architecture Overview¶

documentpageservice is a synchronous, request-driven service — not an event-driven SNS/SQS module. It processes one PDF manipulation job per ECS task invocation, triggered by API Gateway → Lambda → ECS RunTask. This is fundamentally different from the batch-processing, event-driven architecture used by documentloader, documentextractor, and documentuploader.

Why ECS Instead of Lambda?¶

Concern	Lambda	ECS (chosen)
PDF processing libraries	Limited (no native libs)	Full Apache PDFBox + Hyland extractors
File system	512MB /tmp	Configurable ephemeral storage
Startup time	Cold start with large JARs	Container pre-warmed with extractors
Native libraries	Difficult (Lambda layers)	Native Hyland libs via LD_LIBRARY_PATH
Execution time	15 min max	No timeout limit

Architecture Tree¶

documentpageservice/
├── DocumentPageService/                      # Java ECS task
│   ├── src/main/java/com/nextpoint/
│   │   ├── JobRouter.java                   # Entry point — routes to job handler
│   │   ├── AddPageJob.java                  # Add pages (convert + merge + OCR)
│   │   ├── RemovePageJob.java               # Remove all pages from exhibit
│   │   ├── pdfAlterationService.java        # PDF manipulation (PDFBox)
│   │   ├── s3Service.java                   # S3 download/upload operations
│   │   ├── NextpointAPI.java                # HTTP client (XML, HMAC-SHA1)
│   │   ├── Environment.java                 # Environment config enum
│   │   └── pluginhost/
│   │       ├── TaskDocument.java            # Hyland extraction context
│   │       └── TaskExtractionContext.java   # Hyland task wrapper
│   ├── Dockerfile                           # Multi-stage: gradle build → JRE runtime
│   ├── build.gradle.kts
│   └── extractor-zips/                      # Hyland native binaries (ARM64 Linux)
├── infrastructure/                          # AWS CDK (TypeScript)
│   ├── lib/
│   │   ├── api-stack.ts                    # API Gateway + Lambda handler
│   │   ├── ecs-stack.ts                    # ECS cluster + Fargate task definition
│   │   └── shared-resources-stack.ts       # VPC endpoints + security groups
│   ├── lambda/handlers/handler.ts          # Lambda: parse request → ECS RunTask
│   └── config/
│       └── index.ts                        # Environment × region config
└── test/

Language Stack¶

Component	Language	Runtime	Purpose
Page Service	Java	ECS Fargate (JRE)	PDF manipulation via Apache PDFBox
API Handler	TypeScript	Lambda (Node.js 22)	Parse request, launch ECS task
Infrastructure	TypeScript	CDK	AWS resource provisioning

Request Flow¶

Nextpoint Backend
  │
  ▼
API Gateway (POST /pageService/jobRouter)
  │
  ▼
Lambda Handler (handler.ts)
  ├── Validate job type against whitelist
  ├── Build ECS RunTask overrides (env vars from request body)
  └── ecs.runTask() → returns taskArn
  │
  ▼
ECS Fargate Task (Java)
  ├── JobRouter.main() reads environment variables
  ├── Routes to job handler (addPage/removePages/editPdf/rotatePages/splitDocument)
  ├── Downloads PDF from S3
  ├── Manipulates PDF (PDFBox)
  ├── Uploads modified PDF to S3
  ├── Updates Nextpoint API (attachments, exhibits, bates numbers)
  └── Sets processing_in_nge=false via API

Five Job Types¶

Job	Operation	PDF Library	Hyland Extractors
`addPage`	Merge attachment into exhibit PDF	PDFBox merge	RENDER_PDF (convert non-PDF), EXTRACT_TEXT (OCR)
`removePages`	Delete all pages from exhibit	PDFBox page removal	—
`editPdf`	Reorder pages per new order array	PDFBox page extraction + rebuild	—
`rotatePages`	Rotate pages by degree values	PDFBox rotation	Optional OCR on rotated pages
`splitDocument`	Split exhibit at page boundaries	PDFBox splitter	Creates new exhibits + bates ranges

addPage Detail¶

The most complex job — converts non-PDF files, merges, and OCRs:

1. Download source attachment from S3
2. Detect MIME type (Apache Tika)
3. If non-PDF: convert via Hyland RENDER_PDF extractor
4. Split into individual pages
5. For each page:
   a. Extract text (Hyland EXTRACT_TEXT / OCR)
   b. Create attachment record via Nextpoint API
   c. Upload page PDF to S3
6. Merge all pages into exhibit PDF
7. Upload merged PDF to S3
8. Update exhibit metadata via API

Search text limit: 8 MB per field (MAX_ALLOWED_SEARCH_TEXT).

Pattern Mapping¶

Architecture Pattern	documentpageservice Implementation	Notes
SNS Event Publishing	Not used	Synchronous API-driven, no events
SQS Handler	Not used	No queue processing
Exception Hierarchy	Not used	Generic Java exceptions; terminal failures
Hexagonal core/shell	Not followed	Mixed business logic + infrastructure in service classes
Multi-tenancy	Via API	`case_id` passed per request to Nextpoint API
CDK Infrastructure	Three stacks	SharedResources → ECS → API (dependency chain)
Idempotent Handlers	Not implemented	No dedup checks; ECS tasks are fire-and-forget
Config Management	YAML + env vars	`nextpoint_api.yaml` for API config; env vars for S3/job params
Structured Logging	Not followed	Custom NpLoggerImplementation, not JSON structured

Hyland Document Extractor Integration¶

The module reuses Hyland document extraction plugins from documentextractor:

ECS Container (/app/extractors/)
  ├── hyland/                    # Native libraries (ARM64 Linux)
  │   ├── linux-aarch64-gcc-64/ # .so shared objects
  │   └── ...
  └── plugins/                   # Extraction task handlers
      ├── RENDER_PDF             # Convert any file format → PDF
      └── EXTRACT_TEXT           # OCR / text extraction from PDF pages

Plugin loading: TaskHandlerLoader discovers plugins at startup via EXTRACTOR_PLUGINS_PATH environment variable. Native libraries loaded via LD_LIBRARY_PATH=/app/extractors/hyland and JAVA_OPTS=-Djava.library.path=....

Extraction context: TaskDocument and TaskExtractionContext implement the Hyland extraction framework interfaces, providing a StoredDocument and ExtractionContext for the plugin dispatch system.

Nextpoint API Client¶

NextpointAPI.java communicates with the Nextpoint backend via HTTP:

Method	Endpoint	Purpose
`getExhibit()`	`GET /documents/get/{id}`	Fetch exhibit metadata
`getAttachment()`	`GET /attachments/get/{id}`	Fetch attachment metadata
`findBatch()`	`GET /case/{caseId}/documents/...`	Find batch for case
`createAttachment()`	`POST /case/{caseId}/documents/...`	Create page attachment
`updateExhibit()`	`PUT /case/{caseId}/documents/...`	Update exhibit after modification
`sendProcessingErrorEmail()`	`POST /...`	Error notification (prod only)

Authentication: HMAC-SHA1 signed requests with date header and secret key.

Response format: XML (parsed to JsonObject via XMLtoJSON).

S3 File Layout¶

s3://{bucket}/data-files/{md5_hash}/{case_id}/{batch_id}/{extraction_doc_id}/
  ├── pdf/
  │   ├── {file_name}                    # Original exhibit PDF
  │   └── {file_name}_page_{n}.pdf       # Individual page PDFs
  └── native/
      └── {original_file}                # Native file (pre-conversion)

CDK Infrastructure¶

Three-Stack Deployment¶

Stack 1: SharedResourcesStack (pageService-shared)
  └── Security group, VPC endpoints (ECR, CloudWatch, S3)

Stack 2: EcsStack (pageService-ecs)
  ├── ECS Fargate cluster (container insights)
  ├── Task definition: 256 CPU / 512 MB memory
  ├── Docker image built from DocumentPageService/Dockerfile
  └── IAM: task role (S3FullAccess), execution role

Stack 3: ApiStack (pageService-api)
  ├── Lambda function (Node.js 22, parses request → ECS RunTask)
  ├── API Gateway REST API (POST /pageService/jobRouter)
  └── IAM: Lambda role (ECS RunTask + CloudWatch Logs)

Dependency chain: Shared VPC → ECS stack → API stack (API needs ECS task ARN and cluster ARN from SSM parameters).

ECS Task Resource Allocation¶

Resource	Value	Notes
CPU	256 (0.25 vCPU)	Minimal — PDF operations are I/O-bound
Memory	512 MB	Tight for large PDFs; may need tuning
Ephemeral storage	Default (20 GB)	Adequate for single-document processing

Divergences from Standard Architecture¶

Standard Pattern	documentpageservice Divergence	Reason
Python 3.10+	Java (ECS) + TypeScript (Lambda)	PDFBox is Java; Hyland extractors require JVM
Event-driven SNS/SQS	Synchronous API Gateway → Lambda → ECS	Single-document operations, not batch processing
Hexagonal core/shell	Monolithic Java service classes	Small scope; separation not warranted
Exception hierarchy	Terminal failures only (no retry)	ECS tasks are one-shot; no message requeue
Multi-tenant MySQL	No direct DB access — via Nextpoint API	Delegates to backend for all data operations
Structured JSON logging	Custom NpLoggerImplementation	Java logging ecosystem differs from Python
Idempotent handlers	Fire-and-forget ECS tasks	No duplicate protection; relies on caller
Secrets Manager	YAML config file with API keys	Should migrate to Secrets Manager

Pre-Deployment Architectural Review¶

P0 — Blockers¶

1. No API Authentication¶

API Gateway endpoint has no authentication — no API key, IAM auth, or Cognito authorizer. Any caller with the URL can trigger ECS tasks.

Fix: Add IAM authorization or API key requirement at minimum.

2. Hardcoded Credentials in Docker Build¶

CDK reads AWS credentials from local profile at synthesis time and passes them as Docker build args. Credentials may be embedded in the Docker image layer cache.

Fix: Use IAM task roles exclusively (already configured). Remove credential injection from Dockerfile and CDK.

3. S3FullAccess IAM Policy¶

ECS task role has AmazonS3FullAccess — allows read/write/delete on ALL S3 buckets in the account.

Fix: Scope to specific bucket ARN: arn:aws:s3:::{bucket}/* with only s3:GetObject, s3:PutObject, s3:DeleteObject.

P1 — High Priority¶

4. No Retry or Error Recovery¶

ECS task failures are terminal — no retry mechanism, no DLQ, no error event. The only error handling is sending an email in production.

Fix: Lambda handler should poll task status and publish SNS error event on failure. Consider Step Functions for retry orchestration.

5. No Idempotency¶

Same request can trigger multiple ECS tasks processing the same document concurrently, potentially corrupting the PDF.

Fix: Add a processing lock (DynamoDB conditional write or Nextpoint API processing_in_nge flag check before launching task).

6. Minimal Test Coverage¶

Only one placeholder test (testApp() returns true). No PDF manipulation tests, no S3 mock tests, no API client tests.

7. Low Resource Allocation¶

256 CPU / 512 MB may be insufficient for large PDFs with many pages. PDFBox loads entire documents into memory. A 500-page PDF could OOM.

Fix: Make CPU/memory configurable per environment. Consider 1024 CPU / 2048 MB for production.

P2 — Medium Priority¶

8. Single-Use S3 Client¶

s3Service.java creates a new S3Client per operation instead of reusing a long-lived client. Adds connection setup overhead per S3 call.

9. No CloudWatch Metrics or Alarms¶

No custom metrics for job duration, failure rates, or queue depth monitoring.

10. Synchronous API Could Timeout¶

API Gateway has a 29-second timeout. Lambda launches ECS task and returns the task ARN, but doesn't wait for completion. The caller has no built-in way to check if the job succeeded.

Consider: Adding a status endpoint or SNS notification on completion.

Architecture Compliance Summary¶

Requirement	Status	Details
Event-driven via SNS	❌ N/A	Synchronous API — by design, not a pattern violation
Idempotent handlers	❌ FAIL	No duplicate protection
Exception hierarchy	❌ N/A	Java ECS task, not Python Lambda
Multi-tenant DB	✅ OK	Via Nextpoint API (case_id per request)
No secrets in code	⚠️ WARN	YAML config file, not Secrets Manager
IAM least privilege	❌ FAIL	S3FullAccess on task role
API authentication	❌ FAIL	No auth on API Gateway
Testing	❌ FAIL	Placeholder test only

Lessons Learned¶

API Gateway → Lambda → ECS is a viable pattern for one-shot jobs — Lambda validates and launches, ECS does the heavy lifting. Decouples request handling from processing.
Hyland extractor reuse across modules reduces duplication — same native libraries used in documentextractor and documentpageservice for file conversion and OCR.
PDFBox is sufficient for page-level operations — doesn't need Nutrient/PSPDFKit for basic reorder/rotate/split/merge operations. Nutrient is only needed for annotation-aware processing.
Fire-and-forget ECS tasks need status tracking — without a callback or polling mechanism, the caller has no way to know if the job succeeded or failed.
Low resource allocation works for simple operations — 256 CPU / 512 MB handles small PDFs, but production workloads with large documents will need higher limits.

Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.