Skip to content

Reference Implementation: documentextractor

Overview

The documentextractor module is the core extraction service for Nextpoint's Next Gen Engine (NGE). It takes raw files (emails, PSTs, documents, loadfiles) and extracts content: text, PDFs, page images, metadata, and attachments.

This is the largest NGE module, using a hybrid Java/Kotlin + TypeScript architecture: Java ECS workers perform extraction, TypeScript Lambda handlers orchestrate jobs and manage worker state via DynamoDB.

Like documentexporter, this module predates the standard Python architecture patterns. It aligns with the event-driven principle (uses SNS) but diverges significantly in language, compute model, and state management.

Pattern Mapping

Pattern Status documentextractor Implementation
Hexagonal boundaries Diverges Flat shared/ directory in TypeScript handlers. Java worker has component separation (worker, plugins, common-models) but not core/shell
Exception hierarchy Diverges Uses task result status codes (SUCCESS, PASSWORD_PROTECTED, INVALID_FILE, etc.) and problem codes (PDF_UNAVAILABLE, TEXT_UNAVAILABLE) instead of Recoverable/Permanent/Silent exceptions
SNS events Aligns Central topic {prefix}-extraction-task-status with 13 event types. MessageAttributes include eventType, caseId, batchId for filter-based routing
SQS handler Partial TypeScript Lambda handlers consume SQS (extraction-status-handler, failed-task-handler) with batch item failures. But Java workers consume directly from SQS, not via Lambda
Checkpoint pipeline Diverges No database-backed checkpoint state machine. Task pipeline is implicit via task dependencies — each extraction task can spawn child tasks (unpack → extract-text → render-pdf)
Database sessions Not used No RDS. Uses DynamoDB for worker state management. SQLAlchemy not involved
Retry/resilience Partial Exponential backoff for service claim (5 attempts, 250ms base, jitter). SQS visibility timeout extension for long tasks. PDF retry without OCR on failure. Dual-queue cascading failover (small → large → DLQ)
Idempotency Partial DynamoDB conditional updates for service claims (prevents race conditions). Job-add is idempotent (checks if jobId already in set). No checkpoint-based distributed lock
Multi-tenancy Aligns Per-case ECS service assignment. Queue names include caseId. S3 paths per case. Single ECS service serves one case at a time
Job processor orchestration Aligns (different implementation) Dynamic per-job queue creation (small + large + loader + uploader). Job completion with queue cleanup. Service rotation between jobs
CDK infrastructure Partial Multi-stack (API stack, Worker stack, SNS stack). ECS Fargate for workers. API Gateway + Lambda for orchestration. DynamoDB for state
Config management Partial TypeScript config with environment/region matrix. No Python CONFIG_MAP or Secrets Manager caching
Structured logging Partial Context-aware logger with jobId/caseId/batchId. TypeScript implementation, not Python JSON formatter
ORM models Not used No SQLAlchemy. DynamoDB for worker state, S3 for task metadata

Architecture

documentextractor/
├── infrastructure/
│   ├── api/import/
│   │   ├── create/index.ts                  # POST /import — create job, assign worker
│   │   ├── delete/index.ts                  # DELETE /import — cancel job
│   │   ├── worker-finished-handler/index.ts # CloudWatch alarm → job completion
│   │   ├── extraction-status-handler/index.ts # SNS → Rails API status relay
│   │   ├── failed-task-handler/index.ts     # DLQ → error handling + retry
│   │   └── shared/
│   │       ├── worker-state.ts              # DynamoDB service registry
│   │       ├── task-pool.ts                 # Priority-based job scheduling
│   │       ├── job-completion.ts            # Multi-stage completion flow
│   │       ├── task-sns-events.ts           # SNS event types and schemas
│   │       ├── queue-names.ts               # Queue naming conventions
│   │       ├── worker-task.ts               # Task interface definitions
│   │       ├── sns-helper.ts                # SNS publishing
│   │       ├── sqs-helper.ts                # SQS operations
│   │       ├── s3-helper.ts                 # S3 operations
│   │       ├── cloudwatch-helper.ts         # Metrics and alarms
│   │       └── job-settings.ts              # Job/task settings keys
│   └── lib/
│       ├── api-stack.ts                     # Lambda + API Gateway CDK
│       ├── worker-stack.ts                  # ECS Fargate + Auto Scaling CDK
│       ├── sns-stack.ts                     # SNS topic CDK
│       ├── psm-stack.ts                     # PSM: Firehose → S3 Parquet pipeline
│       ├── psm-roles-stack.ts               # PSM: S3 bucket, Glue DB/table, IAM roles
│       ├── psm-monitoring-stack.ts          # PSM: CloudWatch alarms for pipeline health
│       └── helpers/psm-resource-names.ts    # PSM: naming conventions
├── components/
│   ├── extraction-worker/                   # Java — ECS worker process
│   │   ├── src/main/java/.../worker/
│   │   │   ├── Worker.java                  # Main worker loop
│   │   │   ├── queues/TaskPool.java         # Priority scheduling (Java mirror)
│   │   │   └── threadpool/TaskThreadPool.java
│   │   └── Docker/Dockerfile                # Worker image with Hyland native libs
│   ├── extractor-plugins/
│   │   ├── hyland/                          # Hyland Document Filters v11 (SDK 25.3.0) — current
│   │   │   ├── src/.../hyland/
│   │   │   │   ├── HylandTaskHandler.java   # Main task routing
│   │   │   │   ├── HylandTaskHandlerFactory.java  # SPI factory
│   │   │   │   ├── DocumentFiltersWrapper.java    # Lazy-loading singleton API wrapper
│   │   │   │   ├── TextExtractor.java       # Native text extraction
│   │   │   │   ├── PdfRenderer.java         # PDF rendering
│   │   │   │   ├── ImageRenderer.java       # Page image rendering
│   │   │   │   ├── DocumentUnpacker.java    # Archive/mailbox unpacking
│   │   │   │   ├── MetadataProcessor.java   # Metadata extraction
│   │   │   │   └── api/
│   │   │   │       ├── HylandErrorCodes.java
│   │   │   │       └── HylandDocumentTypes.java
│   │   │   ├── config/{local,arm,x86}/ISYS11df.ini  # Per-architecture config
│   │   │   └── lib/ISYS11df.jar             # Hyland Java API
│   │   └── hyland23/                        # Hyland Document Filters v23 (SDK 23.3.3) — legacy
│   │       ├── src/.../hyland23/
│   │       │   ├── Hyland23TaskHandlerFactory.java
│   │       │   ├── Hyland23PstUnpacker.java # PST-specific unpacking
│   │       │   └── DocumentFiltersWrapper.java
│   │       └── lib/ISYS11df.jar
│   ├── loadfile-extractor/                  # Java — loadfile CSV parsing + import
│   │   └── src/.../loadfiles/
│   │       ├── LoadFileExtractor.java       # CSV parsing, field mapping, fan-out
│   │       └── LoadFileTaskHandler.java     # SPI task handler
│   ├── nist-database/                       # Java — NIST hash lookup (deNIST)
│   │   └── src/.../nist/
│   │       ├── NistDatabase.java            # Interface
│   │       └── NistS3Database.java          # S3-backed MD5 existence check
│   ├── lef/                                 # Java — LEF transcript/deposition parser
│   │   └── src/.../lef/LefContentParser.java
│   ├── common-models/                       # Java — shared data models
│   │   └── src/main/java/.../v0/common/
│   │       ├── WorkerTask.java              # Task definition
│   │       ├── NpIdentifiers.java           # Case/batch/job IDs
│   │       └── status/ExtractionStatusEvent.java
│   └── status-events/                       # Java — status publishing
│       └── src/main/java/.../statusevents/
│           └── Status.java                  # Status constants
└── pipeline/                                # CI/CD pipeline CDK

Event Flow

Nextpoint Backend (API call)
CreateImportHandler (Lambda)
  ├─ Validate import request (FILES/MAILBOX/LOADFILE/TRANSCRIPT)
  ├─ Create SQS queues: small + large task queues, loader, uploader
  ├─ Populate initial tasks (job-start notification + intake task)
  ├─ Claim ECS service (DynamoDB conditional update, 5 retries)
  │   ├─ Reuse: case already has service → add job to existing service
  │   └─ New: find FREE service → claim with condition check
  └─ Publish JOB_QUEUED event via SNS
Java ECS Worker (long-running container)
  ├─ Poll task queues via priority-based TaskPool
  ├─ Execute extraction plugin for task type
  │   ├─ intake-mailbox → unpack PST/MBOX → child tasks
  │   ├─ intake-document → extract single file → child tasks
  │   ├─ extract-text → native text extraction
  │   ├─ text-from-image → OCR single image
  │   ├─ render-pdf → convert to PDF
  │   └─ ... (15+ task types)
  ├─ Publish DOCUMENT_PROCESSED, TASK_FINISHED events
  └─ Extend SQS visibility timeout for long tasks
        ▼ (on scale-in alarm)
WorkerFinishedHandler (Lambda, triggered by CloudWatch)
  ├─ Parse alarm dimensions → extract serviceId
  ├─ Query DynamoDB for service's current job
  └─ Trigger JobCompletionHelper
JobCompletionHelper
  ├─ Check all job queues for remaining messages
  ├─ If messages exist: adjust task count, return (not done)
  ├─ If empty: wait 1 minute, re-check (SQS consistency)
  ├─ Delete job queues, retrieve batchId from tags
  ├─ Find next job via TaskPool priority algorithm
  │   ├─ Next job found → rotate service to next job
  │   └─ No jobs left → release service (scale ECS to 0, FREE in DynamoDB)
  └─ Publish JOB_FINISHED event via SNS
        ▼ (on task failure)
FailedTaskHandler (Lambda, triggered by DLQ)
  ├─ Route by task type:
  │   ├─ PDF render failed → retry without OCR (re-queue)
  │   ├─ OCR inline images → graceful degradation (WARNING, not ERROR)
  │   ├─ Text extraction → TEXT_UNAVAILABLE problem code
  │   └─ Unpack/intake → FILE_MISSING_OR_INCOMPLETE
  ├─ Publish TASK_FINISHED error event
  └─ Conditionally publish DOCUMENT_PROCESSED (if document still needs completion)

Key Design Decisions

DynamoDB Worker State Registry

ECS services are tracked in a DynamoDB table as a worker pool:

Field Type Purpose
ServiceId Number (PK) ECS service identifier
Status String (GSI) FREE or BUSY
CaseId String (GSI) Currently assigned case
CurrentJobId String Active job being processed
JobIds String Set All jobs assigned to this service
WorkerLimit Number Max concurrent tasks
ServiceArn String ECS service ARN
ScalingResourceId String Auto Scaling resource ID

Service claim uses DynamoDB conditional updates (status=FREE → BUSY) with exponential backoff retry (5 attempts, 250ms base, random jitter) to handle race conditions when multiple imports compete for services.

Multi-job per service: A service can hold multiple jobs for the same case, rotating between them via priority ordering. This avoids cold-start overhead when the same case has concurrent imports.

Priority-Based Job Scheduling

The TaskPool algorithm combines job priority + queue creation timestamp into a single sort key:

order = (priority << 56) | timestamp
  • Lower priority number = higher priority (default: 50)
  • Timestamp prevents starvation of older lower-priority jobs
  • Algorithm is mirrored in both TypeScript (Lambda handlers) and Java (ECS workers)

Workers rotate through jobs by: complete current job → find next by priority → rotate service state → adjust task count.

Dual-Queue with Cascading Failover

Each job gets two task queues:

{prefix}_nptasks_{caseId}_{jobId}_small   (quick tasks, single files)
{prefix}_nptasks_{caseId}_{jobId}_large   (long tasks, PST extraction, etc.)

Failover cascade: 1. Task fails twice on small queue → moved to large queue 2. Task fails twice on large queue → moved to shared DLQ 3. DLQ triggers FailedTaskHandler → task-specific error handling or retry

Plugin-Based Extraction

The Java worker uses a plugin architecture for extraction. Each task type maps to a handler plugin:

Task Type Purpose
intake-mailbox PST/MBOX/OST unpacking
intake-document Individual file intake
intake-loadfile Load file CSV processing
unpack-document Container unpacking
unpack-mailbox Email container unpacking
extract-text Native text extraction
text-from-image Single-image OCR
text-from-image-list Multi-image OCR
ocr-inline-images Embedded image OCR (graceful failure)
ocr-page-images Page-level OCR
render-pdf Native → PDF conversion
combine-page-images Page image assembly
render-page-images Page rendering
render-xlsx Excel rendering

Intake tasks spawn child tasks (unpack → extract-text → render-pdf), creating an implicit task DAG.

Hyland Document Filters Integration

The extraction engine is powered by Hyland Document Filters — a commercial SDK for document parsing, text extraction, image rendering, and format conversion. Loaded via Java SPI (META-INF/services/TaskHandlerFactory).

Two SDK versions coexist:

Version SDK Release Role Why
v11 (25.3.0) Current Primary extraction engine Handles all task types including mailbox unpacking
v23 (23.3.3) Legacy PST-only fallback Retained because v25.3.0 has a bug extracting MSG files with attachments >8MB from PST and other mailboxes

Core components (v11 plugin): - DocumentFiltersWrapper.java — Lazy-loading singleton wrapping the com.perceptive.documentfilters.* API. Initialized once per worker lifetime - HylandTaskHandler.java — Routes task types to specialized handlers: TextExtractor, PdfRenderer, ImageRenderer, DocumentUnpacker, MetadataProcessor - HylandTaskHandlerFactory.java — SPI factory, registered in META-INF/services/com.nextpoint.documentextractor.v0.interfaces.TaskHandlerFactory

Supported task types via Hyland: - Text extraction (native + OCR via Tesseract) - PDF rendering from any supported format - Page image rendering (PNG output) - Archive/mailbox unpacking (ZIP, RAR, 7Z, PST, MBOX, OST, MSG) - Metadata extraction - Excel rendering

Native library deployment: - Architecture-specific builds: ARM64 (linux-aarch64-gcc-64), x86_64 (linux-intel-gcc-64) - Loaded via JNI: LD_LIBRARY_PATH=/worker/extractors/hyland - Docker image bundles native libs, fonts, and per-architecture config (ISYS11df.ini)

Configuration (ISYS11df.ini): - OCR engine: Tesseract - DPI: 96 default, AUTO for images - Excel extraction mode: CSV - PDF/TIFF compression settings - Image enumeration enabled - Spreadsheet page limit: 10,000

Worker Loop and Graceful Shutdown

The Java worker (Worker.java) runs a continuous polling loop:

  1. Capacity-gated polling: capacityTracker.maximumTaskSize() blocks via CountDownLatch when no thread capacity is available. Only polls SQS when a slot opens.
  2. Exponential backoff on idle: After 3+ consecutive empty polls with full capacity available, the worker exits gracefully (triggers scale-in).
  3. Zombie queue detection: If a queue is deleted mid-processing (e.g., job cancelled), the worker retries once to find other queues before giving up.
  4. SIGTERM shutdown hook: Runtime.getRuntime().addShutdownHook() stops accepting new tasks immediately but allows in-flight tasks to finish. Logs total execution time and capacity usage percentage on exit.

Capacity Tracking and Worker Lifetime

Each worker has a dual capacity model:

Capacity Type Default Purpose
Per-task (concurrent) 8 slots Small tasks=1 slot, large tasks=5 slots (or 8 for spot)
Lifetime (total) 600 units Worker auto-terminates after ~120 large or ~600 small tasks
  • Spot instance restrictions: Spot workers can only process certain task types (OCR, text extraction, PDF) — not mailbox unpacking (too long, risk of spot termination mid-PST)
  • Metrics thread: Virtual thread periodically reports WorkerCapacityUsage to CloudWatch for scaling decisions

SQS Visibility Timeout Extension

Rather than setting a large visibility timeout upfront (which delays retries on crash), the worker incrementally extends it:

  • Checks every 5 minutes (VISIBILITY_CHECKPOINT_INCREMENT)
  • Extends by actual elapsed time + 30-second safety padding
  • If the worker crashes without catching the exception, the message becomes visible again in ~5 minutes (not the full task timeout)
  • Uses Java virtual threads for the visibility monitor to avoid exhausting the platform thread pool

Task Timeout and Retry Escalation

Workers enforce timeouts via Future.get(timeout, TimeUnit.MILLISECONDS):

Failure Action
OutOfMemoryError Requeue as LARGE task, terminate worker
TimeoutException on SMALL task Requeue as LARGE task (more time/resources)
TimeoutException on LARGE task Fail permanently
PDF render failure Retry once with NO_OCR=true (OCR is common failure source)
RestartTaskException Restart immediately on same worker (no requeue)

Max retries per task: 1. Combined with the dual-queue failover (small → large → DLQ), a task gets multiple chances across different resource tiers.

NIST Hash Filtering (deNIST)

Known system files are filtered via NIST National Software Reference Library:

  • S3-backed hash database: Bucket nextpoint-dm-nsrl-hash stores MD5 hashes as S3 object keys
  • Lookup: s3.exists(NIST_BUCKET, md5.toUpperCase()) — simple existence check
  • On match: Throws DeNistException, caught at worker level — file silently skipped with success status (not counted as an error)
  • Prevents indexing of OS files, common software, and other known non-responsive files in eDiscovery

Loadfile Processing

The loadfile extractor (LoadFileExtractor.java) handles bulk imports via CSV:

  • Duplicate protection: Before processing, checks for tracker file with MD5 hash of loadfile path in S3. If exists, skips (idempotent)
  • Parallel inventory: Uses CompletableFuture to asynchronously build image and text file inventories while parsing the CSV
  • Field mapping: Generates normalized nextpoint_load_file.csv when custom field mappings are provided, backs up original
  • Task fan-out: Creates UNPACK_LOADFILE_RECORD child tasks for each parsed record with metadata

Rails Backend Communication (HMAC-SHA1)

The extraction-status-handler relays SNS events to the Rails backend:

  • Webhook callback: POST /batches/update/{batchId} to Rails API endpoint
  • HMAC-SHA1 signature: Signs [HTTP_METHOD, API_PATH, ISO_DATE] joined with newlines, Base64-encoded in API-Authorization header
  • Status mapping: Event types → Rails batch statuses (WORKER_STARTED→processing, JOB_FINISHED→extractor_complete, etc.)
  • Per-environment endpoints: Each environment has its own Rails API URL

CloudWatch-Driven Scaling

ECS service scaling uses CloudWatch metrics: - WorkerCapacityUsage: Per-service utilization metric (dimension: ServiceId) - WorkerPendingWorkload: Pending task count - Scale-in alarm triggers WorkerFinishedHandler → job completion flow - Neutral utilization metric (0.65) placed after state changes to prevent false scale-in

PSM — Processing Status Monitor

PSM is an event analytics pipeline that captures all extraction status events into a queryable data lake for monitoring, debugging, and operational visibility.

Data flow:

SNS (extraction-task-status topic)
  → Kinesis Firehose (SNS subscription, raw message delivery)
    → JQ metadata extraction (caseId, batchId for partitioning)
      → JSON → Parquet conversion (via Glue schema)
        → S3 (Hive-partitioned by caseId/batchId)
          → Queryable via Athena

Three CDK stacks:

Stack Purpose
PsmRolesStack S3 bucket (nge-psm-statuses), Glue database + table, Firehose IAM role, CloudWatch log group
PsmStack Firehose delivery stream subscribed to SNS topic. 64MB/60s buffering, dynamic partitioning via JQ, JSON→Parquet format conversion
PsmMonitoringStack Four CloudWatch alarms monitoring pipeline health

Glue table schema:

Column Type Purpose
jobId string Extraction job identifier
eventType string JOB_QUEUED, DOCUMENT_PROCESSED, etc.
eventDetail string JSON details (problem codes, file paths)
status string SUCCESS or ERROR
documentId string Document identifier
timestamp timestamp Event timestamp
source string Always "extractor"
exhibitIds string Associated exhibit identifiers

Partition keys: caseId (int), batchId (int) — enables efficient per-case and per-batch queries without scanning the entire dataset.

S3 layout:

s3://{bucket}/statuses/caseid={caseId}/batchid={batchId}/   (Parquet files)
s3://{bucket}/errors/{error-type}/{yyyy}/{MM}/{dd}/          (delivery failures)

Pipeline health alarms (PsmMonitoringStack):

Alarm Threshold Meaning
Throttling >90% of records/sec limit Firehose is near capacity
Data freshness >300 seconds Events are delayed reaching S3
Delivery ratio ≤1 (delivered/incoming bytes) Data loss or backpressure
Failed conversions >5% of records JSON→Parquet conversion failures

All alarms fire to a dedicated SNS topic (nge-psm-statuses-alarm) for operational alerting.

Task-Specific Error Recovery

The FailedTaskHandler implements targeted recovery strategies: - PDF render failure with OCR: Re-queue task with NO_OCR setting to retry without OCR (which is a common failure source) - OCR inline images: Downgrade to WARNING (non-critical — document still usable) - Text extraction: Mark TEXT_UNAVAILABLE but still complete document processing - Unpack/intake: Mark FILE_MISSING_OR_INCOMPLETE

SNS Event Types

type EventType =
  | 'JOB_QUEUED'           // Import request received, queues created
  | 'JOB_STARTED'          // Worker began processing
  | 'JOB_FINISHED'         // All tasks complete, queues deleted
  | 'DOCUMENT_ADDED'       // New document discovered (e.g., from PST)
  | 'DOCUMENT_PROCESSED'   // Document fully extracted
  | 'TASK_ADDED'           // Extraction task queued
  | 'TASK_FINISHED'        // Extraction task completed or failed
  | 'WORKER_STARTED'       // ECS worker container started
  | 'WORKER_FINISHED'      // ECS worker scaling down
  | 'ERROR'                // Extraction error
  | 'WARNING'              // Non-fatal issue (e.g., OCR inline failure)
  | 'NOTIFICATION'         // Informational
  | 'IMPORT_CANCELLED'     // Job cancelled by user

Message attributes: eventType, caseId, batchId — used by SNS filter policies to route events to downstream queues (doc_loader, doc_uploader).

Import Types

Type Trigger Initial Task
FILES Individual file import intake-document per file
MAILBOX PST/MBOX/OST import intake-mailbox
LOADFILE Load file CSV with field mapping intake-loadfile
TRANSCRIPT Deposition transcripts (LEF format) unpack-preprocessed-container
MANUAL Auto-detected → FILES or LOADFILE Depends on detection

Queue Naming Conventions

Queue Name Pattern Purpose
Small tasks {prefix}_nptasks_{caseId}_{jobId}_small Quick extraction tasks
Large tasks {prefix}_nptasks_{caseId}_{jobId}_large Long-running extraction tasks
Failed tasks {prefix}-failed-extraction-tasks Shared DLQ for all jobs
Doc loader {prefix}_doc_loader_{caseId}_{batchId} Downstream: document loading
Doc uploader {prefix}_doc_uploader_{caseId}_{batchId} Downstream: document uploading

Tags on queues: NextpointJobId, NextpointBatchId, NextpointJobPriority, TopicSubscriptionArn.

Divergences from Standard Architecture

1. Java/Kotlin + TypeScript (Not Python)

The extraction worker is Java/Kotlin (ECS Fargate), orchestration is TypeScript (Lambda). The standard architecture uses Python throughout. This was driven by the Hyland Document Filters SDK being a Java/JNI library, plus Java ecosystem availability for document processing (Tesseract bindings, PST parsing). Predates the Python architecture standardization.

2. DynamoDB for State (Not RDS)

Worker state lives in DynamoDB instead of RDS/MySQL. This makes sense for the worker pool use case — high-throughput conditional updates, no relational queries needed. But it means the module doesn't share the multi-tenant per-case database pattern.

3. No Hexagonal Boundaries

TypeScript shared/ directory mixes domain logic (task-pool.ts priority algorithm, job-completion.ts flow) with infrastructure (worker-state.ts DynamoDB, sqs-helper.ts). The standard pattern would separate these.

4. Implicit Task DAG (Not Checkpoint Pipeline)

Extraction creates a dynamic task graph (intake → unpack → extract → render) rather than a fixed checkpoint pipeline. Each task can spawn child tasks. This is more flexible than the 11-step checkpoint pattern but harder to resume mid-pipeline.

5. CloudWatch Alarm-Driven Lifecycle

Job completion is triggered by CloudWatch scale-in alarms, not by explicit completion events. This couples job lifecycle to infrastructure scaling signals.

6. Dual-Language Priority Algorithm

The TaskPool priority algorithm is implemented in both TypeScript and Java, creating a maintenance burden. Changes must be synchronized across languages.

Lessons Learned

  1. DynamoDB conditional updates are effective for worker pools — The claim/release pattern with conditional checks handles race conditions without distributed locks.

  2. Dual-queue (small/large) improves throughput — Quick tasks don't get stuck behind long PST extractions. Cascading failover provides graceful degradation.

  3. Task-specific error recovery reduces false failures — Retrying PDF renders without OCR, and downgrading OCR inline failures to warnings, significantly reduces false error rates.

  4. 1-minute SQS emptiness re-check is necessary — SQS's eventual consistency means a queue can appear empty momentarily. The re-check prevents premature job completion.

  5. Priority + timestamp ordering prevents starvation — Pure priority ordering starves low-priority jobs. The composite key ensures older jobs eventually get processed.

  6. Neutral utilization metric prevents false scale-in — After state changes, placing a 0.65 utilization metric prevents CloudWatch from triggering premature scale-in alarms.

  7. Incremental visibility extension beats large upfront timeout — Setting a 5-minute checkpoint instead of a 30-minute timeout means crashed tasks retry in minutes, not hours. The 30-second padding prevents race conditions.

  8. Task size escalation is a useful retry strategy — Requeuing a failed SMALL task as LARGE gives it more time and resources. Combined with OCR-disabled retry for PDF rendering, this handles most transient failures automatically.

  9. Worker lifetime limits prevent resource leaks — Capping workers at ~600 tasks forces periodic container recycling, preventing memory leaks and stale state accumulation in long-running JVM processes.

  10. NIST deduplication saves significant processing — Filtering known system files before extraction avoids wasting compute on OS files, common software, and other non-responsive content.

  11. Virtual threads (Java 21) simplify concurrent monitoring — Visibility monitors, metrics publishers, and cleanup agents can each run in their own virtual thread without exhausting the platform thread pool.

Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.