Reference Implementation: documentextractor¶

Overview¶

The documentextractor module is the core extraction service for Nextpoint's Next Gen Engine (NGE). It takes raw files (emails, PSTs, documents, loadfiles) and extracts content: text, PDFs, page images, metadata, and attachments.

This is the largest NGE module, using a hybrid Java/Kotlin + TypeScript architecture: Java ECS workers perform extraction, TypeScript Lambda handlers orchestrate jobs and manage worker state via DynamoDB.

Like documentexporter, this module predates the standard Python architecture patterns. It aligns with the event-driven principle (uses SNS) but diverges significantly in language, compute model, and state management.

Pattern Mapping¶

Pattern	Status	documentextractor Implementation
Hexagonal boundaries	Diverges	Flat `shared/` directory in TypeScript handlers. Java worker has component separation (worker, plugins, common-models) but not core/shell
Exception hierarchy	Diverges	Uses task result status codes (SUCCESS, PASSWORD_PROTECTED, INVALID_FILE, etc.) and problem codes (PDF_UNAVAILABLE, TEXT_UNAVAILABLE) instead of Recoverable/Permanent/Silent exceptions
SNS events	Aligns	Central topic `{prefix}-extraction-task-status` with 13 event types. MessageAttributes include `eventType`, `caseId`, `batchId` for filter-based routing
SQS handler	Partial	TypeScript Lambda handlers consume SQS (extraction-status-handler, failed-task-handler) with batch item failures. But Java workers consume directly from SQS, not via Lambda
Checkpoint pipeline	Diverges	No database-backed checkpoint state machine. Task pipeline is implicit via task dependencies — each extraction task can spawn child tasks (unpack → extract-text → render-pdf)
Database sessions	Not used	No RDS. Uses DynamoDB for worker state management. SQLAlchemy not involved
Retry/resilience	Partial	Exponential backoff for service claim (5 attempts, 250ms base, jitter). SQS visibility timeout extension for long tasks. PDF retry without OCR on failure. Dual-queue cascading failover (small → large → DLQ)
Idempotency	Partial	DynamoDB conditional updates for service claims (prevents race conditions). Job-add is idempotent (checks if jobId already in set). No checkpoint-based distributed lock
Multi-tenancy	Aligns	Per-case ECS service assignment. Queue names include caseId. S3 paths per case. Single ECS service serves one case at a time
Job processor orchestration	Aligns (different implementation)	Dynamic per-job queue creation (small + large + loader + uploader). Job completion with queue cleanup. Service rotation between jobs
CDK infrastructure	Partial	Multi-stack (API stack, Worker stack, SNS stack). ECS Fargate for workers. API Gateway + Lambda for orchestration. DynamoDB for state
Config management	Partial	TypeScript config with environment/region matrix. No Python CONFIG_MAP or Secrets Manager caching
Structured logging	Partial	Context-aware logger with jobId/caseId/batchId. TypeScript implementation, not Python JSON formatter
ORM models	Not used	No SQLAlchemy. DynamoDB for worker state, S3 for task metadata

Architecture¶

documentextractor/
├── infrastructure/
│   ├── api/import/
│   │   ├── create/index.ts                  # POST /import — create job, assign worker
│   │   ├── delete/index.ts                  # DELETE /import — cancel job
│   │   ├── worker-finished-handler/index.ts # CloudWatch alarm → job completion
│   │   ├── extraction-status-handler/index.ts # SNS → Rails API status relay
│   │   ├── failed-task-handler/index.ts     # DLQ → error handling + retry
│   │   └── shared/
│   │       ├── worker-state.ts              # DynamoDB service registry
│   │       ├── task-pool.ts                 # Priority-based job scheduling
│   │       ├── job-completion.ts            # Multi-stage completion flow
│   │       ├── task-sns-events.ts           # SNS event types and schemas
│   │       ├── queue-names.ts               # Queue naming conventions
│   │       ├── worker-task.ts               # Task interface definitions
│   │       ├── sns-helper.ts                # SNS publishing
│   │       ├── sqs-helper.ts                # SQS operations
│   │       ├── s3-helper.ts                 # S3 operations
│   │       ├── cloudwatch-helper.ts         # Metrics and alarms
│   │       └── job-settings.ts              # Job/task settings keys
│   └── lib/
│       ├── api-stack.ts                     # Lambda + API Gateway CDK
│       ├── worker-stack.ts                  # ECS Fargate + Auto Scaling CDK
│       ├── sns-stack.ts                     # SNS topic CDK
│       ├── psm-stack.ts                     # PSM: Firehose → S3 Parquet pipeline
│       ├── psm-roles-stack.ts               # PSM: S3 bucket, Glue DB/table, IAM roles
│       ├── psm-monitoring-stack.ts          # PSM: CloudWatch alarms for pipeline health
│       └── helpers/psm-resource-names.ts    # PSM: naming conventions
├── components/
│   ├── extraction-worker/                   # Java — ECS worker process
│   │   ├── src/main/java/.../worker/
│   │   │   ├── Worker.java                  # Main worker loop
│   │   │   ├── queues/TaskPool.java         # Priority scheduling (Java mirror)
│   │   │   └── threadpool/TaskThreadPool.java
│   │   └── Docker/Dockerfile                # Worker image with Hyland native libs
│   ├── extractor-plugins/
│   │   ├── hyland/                          # Hyland Document Filters v11 (SDK 25.3.0) — current
│   │   │   ├── src/.../hyland/
│   │   │   │   ├── HylandTaskHandler.java   # Main task routing
│   │   │   │   ├── HylandTaskHandlerFactory.java  # SPI factory
│   │   │   │   ├── DocumentFiltersWrapper.java    # Lazy-loading singleton API wrapper
│   │   │   │   ├── TextExtractor.java       # Native text extraction
│   │   │   │   ├── PdfRenderer.java         # PDF rendering
│   │   │   │   ├── ImageRenderer.java       # Page image rendering
│   │   │   │   ├── DocumentUnpacker.java    # Archive/mailbox unpacking
│   │   │   │   ├── MetadataProcessor.java   # Metadata extraction
│   │   │   │   └── api/
│   │   │   │       ├── HylandErrorCodes.java
│   │   │   │       └── HylandDocumentTypes.java
│   │   │   ├── config/{local,arm,x86}/ISYS11df.ini  # Per-architecture config
│   │   │   └── lib/ISYS11df.jar             # Hyland Java API
│   │   └── hyland23/                        # Hyland Document Filters v23 (SDK 23.3.3) — legacy
│   │       ├── src/.../hyland23/
│   │       │   ├── Hyland23TaskHandlerFactory.java
│   │       │   ├── Hyland23PstUnpacker.java # PST-specific unpacking
│   │       │   └── DocumentFiltersWrapper.java
│   │       └── lib/ISYS11df.jar
│   ├── loadfile-extractor/                  # Java — loadfile CSV parsing + import
│   │   └── src/.../loadfiles/
│   │       ├── LoadFileExtractor.java       # CSV parsing, field mapping, fan-out
│   │       └── LoadFileTaskHandler.java     # SPI task handler
│   ├── nist-database/                       # Java — NIST hash lookup (deNIST)
│   │   └── src/.../nist/
│   │       ├── NistDatabase.java            # Interface
│   │       └── NistS3Database.java          # S3-backed MD5 existence check
│   ├── lef/                                 # Java — LEF transcript/deposition parser
│   │   └── src/.../lef/LefContentParser.java
│   ├── common-models/                       # Java — shared data models
│   │   └── src/main/java/.../v0/common/
│   │       ├── WorkerTask.java              # Task definition
│   │       ├── NpIdentifiers.java           # Case/batch/job IDs
│   │       └── status/ExtractionStatusEvent.java
│   └── status-events/                       # Java — status publishing
│       └── src/main/java/.../statusevents/
│           └── Status.java                  # Status constants
└── pipeline/                                # CI/CD pipeline CDK

Event Flow¶

Nextpoint Backend (API call)
  │
  ▼
CreateImportHandler (Lambda)
  ├─ Validate import request (FILES/MAILBOX/LOADFILE/TRANSCRIPT)
  ├─ Create SQS queues: small + large task queues, loader, uploader
  ├─ Populate initial tasks (job-start notification + intake task)
  ├─ Claim ECS service (DynamoDB conditional update, 5 retries)
  │   ├─ Reuse: case already has service → add job to existing service
  │   └─ New: find FREE service → claim with condition check
  └─ Publish JOB_QUEUED event via SNS
        │
        ▼
Java ECS Worker (long-running container)
  ├─ Poll task queues via priority-based TaskPool
  ├─ Execute extraction plugin for task type
  │   ├─ intake-mailbox → unpack PST/MBOX → child tasks
  │   ├─ intake-document → extract single file → child tasks
  │   ├─ extract-text → native text extraction
  │   ├─ text-from-image → OCR single image
  │   ├─ render-pdf → convert to PDF
  │   └─ ... (15+ task types)
  ├─ Publish DOCUMENT_PROCESSED, TASK_FINISHED events
  └─ Extend SQS visibility timeout for long tasks
        │
        ▼ (on scale-in alarm)
WorkerFinishedHandler (Lambda, triggered by CloudWatch)
  ├─ Parse alarm dimensions → extract serviceId
  ├─ Query DynamoDB for service's current job
  └─ Trigger JobCompletionHelper
        │
        ▼
JobCompletionHelper
  ├─ Check all job queues for remaining messages
  ├─ If messages exist: adjust task count, return (not done)
  ├─ If empty: wait 1 minute, re-check (SQS consistency)
  ├─ Delete job queues, retrieve batchId from tags
  ├─ Find next job via TaskPool priority algorithm
  │   ├─ Next job found → rotate service to next job
  │   └─ No jobs left → release service (scale ECS to 0, FREE in DynamoDB)
  └─ Publish JOB_FINISHED event via SNS
        │
        ▼ (on task failure)
FailedTaskHandler (Lambda, triggered by DLQ)
  ├─ Route by task type:
  │   ├─ PDF render failed → retry without OCR (re-queue)
  │   ├─ OCR inline images → graceful degradation (WARNING, not ERROR)
  │   ├─ Text extraction → TEXT_UNAVAILABLE problem code
  │   └─ Unpack/intake → FILE_MISSING_OR_INCOMPLETE
  ├─ Publish TASK_FINISHED error event
  └─ Conditionally publish DOCUMENT_PROCESSED (if document still needs completion)

Key Design Decisions¶

DynamoDB Worker State Registry¶

ECS services are tracked in a DynamoDB table as a worker pool:

Field	Type	Purpose
ServiceId	Number (PK)	ECS service identifier
Status	String (GSI)	FREE or BUSY
CaseId	String (GSI)	Currently assigned case
CurrentJobId	String	Active job being processed
JobIds	String Set	All jobs assigned to this service
WorkerLimit	Number	Max concurrent tasks
ServiceArn	String	ECS service ARN
ScalingResourceId	String	Auto Scaling resource ID

Service claim uses DynamoDB conditional updates (status=FREE → BUSY) with exponential backoff retry (5 attempts, 250ms base, random jitter) to handle race conditions when multiple imports compete for services.

Multi-job per service: A service can hold multiple jobs for the same case, rotating between them via priority ordering. This avoids cold-start overhead when the same case has concurrent imports.

Priority-Based Job Scheduling¶

The TaskPool algorithm combines job priority + queue creation timestamp into a single sort key:

order = (priority << 56) | timestamp

Lower priority number = higher priority (default: 50)
Timestamp prevents starvation of older lower-priority jobs
Algorithm is mirrored in both TypeScript (Lambda handlers) and Java (ECS workers)

Workers rotate through jobs by: complete current job → find next by priority → rotate service state → adjust task count.

Dual-Queue with Cascading Failover¶

Each job gets two task queues:

{prefix}_nptasks_{caseId}_{jobId}_small   (quick tasks, single files)
{prefix}_nptasks_{caseId}_{jobId}_large   (long tasks, PST extraction, etc.)

Failover cascade: 1. Task fails twice on small queue → moved to large queue 2. Task fails twice on large queue → moved to shared DLQ 3. DLQ triggers FailedTaskHandler → task-specific error handling or retry

Plugin-Based Extraction¶

The Java worker uses a plugin architecture for extraction. Each task type maps to a handler plugin:

Task Type	Purpose
`intake-mailbox`	PST/MBOX/OST unpacking
`intake-document`	Individual file intake
`intake-loadfile`	Load file CSV processing
`unpack-document`	Container unpacking
`unpack-mailbox`	Email container unpacking
`extract-text`	Native text extraction
`text-from-image`	Single-image OCR
`text-from-image-list`	Multi-image OCR
`ocr-inline-images`	Embedded image OCR (graceful failure)
`ocr-page-images`	Page-level OCR
`render-pdf`	Native → PDF conversion
`combine-page-images`	Page image assembly
`render-page-images`	Page rendering
`render-xlsx`	Excel rendering

Intake tasks spawn child tasks (unpack → extract-text → render-pdf), creating an implicit task DAG.

Hyland Document Filters Integration¶

The extraction engine is powered by Hyland Document Filters — a commercial SDK for document parsing, text extraction, image rendering, and format conversion. Loaded via Java SPI (META-INF/services/TaskHandlerFactory).

Two SDK versions coexist:

Version	SDK Release	Role	Why
v11 (25.3.0)	Current	Primary extraction engine	Handles all task types including mailbox unpacking
v23 (23.3.3)	Legacy	PST-only fallback	Retained because v25.3.0 has a bug extracting MSG files with attachments >8MB from PST and other mailboxes

Core components (v11 plugin): - DocumentFiltersWrapper.java — Lazy-loading singleton wrapping the com.perceptive.documentfilters.* API. Initialized once per worker lifetime - HylandTaskHandler.java — Routes task types to specialized handlers: TextExtractor, PdfRenderer, ImageRenderer, DocumentUnpacker, MetadataProcessor - HylandTaskHandlerFactory.java — SPI factory, registered in META-INF/services/com.nextpoint.documentextractor.v0.interfaces.TaskHandlerFactory

Supported task types via Hyland: - Text extraction (native + OCR via Tesseract) - PDF rendering from any supported format - Page image rendering (PNG output) - Archive/mailbox unpacking (ZIP, RAR, 7Z, PST, MBOX, OST, MSG) - Metadata extraction - Excel rendering

Native library deployment: - Architecture-specific builds: ARM64 (linux-aarch64-gcc-64), x86_64 (linux-intel-gcc-64) - Loaded via JNI: LD_LIBRARY_PATH=/worker/extractors/hyland - Docker image bundles native libs, fonts, and per-architecture config (ISYS11df.ini)

Configuration (ISYS11df.ini): - OCR engine: Tesseract - DPI: 96 default, AUTO for images - Excel extraction mode: CSV - PDF/TIFF compression settings - Image enumeration enabled - Spreadsheet page limit: 10,000

Worker Loop and Graceful Shutdown¶

The Java worker (Worker.java) runs a continuous polling loop:

Capacity-gated polling: capacityTracker.maximumTaskSize() blocks via CountDownLatch when no thread capacity is available. Only polls SQS when a slot opens.
Exponential backoff on idle: After 3+ consecutive empty polls with full capacity available, the worker exits gracefully (triggers scale-in).
Zombie queue detection: If a queue is deleted mid-processing (e.g., job cancelled), the worker retries once to find other queues before giving up.
SIGTERM shutdown hook: Runtime.getRuntime().addShutdownHook() stops accepting new tasks immediately but allows in-flight tasks to finish. Logs total execution time and capacity usage percentage on exit.

Capacity Tracking and Worker Lifetime¶

Each worker has a dual capacity model:

Capacity Type	Default	Purpose
Per-task (concurrent)	8 slots	Small tasks=1 slot, large tasks=5 slots (or 8 for spot)
Lifetime (total)	600 units	Worker auto-terminates after ~120 large or ~600 small tasks

Spot instance restrictions: Spot workers can only process certain task types (OCR, text extraction, PDF) — not mailbox unpacking (too long, risk of spot termination mid-PST)
Metrics thread: Virtual thread periodically reports WorkerCapacityUsage to CloudWatch for scaling decisions

SQS Visibility Timeout Extension¶

Rather than setting a large visibility timeout upfront (which delays retries on crash), the worker incrementally extends it:

Checks every 5 minutes (VISIBILITY_CHECKPOINT_INCREMENT)
Extends by actual elapsed time + 30-second safety padding
If the worker crashes without catching the exception, the message becomes visible again in ~5 minutes (not the full task timeout)
Uses Java virtual threads for the visibility monitor to avoid exhausting the platform thread pool

Task Timeout and Retry Escalation¶

Workers enforce timeouts via Future.get(timeout, TimeUnit.MILLISECONDS):

Failure	Action
OutOfMemoryError	Requeue as LARGE task, terminate worker
TimeoutException on SMALL task	Requeue as LARGE task (more time/resources)
TimeoutException on LARGE task	Fail permanently
PDF render failure	Retry once with `NO_OCR=true` (OCR is common failure source)
RestartTaskException	Restart immediately on same worker (no requeue)

Max retries per task: 1. Combined with the dual-queue failover (small → large → DLQ), a task gets multiple chances across different resource tiers.

NIST Hash Filtering (deNIST)¶

Known system files are filtered via NIST National Software Reference Library:

S3-backed hash database: Bucket nextpoint-dm-nsrl-hash stores MD5 hashes as S3 object keys
Lookup: s3.exists(NIST_BUCKET, md5.toUpperCase()) — simple existence check
On match: Throws DeNistException, caught at worker level — file silently skipped with success status (not counted as an error)
Prevents indexing of OS files, common software, and other known non-responsive files in eDiscovery

Loadfile Processing¶

The loadfile extractor (LoadFileExtractor.java) handles bulk imports via CSV:

Duplicate protection: Before processing, checks for tracker file with MD5 hash of loadfile path in S3. If exists, skips (idempotent)
Parallel inventory: Uses CompletableFuture to asynchronously build image and text file inventories while parsing the CSV
Field mapping: Generates normalized nextpoint_load_file.csv when custom field mappings are provided, backs up original
Task fan-out: Creates UNPACK_LOADFILE_RECORD child tasks for each parsed record with metadata

Rails Backend Communication (HMAC-SHA1)¶

The extraction-status-handler relays SNS events to the Rails backend:

Webhook callback: POST /batches/update/{batchId} to Rails API endpoint
HMAC-SHA1 signature: Signs [HTTP_METHOD, API_PATH, ISO_DATE] joined with newlines, Base64-encoded in API-Authorization header
Status mapping: Event types → Rails batch statuses (WORKER_STARTED→processing, JOB_FINISHED→extractor_complete, etc.)
Per-environment endpoints: Each environment has its own Rails API URL

CloudWatch-Driven Scaling¶

ECS service scaling uses CloudWatch metrics: - WorkerCapacityUsage: Per-service utilization metric (dimension: ServiceId) - WorkerPendingWorkload: Pending task count - Scale-in alarm triggers WorkerFinishedHandler → job completion flow - Neutral utilization metric (0.65) placed after state changes to prevent false scale-in

PSM — Processing Status Monitor¶

PSM is an event analytics pipeline that captures all extraction status events into a queryable data lake for monitoring, debugging, and operational visibility.

Data flow:

SNS (extraction-task-status topic)
  → Kinesis Firehose (SNS subscription, raw message delivery)
    → JQ metadata extraction (caseId, batchId for partitioning)
      → JSON → Parquet conversion (via Glue schema)
        → S3 (Hive-partitioned by caseId/batchId)
          → Queryable via Athena

Three CDK stacks:

Stack	Purpose
PsmRolesStack	S3 bucket (`nge-psm-statuses`), Glue database + table, Firehose IAM role, CloudWatch log group
PsmStack	Firehose delivery stream subscribed to SNS topic. 64MB/60s buffering, dynamic partitioning via JQ, JSON→Parquet format conversion
PsmMonitoringStack	Four CloudWatch alarms monitoring pipeline health

Glue table schema:

Column	Type	Purpose
jobId	string	Extraction job identifier
eventType	string	JOB_QUEUED, DOCUMENT_PROCESSED, etc.
eventDetail	string	JSON details (problem codes, file paths)
status	string	SUCCESS or ERROR
documentId	string	Document identifier
timestamp	timestamp	Event timestamp
source	string	Always "extractor"
exhibitIds	string	Associated exhibit identifiers

Partition keys: caseId (int), batchId (int) — enables efficient per-case and per-batch queries without scanning the entire dataset.

S3 layout:

s3://{bucket}/statuses/caseid={caseId}/batchid={batchId}/   (Parquet files)
s3://{bucket}/errors/{error-type}/{yyyy}/{MM}/{dd}/          (delivery failures)

Pipeline health alarms (PsmMonitoringStack):

Alarm	Threshold	Meaning
Throttling	>90% of records/sec limit	Firehose is near capacity
Data freshness	>300 seconds	Events are delayed reaching S3
Delivery ratio	≤1 (delivered/incoming bytes)	Data loss or backpressure
Failed conversions	>5% of records	JSON→Parquet conversion failures

All alarms fire to a dedicated SNS topic (nge-psm-statuses-alarm) for operational alerting.

Task-Specific Error Recovery¶

The FailedTaskHandler implements targeted recovery strategies: - PDF render failure with OCR: Re-queue task with NO_OCR setting to retry without OCR (which is a common failure source) - OCR inline images: Downgrade to WARNING (non-critical — document still usable) - Text extraction: Mark TEXT_UNAVAILABLE but still complete document processing - Unpack/intake: Mark FILE_MISSING_OR_INCOMPLETE

type EventType =
  | 'JOB_QUEUED'           // Import request received, queues created
  | 'JOB_STARTED'          // Worker began processing
  | 'JOB_FINISHED'         // All tasks complete, queues deleted
  | 'DOCUMENT_ADDED'       // New document discovered (e.g., from PST)
  | 'DOCUMENT_PROCESSED'   // Document fully extracted
  | 'TASK_ADDED'           // Extraction task queued
  | 'TASK_FINISHED'        // Extraction task completed or failed
  | 'WORKER_STARTED'       // ECS worker container started
  | 'WORKER_FINISHED'      // ECS worker scaling down
  | 'ERROR'                // Extraction error
  | 'WARNING'              // Non-fatal issue (e.g., OCR inline failure)
  | 'NOTIFICATION'         // Informational
  | 'IMPORT_CANCELLED'     // Job cancelled by user

Message attributes: eventType, caseId, batchId — used by SNS filter policies to route events to downstream queues (doc_loader, doc_uploader).

Import Types¶

Type	Trigger	Initial Task
`FILES`	Individual file import	`intake-document` per file
`MAILBOX`	PST/MBOX/OST import	`intake-mailbox`
`LOADFILE`	Load file CSV with field mapping	`intake-loadfile`
`TRANSCRIPT`	Deposition transcripts (LEF format)	`unpack-preprocessed-container`
`MANUAL`	Auto-detected → FILES or LOADFILE	Depends on detection

Queue Naming Conventions¶

Queue	Name Pattern	Purpose
Small tasks	`{prefix}_nptasks_{caseId}_{jobId}_small`	Quick extraction tasks
Large tasks	`{prefix}_nptasks_{caseId}_{jobId}_large`	Long-running extraction tasks
Failed tasks	`{prefix}-failed-extraction-tasks`	Shared DLQ for all jobs
Doc loader	`{prefix}_doc_loader_{caseId}_{batchId}`	Downstream: document loading
Doc uploader	`{prefix}_doc_uploader_{caseId}_{batchId}`	Downstream: document uploading

Tags on queues: NextpointJobId, NextpointBatchId, NextpointJobPriority, TopicSubscriptionArn.

Divergences from Standard Architecture¶

1. Java/Kotlin + TypeScript (Not Python)¶

The extraction worker is Java/Kotlin (ECS Fargate), orchestration is TypeScript (Lambda). The standard architecture uses Python throughout. This was driven by the Hyland Document Filters SDK being a Java/JNI library, plus Java ecosystem availability for document processing (Tesseract bindings, PST parsing). Predates the Python architecture standardization.

2. DynamoDB for State (Not RDS)¶

Worker state lives in DynamoDB instead of RDS/MySQL. This makes sense for the worker pool use case — high-throughput conditional updates, no relational queries needed. But it means the module doesn't share the multi-tenant per-case database pattern.

3. No Hexagonal Boundaries¶

TypeScript shared/ directory mixes domain logic (task-pool.ts priority algorithm, job-completion.ts flow) with infrastructure (worker-state.ts DynamoDB, sqs-helper.ts). The standard pattern would separate these.

4. Implicit Task DAG (Not Checkpoint Pipeline)¶

Extraction creates a dynamic task graph (intake → unpack → extract → render) rather than a fixed checkpoint pipeline. Each task can spawn child tasks. This is more flexible than the 11-step checkpoint pattern but harder to resume mid-pipeline.

5. CloudWatch Alarm-Driven Lifecycle¶

Job completion is triggered by CloudWatch scale-in alarms, not by explicit completion events. This couples job lifecycle to infrastructure scaling signals.

6. Dual-Language Priority Algorithm¶

The TaskPool priority algorithm is implemented in both TypeScript and Java, creating a maintenance burden. Changes must be synchronized across languages.

Lessons Learned¶

DynamoDB conditional updates are effective for worker pools — The claim/release pattern with conditional checks handles race conditions without distributed locks.
Dual-queue (small/large) improves throughput — Quick tasks don't get stuck behind long PST extractions. Cascading failover provides graceful degradation.
Task-specific error recovery reduces false failures — Retrying PDF renders without OCR, and downgrading OCR inline failures to warnings, significantly reduces false error rates.
1-minute SQS emptiness re-check is necessary — SQS's eventual consistency means a queue can appear empty momentarily. The re-check prevents premature job completion.
Priority + timestamp ordering prevents starvation — Pure priority ordering starves low-priority jobs. The composite key ensures older jobs eventually get processed.
Neutral utilization metric prevents false scale-in — After state changes, placing a 0.65 utilization metric prevents CloudWatch from triggering premature scale-in alarms.
Incremental visibility extension beats large upfront timeout — Setting a 5-minute checkpoint instead of a 30-minute timeout means crashed tasks retry in minutes, not hours. The 30-second padding prevents race conditions.
Task size escalation is a useful retry strategy — Requeuing a failed SMALL task as LARGE gives it more time and resources. Combined with OCR-disabled retry for PDF rendering, this handles most transient failures automatically.
Worker lifetime limits prevent resource leaks — Capping workers at ~600 tasks forces periodic container recycling, preventing memory leaks and stale state accumulation in long-running JVM processes.
NIST deduplication saves significant processing — Filtering known system files before extraction avoids wasting compute on OS files, common software, and other non-responsive content.
Virtual threads (Java 21) simplify concurrent monitoring — Visibility monitors, metrics publishers, and cleanup agents can each run in their own virtual thread without exhausting the platform thread pool.

Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.