Reference Implementation: documentextractor¶
Overview¶
The documentextractor module is the core extraction service for Nextpoint's Next Gen Engine (NGE). It takes raw files (emails, PSTs, documents, loadfiles) and extracts content: text, PDFs, page images, metadata, and attachments.
This is the largest NGE module, using a hybrid Java/Kotlin + TypeScript architecture: Java ECS workers perform extraction, TypeScript Lambda handlers orchestrate jobs and manage worker state via DynamoDB.
Like documentexporter, this module predates the standard Python architecture patterns. It aligns with the event-driven principle (uses SNS) but diverges significantly in language, compute model, and state management.
Pattern Mapping¶
| Pattern | Status | documentextractor Implementation |
|---|---|---|
| Hexagonal boundaries | Diverges | Flat shared/ directory in TypeScript handlers. Java worker has component separation (worker, plugins, common-models) but not core/shell |
| Exception hierarchy | Diverges | Uses task result status codes (SUCCESS, PASSWORD_PROTECTED, INVALID_FILE, etc.) and problem codes (PDF_UNAVAILABLE, TEXT_UNAVAILABLE) instead of Recoverable/Permanent/Silent exceptions |
| SNS events | Aligns | Central topic {prefix}-extraction-task-status with 13 event types. MessageAttributes include eventType, caseId, batchId for filter-based routing |
| SQS handler | Partial | TypeScript Lambda handlers consume SQS (extraction-status-handler, failed-task-handler) with batch item failures. But Java workers consume directly from SQS, not via Lambda |
| Checkpoint pipeline | Diverges | No database-backed checkpoint state machine. Task pipeline is implicit via task dependencies — each extraction task can spawn child tasks (unpack → extract-text → render-pdf) |
| Database sessions | Not used | No RDS. Uses DynamoDB for worker state management. SQLAlchemy not involved |
| Retry/resilience | Partial | Exponential backoff for service claim (5 attempts, 250ms base, jitter). SQS visibility timeout extension for long tasks. PDF retry without OCR on failure. Dual-queue cascading failover (small → large → DLQ) |
| Idempotency | Partial | DynamoDB conditional updates for service claims (prevents race conditions). Job-add is idempotent (checks if jobId already in set). No checkpoint-based distributed lock |
| Multi-tenancy | Aligns | Per-case ECS service assignment. Queue names include caseId. S3 paths per case. Single ECS service serves one case at a time |
| Job processor orchestration | Aligns (different implementation) | Dynamic per-job queue creation (small + large + loader + uploader). Job completion with queue cleanup. Service rotation between jobs |
| CDK infrastructure | Partial | Multi-stack (API stack, Worker stack, SNS stack). ECS Fargate for workers. API Gateway + Lambda for orchestration. DynamoDB for state |
| Config management | Partial | TypeScript config with environment/region matrix. No Python CONFIG_MAP or Secrets Manager caching |
| Structured logging | Partial | Context-aware logger with jobId/caseId/batchId. TypeScript implementation, not Python JSON formatter |
| ORM models | Not used | No SQLAlchemy. DynamoDB for worker state, S3 for task metadata |
Architecture¶
documentextractor/
├── infrastructure/
│ ├── api/import/
│ │ ├── create/index.ts # POST /import — create job, assign worker
│ │ ├── delete/index.ts # DELETE /import — cancel job
│ │ ├── worker-finished-handler/index.ts # CloudWatch alarm → job completion
│ │ ├── extraction-status-handler/index.ts # SNS → Rails API status relay
│ │ ├── failed-task-handler/index.ts # DLQ → error handling + retry
│ │ └── shared/
│ │ ├── worker-state.ts # DynamoDB service registry
│ │ ├── task-pool.ts # Priority-based job scheduling
│ │ ├── job-completion.ts # Multi-stage completion flow
│ │ ├── task-sns-events.ts # SNS event types and schemas
│ │ ├── queue-names.ts # Queue naming conventions
│ │ ├── worker-task.ts # Task interface definitions
│ │ ├── sns-helper.ts # SNS publishing
│ │ ├── sqs-helper.ts # SQS operations
│ │ ├── s3-helper.ts # S3 operations
│ │ ├── cloudwatch-helper.ts # Metrics and alarms
│ │ └── job-settings.ts # Job/task settings keys
│ └── lib/
│ ├── api-stack.ts # Lambda + API Gateway CDK
│ ├── worker-stack.ts # ECS Fargate + Auto Scaling CDK
│ ├── sns-stack.ts # SNS topic CDK
│ ├── psm-stack.ts # PSM: Firehose → S3 Parquet pipeline
│ ├── psm-roles-stack.ts # PSM: S3 bucket, Glue DB/table, IAM roles
│ ├── psm-monitoring-stack.ts # PSM: CloudWatch alarms for pipeline health
│ └── helpers/psm-resource-names.ts # PSM: naming conventions
├── components/
│ ├── extraction-worker/ # Java — ECS worker process
│ │ ├── src/main/java/.../worker/
│ │ │ ├── Worker.java # Main worker loop
│ │ │ ├── queues/TaskPool.java # Priority scheduling (Java mirror)
│ │ │ └── threadpool/TaskThreadPool.java
│ │ └── Docker/Dockerfile # Worker image with Hyland native libs
│ ├── extractor-plugins/
│ │ ├── hyland/ # Hyland Document Filters v11 (SDK 25.3.0) — current
│ │ │ ├── src/.../hyland/
│ │ │ │ ├── HylandTaskHandler.java # Main task routing
│ │ │ │ ├── HylandTaskHandlerFactory.java # SPI factory
│ │ │ │ ├── DocumentFiltersWrapper.java # Lazy-loading singleton API wrapper
│ │ │ │ ├── TextExtractor.java # Native text extraction
│ │ │ │ ├── PdfRenderer.java # PDF rendering
│ │ │ │ ├── ImageRenderer.java # Page image rendering
│ │ │ │ ├── DocumentUnpacker.java # Archive/mailbox unpacking
│ │ │ │ ├── MetadataProcessor.java # Metadata extraction
│ │ │ │ └── api/
│ │ │ │ ├── HylandErrorCodes.java
│ │ │ │ └── HylandDocumentTypes.java
│ │ │ ├── config/{local,arm,x86}/ISYS11df.ini # Per-architecture config
│ │ │ └── lib/ISYS11df.jar # Hyland Java API
│ │ └── hyland23/ # Hyland Document Filters v23 (SDK 23.3.3) — legacy
│ │ ├── src/.../hyland23/
│ │ │ ├── Hyland23TaskHandlerFactory.java
│ │ │ ├── Hyland23PstUnpacker.java # PST-specific unpacking
│ │ │ └── DocumentFiltersWrapper.java
│ │ └── lib/ISYS11df.jar
│ ├── loadfile-extractor/ # Java — loadfile CSV parsing + import
│ │ └── src/.../loadfiles/
│ │ ├── LoadFileExtractor.java # CSV parsing, field mapping, fan-out
│ │ └── LoadFileTaskHandler.java # SPI task handler
│ ├── nist-database/ # Java — NIST hash lookup (deNIST)
│ │ └── src/.../nist/
│ │ ├── NistDatabase.java # Interface
│ │ └── NistS3Database.java # S3-backed MD5 existence check
│ ├── lef/ # Java — LEF transcript/deposition parser
│ │ └── src/.../lef/LefContentParser.java
│ ├── common-models/ # Java — shared data models
│ │ └── src/main/java/.../v0/common/
│ │ ├── WorkerTask.java # Task definition
│ │ ├── NpIdentifiers.java # Case/batch/job IDs
│ │ └── status/ExtractionStatusEvent.java
│ └── status-events/ # Java — status publishing
│ └── src/main/java/.../statusevents/
│ └── Status.java # Status constants
└── pipeline/ # CI/CD pipeline CDK
Event Flow¶
Nextpoint Backend (API call)
│
▼
CreateImportHandler (Lambda)
├─ Validate import request (FILES/MAILBOX/LOADFILE/TRANSCRIPT)
├─ Create SQS queues: small + large task queues, loader, uploader
├─ Populate initial tasks (job-start notification + intake task)
├─ Claim ECS service (DynamoDB conditional update, 5 retries)
│ ├─ Reuse: case already has service → add job to existing service
│ └─ New: find FREE service → claim with condition check
└─ Publish JOB_QUEUED event via SNS
│
▼
Java ECS Worker (long-running container)
├─ Poll task queues via priority-based TaskPool
├─ Execute extraction plugin for task type
│ ├─ intake-mailbox → unpack PST/MBOX → child tasks
│ ├─ intake-document → extract single file → child tasks
│ ├─ extract-text → native text extraction
│ ├─ text-from-image → OCR single image
│ ├─ render-pdf → convert to PDF
│ └─ ... (15+ task types)
├─ Publish DOCUMENT_PROCESSED, TASK_FINISHED events
└─ Extend SQS visibility timeout for long tasks
│
▼ (on scale-in alarm)
WorkerFinishedHandler (Lambda, triggered by CloudWatch)
├─ Parse alarm dimensions → extract serviceId
├─ Query DynamoDB for service's current job
└─ Trigger JobCompletionHelper
│
▼
JobCompletionHelper
├─ Check all job queues for remaining messages
├─ If messages exist: adjust task count, return (not done)
├─ If empty: wait 1 minute, re-check (SQS consistency)
├─ Delete job queues, retrieve batchId from tags
├─ Find next job via TaskPool priority algorithm
│ ├─ Next job found → rotate service to next job
│ └─ No jobs left → release service (scale ECS to 0, FREE in DynamoDB)
└─ Publish JOB_FINISHED event via SNS
│
▼ (on task failure)
FailedTaskHandler (Lambda, triggered by DLQ)
├─ Route by task type:
│ ├─ PDF render failed → retry without OCR (re-queue)
│ ├─ OCR inline images → graceful degradation (WARNING, not ERROR)
│ ├─ Text extraction → TEXT_UNAVAILABLE problem code
│ └─ Unpack/intake → FILE_MISSING_OR_INCOMPLETE
├─ Publish TASK_FINISHED error event
└─ Conditionally publish DOCUMENT_PROCESSED (if document still needs completion)
Key Design Decisions¶
DynamoDB Worker State Registry¶
ECS services are tracked in a DynamoDB table as a worker pool:
| Field | Type | Purpose |
|---|---|---|
| ServiceId | Number (PK) | ECS service identifier |
| Status | String (GSI) | FREE or BUSY |
| CaseId | String (GSI) | Currently assigned case |
| CurrentJobId | String | Active job being processed |
| JobIds | String Set | All jobs assigned to this service |
| WorkerLimit | Number | Max concurrent tasks |
| ServiceArn | String | ECS service ARN |
| ScalingResourceId | String | Auto Scaling resource ID |
Service claim uses DynamoDB conditional updates (status=FREE → BUSY) with exponential backoff retry (5 attempts, 250ms base, random jitter) to handle race conditions when multiple imports compete for services.
Multi-job per service: A service can hold multiple jobs for the same case, rotating between them via priority ordering. This avoids cold-start overhead when the same case has concurrent imports.
Priority-Based Job Scheduling¶
The TaskPool algorithm combines job priority + queue creation timestamp into a single sort key:
- Lower priority number = higher priority (default: 50)
- Timestamp prevents starvation of older lower-priority jobs
- Algorithm is mirrored in both TypeScript (Lambda handlers) and Java (ECS workers)
Workers rotate through jobs by: complete current job → find next by priority → rotate service state → adjust task count.
Dual-Queue with Cascading Failover¶
Each job gets two task queues:
{prefix}_nptasks_{caseId}_{jobId}_small (quick tasks, single files)
{prefix}_nptasks_{caseId}_{jobId}_large (long tasks, PST extraction, etc.)
Failover cascade: 1. Task fails twice on small queue → moved to large queue 2. Task fails twice on large queue → moved to shared DLQ 3. DLQ triggers FailedTaskHandler → task-specific error handling or retry
Plugin-Based Extraction¶
The Java worker uses a plugin architecture for extraction. Each task type maps to a handler plugin:
| Task Type | Purpose |
|---|---|
intake-mailbox |
PST/MBOX/OST unpacking |
intake-document |
Individual file intake |
intake-loadfile |
Load file CSV processing |
unpack-document |
Container unpacking |
unpack-mailbox |
Email container unpacking |
extract-text |
Native text extraction |
text-from-image |
Single-image OCR |
text-from-image-list |
Multi-image OCR |
ocr-inline-images |
Embedded image OCR (graceful failure) |
ocr-page-images |
Page-level OCR |
render-pdf |
Native → PDF conversion |
combine-page-images |
Page image assembly |
render-page-images |
Page rendering |
render-xlsx |
Excel rendering |
Intake tasks spawn child tasks (unpack → extract-text → render-pdf), creating an implicit task DAG.
Hyland Document Filters Integration¶
The extraction engine is powered by Hyland Document Filters — a commercial
SDK for document parsing, text extraction, image rendering, and format conversion.
Loaded via Java SPI (META-INF/services/TaskHandlerFactory).
Two SDK versions coexist:
| Version | SDK Release | Role | Why |
|---|---|---|---|
| v11 (25.3.0) | Current | Primary extraction engine | Handles all task types including mailbox unpacking |
| v23 (23.3.3) | Legacy | PST-only fallback | Retained because v25.3.0 has a bug extracting MSG files with attachments >8MB from PST and other mailboxes |
Core components (v11 plugin):
- DocumentFiltersWrapper.java — Lazy-loading singleton wrapping the
com.perceptive.documentfilters.* API. Initialized once per worker lifetime
- HylandTaskHandler.java — Routes task types to specialized handlers:
TextExtractor, PdfRenderer, ImageRenderer, DocumentUnpacker, MetadataProcessor
- HylandTaskHandlerFactory.java — SPI factory, registered in
META-INF/services/com.nextpoint.documentextractor.v0.interfaces.TaskHandlerFactory
Supported task types via Hyland: - Text extraction (native + OCR via Tesseract) - PDF rendering from any supported format - Page image rendering (PNG output) - Archive/mailbox unpacking (ZIP, RAR, 7Z, PST, MBOX, OST, MSG) - Metadata extraction - Excel rendering
Native library deployment:
- Architecture-specific builds: ARM64 (linux-aarch64-gcc-64), x86_64 (linux-intel-gcc-64)
- Loaded via JNI: LD_LIBRARY_PATH=/worker/extractors/hyland
- Docker image bundles native libs, fonts, and per-architecture config (ISYS11df.ini)
Configuration (ISYS11df.ini):
- OCR engine: Tesseract
- DPI: 96 default, AUTO for images
- Excel extraction mode: CSV
- PDF/TIFF compression settings
- Image enumeration enabled
- Spreadsheet page limit: 10,000
Worker Loop and Graceful Shutdown¶
The Java worker (Worker.java) runs a continuous polling loop:
- Capacity-gated polling:
capacityTracker.maximumTaskSize()blocks viaCountDownLatchwhen no thread capacity is available. Only polls SQS when a slot opens. - Exponential backoff on idle: After 3+ consecutive empty polls with full capacity available, the worker exits gracefully (triggers scale-in).
- Zombie queue detection: If a queue is deleted mid-processing (e.g., job cancelled), the worker retries once to find other queues before giving up.
- SIGTERM shutdown hook:
Runtime.getRuntime().addShutdownHook()stops accepting new tasks immediately but allows in-flight tasks to finish. Logs total execution time and capacity usage percentage on exit.
Capacity Tracking and Worker Lifetime¶
Each worker has a dual capacity model:
| Capacity Type | Default | Purpose |
|---|---|---|
| Per-task (concurrent) | 8 slots | Small tasks=1 slot, large tasks=5 slots (or 8 for spot) |
| Lifetime (total) | 600 units | Worker auto-terminates after ~120 large or ~600 small tasks |
- Spot instance restrictions: Spot workers can only process certain task types (OCR, text extraction, PDF) — not mailbox unpacking (too long, risk of spot termination mid-PST)
- Metrics thread: Virtual thread periodically reports
WorkerCapacityUsageto CloudWatch for scaling decisions
SQS Visibility Timeout Extension¶
Rather than setting a large visibility timeout upfront (which delays retries on crash), the worker incrementally extends it:
- Checks every 5 minutes (VISIBILITY_CHECKPOINT_INCREMENT)
- Extends by actual elapsed time + 30-second safety padding
- If the worker crashes without catching the exception, the message becomes visible again in ~5 minutes (not the full task timeout)
- Uses Java virtual threads for the visibility monitor to avoid exhausting the platform thread pool
Task Timeout and Retry Escalation¶
Workers enforce timeouts via Future.get(timeout, TimeUnit.MILLISECONDS):
| Failure | Action |
|---|---|
| OutOfMemoryError | Requeue as LARGE task, terminate worker |
| TimeoutException on SMALL task | Requeue as LARGE task (more time/resources) |
| TimeoutException on LARGE task | Fail permanently |
| PDF render failure | Retry once with NO_OCR=true (OCR is common failure source) |
| RestartTaskException | Restart immediately on same worker (no requeue) |
Max retries per task: 1. Combined with the dual-queue failover (small → large → DLQ), a task gets multiple chances across different resource tiers.
NIST Hash Filtering (deNIST)¶
Known system files are filtered via NIST National Software Reference Library:
- S3-backed hash database: Bucket
nextpoint-dm-nsrl-hashstores MD5 hashes as S3 object keys - Lookup:
s3.exists(NIST_BUCKET, md5.toUpperCase())— simple existence check - On match: Throws
DeNistException, caught at worker level — file silently skipped with success status (not counted as an error) - Prevents indexing of OS files, common software, and other known non-responsive files in eDiscovery
Loadfile Processing¶
The loadfile extractor (LoadFileExtractor.java) handles bulk imports via CSV:
- Duplicate protection: Before processing, checks for tracker file with MD5 hash of loadfile path in S3. If exists, skips (idempotent)
- Parallel inventory: Uses
CompletableFutureto asynchronously build image and text file inventories while parsing the CSV - Field mapping: Generates normalized
nextpoint_load_file.csvwhen custom field mappings are provided, backs up original - Task fan-out: Creates
UNPACK_LOADFILE_RECORDchild tasks for each parsed record with metadata
Rails Backend Communication (HMAC-SHA1)¶
The extraction-status-handler relays SNS events to the Rails backend:
- Webhook callback:
POST /batches/update/{batchId}to Rails API endpoint - HMAC-SHA1 signature: Signs
[HTTP_METHOD, API_PATH, ISO_DATE]joined with newlines, Base64-encoded inAPI-Authorizationheader - Status mapping: Event types → Rails batch statuses (WORKER_STARTED→processing, JOB_FINISHED→extractor_complete, etc.)
- Per-environment endpoints: Each environment has its own Rails API URL
CloudWatch-Driven Scaling¶
ECS service scaling uses CloudWatch metrics: - WorkerCapacityUsage: Per-service utilization metric (dimension: ServiceId) - WorkerPendingWorkload: Pending task count - Scale-in alarm triggers WorkerFinishedHandler → job completion flow - Neutral utilization metric (0.65) placed after state changes to prevent false scale-in
PSM — Processing Status Monitor¶
PSM is an event analytics pipeline that captures all extraction status events into a queryable data lake for monitoring, debugging, and operational visibility.
Data flow:
SNS (extraction-task-status topic)
→ Kinesis Firehose (SNS subscription, raw message delivery)
→ JQ metadata extraction (caseId, batchId for partitioning)
→ JSON → Parquet conversion (via Glue schema)
→ S3 (Hive-partitioned by caseId/batchId)
→ Queryable via Athena
Three CDK stacks:
| Stack | Purpose |
|---|---|
| PsmRolesStack | S3 bucket (nge-psm-statuses), Glue database + table, Firehose IAM role, CloudWatch log group |
| PsmStack | Firehose delivery stream subscribed to SNS topic. 64MB/60s buffering, dynamic partitioning via JQ, JSON→Parquet format conversion |
| PsmMonitoringStack | Four CloudWatch alarms monitoring pipeline health |
Glue table schema:
| Column | Type | Purpose |
|---|---|---|
| jobId | string | Extraction job identifier |
| eventType | string | JOB_QUEUED, DOCUMENT_PROCESSED, etc. |
| eventDetail | string | JSON details (problem codes, file paths) |
| status | string | SUCCESS or ERROR |
| documentId | string | Document identifier |
| timestamp | timestamp | Event timestamp |
| source | string | Always "extractor" |
| exhibitIds | string | Associated exhibit identifiers |
Partition keys: caseId (int), batchId (int) — enables efficient per-case
and per-batch queries without scanning the entire dataset.
S3 layout:
s3://{bucket}/statuses/caseid={caseId}/batchid={batchId}/ (Parquet files)
s3://{bucket}/errors/{error-type}/{yyyy}/{MM}/{dd}/ (delivery failures)
Pipeline health alarms (PsmMonitoringStack):
| Alarm | Threshold | Meaning |
|---|---|---|
| Throttling | >90% of records/sec limit | Firehose is near capacity |
| Data freshness | >300 seconds | Events are delayed reaching S3 |
| Delivery ratio | ≤1 (delivered/incoming bytes) | Data loss or backpressure |
| Failed conversions | >5% of records | JSON→Parquet conversion failures |
All alarms fire to a dedicated SNS topic (nge-psm-statuses-alarm) for
operational alerting.
Task-Specific Error Recovery¶
The FailedTaskHandler implements targeted recovery strategies:
- PDF render failure with OCR: Re-queue task with NO_OCR setting to retry
without OCR (which is a common failure source)
- OCR inline images: Downgrade to WARNING (non-critical — document still usable)
- Text extraction: Mark TEXT_UNAVAILABLE but still complete document processing
- Unpack/intake: Mark FILE_MISSING_OR_INCOMPLETE
SNS Event Types¶
type EventType =
| 'JOB_QUEUED' // Import request received, queues created
| 'JOB_STARTED' // Worker began processing
| 'JOB_FINISHED' // All tasks complete, queues deleted
| 'DOCUMENT_ADDED' // New document discovered (e.g., from PST)
| 'DOCUMENT_PROCESSED' // Document fully extracted
| 'TASK_ADDED' // Extraction task queued
| 'TASK_FINISHED' // Extraction task completed or failed
| 'WORKER_STARTED' // ECS worker container started
| 'WORKER_FINISHED' // ECS worker scaling down
| 'ERROR' // Extraction error
| 'WARNING' // Non-fatal issue (e.g., OCR inline failure)
| 'NOTIFICATION' // Informational
| 'IMPORT_CANCELLED' // Job cancelled by user
Message attributes: eventType, caseId, batchId — used by SNS filter
policies to route events to downstream queues (doc_loader, doc_uploader).
Import Types¶
| Type | Trigger | Initial Task |
|---|---|---|
FILES |
Individual file import | intake-document per file |
MAILBOX |
PST/MBOX/OST import | intake-mailbox |
LOADFILE |
Load file CSV with field mapping | intake-loadfile |
TRANSCRIPT |
Deposition transcripts (LEF format) | unpack-preprocessed-container |
MANUAL |
Auto-detected → FILES or LOADFILE | Depends on detection |
Queue Naming Conventions¶
| Queue | Name Pattern | Purpose |
|---|---|---|
| Small tasks | {prefix}_nptasks_{caseId}_{jobId}_small |
Quick extraction tasks |
| Large tasks | {prefix}_nptasks_{caseId}_{jobId}_large |
Long-running extraction tasks |
| Failed tasks | {prefix}-failed-extraction-tasks |
Shared DLQ for all jobs |
| Doc loader | {prefix}_doc_loader_{caseId}_{batchId} |
Downstream: document loading |
| Doc uploader | {prefix}_doc_uploader_{caseId}_{batchId} |
Downstream: document uploading |
Tags on queues: NextpointJobId, NextpointBatchId, NextpointJobPriority,
TopicSubscriptionArn.
Divergences from Standard Architecture¶
1. Java/Kotlin + TypeScript (Not Python)¶
The extraction worker is Java/Kotlin (ECS Fargate), orchestration is TypeScript (Lambda). The standard architecture uses Python throughout. This was driven by the Hyland Document Filters SDK being a Java/JNI library, plus Java ecosystem availability for document processing (Tesseract bindings, PST parsing). Predates the Python architecture standardization.
2. DynamoDB for State (Not RDS)¶
Worker state lives in DynamoDB instead of RDS/MySQL. This makes sense for the worker pool use case — high-throughput conditional updates, no relational queries needed. But it means the module doesn't share the multi-tenant per-case database pattern.
3. No Hexagonal Boundaries¶
TypeScript shared/ directory mixes domain logic (task-pool.ts priority algorithm, job-completion.ts flow) with infrastructure (worker-state.ts DynamoDB, sqs-helper.ts). The standard pattern would separate these.
4. Implicit Task DAG (Not Checkpoint Pipeline)¶
Extraction creates a dynamic task graph (intake → unpack → extract → render) rather than a fixed checkpoint pipeline. Each task can spawn child tasks. This is more flexible than the 11-step checkpoint pattern but harder to resume mid-pipeline.
5. CloudWatch Alarm-Driven Lifecycle¶
Job completion is triggered by CloudWatch scale-in alarms, not by explicit completion events. This couples job lifecycle to infrastructure scaling signals.
6. Dual-Language Priority Algorithm¶
The TaskPool priority algorithm is implemented in both TypeScript and Java, creating a maintenance burden. Changes must be synchronized across languages.
Lessons Learned¶
-
DynamoDB conditional updates are effective for worker pools — The claim/release pattern with conditional checks handles race conditions without distributed locks.
-
Dual-queue (small/large) improves throughput — Quick tasks don't get stuck behind long PST extractions. Cascading failover provides graceful degradation.
-
Task-specific error recovery reduces false failures — Retrying PDF renders without OCR, and downgrading OCR inline failures to warnings, significantly reduces false error rates.
-
1-minute SQS emptiness re-check is necessary — SQS's eventual consistency means a queue can appear empty momentarily. The re-check prevents premature job completion.
-
Priority + timestamp ordering prevents starvation — Pure priority ordering starves low-priority jobs. The composite key ensures older jobs eventually get processed.
-
Neutral utilization metric prevents false scale-in — After state changes, placing a 0.65 utilization metric prevents CloudWatch from triggering premature scale-in alarms.
-
Incremental visibility extension beats large upfront timeout — Setting a 5-minute checkpoint instead of a 30-minute timeout means crashed tasks retry in minutes, not hours. The 30-second padding prevents race conditions.
-
Task size escalation is a useful retry strategy — Requeuing a failed SMALL task as LARGE gives it more time and resources. Combined with OCR-disabled retry for PDF rendering, this handles most transient failures automatically.
-
Worker lifetime limits prevent resource leaks — Capping workers at ~600 tasks forces periodic container recycling, preventing memory leaks and stale state accumulation in long-running JVM processes.
-
NIST deduplication saves significant processing — Filtering known system files before extraction avoids wasting compute on OS files, common software, and other non-responsive content.
-
Virtual threads (Java 21) simplify concurrent monitoring — Visibility monitors, metrics publishers, and cleanup agents can each run in their own virtual thread without exhausting the platform thread pool.
Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.