ADR-010: Deposition/Transcript Processing Modernization (Litigation Suite)¶
Status¶
Proposed
Date¶
2026-03-19
Context¶
Deposition and transcript processing is part of the Litigation suite. It handles parsing court reporter transcripts (multiple proprietary formats), managing deposition metadata, generating PDF/text exports, and coordinating with video processing.
Current State¶
Workers (EC2 custom daemon):
| Worker | Purpose | External Tools |
|---|---|---|
TranscriptParseWorker |
Parse 6 transcript formats, extract text/metadata | xpdf |
DepositionZipWorker |
Extract ZIP packages of transcripts + videos | unzip |
Rails Sidekiq jobs:
| Job | Purpose | Dependencies |
|---|---|---|
DepositionPdfJob |
Generate PDF from transcript pages | Prawn (Ruby PDF) |
DepositionSearchPdfJob |
Search-hit PDF from ES results | ES + Prawn |
DepositionTextJob |
Plain text export of designated portions | None |
DepositionShareJob |
Copy depositions between cases | S3 copy, DB copy |
DepositionDesignationMergeJob |
Merge designation labels | Video timing calc |
DepositionSummaryReportJob |
CSV summary of depositions | Per-case DB |
TranscriptMetadataReportJob |
CSV of transcript metadata | Per-case DB |
DepositionVolumeExhibitsInFolderLinkerJob |
Link exhibits to deposition volume | Per-case DB |
Transcript formats (6): LEF (LiveNote), PTF (LiveNote binary), PTX (LiveNote variant), PDF, CMS (Court Manager), plain text
Key complexity:
- LEF files are password-protected ZIPs (password: "livenote") containing transcript + video sync + exhibits
- PTF is a proprietary binary format requiring a custom parser (Nextpoint::LiveNote::PTF)
- CMS uses an external binary (extract_cms_transcript) to extract from Microsoft JET/Access databases
- Auto-syncfile generation detects timestamps in transcript text and creates SVI-format syncfiles
- NGE awareness: TranscriptParseWorker routes exhibit creation through NGE pipeline for NGE-enabled cases
Theater/Treatment (presentation layer):
| Component | Purpose | External Tools |
|---|---|---|
TreatmentWorker (workers) |
Create callout/highlight composite images | GD2 (libgd) |
TheaterController (Rails) |
Courtroom presentation interface | GD2 |
TheaterProcessor (Rails) |
Server-side image tile/callout rendering | GD2 |
Why Modernize?¶
- Same EC2 dependency as video — custom polling daemon, single-tenant, no scaling
- Transcript parsing is stateless — ideal for Lambda (parse file, return structured data)
- PDF generation is bounded — Prawn PDF generation fits Lambda's 15-min timeout
- Theater is server-side image processing — GD2 rendering could move to Lambda/Fargate
- nextpoint-ai already modernized summarization — AI transcript summaries use EventBridge + Lambda
Decision¶
A phased approach that prioritizes the highest-value, lowest-risk extractions first.
Phase 1: Transcript Parsing Service (Lambda)¶
Extract transcript parsing into a Lambda function:
Rails App (or NGE module)
│
├── SNS: TranscriptParseRequested
│ ├── deposition_volume_id, s3_path, format_hint
│ │
│ ▼
│ SQS Queue → Lambda: TranscriptParser
│ │
│ ├── core/
│ │ ├── parsers/
│ │ │ ├── lef_parser.py (LEF extraction + embedded exhibit routing)
│ │ │ ├── ptf_parser.py (LiveNote binary format)
│ │ │ ├── ptx_parser.py (LiveNote PTX variant)
│ │ │ ├── pdf_parser.py (xpdf text extraction)
│ │ │ ├── cms_parser.py (CMS/JET database extraction)
│ │ │ └── text_parser.py (plain text with page detection)
│ │ ├── transcript_analyzer.py (page breaks, line numbers, layout)
│ │ └── syncfile_generator.py (auto-generate SVI from timestamps)
│ │
│ ├── shell/
│ │ ├── s3_ops.py (download transcript, upload parsed data)
│ │ ├── db/database.py (update deposition_volume with parse results)
│ │ └── sns_ops.py (emit events for exhibit routing)
│ │
│ ▼
│ SNS: TranscriptParsed
│ ├── If LEF with exhibits → SNS: ExhibitsExtracted → documentextractor
│ └── If syncfile generated → SNS: SyncfileCreated → video sync handler
Why Lambda (not ECS)?
- Parsing completes in seconds to low minutes (well within 15-min Lambda timeout)
- No heavy binary dependencies (xpdf is small, can be included as Lambda layer)
- CMS parser needs extract_cms_transcript binary — package as Lambda layer or container image
Phase 2: Deposition PDF/Text Export (Lambda)¶
Extract PDF and text generation:
SNS: DepositionExportRequested
│
├── type: pdf → Lambda with Prawn or ReportLab
│ (transcript pages → formatted PDF with highlights, notes, margin options)
│
└── type: text → Lambda
(designated portions → plain text file)
Phase 3: Theater/Treatment Modernization (ECS Fargate)¶
Theater and treatment processing are the most complex to modernize:
Option A: Keep in Rails (recommended for now)
- Theater is request-driven (user clicks → server renders tile)
- Low latency required (< 200ms per tile)
- GD2-based rendering works and is fast
- Caching in /tmp handles repeat requests
Option B: Extract to ECS (future, if needed)
- Pre-render treatment images via Fargate task
- Serve tiles from S3/CloudFront instead of server-side GD2
- Only worth it if theater usage grows significantly
Recommendation: Keep theater in Rails for now. The server-side tile rendering with file-based caching works well for the current usage pattern. Extract only if performance becomes an issue.
What Stays in Rails¶
| Component | Why |
|---|---|
Deposition / DepositionVolume models |
Core data models, CRUD |
DepositionShareJob |
Cross-case DB + S3 copy — complex coordination, not compute-heavy |
DepositionDesignationMergeJob |
Label merging with video timing — domain logic, not compute |
TheaterController / TheaterProcessor |
Request-driven rendering, latency-sensitive |
TreatmentWorker |
GD2 image rendering — works, low volume, extract later if needed |
DepositionSummaryReportJob |
Simple DB query → CSV — covered by ADR-007 (reports service) |
Consequences¶
Positive¶
- Phase 1 removes EC2 dependency for the most common deposition operation (parsing)
- Consistent with nextpoint-ai — both use event-driven Lambda for transcript processing
- Format parsers become testable — isolated in Lambda, easier to test than embedded in worker
- LEF exhibit routing integrates with NGE — exhibits extracted from LEF files route through documentextractor
Negative¶
- 6 format parsers to port — LEF, PTF, PTX, PDF, CMS, plain text (PTF binary parser is complex)
- CMS binary dependency —
extract_cms_transcriptis a compiled binary, needs Lambda layer or container - Theater stays Legacy — no modernization for the courtroom presentation system
Risks¶
- PTF binary parser complexity —
Nextpoint::LiveNote::PTFparses a proprietary binary format with blocks, values, quickmarks. Porting to Python requires exact byte-level parity. Mitigation: wrap existing Ruby parser in a Lambda container image (Ruby 3.x Lambda runtime). - LEF password handling — LEF files use a hardcoded password ("livenote"). Security review needed if extracting to a separate service.
- Auto-syncfile accuracy — timestamp detection in transcript text uses heuristics (time-of-day vs elapsed time). Must validate with real-world transcripts from multiple court reporters.
- Phase ordering matters — Phase 1 (parsing) must be stable before Phase 2 (PDF export) because PDF generation depends on parsed page data. Don't parallelize these phases.
Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.