Semantic Search: Use Case Mapping to Nextpoint Architecture¶
Overview¶
This document maps each semantic search use case to the Nextpoint architecture — what already exists, what the documentsearch module enables, and what requires additional services. Use cases are classified into tiers based on build complexity.
Personas (priority order): 1. Attorney — primary value target 2. Paralegal — secondary 3. Litigation support — not targeted (they benefit indirectly)
Value metrics: - Reduce total time (effort + orchestration + wait) to complete a task - Reduce effort time (hands-on attorney/paralegal work) - Number of tasks automated by AI agent (human-in-the-loop)
Tier Definitions¶
| Tier | Requires | Nextpoint Component | Build Effort |
|---|---|---|---|
| T1: Hybrid Search | documentsearch module only | POST /search with filters |
documentsearch module |
| T1+: Search + Existing Features | T1 + minor Rails UI integration | POST /search -> Rails tags/folders/review queue |
Days of Rails work |
| T2: Agent Reasoning | LLM orchestration that runs multiple searches, compares, synthesizes | New agent service (extends nextpoint-ai pattern) | New module, weeks |
| T2+: Cross-Corpus Agent | T2 agent across documents AND depositions | Agent + Litigation suite integration | New module + cross-suite |
Use Case Coverage Summary¶
| # | Use Case | Side | Tier | Prototype Demoable? |
|---|---|---|---|---|
| 1 | Hot Doc Identification | Receiving | T1 | YES |
| 2 | Deposition Prep | Receiving | T1+ | YES |
| 3 | Investigating / Constructing Narrative | Receiving | T1 | YES |
| 4 | Gap Analysis | Receiving | T2 | Partial |
| 5 | Pattern and Practice Identification | Receiving | T1 / T2 | YES (T1) |
| 6 | Third-Party Subpoena Targeting | Receiving | T1 | YES |
| 7 | Motion to Compel Support | Receiving | T1 / T2 | YES (T1) |
| 8 | Settlement and Mediation Prep | Receiving | T1+ | YES |
| 9 | Responsive Review | Producing | T1+ | YES |
| 10 | Privilege Review | Producing | T1 | YES |
| 11 | Redaction Identification | Producing | T1 | YES |
| 12 | Clawback Risk Assessment | Producing | T1 | YES |
| 13 | Post-Deposition Analysis | Depositions | T1+ | YES |
| 14 | Multi-Deposition Analysis | Depositions | T2+ | No |
8 of 14 use cases are T1 (documentsearch module only). 11 of 14 are demoable with the 2-week prototype.
Receiving Side¶
1. Hot Doc Identification¶
Tier: T1 (Hybrid Search) Persona: Attorney Value: Eliminate effort time for keyword iteration. Reduce risk of missing evidence.
Today (keyword search): Attorney manually crafts boolean queries, iterates on terms. Search for "Special Purpose Entities" misses "SPE", "Raptor", "JEDI". Attorney must already know the vocabulary to find the evidence.
With documentsearch:
POST /search
{ "query": "internal Enron discussions about Special Purpose Entities",
"case_id": 123,
"mode": "hybrid" }
- BM25 leg finds documents containing "Special Purpose Entities" (exact match)
- Vector leg finds documents about SPEs, Raptor, JEDI (conceptual match)
- RRF fuses both into one ranked list
- One query replaces 5-10 keyword iterations
What exists: QLE -> ES keyword search (existing).
What's new: documentsearch module (POST /search).
Rails changes: None for basic use. Search UI toggle for "Semantic Search" mode.
Example queries an attorney would run: - "internal discussions about the accounting treatment of off-balance-sheet entities" - "communications about product safety concerns before the recall" - "negotiations with regulators about the compliance violations"
2. Deposition Prep¶
Tier: T1+ (Search + Existing Features) Persona: Attorney, Paralegal Value: Reduce effort time gathering every document connected to a witness's knowledge of a specific issue.
Today: Manual keyword search -> manually tag results -> assemble in folder as deposition binder. Paralegal does initial pass, attorney reviews. Hours per witness per topic.
With documentsearch:
POST /search
{ "query": "discussions about Special Purpose Entities",
"case_id": 123,
"filters": {
"custodians": ["jeff.skilling@enron.com"]
},
"limit": 200 }
Search returns ranked exhibit IDs. Attorney or paralegal reviews top results and saves to a deposition prep folder.
What exists:
- Tagging (Rails: Tag model, bulk tag operations)
- Folders (Rails: folder assignment on exhibits)
- Export (Rails: document export to PDF/native)
What's new:
- documentsearch module (POST /search with custodian filter)
- "Save search results to folder" UI action (Rails frontend, ~1-2 days)
- Optional: "Create depo binder from search" workflow
Integration path:
1. POST /search returns ranked exhibit_id list
2. Rails bulk-tags results into "Skilling Depo Prep -- SPEs" folder
3. Attorney reviews in confidence-ranked order (highest relevance first)
Scope creep note (AI chronology summary): An AI-generated chronology of the depo prep corpus would further reduce effort. This connects documentsearch to nextpoint-ai: - documentsearch finds the documents - Rails folders them - nextpoint-ai summarizes the folder into a chronology The pieces exist; the orchestration (folder -> summarize) is new Rails work.
3. Investigating the Evidence / Constructing the Narrative¶
Tier: T1 (Hybrid Search) Persona: Attorney Value: Reduce total time to understand what happened -- the timeline of decisions, who knew what when, where the story changes.
Today: Attorney runs keyword searches, reads documents, mentally constructs timeline. Entirely cognitive labor. Misses documents where the relevant concept is expressed in unexpected language.
With documentsearch:
POST /search
{ "query": "discussions about the product defect before the recall announcement",
"case_id": 123,
"filters": {
"date_range": { "end": "2023-06-15" }
}}
POST /search
{ "query": "communications expressing concern about the testing results",
"case_id": 123 }
POST /search
{ "query": "decisions about whether to disclose the issue to the board",
"case_id": 123,
"filters": {
"date_range": { "start": "2023-01-01", "end": "2023-06-15" }
}}
Why this works at T1: Date filters are applied in both BM25 and vector search legs before results are returned. The embedding model understands temporal and causal concepts ("before the recall", "expressing concern"). Hard date boundaries are enforced by the filter -- no hallucination risk on dates.
What exists: ES keyword search with date filters (existing). What's new: documentsearch module. No additional Rails changes needed -- date_range filter is in the module spec.
4. Find Gaps in Received Production¶
Tier: T2 (Agent Reasoning) Persona: Attorney Value: Reduce total time to identify facts about what is NOT in the production. Transform vague absence into specific, articulable gaps.
Today: Entirely manual. Attorney runs searches, mentally notes which custodians and date ranges appear, manually identifies absences by subtraction. Hours of work. And because keyword search misses synonyms, the attorney can never be confident a gap is real rather than a vocabulary mismatch.
With documentsearch alone (T1, partial): Attorney manually runs the same query filtered to each custodian and compares result counts. Already a massive improvement over keyword-based gap analysis because semantic search eliminates the vocabulary mismatch problem.
# Run for each custodian manually
POST /search
{ "query": "discussions about safety defect in the suspension system",
"case_id": 123,
"filters": {
"custodians": ["vp.engineering@company.com"],
"date_range": { "start": "2023-03-01", "end": "2023-04-15" }
}}
# Result: 0 documents
POST /search
{ "query": "discussions about safety defect in the suspension system",
"case_id": 123,
"filters": {
"custodians": ["senior.engineer@company.com"],
"date_range": { "start": "2023-03-01", "end": "2023-04-15" }
}}
# Result: 23 documents
The gap is now visible: VP of Engineering has 0 results in a period where a peer engineer had 23.
With T2 agent (full automation):
Agent input:
{ "topic": "safety defect in the suspension system",
"custodians": ["vp.eng", "sr.eng.1", "sr.eng.2", "sr.eng.3", "qa.lead"],
"date_range": { "start": "2023-03-01", "end": "2023-04-15" },
"case_id": 123 }
Agent workflow:
1. Run the SAME semantic query for each custodian in the date range
(5 parallel POST /search calls)
2. Collect: result count, top relevance scores, date distribution per custodian
3. LLM (Bedrock Claude) analyzes the pattern:
- Compare result counts across custodians
- Identify statistical outliers (zero or near-zero results)
- Check whether the custodian has OTHER documents in that date range
(active mailbox, just silent on this topic)
4. Return structured gap report:
"VP of Engineering has 0 documents matching 'safety defect discussions'
during March 1 - April 15, while 4 peer engineers averaged 19 results
each. VP's mailbox was otherwise active (47 other documents in period).
This absence is specific to the safety defect topic."
Architecture:
Rails UI -> API Gateway -> Gap Analysis Agent Lambda
|
+-> For each custodian: POST /search (documentsearch endpoint)
+-> For custodian activity: query exhibits table (existing MySQL)
+-> LLM synthesis (Bedrock Claude)
+-> Return structured report
Builds on nextpoint-ai's Orchestrator + Processor Lambda pattern. The
documentsearch /search endpoint is the primitive the agent calls.
What exists: documentsearch module (T1), nextpoint-ai architecture pattern. What's new: Gap analysis agent service. Orchestrates multiple search calls and LLM synthesis.
Illustrative scenarios: See Appendix A (The Silent Executive, The Missing Pre-Decision Window).
5. Pattern and Practice Identification¶
Tier: T1 (basic) / T2 (synthesized) Persona: Attorney Value: Reduce total time to surface systemic behavior across a production.
With documentsearch (T1):
POST /search
{ "query": "instances where employees raised safety concerns and were told to stay the course",
"case_id": 123,
"limit": 100 }
Returns ranked documents. Attorney scans results and sees the pattern manually. This is already impossible with keyword search -- "stay the course" as a response to safety concerns is a behavioral pattern, not a keyword.
With T2 agent (pattern synthesis):
Agent workflow:
1. Run semantic search for the behavioral pattern
2. Cluster results by custodian, date, and organizational level
3. LLM analyzes:
"This pattern appears 14 times across 6 custodians over 8 months.
The earliest instance is March 2022 (Smith to Jones, Exhibit #142).
The pattern intensifies in Q4 2022 (9 of 14 instances).
Management-level responses account for 11 of 14 instances."
4. Present as structured timeline with document links
Feasibility note: This is where hallucination risk is highest. The LLM might over-interpret weak semantic matches as "pattern instances." Mitigation: always return the raw snippets alongside the synthesis so the attorney can verify each claimed instance.
What exists: documentsearch module (T1). What's new: Pattern analysis agent (T2). Uses same architecture as gap analysis agent.
6. Third-Party Subpoena Targeting¶
Tier: T1 (Hybrid Search) Persona: Attorney, Paralegal Value: Reduce effort time to identify what third parties likely have and what to ask for, without requiring counsel to know the names and terms to search.
Today: Attorney must already know which third parties are involved. Keyword searches for specific entity names. Misses references where the third party is described by role ("our insurer", "the bank") rather than by name.
With documentsearch:
POST /search
{ "query": "references to conversations with insurance companies about the product recall",
"case_id": 123 }
POST /search
{ "query": "communications mentioning outside counsel regarding the merger",
"case_id": 123 }
POST /search
{ "query": "discussions about what the regulator told us during the investigation",
"case_id": 123 }
Semantic search surfaces references by ROLE ("our insurer said...", "I spoke with the bank about...") without requiring the attorney to know which insurer or which bank. The results identify the third parties to subpoena.
What exists: ES keyword search (existing). What's new: documentsearch module. No additional Rails changes.
7. Motion to Compel Support¶
Tier: T1 (basic) / T2 (automated) Persona: Attorney Value: Reduce effort time to build the factual basis for a motion to compel with specific grounds rather than general assertions of incompleteness.
With documentsearch (T1):
POST /search
{ "query": "emails referencing reports or analyses about safety testing",
"case_id": 123 }
Surfaces individual emails that reference documents. Attorney manually compiles the list of referenced-but-not-produced documents.
With T2 agent (automated cross-reference):
Agent workflow:
1. Search for references to external documents, reports, analyses
2. Extract referenced document descriptions from search results (LLM)
3. Cross-reference against the production manifest (exhibits table in MySQL)
4. Identify referenced documents not present in production:
"12 emails reference a 'Q3 Safety Assessment Report' (Exhibits #42, #67,
#89, #103, #145, #178, #201, #234, #267, #290, #312, #345).
This document does not appear in the production.
8 emails reference 'Board Presentation on Product Risk' -- also absent."
5. Generate structured motion-to-compel language with specific citations
Architecture: Agent calls documentsearch + queries exhibits table (existing MySQL) to identify what's referenced but not produced. LLM formats output.
What exists: documentsearch module (T1), exhibits table (MySQL). What's new: Motion-to-compel agent (T2).
8. Settlement and Mediation Prep¶
Tier: T1+ (Search + Existing Features) Persona: Attorney Value: Reduce effort time to curate the strongest evidence package for mediation, replacing re-review of previously flagged folders.
Today: Attorney re-reviews folders of previously flagged documents under time pressure. Relies on memory of which documents were strongest on each issue.
With documentsearch:
POST /search
{ "query": "documents establishing defendant's knowledge of the safety defect prior to the recall",
"case_id": 123,
"filters": {
"date_range": { "end": "2023-06-01" }
},
"limit": 50 }
Results ranked by relevance. Top results are the strongest evidence on that issue. Save to a mediation prep folder.
What exists: Folders, export, print (Rails). What's new: documentsearch module + "save search results to folder" UI action (same as depo prep).
Producing Side¶
9. Responsive Review¶
Tier: T1+ (Search + Existing Features) Persona: Paralegal (primary reviewer), Attorney (quality control) Value: Reduce effort time to identify responsive documents. Cover the "back half" of a document set that time-boxed reviews never reach.
Today: Linear review in random or chronological order. Time-boxed reviews miss documents at the end of the queue. The most responsive documents might be buried at position 50,000 in a 100,000-document review set.
With documentsearch:
POST /search
{ "query": "all documents relating to the design and testing of the XYZ component",
"case_id": 123,
"limit": 500 }
The query is the RFP language itself. Results ranked by relevance. Reviewer works top-down. Most-likely-responsive documents are reviewed first.
Integration path:
1. POST /search returns scored exhibit IDs
2. Rails review queue accepts sort order by relevance score (instead of
chronological or random)
3. Reviewer sees most-relevant documents first
4. The "back half" is now prioritized by semantic relevance, not chronological luck
What exists: - Review queue (Rails: review workflow, coding panel) - Batch coding (Rails: bulk responsive/non-responsive tagging)
What's new: - documentsearch module - Review queue sort-by-relevance option (Rails frontend change) - Optional: "one search per RFP request" workflow that pre-sorts the entire review set by responsiveness to each RFP
10. Privilege Review¶
Tier: T1 (Hybrid Search) Persona: Attorney (privilege calls require attorney judgment) Value: Reduce effort time identifying privilege documents. Surface privilege communications that keyword search misses. Reduce false positives from broad keyword terms.
Today: Keyword search for "privilege", "attorney", "counsel", "legal advice". Returns massive false positives: HR documents about "attorney fees", vendor contracts referencing "legal department", newsletters mentioning "legal". Misses: conversational privilege language where nobody uses the word "privilege."
With documentsearch:
POST /search
{ "query": "communications seeking or providing legal advice about the transaction",
"case_id": 123 }
POST /search
{ "query": "messages where someone asks for in-house counsel's opinion before responding",
"case_id": 123 }
Semantic search understands the CONCEPT of privilege: "I'd like to get Sarah's take on the liability question before we respond" -- where Sarah is in-house counsel. No keyword overlap with "privilege" or "attorney-client."
What exists: ES keyword search with privilege-related terms (existing). What's new: documentsearch module. No additional Rails changes needed. Results feed into existing privilege logging workflow.
Industry context: Traditional privilege logging averages ~7 documents/hour. AI-assisted privilege logging reaches ~35 documents/hour (5x). The EDRM community has accepted AI-assisted privilege workflows as defensible when combined with attorney oversight, validation sampling, and audit trails (EDRM, March 2026).
Full privilege workflow (T1 → T2 → T3):
| Step | What | Tier | Status |
|---|---|---|---|
| 1. Find privilege candidates | Semantic search across corpus | T1 | Designed (documentsearch) |
| 2. Classify candidates | LLM reads each doc, assesses privilege likelihood | T2 | Not yet designed |
| 3. Attorney review | Human judgment on flagged docs | Exists | Rails privilege workflow |
| 4. Privilege log generation | LLM formats entries into standard privilege log | T2/T3 | Not yet designed |
T1 provides step 1. Each subsequent step layers on top. Courts require human judgment for final privilege calls (step 3 is always attorney), but steps 1, 2, and 4 can be AI-assisted.
11. Redaction Identification¶
Tier: T1 (Hybrid Search) Persona: Paralegal, Attorney Value: Reduce effort time to find documents requiring redaction before production. Surface candidates keyword review misses.
Today: Keyword search for PII terms, then manual review. Misses documents where sensitive content is described in context rather than in formal terms.
With documentsearch:
POST /search
{ "query": "threads discussing the company's unrelated litigation with [other party]",
"case_id": 123 }
POST /search
{ "query": "communications containing social security numbers or financial account details",
"case_id": 123 }
What exists: Redaction tools (Rails: redaction workflow). What's new: documentsearch module. Results feed into existing redaction workflow.
Two layers of PII/PHI detection:
| Layer | What It Finds | Tier | Status |
|---|---|---|---|
| Semantic search (documentsearch T1) | Documents that DISCUSS PII/PHI topics (e.g., "employee medical conditions") | T1 | Designed |
| Entity detection (NER + regex, T2) | Specific PII/PHI entities within documents (the actual SSN, the actual name, the actual diagnosis) | T2 | Not yet designed |
T1 semantic search identifies documents that need redaction review. T2 entity detection identifies the specific text to redact within those documents.
Regulatory context: GDPR (4% of global revenue), CCPA ($7,500 per violation), HIPAA ($2.1M per category). Auditable PII/PHI redaction workflows are now a prerequisite for responsible eDiscovery.
Categories and detection difficulty:
| Category | Example | T1 Semantic Search? | T2 Entity Detection? |
|---|---|---|---|
| Structured PII (SSN, credit card) | "123-45-6789" | No (not a concept) | Yes (regex) |
| Unstructured PII (names in context) | "John discussed his finances with..." | Partial (finds relevant docs) | Yes (NER) |
| PHI (health records) | "patient's treatment plan with Dr. Martinez" | Yes (finds health-related docs) | Yes (NER + medical context) |
| Composite PII | Job title + department + date = identifies one person | No | Yes (LLM classification) |
T1 handles PHI and unstructured PII well (conceptual matching). Structured PII (exact SSN patterns) requires T2 entity detection. Both layers feed into the existing Rails redaction workflow.
12. Clawback Risk Assessment¶
Tier: T1 (Hybrid Search) Persona: Attorney Value: Reduce effort time to assess inadvertent production exposure under time pressure. Replace manual keyword reconstruction with instant conceptual search.
Scenario: A privileged document was inadvertently produced. Attorney needs to immediately find other documents about the same subject matter that might also be privileged and need to be clawed back.
Today: Reconstruct keyword searches from memory under time pressure. Run multiple boolean queries. Miss documents where the same legal advice is discussed in different vocabulary.
With documentsearch:
POST /search
{ "query": "[paste the substance of the clawed-back document: attorney advice
on the acquisition risks and regulatory exposure]",
"case_id": 123 }
Instant results. The substance of the clawed-back document IS the query. Semantic search finds all documents discussing the same legal advice, even in different words.
What exists: Clawback workflow (Rails). What's new: documentsearch module. The urgency of clawback situations makes sub-second search response time critical -- this is where the 170ms hybrid search latency delivers direct value.
Depositions¶
13. Post-Deposition Analysis¶
Tier: T1+ (Search + Existing Features) Persona: Attorney Value: Reduce effort time to find documents that support or contradict a witness's testimony. Replace manual keyword reconstruction from deposition notes.
Today: Attorney takes notes during deposition, then reconstructs keyword searches based on testimony. Misses documents where the relevant content uses different language than the witness used.
With documentsearch:
POST /search
{ "query": "documents consistent with the claim that engineering approved the final design",
"case_id": 123 }
POST /search
{ "query": "documents contradicting the assertion that she was not copied on safety escalations",
"case_id": 123,
"filters": {
"custodians": ["witness@company.com"]
}}
POST /search
{ "query": "email threads where the witness participated in discussions about the timeline change",
"case_id": 123,
"filters": {
"custodians": ["witness@company.com"]
}}
Why semantic search is critical here: The "contradicting" query is particularly powerful. Semantic search understands contradiction as a concept. A document where the witness IS on the CC line of a safety escalation email contradicts her assertion that she was not copied -- semantic search makes that connection even though the document itself doesn't use the word "contradict."
What exists: Document viewer, tagging (Rails). What's new: documentsearch module + "save supporting/contradicting docs" workflow (minor Rails UI).
14. Multi-Deposition Analysis¶
Tier: T2+ (Cross-Corpus Agent) Persona: Attorney Value: Reduce total time to identify where witnesses contradicted each other or themselves. Surface inconsistencies that today only emerge through intensive parallel review and attorney recall.
Today: Attorney reads all transcripts, compares testimony manually. Relies on memory and notes across multiple depositions. Inconsistencies are discovered by chance or by extraordinary attorney diligence.
What this requires: An agent operating across the deposition transcript corpus AND the document production.
Agent workflow:
1. Search ALL deposition transcripts for testimony about [topic]
(requires transcript embeddings -- Litigation suite integration)
2. Cluster testimony by witness
3. LLM compares across witnesses:
"Witness A testified: 'I was not aware of the testing results until July.'
Witness B testified: 'I briefed the entire team, including [Witness A],
on the testing results in the May meeting.'
These statements are inconsistent."
4. Cross-reference against documents:
"Exhibit #234 (May meeting attendee list) includes Witness A,
contradicting her testimony that she was unaware until July."
5. Present contradiction matrix with citations to transcript pages
and document exhibit numbers
Architecture:
Rails UI -> API Gateway -> Multi-Depo Agent
|
+-> Transcript search (requires: transcript embeddings via
| Litigation suite, which is currently Legacy-only)
+-> Document search (documentsearch POST /search)
+-> LLM contradiction analysis (Bedrock Claude)
+-> Return structured contradiction report
What exists: nextpoint-ai (transcript summarization), documentsearch (T1). What's new: - Transcript embedding pipeline (extend documentsearch to Litigation suite transcripts) - Multi-deposition agent service - Cross-corpus orchestration (documents + transcripts)
This is the hardest use case. It spans two data sources (Discovery documents and Litigation transcripts), requires LLM reasoning about consistency and contradiction, needs structured output, and depends on the Litigation suite which is currently Legacy-only. It is T2+ because it requires cross-suite integration that doesn't exist today.
Architecture Mapping¶
What Each Tier Builds On¶
T1: documentsearch module
| POST /search (hybrid BM25 + vector, filters)
| Subscribes to DOCUMENT_PROCESSED events
| Backfill pipeline for existing documents
|
T1+: + Minor Rails UI integration
| "Save to folder" action
| Review queue sort-by-relevance
| "Create depo binder from search" workflow
|
T2: + Agent orchestration service
| Multiple POST /search calls per workflow
| LLM synthesis (Bedrock Claude)
| Cross-reference with exhibits table (MySQL)
| Builds on nextpoint-ai Orchestrator + Processor pattern
| Evaluate: PageIndex for deep document reasoning (post-retrieval)
|
T2+: + Litigation suite integration
Transcript embedding pipeline
Cross-corpus agent (documents + transcripts)
Component Dependencies¶
| Component | Use Cases Enabled | Depends On |
|---|---|---|
| documentsearch module | 1, 3, 5, 6, 7, 10, 11, 12 | Existing SNS pipeline, Voyage AI, vector store |
| "Save to folder" UI | 2, 8, 9, 13 | documentsearch, existing Rails folders/tags |
| Review queue sort | 9 | documentsearch, existing Rails review workflow |
| Gap analysis agent | 4 | documentsearch, Bedrock Claude, nextpoint-ai pattern |
| Pattern analysis agent | 5 (enhanced) | documentsearch, Bedrock Claude |
| Motion-to-compel agent | 7 (enhanced) | documentsearch, exhibits table, Bedrock Claude |
| Transcript embeddings | 14 | documentsearch (extended to transcripts), Litigation suite |
| Multi-deposition agent | 14 | Transcript embeddings, documentsearch, Bedrock Claude |
Existing Nextpoint Components Reused¶
| Existing Component | Reused By | How |
|---|---|---|
SNS DOCUMENT_PROCESSED events |
documentsearch ingest | Subscribe to existing event (zero upstream changes) |
| Elasticsearch 7.4 (per-case indices) | Hybrid search BM25 leg | Query existing indices for keyword results |
| Rails folders/tags | Depo prep, settlement prep, responsive review | Save search results to existing folder structures |
| Rails review queue | Responsive review | Sort by relevance score instead of chronological |
| Rails redaction workflow | Redaction identification | Feed search results into existing workflow |
| Rails privilege logging | Privilege review | Feed search results into existing workflow |
| nextpoint-ai architecture | T2 agents | Orchestrator + Processor Lambda pattern, Bedrock integration |
| Exhibits table (MySQL) | Motion to compel agent, backfill | Document manifest for cross-referencing and backfill source |
| PSM (Athena pipeline) | Progress tracking | DOCUMENT_EMBEDDED events tracked via existing Firehose pipeline |
| NgeCaseTrackerJob | Embedding progress | Rails polls embedding status via existing Athena polling |
Build Sequence¶
Phase 1: Prototype (2 weeks)¶
Goal: Validate retrieval quality on a known matter. Demo T1 use cases.
- documentsearch module (single case, pgvector, basic chunking)
- Standalone React frontend (Claude Code)
- Demo script: Hot Doc ID, Privilege Review, Narrative Investigation
Demoable use cases: 1, 3, 5 (basic), 6, 7 (basic), 10, 11, 12
Phase 2: Production T1 (1 quarter)¶
Goal: Ship hybrid search as a production feature.
- documentsearch module with production vector store (OpenSearch)
- Domain-specific chunking (email-aware, document section-aware)
- Backfill pipeline for existing documents
- Multi-tenant per-case isolation
- Rails UI integration: search toggle, result display with snippets
Unlocks use cases: All T1 (1, 3, 5, 6, 7, 10, 11, 12)
Phase 3: Rails Integration (weeks after Phase 2)¶
Goal: Connect search to existing workflows.
- "Save search results to folder" action
- Review queue sort-by-relevance option
- Depo prep binder workflow
- Embedding progress indicators in import status UI
Unlocks use cases: All T1+ (2, 8, 9, 13)
Phase 4: Agent Service (1 quarter after Phase 2)¶
Goal: Automated multi-search reasoning.
- Agent orchestration service (extends nextpoint-ai pattern)
- Gap analysis workflow
- Pattern identification workflow
- Motion to compel cross-reference workflow
- Evaluate PageIndex for deep document reasoning (see Appendix C)
Unlocks use cases: T2 (4, 5 enhanced, 7 enhanced)
Phase 5: Cross-Corpus (future)¶
Goal: Deposition + document integration.
- Transcript embedding pipeline (extend documentsearch to Litigation suite)
- Multi-deposition contradiction agent
- Cross-corpus search (documents + transcripts in one query)
Unlocks use cases: T2+ (14)
Appendix A: Gap Analysis Scenarios¶
Scenario 1: The Silent Executive¶
Case: Product liability. VP of Engineering allegedly knew about safety defect before product shipped. Production covers 18 months of emails.
Keyword search outcome: Terms like "defect", "safety", "failure", "risk" filtered to VP return noise (engineering jargon, HR docs) and a few genuine hits. Absence logged as "no responsive documents."
Semantic search outcome:
Query 1: "concerns raised about product performance before launch"
VP results: 0 | Peer engineers average: 18
Query 2: "internal escalations about engineering problems in the suspension system"
VP results: 0 | Peer engineers average: 23
Query 3: "messages where someone is told not to worry about a known issue"
VP results: 0 | Peer engineers average: 8
Query 3 is the one keyword search can never touch. Nobody writes "don't worry about the known defect." They write "let's stay the course on the timeline" or "we'll address that in the next release cycle." Semantic search surfaces the reassurance language.
When ALL THREE queries return zero from the VP's mailbox during the six weeks before launch -- a period where every other senior engineer was actively discussing the issue -- that silence is specific and articulable. The attorney can depose the VP about that specific period armed with the knowledge that the documents that should exist don't appear.
T1 version: Attorney runs these 3 queries x N custodians manually. Still dramatically better than keyword search.
T2 version: Gap analysis agent runs all queries across all custodians automatically and presents the comparison matrix.
Scenario 2: The Missing Pre-Decision Window¶
Case: Breach of contract. Did defendant deliberately exit the agreement or simply fail to perform? Production covers two years of communications.
Keyword search outcome: "Contract", "agreement", "termination", "exit" return documents clustered around the formal termination date. The critical window -- six weeks before formal action -- is invisible.
Semantic search outcome:
Query 1: "internal discussions about whether to continue the relationship with [plaintiff]"
CFO results: 0 in critical window | Deal team average: 12
Query 2: "financial analysis comparing cost of performance versus cost of breach"
CEO results: 0 in critical window | Finance team average: 7
Query 3: "communications about alternative suppliers or partners for this project"
CFO results: 0 in critical window | Procurement: 15
The production has robust traffic from every other custodian in that window. CFO and CEO have essentially nothing. That is not a search failure. That is a pattern. Counsel enters the CFO deposition knowing the documents that should exist don't appear.
Why Keyword Search Cannot Do Gap Analysis¶
Keyword search only tells you what IS there. Gap analysis requires characterizing what SHOULD be there and isn't. With keywords:
- Run searches, get results
- Manually note custodians and date ranges represented
- Cross-reference against expectations of who should have communicated
- Identify gaps by subtraction (fully manual, attorney-hours process)
And because keyword search misses synonyms and conceptual matches, you can never be confident that a gap is real rather than a vocabulary mismatch. If you search for "liability" and get nothing, you don't know if the custodian never discussed it or discussed it using words your keyword list didn't anticipate.
Semantic search inverts this. A conceptually broad query run against a specific custodian's documents during a specific time window produces a result set whose SIZE AND CONTENT is itself informative. A near-empty result for "any discussion of the acquisition risks during Q4" from the M&A lead is a signal, not just an absence of keyword hits.
That is the difference between a keyword gap and a meaningful gap. Only the second kind is actionable in litigation.
Appendix B: Use Case to API Mapping¶
Every use case ultimately calls the same endpoint with different parameters:
POST /search
{
"query": "<natural language>",
"case_id": <int>,
"filters": {
"custodians": ["<email>"], # optional
"date_range": { # optional
"start": "YYYY-MM-DD",
"end": "YYYY-MM-DD"
},
"batch_ids": [<int>] # optional
},
"mode": "hybrid", # "hybrid" | "semantic" | "keyword"
"limit": 20,
"offset": 0
}
| Use Case | query | Key Filters |
|---|---|---|
| Hot Doc ID | Topic-based conceptual query | None or broad |
| Depo Prep | Topic + witness focus | custodians |
| Narrative | Temporal/causal questions | date_range |
| Gap Analysis (T1) | Same query, repeated per custodian | custodians + date_range |
| Pattern ID | Behavioral pattern description | None |
| Subpoena Targeting | Third-party references | None |
| Motion to Compel | References to absent documents | None |
| Settlement Prep | Strongest evidence on issue | date_range |
| Responsive Review | RFP language verbatim | None or batch_ids |
| Privilege Review | Privilege concept queries | None |
| Redaction ID | PII/sensitive content concepts | None |
| Clawback | Substance of clawed-back doc | None |
| Post-Depo | Testimony substance | custodians |
| Multi-Depo (T2+) | Topic across witnesses | (transcript corpus) |
The API surface is identical across all use cases. The attorney's natural language query and filter selection is what differentiates them. This is the architectural simplicity: one endpoint, many workflows.
Appendix C: PageIndex — T2 Evaluation Item¶
What It Is¶
PageIndex (VectifyAI, open-source) is a vectorless, reasoning-based retrieval framework. Instead of embedding chunks and doing similarity search, it builds a hierarchical tree index (machine-readable table of contents) from a document and uses LLM reasoning to navigate that tree at query time. Mafin 2.5 (built on PageIndex) achieved 98.7% accuracy on FinanceBench (financial document fact extraction) where typical vector RAG scored 65-80%.
Two-step architecture: 1. Index generation (once per document) — analyzes document's natural structure (sections, subsections, headings, logical groupings). Each tree node has: title, summary of what it covers, what questions it could answer, and page range. Original document stays intact and accessible. 2. Reasoning-driven tree search (per query) — LLM reads top-level nodes, reasons which branch likely contains the answer, follows that branch, reads next level, reasons again. Drills down to specific sections, retrieves complete sections (not fragments). Explainable by design — the tree traversal path is transparent.
References: - Source: https://github.com/VectifyAI/PageIndex - Cloud API: https://docs.pageindex.ai (REST + MCP protocol integration) - Interactive demo: https://chat.pageindex.ai
What It Solves That Vector Search Does Not¶
Vector search finds WHICH documents are relevant across a corpus. PageIndex finds WHERE within a single document the answer lives, following internal cross-references that vector search cannot.
| Capability | Vector Search (T1) | PageIndex |
|---|---|---|
| Search across 500K documents | Yes | No (single document) |
| Conceptual matching ("discussions about SPEs") | Yes | N/A |
| Follow cross-references ("see Appendix G") | No | Yes |
| Navigate hierarchical structure (sections, tables) | No | Yes |
| Extract precise facts from structured documents | Weak | Strong |
| Latency per query | ~170ms | Seconds (multiple LLM calls) |
| Cost per query | ~$0.000001 | ~$0.01-0.10 (LLM invocations) |
Why It Does NOT Replace T1¶
PageIndex cannot search across a corpus. An attorney searching "discussions about the defect before anyone used the word defect" across 500K documents cannot run PageIndex on every document — that would be 500K x multiple LLM calls = millions of dollars and hours of latency.
Vector search and PageIndex solve different problems: - Vector search: "Which of these 500K documents are about this concept?" - PageIndex: "Where exactly in THIS document is the answer, following all internal references?"
Where It Fits: Post-Retrieval Reasoning in T2 Agents¶
PageIndex is a potential enhancement for the T2 agent service layer, used AFTER vector search narrows the corpus:
Step 1: documentsearch hybrid search
-> finds 20 most relevant documents out of 500K
-> ~170ms, ~$0.000001
Step 2: PageIndex deep reasoning on those 20 documents
-> extracts precise answers, follows cross-references,
resolves "see Section 4.2" and "per Appendix A"
-> ~10-30 seconds, ~$0.20-1.00
Use Cases Where PageIndex Adds Value¶
| Use Case | How PageIndex Helps | Tier |
|---|---|---|
| Deposition Prep | After finding relevant docs, extract specific facts and build chronology from structured documents | T2 |
| Investigating the Narrative | Deep extraction from identified key documents (contracts, board presentations) | T2 |
| Settlement Prep | Extract strongest evidence passages with cross-reference resolution | T2 |
| Motion to Compel | Follow references to cited documents ("per the Q3 Safety Report") within source documents | T2 |
| Responsive Review | For complex documents (contracts, regulatory filings), determine which sections are responsive | T2 |
Use Cases Where PageIndex Does NOT Help¶
| Use Case | Why Not |
|---|---|
| Hot Doc Identification | Corpus-level search problem, not single-document reasoning |
| Gap Analysis | Comparing result counts across custodians — corpus-level, not document-level |
| Privilege Review | Concept matching across corpus, not structured fact extraction |
| Clawback Risk Assessment | Need to find similar documents fast, not reason within one |
Evaluation Plan (Phase 4)¶
During T2 agent service development, evaluate PageIndex on:
-
Structured legal documents — merger agreements, contracts, regulatory filings already identified by vector search. Does tree navigation find cross-referenced provisions that chunk-based snippets miss?
-
Cost/latency trade-off — at $0.01-0.10 per document per query, is the improvement worth the cost when applied to 20 post-retrieval documents?
-
Integration pattern — PageIndex tree indices could be pre-built at ingest time (alongside embeddings) or built on-demand for documents returned by search. Pre-built adds storage; on-demand adds latency.
-
Comparison with long-context LLMs — as context windows grow (Claude supports 200K tokens), feeding the full document to an LLM may achieve similar results without the tree index. Compare PageIndex structured navigation vs brute-force long-context on the same documents.
Decision: Not for T1, Evaluate for T2¶
PageIndex does not change the T1 architecture. The documentsearch module (vector + BM25 + RRF) remains the correct approach for corpus-level search. PageIndex is a T2 enhancement for structured document reasoning, to be evaluated during agent service development alongside long-context LLM alternatives.
Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.