Reference Implementation: documentsearch¶
Overview¶
documentsearch is the semantic search module for the Nextpoint eDiscovery platform. It generates vector embeddings during document ingestion and serves hybrid search (BM25 + vector) via API Gateway. Attorneys can run natural language queries like "internal discussions about Special Purpose Entities" and surface documents that keyword search would miss entirely.
EDRM Stage: 6 (Review) + 7 (Analysis) -- semantic search for document review and investigative analysis. Suite: Discovery (documents). Operates alongside existing keyword search, not replacing it.
Architecture¶
documentsearch/
├── lib/lambda/src/documentsearch/
│ ├── core/ # Pure business logic -- NO AWS imports
│ │ ├── exceptions.py # Recoverable/Permanent/Silent hierarchy
│ │ ├── checkpoints.py # EmbeddingCheckpoints enum (7 steps)
│ │ ├── chunking/
│ │ │ ├── strategy.py # ChunkingStrategy interface
│ │ │ ├── email_chunker.py # Thread-aware: preserves reply chains + metadata
│ │ │ ├── document_chunker.py # Section-aware: headings, paragraphs, tables
│ │ │ └── metadata.py # Prepends sender/date/subject to chunks
│ │ ├── search/
│ │ │ ├── hybrid.py # Reciprocal Rank Fusion (BM25 + vector)
│ │ │ ├── reranker.py # Optional LLM reranking of top-K
│ │ │ └── filters.py # Custodian, date range, batch filters
│ │ ├── models/
│ │ │ ├── db_models.py # search_chunks, search_embedding_status tables
│ │ │ └── schemas.py # EmbeddingRequest, SearchRequest, SearchResult
│ │ └── utils/
│ │ ├── db_transaction.py # @retry_on_db_conflict
│ │ └── context_data.py # ContextVar-based request context
│ ├── shell/ # Infrastructure -- AWS, external APIs
│ │ ├── db/
│ │ │ └── database.py # writer_session/reader_session (per-case)
│ │ ├── vectorstore/
│ │ │ ├── base.py # VectorStore interface (abstract)
│ │ │ ├── opensearch_store.py # OpenSearch k-NN implementation
│ │ │ └── pgvector_store.py # pgvector implementation (prototype)
│ │ ├── keyword/
│ │ │ └── es_ops.py # BM25 query against existing ES index
│ │ ├── embedding/
│ │ │ └── voyage_client.py # Voyage AI voyage-law-2 API client
│ │ └── utils/
│ │ ├── sns_ops.py # EventType enum, SNS publishing
│ │ ├── sqs_ops.py # SQS message handling
│ │ └── s3_ops.py # Fetch extracted text from S3
│ ├── handlers/
│ │ ├── embedding_index.py # SQS handler -- ingest pipeline (chunk + embed)
│ │ ├── search_index.py # API Gateway handler -- sync search queries
│ │ ├── backfill_index.py # API Gateway handler -- trigger backfill for case
│ │ └── job_processor_index.py # Job orchestrator -- per-batch infra lifecycle
│ └── config.py # Env vars, region config, Voyage API key ref
├── lib/lambda/tests/
│ ├── conftest.py # Auto-use fixtures: mock AWS, DB, Voyage API
│ ├── core/test_chunking.py
│ ├── core/test_hybrid_search.py
│ ├── core/test_rrf.py
│ └── shell/test_voyage_client.py
├── lib/ # CDK infrastructure (TypeScript)
│ ├── common-resources-stack.ts # Vector store infra, shared resources
│ ├── search-ingest-stack.ts # Embedding Lambda + SQS + SNS subscription
│ ├── search-api-stack.ts # API Gateway + search Lambda + backfill Lambda
│ └── search-stack.ts # Top-level stack composition
└── CLAUDE.md
Pattern Mapping¶
| Pattern | documentsearch Implementation | Standard NGE Pattern |
|---|---|---|
| Hexagonal boundaries | core/ contains chunking, RRF, filters, checkpoints; shell/ wraps Voyage AI, OpenSearch/pgvector, ES, S3 |
core/ + shell/ separation |
| Exception hierarchy | RecoverableException, PermanentFailureException, SilentSuccessException | Standard 3-type hierarchy |
| SNS events | 5 EventType enum values (DOCUMENT_EMBEDDED through BACKFILL_COMPLETED) | SNS events as facts (past tense) |
| SQS handler | embedding_index.py -- batch processing with partial failure support |
Standard 6-step handler |
| Checkpoint pipeline | 7-step state machine (EMBEDDING_STARTED -> EMBEDDING_COMPLETE) | Checkpoint-based composite PK |
| Database sessions | writer_session, reader_session (per-case) | Per-case MySQL databases |
| Retry/resilience | @retry_on_db_conflict, SQS exponential backoff, DLQ redrive | Standard retry patterns |
| Idempotency | Checkpoint-based (composite PK on search_embedding_status) | Standard idempotency |
| Multi-tenancy | Per-case vector index + per-case MySQL for chunks | Per-case isolation |
| API design | API Gateway + Lambda for sync search, IAM auth | Same as documentexchanger |
| Job processor | Per-batch infra lifecycle for live ingest | Standard orchestration |
| Vector store abstraction | shell/vectorstore/base.py interface, swappable backends |
N/A (new pattern) |
| Dual entry point | Async ingest (SQS) + sync query (API Gateway) | Same as documentexchanger |
| Backfill pipeline | Separate queue, lower concurrency, same embedding Lambda | N/A (new pattern) |
Key Design Decisions¶
Where It Fits in the Pipeline¶
Live Ingest (new documents)¶
Rails -> ProcessorApi.import()
|
documentextractor <-- entry point (UNCHANGED)
| SNS: DOCUMENT_PROCESSED
|---> documentloader -> MySQL + ES <-- existing (UNCHANGED)
|---> documentuploader -> PDF/Nutrient <-- existing (UNCHANGED)
|---> PSM -> Athena <-- existing (UNCHANGED)
+---> documentsearch (NEW) -> chunk, embed, store vectors
Subscribes to the existing DOCUMENT_PROCESSED SNS event via standard fan-out.
Zero changes to upstream modules. The SNS topic already exists; we add one
more SQS subscription with a filter policy.
Backfill (existing documents)¶
Existing documents already passed through the pipeline and will never trigger
new DOCUMENT_PROCESSED events. Their extracted text is already on S3 (from
documentextractor) and their metadata is already in MySQL (from documentloader).
Only chunking and embedding is needed.
Rails (admin UI or CLI)
| POST /backfill { "case_id": 123 }
v
Backfill Lambda (orchestrator)
| Query MySQL: SELECT document_id, s3_path FROM exhibits
| WHERE NOT EXISTS in search_embedding_status
v
For each un-embedded document:
| Publish to SNS: { eventType: "BACKFILL_REQUESTED", ... }
v
backfill_embedding_queue (SQS, lower concurrency)
| Same Embedding Lambda code path
v
Chunk -> Embed -> Store (identical to live ingest)
Key properties:
- Incremental: Only processes documents without existing embeddings. Safe to
re-run; checkpoint table prevents duplicate work.
- Resumable: If the backfill is interrupted (Lambda timeout, deployment, etc.),
re-triggering picks up where it left off.
- Throttled: Backfill queue runs at lower MaximumConcurrency than live ingest
to avoid starving live imports or hitting Voyage AI rate limits.
- Progress tracking: DOCUMENT_EMBEDDED events flow through PSM -> Athena.
Rails polls progress via existing NgeCaseTrackerJob.
Search Query Flow¶
Sits alongside existing keyword search, does not replace it:
+--> existing ES index (BM25 keyword search)
| |
Rails search request -----+ Reciprocal Rank Fusion (in documentsearch)
| ^
+--> vector store (k-NN semantic search)
Keyword search stays exactly where it is. The module adds a parallel vector search leg and fuses both result sets via RRF. Attorneys can still search Bates numbers and exact phrases (keyword path). Conceptual queries hit the vector path. RRF combines both into one ranked list.
Checkpoint Pipeline¶
class EmbeddingCheckpoints(Enum):
EMBEDDING_STARTED = 0
TEXT_FETCHED = 1 # Extracted text retrieved from S3
CHUNKS_CREATED = 2 # Document chunked with metadata
EMBEDDINGS_GENERATED = 3 # Voyage AI called, vectors returned
VECTORS_STORED = 4 # Vectors upserted to vector store
CHUNKS_STORED = 5 # Chunk text + metadata stored in MySQL
EMBEDDING_COMPLETE = 6
Why 7 steps: Voyage AI API calls are the most expensive step (cost and latency).
If Lambda times out after generating embeddings but before storing them, we resume
from EMBEDDINGS_GENERATED and skip the re-call. Without this checkpoint, every
retry re-calls Voyage AI unnecessarily.
Composite PK: (npcase_id, document_id) -- one embedding pipeline per document.
No batch_id in the PK because the same document should never be embedded twice
regardless of which batch triggered it.
Two Queues for Ingest¶
SNS topic (existing)
|---> live_embedding_queue (MaximumConcurrency: 10) <-- DOCUMENT_PROCESSED
+---> backfill_embedding_queue (MaximumConcurrency: 3) <-- BACKFILL_REQUESTED
| |
Same Embedding Lambda (different concurrency limits)
Live ingest gets priority via higher concurrency. Backfill runs at lower concurrency so it doesn't compete for Voyage API rate limits or vector store write throughput. Both queues feed the same Lambda code -- the handler doesn't know or care whether the trigger was live or backfill.
Backfill Triggering Options¶
| Option | Mechanism | UX |
|---|---|---|
| Explicit per-case | Rails admin UI: "Enable Semantic Search" button per case. POST /backfill { case_id } |
Full control, phased rollout |
| Bulk migration | Script iterates all nge_enabled? cases, triggers backfill for each. One-time job. |
Initial rollout |
| On first search | If case has no embeddings, trigger backfill automatically. Return keyword-only results with "Semantic search is being prepared" message. | Zero admin overhead, best UX |
Recommended: Option 3 for GA, Option 1 for early access. On-first-search eliminates the enablement step entirely, but early access benefits from explicit control over which cases get the feature.
Re-embedding (Model Upgrades)¶
The embedding_model column on every chunk row tracks which model generated
each embedding. When upgrading (e.g., voyage-law-2 -> voyage-law-3):
-- Backfill filter becomes model-aware:
SELECT document_id, s3_path FROM exhibits
WHERE document_id NOT IN (
SELECT document_id FROM search_embedding_status
WHERE embedding_model = 'voyage-law-3' -- target model
)
Same pipeline, same throttling, same progress tracking. Old vectors are overwritten in the vector store; old chunk rows are updated in MySQL.
Vector Store Abstraction¶
The shell/vectorstore/ layer abstracts the storage backend behind an interface:
# shell/vectorstore/base.py
class VectorStore(ABC):
@abstractmethod
def create_index(self, case_id: int, dimension: int) -> None: ...
@abstractmethod
def upsert_vectors(self, case_id: int, vectors: list[VectorRecord]) -> None: ...
@abstractmethod
def search(self, case_id: int, query_vector: list[float],
k: int, filters: dict) -> list[VectorResult]: ...
@abstractmethod
def delete_index(self, case_id: int) -> None: ...
This allows swapping backends without touching core/:
- Prototype: pgvector on Aurora PostgreSQL (simple, low ops overhead)
- Production option A: OpenSearch k-NN (native hybrid, matches ES pattern)
- Production option B: OpenSearch Serverless (no cluster sizing, pay-per-query)
- Evaluation: Bedrock Knowledge Bases (managed RAG, but limited customization)
Vector Store Options¶
Full analysis with cost models, scale limits, and query latency numbers:
see adr/adr-vector-store-selection.md.
Summary:
| Option | Native Hybrid | Replaces ES? | Monthly Cost | Recommendation |
|---|---|---|---|---|
| OpenSearch Managed | YES | YES | $3,900-5,000 | Production (high volume) |
| OpenSearch Serverless | YES | YES | $700-2,000 | Production (moderate volume) |
| Aurora pgvector | No | No | $3,000-3,500* | Prototype |
| Bedrock KB | No | No | Variable | Eliminated |
| MemoryDB Redis | No | No | $1,200-5,000 | Eliminated |
*pgvector cost includes maintaining existing ES cluster for keyword search.
Decision: pgvector for prototype, OpenSearch (Managed or Serverless) for
production. The VectorStore abstraction in shell/vectorstore/ makes the
switch a shell-layer change only. Bedrock KB and MemoryDB eliminated (no hybrid
search, no Voyage AI, or single-shard limits).
Multi-Tenant Isolation¶
Vector Store: Index-per-case¶
Why index-per-case: - Matches existing ES per-case index pattern (Rails already uses this) - Hard tenant boundary -- no filter policy bugs can leak data across cases - Independent lifecycle -- delete case = delete index - Index size stays manageable (even large cases are <1M documents)
MySQL: Per-case database¶
Chunk text and embedding metadata are stored in the per-case database
(nextpoint_case_{id}), same as documentloader's exhibits and attachments.
class SearchChunk(Base):
__tablename__ = "search_chunks"
id = Column(Integer, primary_key=True, autoincrement=True)
document_id = Column(String(255), nullable=False, index=True)
chunk_id = Column(String(255), nullable=False, unique=True)
chunk_index = Column(Integer, nullable=False)
chunk_text = Column(Text, nullable=False)
metadata_json = Column(JSON) # custodian, date, subject, page range
embedding_model = Column(String(100)) # "voyage-law-2"
created_at = Column(DateTime, server_default=func.now())
class SearchEmbeddingStatus(Base):
__tablename__ = "search_embedding_status"
npcase_id = Column(Integer, primary_key=True)
document_id = Column(String(255), primary_key=True)
checkpoint_id = Column(Integer, nullable=False, default=0)
embedding_model = Column(String(100))
chunk_count = Column(Integer)
status = Column(String(20)) # "complete", "in_progress", "failed"
created_at = Column(DateTime, server_default=func.now())
updated_at = Column(DateTime, onupdate=func.now())
Chunks are the source of truth for snippet rendering. Vectors in the vector store are derived and re-generable from chunk text.
Event Types¶
Events consumed¶
| Event | Source | SNS Filter | Purpose |
|---|---|---|---|
DOCUMENT_PROCESSED |
documentextractor | eventType, caseId, batchId |
Triggers embedding for new document |
IMPORT_CANCELLED |
documentloader | eventType, caseId, batchId |
Cancel in-progress embedding work |
Events published¶
| Event | Purpose | Consumers |
|---|---|---|
DOCUMENT_EMBEDDED |
Embedding complete for one document | PSM (Athena tracking) |
EMBEDDING_FAILED |
Permanent embedding failure | PSM, Rails (error display) |
EMBEDDING_JOB_FINISHED |
All documents in batch embedded | Job Processor (teardown) |
BACKFILL_STARTED |
Backfill initiated for a case | PSM (tracking) |
BACKFILL_COMPLETED |
All existing docs in case embedded | PSM, Rails (UI update) |
SNS Message Structure¶
{
"source": "documentsearch",
"jobId": str,
"caseId": int,
"batchId": int,
"documentId": str,
"eventType": "DOCUMENT_EMBEDDED",
"status": "SUCCESS",
"timestamp": "2026-03-31T12:00:00Z",
"eventDetail": {
"chunk_count": 12,
"embedding_model": "voyage-law-2",
"processing_time_ms": 3400
}
}
MessageAttributes on every publish: eventType, caseId, batchId.
Chunking Strategy¶
Why domain-specific chunking matters¶
Naive 500-token splitting will: - Split an email reply from the message it's replying to - Separate "stay the course on the timeline" from the safety defect context - Lose sender/date/subject metadata that makes results useful for attorneys
Email chunking (core/chunking/email_chunker.py)¶
Input: Email thread with 4 messages
Output: 4 chunks, one per message in thread
Each chunk:
[METADATA] From: jeff.skilling@enron.com | Date: 2001-03-15 | Subject: RE: Raptor update
[BODY] The full text of this single message in the thread
[CONTEXT] Replying to: andy.fastow@enron.com on 2001-03-14 re: "Raptor update"
- Preserves sender attribution (critical for custodian-scoped search)
- Keeps reply context without duplicating entire thread in every chunk
- Metadata prefix means the embedding captures WHO said WHAT and WHEN
Document chunking (core/chunking/document_chunker.py)¶
Input: 20-page contract
Output: N chunks at natural section boundaries
Strategy:
1. Split at section headings (if detected)
2. Within sections, split at paragraph boundaries
3. Target chunk size: 512 tokens (voyage-law-2 sweet spot)
4. Overlap: 50 tokens between adjacent chunks
5. Prepend document metadata to first chunk
Chunk metadata¶
Every chunk carries metadata both as: - Prepended text (embedded with content -- improves retrieval quality) - Structured fields (stored in MySQL/vector store -- enables filtering)
@dataclass
class ChunkMetadata:
document_id: str
chunk_index: int
custodian: str | None # Email sender or document author
date: datetime | None # Email date or document date
subject: str | None # Email subject or document title
page_range: tuple[int, int] | None
document_type: str # "email", "document", "spreadsheet"
Embedding Pipeline¶
Voyage AI integration (shell/embedding/voyage_client.py)¶
class VoyageClient:
MODEL = "voyage-law-2"
DIMENSIONS = 1024
MAX_BATCH_SIZE = 128 # Voyage API limit
MAX_TOKENS_PER_BATCH = 120_000
def embed_documents(self, texts: list[str]) -> list[list[float]]:
"""Batch embed document chunks. Auto-batches to stay within limits."""
# Voyage API: POST https://api.voyageai.com/v1/embeddings
# input_type="document" for corpus embeddings
def embed_query(self, query: str) -> list[float]:
"""Embed a single search query."""
# input_type="query" -- asymmetric embedding for retrieval
Asymmetric embeddings: Voyage AI uses different input_type for documents
vs queries ("document" at ingest, "query" at search time). This is handled
in shell/, not core/. See patterns/asymmetric-embeddings.md for full
explanation, implementation, and AWS deployment options (direct API vs SageMaker
endpoint).
Rate limiting and concurrency¶
- Voyage AI rate limits: Managed via SQS
MaximumConcurrencyon embedding Lambda - Live ingest:
MaximumConcurrency: 10(higher priority) - Backfill:
MaximumConcurrency: 3(background, don't starve live) - Tuning: Adjust based on Voyage API tier and case sizes
Handler flow¶
# handlers/embedding_index.py -- standard 6-step SQS handler
def lambda_handler(event, _context):
batch_item_failures = []
for record in event["Records"]:
try:
set_event(record)
message = _parse_record(record) # Double-encoded SNS->SQS
set_npcase(str(message["caseId"]))
if not get_batch_context():
with writer_session() as session:
set_batch_context(_load_batch_context(session, message))
SNS(message)
_route_event(message) # -> core process: fetch, chunk, embed, store
except Exception as e:
_handle_exception(e, record, batch_item_failures)
finally:
set_event(None)
set_batch_context(None)
return {"batchItemFailures": batch_item_failures}
Current Elasticsearch State (BM25 Leg)¶
Understanding the existing ES infrastructure is critical because the BM25 leg of hybrid search queries these indices read-only. No ES changes are needed.
BM25 Configuration¶
No custom BM25 tuning. Uses Elasticsearch 7.4 defaults:
| Parameter | Current Value | ES Default | What It Controls |
|---|---|---|---|
k1 |
1.2 | 1.2 | Term frequency saturation. Higher = repeated terms weighted more heavily. |
b |
0.75 | 0.75 | Document length normalization. 1.0 = full normalization, 0.0 = none. |
search_type |
dfs_query_then_fetch |
query_then_fetch |
Global IDF scoring across all shards (not per-shard). |
Observations and Tuning Opportunities¶
Length normalization (b=0.75) may need tuning for legal search.
Default b=0.75 penalizes long documents relative to short ones. Legal
productions contain an extreme mix: 2-line emails, 5-page memos, 200-page
contracts. With b=0.75:
- A keyword match in a 200-page contract scores lower than the same match in
a 2-line email (normalized by document length)
- This is often wrong for legal search -- a contract mentioning "Special Purpose
Entity" once in 200 pages IS relevant, and shouldn't be penalized relative to
a short email
Post-prototype evaluation: Test b=0.5 or b=0.3 on a known matter and
compare BM25 ranking quality. Lower b reduces the length penalty, giving long
documents fairer ranking. This is a per-index setting change, not a code change.
dfs_query_then_fetch is already correct. Global IDF prevents scoring
anomalies from uneven document distribution across shards. This is important for
hybrid search determinism -- per-shard IDF would introduce variance.
Index Architecture¶
| Aspect | Current Configuration |
|---|---|
| Strategy | Shared physical indices with per-case filtered aliases (NOT index-per-case) |
| Alias naming | {environment}_{npcase_id}_{type} (e.g., production_12345_exhibits) |
| Physical index naming | {environment}_{type}_{identifier}_{sequential_number} |
| Max shard size | 30 GB (new physical index created when exceeded) |
| Join field | Parent-child for exhibit -> pages (exhibit_join) |
| Full-text field | search_text (uses nextpoint_analyzer) |
| Exhibit mapping | 290+ fields |
| Dynamic mapping | Strict (no auto-detected fields) |
Custom Analyzers¶
| Analyzer | Purpose |
|---|---|
nextpoint_analyzer |
Index-time: email parsing, edge n-gram, path hierarchies |
nextpoint_search_analyzer |
Search-time paired analyzer |
edge_ngram_analyzer |
Autocomplete support (min_gram: 1, max_gram: 6) |
custom_path_tree |
Folder path hierarchy processing |
How Hybrid Search Queries ES¶
The BM25 leg queries the existing per-case filtered alias:
# shell/keyword/es_ops.py
def keyword_search(case_id, query, filters, size=100):
alias = f"{config.ENVIRONMENT}_{case_id}_exhibits"
# Standard ES query against existing alias
# Uses existing nextpoint_search_analyzer
# Returns BM25-scored results with document_id, score
# No mapping changes, no new fields, read-only consumer
Key point: The vector index (OpenSearch/pgvector) and the keyword index (existing ES 7.4) are completely separate systems. The hybrid search module reads from both and fuses results in application code via RRF. If OpenSearch replaces ES 7.4 in the future, both legs move to one system and native hybrid search becomes possible (eliminating the need for application-level RRF).
Search API¶
Endpoints¶
| Method | Path | Auth | Purpose |
|---|---|---|---|
POST /search |
/search |
IAM | Hybrid semantic + keyword search |
POST /backfill |
/backfill |
IAM | Trigger embedding backfill for a case |
GET /status/{case_id} |
/status/{case_id} |
IAM | Embedding progress for a case |
Search request¶
{
"query": "internal discussions about Special Purpose Entities",
"case_id": 123,
"filters": {
"custodians": ["jeff.skilling@enron.com"],
"date_range": {"start": "2001-01-01", "end": "2001-12-31"},
"batch_ids": [456, 789]
},
"limit": 20,
"offset": 0,
"mode": "hybrid",
"exact": false
}
| Parameter | Values | Default | Purpose |
|---|---|---|---|
mode |
"hybrid", "semantic", "keyword" |
"hybrid" |
Which search legs to run |
exact |
true, false |
false |
Brute-force k-NN + global IDF for bit-identical reproducibility. Slower (~500ms). Use for search methodology declarations. |
Search response¶
{
"status": "success",
"data": {
"results": [
{
"document_id": "abc-123",
"exhibit_id": 42,
"score": 0.87,
"bm25_rank": 15,
"vector_rank": 2,
"snippets": [
{
"text": "We need to discuss the Raptor structure before...",
"chunk_id": "abc-123-chunk-4",
"page": 2
}
],
"metadata": {
"author": "jeff.skilling@enron.com",
"date": "2001-03-15",
"subject": "RE: Raptor update"
}
}
],
"total": 142,
"embedding_status": "complete",
"search_id": "uuid",
"search_audit": {
"embedding_model": "voyage-law-2",
"ef_search": 256,
"index_version": "case_123_vectors_v2",
"exact_mode": false,
"search_mode": "hybrid"
},
"timings": {
"query_embedding_ms": 45,
"vector_search_ms": 80,
"keyword_search_ms": 40,
"fusion_ms": 5,
"total_ms": 170
}
},
"requestId": "uuid"
}
Search handler flow¶
# handlers/search_index.py -- API Gateway handler (<=25s)
def lambda_handler(event, _context):
try:
request = SearchRequest.validate(event["body"])
set_npcase(str(request.case_id))
# Step 1: Embed the query (Voyage AI, ~50ms)
query_vector = voyage_client.embed_query(request.query)
# Step 2: Parallel search (both legs)
vector_results = vector_store.search(
case_id=request.case_id,
query_vector=query_vector,
k=100,
filters=request.filters
)
bm25_results = es_ops.keyword_search(
index=f"case_{request.case_id}", # Existing ES index
query=request.query,
filters=request.filters,
size=100
)
# Step 3: Reciprocal Rank Fusion (core/ -- pure logic)
fused = hybrid_search.rrf(vector_results, bm25_results, k=60)
# Step 4: Fetch snippets for top results
top_results = fused[:request.limit]
enriched = _enrich_with_snippets(top_results, request.case_id)
return api_response(200, enriched)
except ValidationError as e:
return api_response(400, error=e.message)
except Exception as e:
log_message("search_error", error=str(e))
return api_response(500, error="Search failed")
Reciprocal Rank Fusion (core/search/hybrid.py)¶
def rrf(vector_results: list, bm25_results: list, k: int = 60) -> list:
"""
Combine two ranked lists using Reciprocal Rank Fusion.
score(d) = sum(1 / (k + rank_i(d))) for each result set i
k=60 is the standard constant from the original RRF paper.
"""
scores = defaultdict(float)
metadata = {}
for rank, result in enumerate(vector_results):
doc_id = result["document_id"]
scores[doc_id] += 1.0 / (k + rank + 1)
metadata[doc_id] = {**result, "vector_rank": rank + 1}
for rank, result in enumerate(bm25_results):
doc_id = result["document_id"]
scores[doc_id] += 1.0 / (k + rank + 1)
if doc_id in metadata:
metadata[doc_id]["bm25_rank"] = rank + 1
else:
metadata[doc_id] = {**result, "bm25_rank": rank + 1}
ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
return [{"document_id": doc_id, "rrf_score": score, **metadata[doc_id]}
for doc_id, score in ranked]
Pure business logic -- lives in core/search/, no AWS imports, fully unit-testable.
New Infrastructure Required¶
Complete Inventory¶
Everything the documentsearch module adds. Existing infrastructure is NOT modified -- see "Current Elasticsearch State" section above.
Voyage AI API Access¶
| Item | Details |
|---|---|
| What | API key for https://api.voyageai.com/v1/embeddings |
| Model | voyage-law-2 (1024 dimensions, legal-optimized) |
| Storage | Secrets Manager (documentsearch/voyage-api-key) |
| Network | Lambda -> NAT Gateway -> Voyage AI (HTTPS). NAT Gateway already exists. |
| Cost | $0.12 per million tokens |
| Alternative | SageMaker endpoint ($737/mo) if data-in-VPC required |
Vector Store¶
| Phase | Technology | Instance | Cost |
|---|---|---|---|
| Prototype | Aurora PostgreSQL 15 + pgvector | db.t3.medium | ~$157/mo (incl. RDS Proxy) |
| Production (option A) | OpenSearch Managed + k-NN | r6g.large.search x 2 | ~$3,900-5,000/mo |
| Production (option B) | OpenSearch Serverless | Auto-scaled OCUs | ~$700-2,000/mo |
Decision deferred. See adr/adr-vector-store-selection.md.
Lambda Functions (4)¶
| Lambda | Memory | Timeout | Trigger | Purpose |
|---|---|---|---|---|
| Embedding Lambda | 1024 MB | 900s | SQS (live + backfill) | Chunk, embed via Voyage AI, store vectors |
| Search Lambda | 512 MB | 25s | API Gateway | Embed query, parallel BM25 + k-NN, RRF |
| Backfill Lambda | 256 MB | 25s | API Gateway | Trigger backfill for a case |
| Job Processor Lambda | 256 MB | 900s | SQS | Per-batch infrastructure lifecycle |
All: Python 3.10+, private subnets, Lambda layer.
SQS Queues (4)¶
| Queue | Purpose | MaximumConcurrency | DLQ Retention |
|---|---|---|---|
live_embedding_queue |
DOCUMENT_PROCESSED events | 10 | 14 days |
live_embedding_dlq |
Dead letters (live) | N/A | 14 days |
backfill_embedding_queue |
BACKFILL_REQUESTED events | 3 | 14 days |
backfill_embedding_dlq |
Dead letters (backfill) | N/A | 14 days |
Visibility timeout = 900s. maxReceiveCount = 3.
SNS Subscriptions (2)¶
| Subscription | Filter Policy |
|---|---|
| Live ingest | eventType: ["DOCUMENT_PROCESSED", "IMPORT_CANCELLED"], caseId, batchId |
| Backfill | eventType: ["BACKFILL_REQUESTED"], caseId |
Added to existing shared SNS topic. Topic itself is unchanged.
API Gateway¶
| Endpoint | Auth | Timeout | Purpose |
|---|---|---|---|
POST /search |
IAM | 25s | Hybrid semantic + keyword search |
POST /backfill |
IAM | 25s | Trigger embedding for existing docs |
GET /status/{case_id} |
IAM | 10s | Embedding progress |
URL published to SSM Parameter Store (documentsearch_api_url).
MySQL Tables (per-case, 2 new tables)¶
Created in each nextpoint_case_{id} database by documentsearch's own migration:
| Table | Purpose | Size Estimate |
|---|---|---|
search_chunks |
Chunk text, metadata, model version | ~1 KB/chunk x chunks |
search_embedding_status |
Per-document checkpoint, status | ~100 bytes/document |
CloudWatch Alarms (5)¶
| Alarm | Threshold |
|---|---|
| Search p99 latency | > 2 seconds |
| Search error rate | > 5% |
| Embedding DLQ depth | > 0 for 15 minutes |
| Backfill DLQ depth | > 0 for 30 minutes |
| Voyage AI error rate | > 10 in 5 minutes |
CDK Stacks (3)¶
CommonResourcesStack -- deployed once per environment: - Vector store infrastructure (OpenSearch domain or Aurora PostgreSQL) - Secrets Manager reference for Voyage AI API key - IAM roles for Lambda -> vector store access - SNS topic subscription for DOCUMENT_PROCESSED events
SearchIngestStack -- embedding pipeline:
- Embedding Lambda + Lambda layer
- Live ingest SQS queue + DLQ
- Backfill SQS queue + DLQ (separate, lower concurrency)
- SNS filter policies
- MaximumConcurrency: 10 (live), MaximumConcurrency: 3 (backfill)
- Job Processor Lambda for per-batch infrastructure lifecycle
SearchApiStack -- search and backfill endpoints: - API Gateway REST API - Search/Backfill/Status Lambdas - IAM authorizer (Rails -> Lambda) - CloudWatch alarms
Cost Summary¶
Production corpus: 870M documents, 6.4B pages. Realistic backfill scope after filters (NGE-enabled ~10%, Discovery ~90%): ~78M documents.
One-Time Backfill Costs¶
| Phase | Documents | Voyage AI | Compute | Total |
|---|---|---|---|---|
| Prototype | 50K | $18 | ~$2 | ~$20 |
| Pilot (10 cases) | 500K | $180 | ~$15 | ~$195 |
| Phase 1 (100 cases) | 10M | $3,600 | ~$300 | ~$3,900 |
| Phase 2 (all NGE Discovery) | 78M | $28,000 | ~$2,300 | ~$30,300 |
Ongoing Monthly Costs (Post-Backfill, Steady State)¶
| Cost Category | Managed OpenSearch | Serverless OpenSearch |
|---|---|---|
| Infrastructure (vector store, SQS, CW) | $3,950-5,050 | $923-2,223 |
| New document embedding (~2-5M docs/mo) | $790-1,970 | $790-1,970 |
| Search queries (~30-100K/mo) | $15-50 | $15-50 |
| Total monthly | $4,755-7,070 | $1,728-4,243 |
New document embedding is the ongoing cost driver (~$790-1,970/mo for new imports flowing through the pipeline). Search query embedding cost is negligible (~$0.0000012 per query for Voyage AI).
Note: These costs are in addition to existing ES 7.4 cluster. If OpenSearch replaces ES 7.4 long-term (consolidation), the vector store cost is partially offset by the eliminated ES cluster.
See reference-implementations/semantic-search-infrastructure.md for full
breakdown by component, per-query cost analysis, and cost comparison.
Infrastructure That Already Exists (Zero Changes)¶
| Component | Role in Semantic Search | Modified? |
|---|---|---|
| SNS topic | Embedding pipeline subscribes to existing events | No |
| Extracted text on S3 | Input to chunking (documentextractor already ran) | No |
| Elasticsearch 7.4 | BM25 leg queries existing per-case aliases read-only | No |
| Aurora MySQL (per-case DBs) | New tables added via module migration; exhibits table read-only | No |
| PSM -> Firehose -> Athena | Captures DOCUMENT_EMBEDDED events automatically | No |
| NgeCaseTrackerJob | Polls embedding progress via existing Athena pipeline | No |
| VPC / subnets / NAT Gateway | CDK stacks deploy into existing network | No |
| Secrets Manager | Stores Voyage AI API key (standard pattern) | No |
| SSM Parameter Store | Publishes documentsearch API URL | No |
Integration with Rails¶
Ingest -- no Rails changes needed¶
The embedding pipeline subscribes to existing SNS events. Rails doesn't need to
know about it. PSM captures DOCUMENT_EMBEDDED events via the existing
Firehose -> Athena pipeline, so Rails can track embedding progress via
NgeCaseTrackerJob (existing polling mechanism).
Backfill -- one new API call¶
# app/helpers/semantic_search_helper.rb (new)
def trigger_semantic_backfill(case_id)
# IAM-authenticated call to documentsearch API Gateway
response = iam_post(
ssm_parameter("documentsearch_api_url") + "/backfill",
{ case_id: case_id }.to_json
)
JSON.parse(response.body)
end
def semantic_search_status(case_id)
response = iam_get(
ssm_parameter("documentsearch_api_url") + "/status/#{case_id}"
)
JSON.parse(response.body)
end
Search -- new API call alongside existing search¶
# app/helpers/semantic_search_helper.rb (continued)
def semantic_search(case_id:, query:, filters: {}, limit: 20)
response = iam_post(
ssm_parameter("documentsearch_api_url") + "/search",
{ query: query, case_id: case_id, filters: filters, limit: limit }.to_json
)
JSON.parse(response.body)
end
UI integration (post-prototype)¶
- "Semantic Search" toggle/tab alongside existing keyword search
- Results page reuses existing document list components
- Snippet highlighting is new (rendered from chunk text in search response)
- Side-by-side comparison mode (keyword vs semantic) for attorney evaluation
- "Semantic search is being prepared" state when backfill is in progress
- Embedding progress indicator in import status UI
Existing Data Handling (Complete Flow)¶
What Already Exists for Each Document¶
Every document that went through the NGE pipeline already has all the raw material needed for embedding. No re-extraction or re-processing is required.
| Data | Location | Created By | Available? |
|---|---|---|---|
| Raw file | S3 (case_{id}/documents/{uid}/{filename}) |
Upload | Yes |
| Extracted text | S3 (case_{id}/documents/{uid}/extracted.txt) |
documentextractor | Yes |
| Metadata (author, date, subject) | MySQL (exhibits table in nextpoint_case_{id}) |
documentloader | Yes |
| ES index entry (BM25-searchable) | ES alias ({env}_{case_id}_exhibits) |
documentloader | Yes |
| Chunk text | Not yet created | documentsearch | No |
| Vector embeddings | Not yet created | documentsearch | No |
| Embedding status | Not yet created | documentsearch | No |
Backfill creates only the last three rows. The expensive work (extraction, metadata parsing, ES indexing) is already done.
What Backfill Does NOT Require¶
| Not Needed | Why |
|---|---|
| Re-extraction of text | Already on S3 from documentextractor |
| Re-indexing in Elasticsearch | Existing ES entries used as-is for BM25 leg |
| Rails downtime | Backfill is background SQS processing |
| Import re-processing | Documents don't re-enter the pipeline |
| Schema migration on exhibits table | New tables only; exhibits table is read-only |
| Upstream module changes | documentsearch subscribes independently |
Backfill Scale Estimates¶
Per-Case Estimates¶
| Case Size | Documents | Chunks (est.) | Voyage AI Cost | Time (MaxConc=3) | Time (MaxConc=10) |
|---|---|---|---|---|---|
| Small | 1,000 | 10,000 | ~$0.24 | ~10 min | ~3 min |
| Medium | 50,000 | 500,000 | ~$12 | ~8 hours | ~2.5 hours |
| Large | 500,000 | 5,000,000 | ~$120 | ~3.5 days | ~1 day |
Assumptions: 1 document per invocation, ~2 seconds per document (S3 fetch + chunk + Voyage API + vector store + MySQL). Voyage AI cost: ~15 chunks/doc x ~200 tokens/chunk x $0.12/M tokens.
Production Corpus (870M documents, 6.4B pages)¶
Production corpus analysis (end of 2025):
| Metric | Value |
|---|---|
| Total documents | 870M |
| Total pages | 6.4B |
| Avg pages per document | ~7.4 |
| Estimated chunks per document | ~15 |
Scope filters (cumulative):
| Filter | Docs After | Reduction | Rationale |
|---|---|---|---|
| Full corpus | 870M | -- | Starting point |
| NGE-enabled cases only | ~87M | ~90% out | Only ~10% of cases are NGE-enabled |
| Discovery suite only | ~78M | ~10% out | ~90% of docs are Discovery; Litigation is T2+ |
| Realistic backfill scope | ~78M | ~91% | Active NGE Discovery cases |
Phased rollout:
| Phase | Scope | Docs | Cost | Timeline |
|---|---|---|---|---|
| Prototype | 1 known case | ~50K | ~$18 | 1 day |
| Pilot | 10 active NGE cases | ~500K | ~$180 | 1 day |
| Phase 1 | Top 100 active NGE cases | ~10M | ~$3,600 | 2-3 days |
| Phase 2 | All active NGE Discovery | ~78M | ~$28,000 | 1-2 weeks |
| On-demand | Remaining cases | Per-case | Per-case | On search |
Phase 1 ($3,600) proves value. Phase 2 ($28K) requires business case approval. On-demand backfill handles the long tail of 870M documents that will rarely be searched.
See reference-implementations/semantic-search-infrastructure.md for full
cost model with storage, compute, and rate limit analysis.
Step 1: Identify un-embedded documents¶
-- Per-case query (runs in backfill Lambda)
SELECT e.document_id, e.s3_text_path, e.author, e.date_sent, e.subject
FROM exhibits e
LEFT JOIN search_embedding_status s
ON s.document_id = e.document_id
AND s.embedding_model = %(target_model)s
WHERE s.document_id IS NULL
AND e.delete_at_gmt IS NULL
ORDER BY e.id
LIMIT 1000 OFFSET %(offset)s
Step 2: Publish backfill events (batched)¶
# handlers/backfill_index.py
def lambda_handler(event, _context):
request = BackfillRequest.validate(event["body"])
case_id = request.case_id
with reader_session(npcase_id=str(case_id)) as session:
total = _count_unembedded(session, case_id, config.EMBEDDING_MODEL)
if total == 0:
return api_response(200, {"message": "All documents already embedded"})
documents = _get_unembedded_batch(session, case_id, config.EMBEDDING_MODEL)
# Publish in batches of 100 to avoid Lambda timeout
for batch in chunked(documents, 100):
for doc in batch:
sns.publish(
eventType="BACKFILL_REQUESTED",
caseId=case_id,
documentId=doc.document_id,
eventDetail={
"s3_path": doc.s3_text_path,
"embedding_model": config.EMBEDDING_MODEL
}
)
return api_response(202, {
"message": f"Backfill initiated for {total} documents",
"case_id": case_id,
"total_documents": total
})
Step 3: Embedding Lambda processes (same code path)¶
The embedding Lambda receives messages from either live_embedding_queue or
backfill_embedding_queue. The handler code is identical -- it doesn't
distinguish between live and backfill events.
Step 4: Track progress¶
-- Status query (runs in status Lambda)
SELECT
COUNT(*) as total_documents,
SUM(CASE WHEN status = 'complete' THEN 1 ELSE 0 END) as embedded,
SUM(CASE WHEN status = 'in_progress' THEN 1 ELSE 0 END) as in_progress,
SUM(CASE WHEN status = 'failed' THEN 1 ELSE 0 END) as failed,
embedding_model
FROM search_embedding_status
WHERE npcase_id = %(case_id)s
GROUP BY embedding_model
Step 5: Handle edge cases¶
| Edge Case | Handling |
|---|---|
| Document re-imported after backfill | Live DOCUMENT_PROCESSED event triggers new embedding. Checkpoint PK prevents duplicate. Old chunks/vectors overwritten. |
| Backfill + live ingest concurrent | Both write to checkpoint table. Composite PK (npcase_id, document_id) means first writer wins. Second gets RecoverableException, retries, sees checkpoint = COMPLETE, skips. |
| Case deleted during backfill | IMPORT_CANCELLED event stops in-progress work. Remaining SQS messages processed and skipped (SilentSuccessException). Case deletion also deletes the vector index and per-case DB. |
| Voyage AI unavailable | RecoverableException -> SQS exponential backoff (120s -> 900s). After max retries, DLQ. Backfill can be re-triggered after API recovery. |
| Model upgrade mid-backfill | embedding_model column tracks per-document. Re-run backfill with new model -- filter picks up documents without new model's embeddings. |
Non-Functional Requirements¶
NFRs that must be defined before production deployment (Phase 1). Identified via gap analysis against the documentsearch architecture.
Performance¶
| Requirement | Target | Measurement |
|---|---|---|
| Search latency (p99) | < 2 seconds | CloudWatch alarm on API Gateway |
| Search latency (p50) | < 500ms | CloudWatch metric |
| Query embedding latency | < 100ms | Voyage AI response time |
| Embedding throughput (live ingest) | >= 2,560 docs/min | SQS age of oldest message |
| Backfill throughput | >= 2,560 docs/min (standard API) | Backfill progress tracking |
| Parallel search legs | BM25 and vector search run in parallel, not sequential | Implementation requirement |
Availability¶
| Requirement | Target | Notes |
|---|---|---|
| Search API uptime | 99.9% (8.76 hours downtime/year) | Match existing ES availability |
| Embedding pipeline uptime | 99.5% | Slight degradation acceptable; new docs queue in SQS |
| Degraded mode | If vector store is down, return BM25-only results | Graceful degradation, not full outage |
| Failover | OpenSearch multi-AZ (production) | Single-AZ acceptable for prototype |
Degraded mode is critical: if the vector store is unavailable, search should fall back to keyword-only results (BM25 leg) rather than failing entirely. The attorney gets results — just not semantically enhanced results.
Data Retention and Lifecycle¶
| Data | Retention | Deletion Trigger |
|---|---|---|
| Vector embeddings (OpenSearch) | Lifetime of case | Case deletion -> delete vector index |
Chunk text (MySQL search_chunks) |
Lifetime of case | Case deletion -> table dropped with per-case DB |
Embedding status (MySQL search_embedding_status) |
Lifetime of case | Case deletion -> table dropped |
| Search audit logs (CloudWatch) | 90 days (configurable) | CloudWatch log group retention policy |
| Search audit logs (S3 archive) | 7 years (compliance) | S3 lifecycle policy |
When a case is deleted:
1. Delete the vector index (case_{id}_vectors) from OpenSearch
2. Per-case MySQL database is dropped (existing pattern) — search_chunks
and search_embedding_status tables are deleted with it
3. Search audit logs in CloudWatch are retained per retention policy
(they don't contain document content, only query metadata and result IDs)
Disaster Recovery¶
| Component | Backup Strategy | RPO | RTO |
|---|---|---|---|
| OpenSearch vectors | Automated snapshots to S3 (hourly) | 1 hour | 2-4 hours (restore from snapshot) |
| MySQL chunks/status | Aurora automated backups (existing) | 5 minutes (Aurora continuous) | < 1 hour |
| Voyage AI API key | Secrets Manager (replicated) | 0 | Minutes |
| CDK infrastructure | Infrastructure as code, redeployable | 0 | 30-60 minutes |
Worst case (total OpenSearch loss): Re-embed from extracted text on S3. Chunks in MySQL are the source of truth for text; vectors are derived and re-generable. Full re-embedding of Phase 1 (10M docs) takes 2-3 days. This is the backup of last resort — snapshots should prevent it.
Security and Encryption¶
| Layer | Requirement | Implementation |
|---|---|---|
| Encryption at rest (vectors) | AES-256 | OpenSearch domain encryption enabled |
| Encryption at rest (chunks) | AES-256 | Aurora encryption enabled (existing) |
| Encryption in transit | TLS 1.2+ | All API calls, Lambda -> OpenSearch, Lambda -> Aurora |
| Voyage AI data handling | SOC 2 compliant | Document text sent to Voyage API for embedding. Voyage does not store input data (verify in terms). |
| IAM authentication | Rails -> API Gateway | Same pattern as documentexchanger |
| Per-case isolation | Hard tenant boundary | Vector index-per-case, MySQL per-case database |
| Secret management | Voyage API key in Secrets Manager | Rotatable, auditable |
If compliance requires document text to never leave VPC: switch from Voyage
direct API to SageMaker endpoint (see patterns/asymmetric-embeddings.md).
Compliance¶
| Standard | Status | Gap |
|---|---|---|
| SOC 2 | Partial | Voyage AI is SOC 2 compliant. End-to-end audit trail (search audit logs) designed. Need formal review of new components. |
| HIPAA | Partial | Encryption at rest/transit covered. BAA with Voyage AI needed if cases contain PHI. SageMaker endpoint avoids external data transfer. |
| Data residency | Depends on deployment | Voyage direct API: data leaves VPC. SageMaker: data stays in VPC. Document the choice. |
Scalability¶
| Dimension | Current Design | Limit | Scaling Action |
|---|---|---|---|
| Concurrent searches | Lambda concurrency | 1000 (AWS default) | Request limit increase |
| Cases with embeddings | Per-case OpenSearch index | 30K shards/domain | Tiered storage (hot/warm/cold) |
| Documents per case | Single vector index | 25M vectors per index | Sufficient for largest cases |
| Embedding throughput | SQS MaxConcurrency | Voyage API rate limit | Enterprise tier or SageMaker |
Observability¶
| Component | Monitoring | Alerting |
|---|---|---|
| Search latency | CloudWatch metrics (p50, p99) | Alarm if p99 > 2s |
| Search errors | CloudWatch error rate | Alarm if > 5% |
| Embedding DLQ | Queue depth monitoring | Alarm if > 0 for 15 min |
| Backfill DLQ | Queue depth monitoring | Alarm if > 0 for 30 min |
| Voyage AI errors | Error count metric | Alarm if > 10 in 5 min |
| OpenSearch cluster health | Cluster status (green/yellow/red) | Alarm on yellow or red |
| Embedding pipeline lag | SQS age of oldest message | Alarm if > 30 min |
| Retrieval quality | Not yet defined | Gap — define quality metrics post-prototype |
Retrieval Quality (Post-Prototype)¶
Retrieval quality monitoring is a gap that should be addressed after the prototype validates the baseline. Proposed approach:
- Golden query set: 20-50 queries with known-good results on a reference case
- Automated regression: Run golden queries after any change to chunking, embedding model, search parameters, or RRF weights
- Metrics: Precision@10, Recall@20, NDCG@20, Mean Reciprocal Rank
- Drift detection: Monthly automated run of golden queries, alert if metrics drop > 5% from baseline
This is not needed for the prototype — the prototype IS the quality validation. Define the golden query set during pilot (10 cases with attorney feedback).
Determinism and Legal Defensibility¶
Are search results deterministic?¶
Semantic search introduces one new source of non-determinism compared to keyword search: the HNSW approximate nearest neighbor algorithm. Understanding each component:
| Component | Deterministic? | Details |
|---|---|---|
| Embedding generation (Voyage AI) | Yes | Same text produces the same vector every time. No temperature, no sampling. |
| BM25 search (Elasticsearch) | Mostly | Same query against a static index returns same scores. Minor variance from per-shard IDF calculation in multi-shard indices. Fixable with search_type=dfs_query_then_fetch. |
| HNSW vector search | No (approximate) | Returns approximately nearest neighbors, not guaranteed exact. See below. |
| RRF fusion | Yes | Pure math on ranked lists. Same inputs produce same output. |
| LLM reranking (optional, not in T1) | No | LLMs have sampling variance. Not used in T1 architecture. |
HNSW approximation behavior¶
HNSW does not scan every vector. It traverses a graph to find approximate
nearest neighbors. For a static index (no concurrent writes), queries ARE
deterministic: same query vector + same graph + same ef_search = same
traversal = same results.
When the index is being updated (live ingest or backfill), results can vary between queries because the graph structure changes.
The approximation itself means HNSW may miss some true nearest neighbors in
exchange for speed. The ef_search parameter controls this trade-off:
ef_search=100 → examines ~100 candidates → ~5ms, ~95% recall
ef_search=256 → examines ~256 candidates → ~15ms, ~99% recall
ef_search=512 → examines ~512 candidates → ~30ms, ~99.5% recall
At ef_search=256 (our default), the results that vary between runs are
documents at the relevance boundary -- documents that are barely relevant
either way. The top results are effectively stable.
Comparison to keyword search¶
Keyword search has similar non-determinism sources that are less visible:
| Source of variation | Keyword search | Semantic search |
|---|---|---|
| Index updates between queries | Yes | Yes |
| Shard-level scoring variance | Yes (ES IDF per shard) | Yes (HNSW graph per shard) |
| Approximate algorithm | No (exact match) | Yes (HNSW) |
| Relevance ranking stability | High (BM25 is stable) | High for top results, minor variance at boundary |
| Model changes | N/A | Yes (re-embedding changes all vectors) |
Defensibility strategy¶
For legal defensibility, the architecture provides four mechanisms:
1. Search audit logging
Every search call is logged with full reproducibility data:
{
"search_id": "uuid",
"query": "communications about the safety defect",
"query_vector": [0.22, -0.40, ...], # the actual vector, not recomputed
"case_id": 123,
"filters": {"custodians": [...], "date_range": {...}},
"timestamp": "2026-03-31T10:00:00Z",
"embedding_model": "voyage-law-2",
"ef_search": 256,
"result_count": 142,
"result_document_ids": ["doc-1", "doc-2", ...],
"result_scores": [0.0323, 0.0318, ...],
"index_version": "case_123_vectors_v2",
"search_mode": "hybrid",
"timings": {"query_embedding_ms": 45, "vector_search_ms": 80, ...}
}
The attorney can declare: "I ran this query at this time and received these specific 142 documents. The search log is attached as Exhibit A."
2. High ef_search default
ef_search=256 provides ~99% recall. The 1% of true nearest neighbors that
might be missed are at the relevance boundary -- documents with cosine
similarity barely above the threshold. Top results are stable across runs.
3. Point-in-time result snapshots
The legal workflow is: search -> tag results -> review tagged set. Once documents are tagged into a review folder, the review set is frozen. This is the same pattern attorneys already use with keyword search -- they don't re-run keyword searches mid-review and expect identical results while the index is being updated.
4. Exact mode for reproducibility-critical searches
For searches that must be bit-identical across runs (e.g., for a motion declaration or a search methodology affidavit), the API supports exact mode:
POST /search
{
"query": "communications about the safety defect",
"case_id": 123,
"mode": "hybrid",
"exact": true,
"index_snapshot": "v2"
}
Exact mode:
- Runs brute-force k-NN (not HNSW) -- scans every vector, no approximation
- Pins to a specific index snapshot version
- Uses dfs_query_then_fetch for BM25 (global IDF, not per-shard)
- Returns bit-identical results every time against the same snapshot
- Slower (~500ms vs ~20ms) -- use for final declarations, not exploration
What to communicate to attorneys¶
Semantic search results are:
- Stable -- the top results for a given query are consistent across runs against a static index. Minor variations occur only at the relevance boundary.
- Reproducible -- every search is logged with the query vector, timestamp, full result set, and index version. The log is the definitive record.
- Not bit-identical during active indexing -- same as keyword search during active imports. Results stabilize once indexing is complete.
- Defensible -- result snapshots are preserved in audit logs. Exact mode available for search methodology declarations. The search methodology (hybrid BM25 + vector, RRF fusion, specific model and parameters) is fully documentable.
Search methodology documentation template¶
For declarations or search methodology letters, the system can generate:
Search Methodology Statement
Search Technology: Hybrid semantic + keyword search (documentsearch v1.0)
Embedding Model: Voyage AI voyage-law-2 (1024 dimensions, legal-optimized)
Keyword Engine: Elasticsearch 7.4 (BM25 scoring)
Fusion Method: Reciprocal Rank Fusion (k=60)
Vector Search: HNSW with ef_search=256 (~99% recall)
Search Parameters:
Query: [natural language query text]
Filters: [custodians, date range, batch IDs]
Mode: hybrid (semantic + keyword)
Results returned: [N] documents
Reproducibility:
Search ID: [uuid]
Executed: [timestamp]
Index version: [version identifier]
Full result set preserved in search audit log.
Divergences from Standard NGE Patterns¶
| Aspect | documentsearch | Standard NGE | Why |
|---|---|---|---|
| External API dependency | Voyage AI embedding API | No external APIs | Embedding generation requires specialized model not available in AWS |
| Dual ingest queues | Live + backfill at different concurrency | Single queue per event type | Backfill must not starve live ingest or Voyage rate limits |
| Sync API for search | API Gateway -> Lambda (25s) | Async SQS processing | Search requires synchronous response for UI |
| Vector store abstraction | shell/vectorstore/base.py interface |
No storage abstraction | Allows prototype (pgvector) and production (OpenSearch) backends |
| No per-batch dynamic infra | Job Processor optional (may use static queues) | Per-batch Lambda/SQS creation | Embedding is lighter-weight than document loading; static queues may suffice |
| Backfill pipeline | Separate trigger and queue for existing data | No backfill concept | Must handle documents that pre-date the module |
Configuration¶
| Setting | Default | Notes |
|---|---|---|
| Embedding model | voyage-law-2 |
Configurable for model upgrades |
| Embedding dimensions | 1024 | Must match model output |
| Chunk target size | 512 tokens | Voyage AI sweet spot |
| Chunk overlap | 50 tokens | Cross-boundary context preservation |
| RRF k constant | 60 | Standard from RRF paper |
| Live MaxConcurrency | 10 | Embedding Lambda SQS concurrency |
| Backfill MaxConcurrency | 3 | Lower priority than live |
| HNSW ef_search | 256 | ~99% recall. Higher = more accurate, slower. |
| Exact mode default | false | Brute-force k-NN for reproducibility. ~500ms vs ~20ms. |
| Search top-K per leg | 100 | Candidates from each search leg before RRF |
| Search Lambda timeout | 25s | API Gateway integration limit |
| Embedding Lambda timeout | 900s | Max Lambda timeout |
| Embedding Lambda memory | 1024MB | Sufficient for vector operations |
| Search Lambda memory | 512MB | Lighter than embedding |
Prototype vs Production¶
| Concern | Prototype | Production |
|---|---|---|
| Vector store | pgvector on single Aurora PostgreSQL | OpenSearch k-NN or OpenSearch Serverless |
| Multi-tenancy | Single hardcoded case | Per-case index lifecycle |
| Chunking | Paragraph-level with metadata prepend | Domain-specific (email/document/spreadsheet) |
| Rate limiting | SQS MaxConcurrency only | SQS MaxConcurrency + Voyage API rate tracking |
| Frontend | Standalone React app (Claude Code) | Rails UI integration (ERB + React) |
| Embedding versioning | None | Model version tracked, re-embedding pipeline |
| Cost tracking | None | Token counting per Voyage call, per-case allocation |
| Monitoring | CloudWatch logs only | Latency alarms, retrieval quality metrics |
| Backfill | Manual script | API + on-first-search auto-trigger |
| Backfill scope | 1 case (~50K docs, ~$18) | Phase 1: 10M docs (~$3,600), Phase 2: 78M docs (~$28K) |
Future Evaluation: PageIndex (T2)¶
PageIndex (VectifyAI, open-source) is a vectorless, reasoning-based retrieval framework that builds a hierarchical tree from a document and uses LLM reasoning to navigate it. Mafin 2.5 (built on PageIndex) achieved 98.7% on FinanceBench where typical vector RAG scored 65-80%.
PageIndex does NOT replace T1 (corpus-level search). It solves a different problem: deep structured reasoning WITHIN a single document (following cross-references, navigating sections/tables). It cannot search across a corpus.
How it works: Two-step process. (1) Index generation: analyzes document's natural structure (sections, subsections, headings) and builds a hierarchical tree where each node has a title, summary, answerable-questions, and page range. Original document stays intact. (2) Reasoning-driven tree search: LLM reads top-level nodes, reasons which branch likely contains the answer, follows it, repeats at each level, retrieves complete sections (not fragments). Retrieval is explainable — you see which tree nodes were traversed and which pages retrieved.
Core critique of vector RAG relevant to our chunking strategy: Semantic similarity ≠ relevance. A paragraph about "capital expenditure 2021" and "revenue projections 2023" embed near each other because both use financial language, but only one answers the question. Chunking destroys the author's organizational intelligence (sections, cross-references, logical flow). Our domain-aware chunking (email-aware, section-aware) mitigates this but doesn't eliminate it.
Potential T2 integration: Use vector search to find top 20 documents, then PageIndex for deep extraction from those 20. Relevant for depo prep, narrative investigation, settlement prep, motion to compel — use cases where the attorney needs precise facts from structured documents (contracts, regulatory filings) after search identifies them.
Additional resources: PageIndex is open-source (github.com/VectifyAI/PageIndex), has a cloud API (api.pageindex.ai/v1) with MCP protocol integration, and a chat interface (chat.pageindex.ai) for testing on custom documents.
Evaluate during Phase 4 (agent service development). Compare against
long-context LLMs on the same documents. See
reference-implementations/semantic-search-use-cases.md Appendix C for full
evaluation plan.
Key File Locations¶
| File | Purpose |
|---|---|
lib/lambda/src/documentsearch/core/chunking/ |
Domain-specific chunking logic |
lib/lambda/src/documentsearch/core/search/hybrid.py |
Reciprocal Rank Fusion |
lib/lambda/src/documentsearch/shell/vectorstore/base.py |
Vector store abstraction |
lib/lambda/src/documentsearch/shell/embedding/voyage_client.py |
Voyage AI integration |
lib/lambda/src/documentsearch/handlers/embedding_index.py |
Ingest handler (SQS) |
lib/lambda/src/documentsearch/handlers/search_index.py |
Search handler (API GW) |
lib/lambda/src/documentsearch/handlers/backfill_index.py |
Backfill trigger (API GW) |
lib/lambda/src/documentsearch/handlers/job_processor_index.py |
Batch lifecycle |
lib/common-resources-stack.ts |
Vector store + shared infra |
lib/search-ingest-stack.ts |
Embedding pipeline CDK |
lib/search-api-stack.ts |
Search API CDK |
Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.