Reference Implementation: documentsearch¶

Overview¶

documentsearch is the semantic search module for the Nextpoint eDiscovery platform. It generates vector embeddings during document ingestion and serves hybrid search (BM25 + vector) via API Gateway. Attorneys can run natural language queries like "internal discussions about Special Purpose Entities" and surface documents that keyword search would miss entirely.

EDRM Stage: 6 (Review) + 7 (Analysis) -- semantic search for document review and investigative analysis. Suite: Discovery (documents). Operates alongside existing keyword search, not replacing it.

Architecture¶

documentsearch/
├── lib/lambda/src/documentsearch/
│   ├── core/                              # Pure business logic -- NO AWS imports
│   │   ├── exceptions.py                  # Recoverable/Permanent/Silent hierarchy
│   │   ├── checkpoints.py                 # EmbeddingCheckpoints enum (7 steps)
│   │   ├── chunking/
│   │   │   ├── strategy.py                # ChunkingStrategy interface
│   │   │   ├── email_chunker.py           # Thread-aware: preserves reply chains + metadata
│   │   │   ├── document_chunker.py        # Section-aware: headings, paragraphs, tables
│   │   │   └── metadata.py                # Prepends sender/date/subject to chunks
│   │   ├── search/
│   │   │   ├── hybrid.py                  # Reciprocal Rank Fusion (BM25 + vector)
│   │   │   ├── reranker.py                # Optional LLM reranking of top-K
│   │   │   └── filters.py                 # Custodian, date range, batch filters
│   │   ├── models/
│   │   │   ├── db_models.py               # search_chunks, search_embedding_status tables
│   │   │   └── schemas.py                 # EmbeddingRequest, SearchRequest, SearchResult
│   │   └── utils/
│   │       ├── db_transaction.py          # @retry_on_db_conflict
│   │       └── context_data.py            # ContextVar-based request context
│   ├── shell/                              # Infrastructure -- AWS, external APIs
│   │   ├── db/
│   │   │   └── database.py                # writer_session/reader_session (per-case)
│   │   ├── vectorstore/
│   │   │   ├── base.py                    # VectorStore interface (abstract)
│   │   │   ├── opensearch_store.py        # OpenSearch k-NN implementation
│   │   │   └── pgvector_store.py          # pgvector implementation (prototype)
│   │   ├── keyword/
│   │   │   └── es_ops.py                  # BM25 query against existing ES index
│   │   ├── embedding/
│   │   │   └── voyage_client.py           # Voyage AI voyage-law-2 API client
│   │   └── utils/
│   │       ├── sns_ops.py                 # EventType enum, SNS publishing
│   │       ├── sqs_ops.py                 # SQS message handling
│   │       └── s3_ops.py                  # Fetch extracted text from S3
│   ├── handlers/
│   │   ├── embedding_index.py             # SQS handler -- ingest pipeline (chunk + embed)
│   │   ├── search_index.py                # API Gateway handler -- sync search queries
│   │   ├── backfill_index.py              # API Gateway handler -- trigger backfill for case
│   │   └── job_processor_index.py         # Job orchestrator -- per-batch infra lifecycle
│   └── config.py                           # Env vars, region config, Voyage API key ref
├── lib/lambda/tests/
│   ├── conftest.py                         # Auto-use fixtures: mock AWS, DB, Voyage API
│   ├── core/test_chunking.py
│   ├── core/test_hybrid_search.py
│   ├── core/test_rrf.py
│   └── shell/test_voyage_client.py
├── lib/                                    # CDK infrastructure (TypeScript)
│   ├── common-resources-stack.ts           # Vector store infra, shared resources
│   ├── search-ingest-stack.ts              # Embedding Lambda + SQS + SNS subscription
│   ├── search-api-stack.ts                 # API Gateway + search Lambda + backfill Lambda
│   └── search-stack.ts                     # Top-level stack composition
└── CLAUDE.md

Pattern Mapping¶

Pattern	documentsearch Implementation	Standard NGE Pattern
Hexagonal boundaries	`core/` contains chunking, RRF, filters, checkpoints; `shell/` wraps Voyage AI, OpenSearch/pgvector, ES, S3	core/ + shell/ separation
Exception hierarchy	RecoverableException, PermanentFailureException, SilentSuccessException	Standard 3-type hierarchy
SNS events	5 EventType enum values (DOCUMENT_EMBEDDED through BACKFILL_COMPLETED)	SNS events as facts (past tense)
SQS handler	`embedding_index.py` -- batch processing with partial failure support	Standard 6-step handler
Checkpoint pipeline	7-step state machine (EMBEDDING_STARTED -> EMBEDDING_COMPLETE)	Checkpoint-based composite PK
Database sessions	writer_session, reader_session (per-case)	Per-case MySQL databases
Retry/resilience	@retry_on_db_conflict, SQS exponential backoff, DLQ redrive	Standard retry patterns
Idempotency	Checkpoint-based (composite PK on search_embedding_status)	Standard idempotency
Multi-tenancy	Per-case vector index + per-case MySQL for chunks	Per-case isolation
API design	API Gateway + Lambda for sync search, IAM auth	Same as documentexchanger
Job processor	Per-batch infra lifecycle for live ingest	Standard orchestration
Vector store abstraction	`shell/vectorstore/base.py` interface, swappable backends	N/A (new pattern)
Dual entry point	Async ingest (SQS) + sync query (API Gateway)	Same as documentexchanger
Backfill pipeline	Separate queue, lower concurrency, same embedding Lambda	N/A (new pattern)

Key Design Decisions¶

Where It Fits in the Pipeline¶

Live Ingest (new documents)¶

Rails -> ProcessorApi.import()
  |
documentextractor                           <-- entry point (UNCHANGED)
  | SNS: DOCUMENT_PROCESSED
  |---> documentloader      -> MySQL + ES    <-- existing (UNCHANGED)
  |---> documentuploader    -> PDF/Nutrient  <-- existing (UNCHANGED)
  |---> PSM                 -> Athena        <-- existing (UNCHANGED)
  +---> documentsearch (NEW) -> chunk, embed, store vectors

Subscribes to the existing DOCUMENT_PROCESSED SNS event via standard fan-out. Zero changes to upstream modules. The SNS topic already exists; we add one more SQS subscription with a filter policy.

Backfill (existing documents)¶

Existing documents already passed through the pipeline and will never trigger new DOCUMENT_PROCESSED events. Their extracted text is already on S3 (from documentextractor) and their metadata is already in MySQL (from documentloader). Only chunking and embedding is needed.

Rails (admin UI or CLI)
  | POST /backfill { "case_id": 123 }
  v
Backfill Lambda (orchestrator)
  | Query MySQL: SELECT document_id, s3_path FROM exhibits
  |              WHERE NOT EXISTS in search_embedding_status
  v
For each un-embedded document:
  | Publish to SNS: { eventType: "BACKFILL_REQUESTED", ... }
  v
backfill_embedding_queue (SQS, lower concurrency)
  | Same Embedding Lambda code path
  v
Chunk -> Embed -> Store (identical to live ingest)

Key properties: - Incremental: Only processes documents without existing embeddings. Safe to re-run; checkpoint table prevents duplicate work. - Resumable: If the backfill is interrupted (Lambda timeout, deployment, etc.), re-triggering picks up where it left off. - Throttled: Backfill queue runs at lower MaximumConcurrency than live ingest to avoid starving live imports or hitting Voyage AI rate limits. - Progress tracking: DOCUMENT_EMBEDDED events flow through PSM -> Athena. Rails polls progress via existing NgeCaseTrackerJob.

Search Query Flow¶

Sits alongside existing keyword search, does not replace it:

                          +--> existing ES index (BM25 keyword search)
                          |         |
Rails search request -----+    Reciprocal Rank Fusion (in documentsearch)
                          |         ^
                          +--> vector store (k-NN semantic search)

Keyword search stays exactly where it is. The module adds a parallel vector search leg and fuses both result sets via RRF. Attorneys can still search Bates numbers and exact phrases (keyword path). Conceptual queries hit the vector path. RRF combines both into one ranked list.

Checkpoint Pipeline¶

class EmbeddingCheckpoints(Enum):
    EMBEDDING_STARTED = 0
    TEXT_FETCHED = 1          # Extracted text retrieved from S3
    CHUNKS_CREATED = 2        # Document chunked with metadata
    EMBEDDINGS_GENERATED = 3  # Voyage AI called, vectors returned
    VECTORS_STORED = 4        # Vectors upserted to vector store
    CHUNKS_STORED = 5         # Chunk text + metadata stored in MySQL
    EMBEDDING_COMPLETE = 6

Why 7 steps: Voyage AI API calls are the most expensive step (cost and latency). If Lambda times out after generating embeddings but before storing them, we resume from EMBEDDINGS_GENERATED and skip the re-call. Without this checkpoint, every retry re-calls Voyage AI unnecessarily.

Composite PK: (npcase_id, document_id) -- one embedding pipeline per document. No batch_id in the PK because the same document should never be embedded twice regardless of which batch triggered it.

Two Queues for Ingest¶

SNS topic (existing)
  |---> live_embedding_queue    (MaximumConcurrency: 10)  <-- DOCUMENT_PROCESSED
  +---> backfill_embedding_queue (MaximumConcurrency: 3)  <-- BACKFILL_REQUESTED
              |                         |
           Same Embedding Lambda (different concurrency limits)

Live ingest gets priority via higher concurrency. Backfill runs at lower concurrency so it doesn't compete for Voyage API rate limits or vector store write throughput. Both queues feed the same Lambda code -- the handler doesn't know or care whether the trigger was live or backfill.

Backfill Triggering Options¶

Option	Mechanism	UX
Explicit per-case	Rails admin UI: "Enable Semantic Search" button per case. `POST /backfill { case_id }`	Full control, phased rollout
Bulk migration	Script iterates all `nge_enabled?` cases, triggers backfill for each. One-time job.	Initial rollout
On first search	If case has no embeddings, trigger backfill automatically. Return keyword-only results with "Semantic search is being prepared" message.	Zero admin overhead, best UX

Recommended: Option 3 for GA, Option 1 for early access. On-first-search eliminates the enablement step entirely, but early access benefits from explicit control over which cases get the feature.

Re-embedding (Model Upgrades)¶

The embedding_model column on every chunk row tracks which model generated each embedding. When upgrading (e.g., voyage-law-2 -> voyage-law-3):

-- Backfill filter becomes model-aware:
SELECT document_id, s3_path FROM exhibits
WHERE document_id NOT IN (
    SELECT document_id FROM search_embedding_status
    WHERE embedding_model = 'voyage-law-3'  -- target model
)

Same pipeline, same throttling, same progress tracking. Old vectors are overwritten in the vector store; old chunk rows are updated in MySQL.

Vector Store Abstraction¶

The shell/vectorstore/ layer abstracts the storage backend behind an interface:

# shell/vectorstore/base.py
class VectorStore(ABC):
    @abstractmethod
    def create_index(self, case_id: int, dimension: int) -> None: ...

    @abstractmethod
    def upsert_vectors(self, case_id: int, vectors: list[VectorRecord]) -> None: ...

    @abstractmethod
    def search(self, case_id: int, query_vector: list[float],
               k: int, filters: dict) -> list[VectorResult]: ...

    @abstractmethod
    def delete_index(self, case_id: int) -> None: ...

This allows swapping backends without touching core/: - Prototype: pgvector on Aurora PostgreSQL (simple, low ops overhead) - Production option A: OpenSearch k-NN (native hybrid, matches ES pattern) - Production option B: OpenSearch Serverless (no cluster sizing, pay-per-query) - Evaluation: Bedrock Knowledge Bases (managed RAG, but limited customization)

Vector Store Options¶

Full analysis with cost models, scale limits, and query latency numbers: see adr/adr-vector-store-selection.md.

Summary:

Option	Native Hybrid	Replaces ES?	Monthly Cost	Recommendation
OpenSearch Managed	YES	YES	$3,900-5,000	Production (high volume)
OpenSearch Serverless	YES	YES	$700-2,000	Production (moderate volume)
Aurora pgvector	No	No	$3,000-3,500*	Prototype
Bedrock KB	No	No	Variable	Eliminated
MemoryDB Redis	No	No	$1,200-5,000	Eliminated

*pgvector cost includes maintaining existing ES cluster for keyword search.

Decision: pgvector for prototype, OpenSearch (Managed or Serverless) for production. The VectorStore abstraction in shell/vectorstore/ makes the switch a shell-layer change only. Bedrock KB and MemoryDB eliminated (no hybrid search, no Voyage AI, or single-shard limits).

Multi-Tenant Isolation¶

Vector Store: Index-per-case¶

case_{case_id}_vectors    # One vector index per case

Why index-per-case: - Matches existing ES per-case index pattern (Rails already uses this) - Hard tenant boundary -- no filter policy bugs can leak data across cases - Independent lifecycle -- delete case = delete index - Index size stays manageable (even large cases are <1M documents)

MySQL: Per-case database¶

Chunk text and embedding metadata are stored in the per-case database (nextpoint_case_{id}), same as documentloader's exhibits and attachments.

class SearchChunk(Base):
    __tablename__ = "search_chunks"
    id = Column(Integer, primary_key=True, autoincrement=True)
    document_id = Column(String(255), nullable=False, index=True)
    chunk_id = Column(String(255), nullable=False, unique=True)
    chunk_index = Column(Integer, nullable=False)
    chunk_text = Column(Text, nullable=False)
    metadata_json = Column(JSON)        # custodian, date, subject, page range
    embedding_model = Column(String(100))  # "voyage-law-2"
    created_at = Column(DateTime, server_default=func.now())

class SearchEmbeddingStatus(Base):
    __tablename__ = "search_embedding_status"
    npcase_id = Column(Integer, primary_key=True)
    document_id = Column(String(255), primary_key=True)
    checkpoint_id = Column(Integer, nullable=False, default=0)
    embedding_model = Column(String(100))
    chunk_count = Column(Integer)
    status = Column(String(20))         # "complete", "in_progress", "failed"
    created_at = Column(DateTime, server_default=func.now())
    updated_at = Column(DateTime, onupdate=func.now())

Chunks are the source of truth for snippet rendering. Vectors in the vector store are derived and re-generable from chunk text.

Event Types¶

Events consumed¶

Event	Source	SNS Filter	Purpose
`DOCUMENT_PROCESSED`	documentextractor	`eventType`, `caseId`, `batchId`	Triggers embedding for new document
`IMPORT_CANCELLED`	documentloader	`eventType`, `caseId`, `batchId`	Cancel in-progress embedding work

Events published¶

Event	Purpose	Consumers
`DOCUMENT_EMBEDDED`	Embedding complete for one document	PSM (Athena tracking)
`EMBEDDING_FAILED`	Permanent embedding failure	PSM, Rails (error display)
`EMBEDDING_JOB_FINISHED`	All documents in batch embedded	Job Processor (teardown)
`BACKFILL_STARTED`	Backfill initiated for a case	PSM (tracking)
`BACKFILL_COMPLETED`	All existing docs in case embedded	PSM, Rails (UI update)

{
    "source": "documentsearch",
    "jobId": str,
    "caseId": int,
    "batchId": int,
    "documentId": str,
    "eventType": "DOCUMENT_EMBEDDED",
    "status": "SUCCESS",
    "timestamp": "2026-03-31T12:00:00Z",
    "eventDetail": {
        "chunk_count": 12,
        "embedding_model": "voyage-law-2",
        "processing_time_ms": 3400
    }
}

MessageAttributes on every publish: eventType, caseId, batchId.

Chunking Strategy¶

Why domain-specific chunking matters¶

Naive 500-token splitting will: - Split an email reply from the message it's replying to - Separate "stay the course on the timeline" from the safety defect context - Lose sender/date/subject metadata that makes results useful for attorneys

Email chunking (core/chunking/email_chunker.py)¶

Input:  Email thread with 4 messages
Output: 4 chunks, one per message in thread

Each chunk:
  [METADATA] From: jeff.skilling@enron.com | Date: 2001-03-15 | Subject: RE: Raptor update
  [BODY] The full text of this single message in the thread
  [CONTEXT] Replying to: andy.fastow@enron.com on 2001-03-14 re: "Raptor update"

Preserves sender attribution (critical for custodian-scoped search)
Keeps reply context without duplicating entire thread in every chunk
Metadata prefix means the embedding captures WHO said WHAT and WHEN

Document chunking (core/chunking/document_chunker.py)¶

Input:  20-page contract
Output: N chunks at natural section boundaries

Strategy:
  1. Split at section headings (if detected)
  2. Within sections, split at paragraph boundaries
  3. Target chunk size: 512 tokens (voyage-law-2 sweet spot)
  4. Overlap: 50 tokens between adjacent chunks
  5. Prepend document metadata to first chunk

Chunk metadata¶

Every chunk carries metadata both as: - Prepended text (embedded with content -- improves retrieval quality) - Structured fields (stored in MySQL/vector store -- enables filtering)

@dataclass
class ChunkMetadata:
    document_id: str
    chunk_index: int
    custodian: str | None        # Email sender or document author
    date: datetime | None         # Email date or document date
    subject: str | None           # Email subject or document title
    page_range: tuple[int, int] | None
    document_type: str            # "email", "document", "spreadsheet"

Embedding Pipeline¶

Voyage AI integration (shell/embedding/voyage_client.py)¶

class VoyageClient:
    MODEL = "voyage-law-2"
    DIMENSIONS = 1024
    MAX_BATCH_SIZE = 128       # Voyage API limit
    MAX_TOKENS_PER_BATCH = 120_000

    def embed_documents(self, texts: list[str]) -> list[list[float]]:
        """Batch embed document chunks. Auto-batches to stay within limits."""
        # Voyage API: POST https://api.voyageai.com/v1/embeddings
        # input_type="document" for corpus embeddings

    def embed_query(self, query: str) -> list[float]:
        """Embed a single search query."""
        # input_type="query" -- asymmetric embedding for retrieval

Asymmetric embeddings: Voyage AI uses different input_type for documents vs queries ("document" at ingest, "query" at search time). This is handled in shell/, not core/. See patterns/asymmetric-embeddings.md for full explanation, implementation, and AWS deployment options (direct API vs SageMaker endpoint).

Rate limiting and concurrency¶

Voyage AI rate limits: Managed via SQS MaximumConcurrency on embedding Lambda
Live ingest: MaximumConcurrency: 10 (higher priority)
Backfill: MaximumConcurrency: 3 (background, don't starve live)
Tuning: Adjust based on Voyage API tier and case sizes

Handler flow¶

# handlers/embedding_index.py -- standard 6-step SQS handler

def lambda_handler(event, _context):
    batch_item_failures = []
    for record in event["Records"]:
        try:
            set_event(record)
            message = _parse_record(record)  # Double-encoded SNS->SQS
            set_npcase(str(message["caseId"]))

            if not get_batch_context():
                with writer_session() as session:
                    set_batch_context(_load_batch_context(session, message))

            SNS(message)
            _route_event(message)  # -> core process: fetch, chunk, embed, store

        except Exception as e:
            _handle_exception(e, record, batch_item_failures)
        finally:
            set_event(None)

    set_batch_context(None)
    return {"batchItemFailures": batch_item_failures}

Current Elasticsearch State (BM25 Leg)¶

Understanding the existing ES infrastructure is critical because the BM25 leg of hybrid search queries these indices read-only. No ES changes are needed.

BM25 Configuration¶

No custom BM25 tuning. Uses Elasticsearch 7.4 defaults:

Parameter	Current Value	ES Default	What It Controls
`k1`	1.2	1.2	Term frequency saturation. Higher = repeated terms weighted more heavily.
`b`	0.75	0.75	Document length normalization. 1.0 = full normalization, 0.0 = none.
`search_type`	`dfs_query_then_fetch`	`query_then_fetch`	Global IDF scoring across all shards (not per-shard).

Observations and Tuning Opportunities¶

Length normalization (b=0.75) may need tuning for legal search.

Default b=0.75 penalizes long documents relative to short ones. Legal productions contain an extreme mix: 2-line emails, 5-page memos, 200-page contracts. With b=0.75: - A keyword match in a 200-page contract scores lower than the same match in a 2-line email (normalized by document length) - This is often wrong for legal search -- a contract mentioning "Special Purpose Entity" once in 200 pages IS relevant, and shouldn't be penalized relative to a short email

Post-prototype evaluation: Test b=0.5 or b=0.3 on a known matter and compare BM25 ranking quality. Lower b reduces the length penalty, giving long documents fairer ranking. This is a per-index setting change, not a code change.

dfs_query_then_fetch is already correct. Global IDF prevents scoring anomalies from uneven document distribution across shards. This is important for hybrid search determinism -- per-shard IDF would introduce variance.

Index Architecture¶

Aspect	Current Configuration
Strategy	Shared physical indices with per-case filtered aliases (NOT index-per-case)
Alias naming	`{environment}_{npcase_id}_{type}` (e.g., `production_12345_exhibits`)
Physical index naming	`{environment}_{type}_{identifier}_{sequential_number}`
Max shard size	30 GB (new physical index created when exceeded)
Join field	Parent-child for exhibit -> pages (`exhibit_join`)
Full-text field	`search_text` (uses `nextpoint_analyzer`)
Exhibit mapping	290+ fields
Dynamic mapping	Strict (no auto-detected fields)

Custom Analyzers¶

Analyzer	Purpose
`nextpoint_analyzer`	Index-time: email parsing, edge n-gram, path hierarchies
`nextpoint_search_analyzer`	Search-time paired analyzer
`edge_ngram_analyzer`	Autocomplete support (min_gram: 1, max_gram: 6)
`custom_path_tree`	Folder path hierarchy processing

How Hybrid Search Queries ES¶

The BM25 leg queries the existing per-case filtered alias:

# shell/keyword/es_ops.py
def keyword_search(case_id, query, filters, size=100):
    alias = f"{config.ENVIRONMENT}_{case_id}_exhibits"
    # Standard ES query against existing alias
    # Uses existing nextpoint_search_analyzer
    # Returns BM25-scored results with document_id, score
    # No mapping changes, no new fields, read-only consumer

Key point: The vector index (OpenSearch/pgvector) and the keyword index (existing ES 7.4) are completely separate systems. The hybrid search module reads from both and fuses results in application code via RRF. If OpenSearch replaces ES 7.4 in the future, both legs move to one system and native hybrid search becomes possible (eliminating the need for application-level RRF).

Search API¶

Endpoints¶

Method	Path	Auth	Purpose
`POST /search`	`/search`	IAM	Hybrid semantic + keyword search
`POST /backfill`	`/backfill`	IAM	Trigger embedding backfill for a case
`GET /status/{case_id}`	`/status/{case_id}`	IAM	Embedding progress for a case

Search request¶

{
    "query": "internal discussions about Special Purpose Entities",
    "case_id": 123,
    "filters": {
        "custodians": ["jeff.skilling@enron.com"],
        "date_range": {"start": "2001-01-01", "end": "2001-12-31"},
        "batch_ids": [456, 789]
    },
    "limit": 20,
    "offset": 0,
    "mode": "hybrid",
    "exact": false
}

Parameter	Values	Default	Purpose
`mode`	`"hybrid"`, `"semantic"`, `"keyword"`	`"hybrid"`	Which search legs to run
`exact`	`true`, `false`	`false`	Brute-force k-NN + global IDF for bit-identical reproducibility. Slower (~500ms). Use for search methodology declarations.

Search response¶

{
    "status": "success",
    "data": {
        "results": [
            {
                "document_id": "abc-123",
                "exhibit_id": 42,
                "score": 0.87,
                "bm25_rank": 15,
                "vector_rank": 2,
                "snippets": [
                    {
                        "text": "We need to discuss the Raptor structure before...",
                        "chunk_id": "abc-123-chunk-4",
                        "page": 2
                    }
                ],
                "metadata": {
                    "author": "jeff.skilling@enron.com",
                    "date": "2001-03-15",
                    "subject": "RE: Raptor update"
                }
            }
        ],
        "total": 142,
        "embedding_status": "complete",
        "search_id": "uuid",
        "search_audit": {
            "embedding_model": "voyage-law-2",
            "ef_search": 256,
            "index_version": "case_123_vectors_v2",
            "exact_mode": false,
            "search_mode": "hybrid"
        },
        "timings": {
            "query_embedding_ms": 45,
            "vector_search_ms": 80,
            "keyword_search_ms": 40,
            "fusion_ms": 5,
            "total_ms": 170
        }
    },
    "requestId": "uuid"
}

Search handler flow¶

# handlers/search_index.py -- API Gateway handler (<=25s)

def lambda_handler(event, _context):
    try:
        request = SearchRequest.validate(event["body"])
        set_npcase(str(request.case_id))

        # Step 1: Embed the query (Voyage AI, ~50ms)
        query_vector = voyage_client.embed_query(request.query)

        # Step 2: Parallel search (both legs)
        vector_results = vector_store.search(
            case_id=request.case_id,
            query_vector=query_vector,
            k=100,
            filters=request.filters
        )
        bm25_results = es_ops.keyword_search(
            index=f"case_{request.case_id}",   # Existing ES index
            query=request.query,
            filters=request.filters,
            size=100
        )

        # Step 3: Reciprocal Rank Fusion (core/ -- pure logic)
        fused = hybrid_search.rrf(vector_results, bm25_results, k=60)

        # Step 4: Fetch snippets for top results
        top_results = fused[:request.limit]
        enriched = _enrich_with_snippets(top_results, request.case_id)

        return api_response(200, enriched)

    except ValidationError as e:
        return api_response(400, error=e.message)
    except Exception as e:
        log_message("search_error", error=str(e))
        return api_response(500, error="Search failed")

Reciprocal Rank Fusion (core/search/hybrid.py)¶

def rrf(vector_results: list, bm25_results: list, k: int = 60) -> list:
    """
    Combine two ranked lists using Reciprocal Rank Fusion.
    score(d) = sum(1 / (k + rank_i(d))) for each result set i
    k=60 is the standard constant from the original RRF paper.
    """
    scores = defaultdict(float)
    metadata = {}

    for rank, result in enumerate(vector_results):
        doc_id = result["document_id"]
        scores[doc_id] += 1.0 / (k + rank + 1)
        metadata[doc_id] = {**result, "vector_rank": rank + 1}

    for rank, result in enumerate(bm25_results):
        doc_id = result["document_id"]
        scores[doc_id] += 1.0 / (k + rank + 1)
        if doc_id in metadata:
            metadata[doc_id]["bm25_rank"] = rank + 1
        else:
            metadata[doc_id] = {**result, "bm25_rank": rank + 1}

    ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    return [{"document_id": doc_id, "rrf_score": score, **metadata[doc_id]}
            for doc_id, score in ranked]

Pure business logic -- lives in core/search/, no AWS imports, fully unit-testable.

New Infrastructure Required¶

Complete Inventory¶

Everything the documentsearch module adds. Existing infrastructure is NOT modified -- see "Current Elasticsearch State" section above.

Voyage AI API Access¶

Item	Details
What	API key for `https://api.voyageai.com/v1/embeddings`
Model	`voyage-law-2` (1024 dimensions, legal-optimized)
Storage	Secrets Manager (`documentsearch/voyage-api-key`)
Network	Lambda -> NAT Gateway -> Voyage AI (HTTPS). NAT Gateway already exists.
Cost	$0.12 per million tokens
Alternative	SageMaker endpoint ($737/mo) if data-in-VPC required

Vector Store¶

Phase	Technology	Instance	Cost
Prototype	Aurora PostgreSQL 15 + pgvector	db.t3.medium	~$157/mo (incl. RDS Proxy)
Production (option A)	OpenSearch Managed + k-NN	r6g.large.search x 2	~$3,900-5,000/mo
Production (option B)	OpenSearch Serverless	Auto-scaled OCUs	~$700-2,000/mo

Decision deferred. See adr/adr-vector-store-selection.md.

Lambda Functions (4)¶

Lambda	Memory	Timeout	Trigger	Purpose
Embedding Lambda	1024 MB	900s	SQS (live + backfill)	Chunk, embed via Voyage AI, store vectors
Search Lambda	512 MB	25s	API Gateway	Embed query, parallel BM25 + k-NN, RRF
Backfill Lambda	256 MB	25s	API Gateway	Trigger backfill for a case
Job Processor Lambda	256 MB	900s	SQS	Per-batch infrastructure lifecycle

All: Python 3.10+, private subnets, Lambda layer.

SQS Queues (4)¶

Queue	Purpose	MaximumConcurrency	DLQ Retention
`live_embedding_queue`	DOCUMENT_PROCESSED events	10	14 days
`live_embedding_dlq`	Dead letters (live)	N/A	14 days
`backfill_embedding_queue`	BACKFILL_REQUESTED events	3	14 days
`backfill_embedding_dlq`	Dead letters (backfill)	N/A	14 days

Visibility timeout = 900s. maxReceiveCount = 3.

Subscription	Filter Policy
Live ingest	`eventType: ["DOCUMENT_PROCESSED", "IMPORT_CANCELLED"]`, `caseId`, `batchId`
Backfill	`eventType: ["BACKFILL_REQUESTED"]`, `caseId`

Added to existing shared SNS topic. Topic itself is unchanged.

API Gateway¶

Endpoint	Auth	Timeout	Purpose
`POST /search`	IAM	25s	Hybrid semantic + keyword search
`POST /backfill`	IAM	25s	Trigger embedding for existing docs
`GET /status/{case_id}`	IAM	10s	Embedding progress

URL published to SSM Parameter Store (documentsearch_api_url).

MySQL Tables (per-case, 2 new tables)¶

Created in each nextpoint_case_{id} database by documentsearch's own migration:

Table	Purpose	Size Estimate
`search_chunks`	Chunk text, metadata, model version	~1 KB/chunk x chunks
`search_embedding_status`	Per-document checkpoint, status	~100 bytes/document

CloudWatch Alarms (5)¶

Alarm	Threshold
Search p99 latency	> 2 seconds
Search error rate	> 5%
Embedding DLQ depth	> 0 for 15 minutes
Backfill DLQ depth	> 0 for 30 minutes
Voyage AI error rate	> 10 in 5 minutes

CDK Stacks (3)¶

CommonResourcesStack -- deployed once per environment: - Vector store infrastructure (OpenSearch domain or Aurora PostgreSQL) - Secrets Manager reference for Voyage AI API key - IAM roles for Lambda -> vector store access - SNS topic subscription for DOCUMENT_PROCESSED events

SearchIngestStack -- embedding pipeline: - Embedding Lambda + Lambda layer - Live ingest SQS queue + DLQ - Backfill SQS queue + DLQ (separate, lower concurrency) - SNS filter policies - MaximumConcurrency: 10 (live), MaximumConcurrency: 3 (backfill) - Job Processor Lambda for per-batch infrastructure lifecycle

SearchApiStack -- search and backfill endpoints: - API Gateway REST API - Search/Backfill/Status Lambdas - IAM authorizer (Rails -> Lambda) - CloudWatch alarms

Cost Summary¶

Production corpus: 870M documents, 6.4B pages. Realistic backfill scope after filters (NGE-enabled ~10%, Discovery ~90%): ~78M documents.

One-Time Backfill Costs¶

Phase	Documents	Voyage AI	Compute	Total
Prototype	50K	$18	~$2	~$20
Pilot (10 cases)	500K	$180	~$15	~$195
Phase 1 (100 cases)	10M	$3,600	~$300	~$3,900
Phase 2 (all NGE Discovery)	78M	$28,000	~$2,300	~$30,300

Ongoing Monthly Costs (Post-Backfill, Steady State)¶

Cost Category	Managed OpenSearch	Serverless OpenSearch
Infrastructure (vector store, SQS, CW)	$3,950-5,050	$923-2,223
New document embedding (~2-5M docs/mo)	$790-1,970	$790-1,970
Search queries (~30-100K/mo)	$15-50	$15-50
Total monthly	$4,755-7,070	$1,728-4,243

New document embedding is the ongoing cost driver (~$790-1,970/mo for new imports flowing through the pipeline). Search query embedding cost is negligible (~$0.0000012 per query for Voyage AI).

Note: These costs are in addition to existing ES 7.4 cluster. If OpenSearch replaces ES 7.4 long-term (consolidation), the vector store cost is partially offset by the eliminated ES cluster.

See reference-implementations/semantic-search-infrastructure.md for full breakdown by component, per-query cost analysis, and cost comparison.

Infrastructure That Already Exists (Zero Changes)¶

Component	Role in Semantic Search	Modified?
SNS topic	Embedding pipeline subscribes to existing events	No
Extracted text on S3	Input to chunking (documentextractor already ran)	No
Elasticsearch 7.4	BM25 leg queries existing per-case aliases read-only	No
Aurora MySQL (per-case DBs)	New tables added via module migration; exhibits table read-only	No
PSM -> Firehose -> Athena	Captures DOCUMENT_EMBEDDED events automatically	No
NgeCaseTrackerJob	Polls embedding progress via existing Athena pipeline	No
VPC / subnets / NAT Gateway	CDK stacks deploy into existing network	No
Secrets Manager	Stores Voyage AI API key (standard pattern)	No
SSM Parameter Store	Publishes documentsearch API URL	No

Integration with Rails¶

Ingest -- no Rails changes needed¶

The embedding pipeline subscribes to existing SNS events. Rails doesn't need to know about it. PSM captures DOCUMENT_EMBEDDED events via the existing Firehose -> Athena pipeline, so Rails can track embedding progress via NgeCaseTrackerJob (existing polling mechanism).

Backfill -- one new API call¶

# app/helpers/semantic_search_helper.rb (new)

def trigger_semantic_backfill(case_id)
  # IAM-authenticated call to documentsearch API Gateway
  response = iam_post(
    ssm_parameter("documentsearch_api_url") + "/backfill",
    { case_id: case_id }.to_json
  )
  JSON.parse(response.body)
end

def semantic_search_status(case_id)
  response = iam_get(
    ssm_parameter("documentsearch_api_url") + "/status/#{case_id}"
  )
  JSON.parse(response.body)
end

Search -- new API call alongside existing search¶

# app/helpers/semantic_search_helper.rb (continued)

def semantic_search(case_id:, query:, filters: {}, limit: 20)
  response = iam_post(
    ssm_parameter("documentsearch_api_url") + "/search",
    { query: query, case_id: case_id, filters: filters, limit: limit }.to_json
  )
  JSON.parse(response.body)
end

UI integration (post-prototype)¶

"Semantic Search" toggle/tab alongside existing keyword search
Results page reuses existing document list components
Snippet highlighting is new (rendered from chunk text in search response)
Side-by-side comparison mode (keyword vs semantic) for attorney evaluation
"Semantic search is being prepared" state when backfill is in progress
Embedding progress indicator in import status UI

Existing Data Handling (Complete Flow)¶

What Already Exists for Each Document¶

Every document that went through the NGE pipeline already has all the raw material needed for embedding. No re-extraction or re-processing is required.

Data	Location	Created By	Available?
Raw file	S3 (`case_{id}/documents/{uid}/{filename}`)	Upload	Yes
Extracted text	S3 (`case_{id}/documents/{uid}/extracted.txt`)	documentextractor	Yes
Metadata (author, date, subject)	MySQL (`exhibits` table in `nextpoint_case_{id}`)	documentloader	Yes
ES index entry (BM25-searchable)	ES alias (`{env}_{case_id}_exhibits`)	documentloader	Yes
Chunk text	Not yet created	documentsearch	No
Vector embeddings	Not yet created	documentsearch	No
Embedding status	Not yet created	documentsearch	No

Backfill creates only the last three rows. The expensive work (extraction, metadata parsing, ES indexing) is already done.

What Backfill Does NOT Require¶

Not Needed	Why
Re-extraction of text	Already on S3 from documentextractor
Re-indexing in Elasticsearch	Existing ES entries used as-is for BM25 leg
Rails downtime	Backfill is background SQS processing
Import re-processing	Documents don't re-enter the pipeline
Schema migration on exhibits table	New tables only; exhibits table is read-only
Upstream module changes	documentsearch subscribes independently

Backfill Scale Estimates¶

Per-Case Estimates¶

Case Size	Documents	Chunks (est.)	Voyage AI Cost	Time (MaxConc=3)	Time (MaxConc=10)
Small	1,000	10,000	~$0.24	~10 min	~3 min
Medium	50,000	500,000	~$12	~8 hours	~2.5 hours
Large	500,000	5,000,000	~$120	~3.5 days	~1 day

Assumptions: 1 document per invocation, ~2 seconds per document (S3 fetch + chunk + Voyage API + vector store + MySQL). Voyage AI cost: ~15 chunks/doc x ~200 tokens/chunk x $0.12/M tokens.

Production Corpus (870M documents, 6.4B pages)¶

Production corpus analysis (end of 2025):

Metric	Value
Total documents	870M
Total pages	6.4B
Avg pages per document	~7.4
Estimated chunks per document	~15

Scope filters (cumulative):

Filter	Docs After	Reduction	Rationale
Full corpus	870M	--	Starting point
NGE-enabled cases only	~87M	~90% out	Only ~10% of cases are NGE-enabled
Discovery suite only	~78M	~10% out	~90% of docs are Discovery; Litigation is T2+
Realistic backfill scope	~78M	~91%	Active NGE Discovery cases

Phased rollout:

Phase	Scope	Docs	Cost	Timeline
Prototype	1 known case	~50K	~$18	1 day
Pilot	10 active NGE cases	~500K	~$180	1 day
Phase 1	Top 100 active NGE cases	~10M	~$3,600	2-3 days
Phase 2	All active NGE Discovery	~78M	~$28,000	1-2 weeks
On-demand	Remaining cases	Per-case	Per-case	On search

Phase 1 ($3,600) proves value. Phase 2 ($28K) requires business case approval. On-demand backfill handles the long tail of 870M documents that will rarely be searched.

See reference-implementations/semantic-search-infrastructure.md for full cost model with storage, compute, and rate limit analysis.

Step 1: Identify un-embedded documents¶

-- Per-case query (runs in backfill Lambda)
SELECT e.document_id, e.s3_text_path, e.author, e.date_sent, e.subject
FROM exhibits e
LEFT JOIN search_embedding_status s
    ON s.document_id = e.document_id
    AND s.embedding_model = %(target_model)s
WHERE s.document_id IS NULL
    AND e.delete_at_gmt IS NULL
ORDER BY e.id
LIMIT 1000 OFFSET %(offset)s

Step 2: Publish backfill events (batched)¶

# handlers/backfill_index.py

def lambda_handler(event, _context):
    request = BackfillRequest.validate(event["body"])
    case_id = request.case_id

    with reader_session(npcase_id=str(case_id)) as session:
        total = _count_unembedded(session, case_id, config.EMBEDDING_MODEL)
        if total == 0:
            return api_response(200, {"message": "All documents already embedded"})

        documents = _get_unembedded_batch(session, case_id, config.EMBEDDING_MODEL)

    # Publish in batches of 100 to avoid Lambda timeout
    for batch in chunked(documents, 100):
        for doc in batch:
            sns.publish(
                eventType="BACKFILL_REQUESTED",
                caseId=case_id,
                documentId=doc.document_id,
                eventDetail={
                    "s3_path": doc.s3_text_path,
                    "embedding_model": config.EMBEDDING_MODEL
                }
            )

    return api_response(202, {
        "message": f"Backfill initiated for {total} documents",
        "case_id": case_id,
        "total_documents": total
    })

Step 3: Embedding Lambda processes (same code path)¶

The embedding Lambda receives messages from either live_embedding_queue or backfill_embedding_queue. The handler code is identical -- it doesn't distinguish between live and backfill events.

Step 4: Track progress¶

-- Status query (runs in status Lambda)
SELECT
    COUNT(*) as total_documents,
    SUM(CASE WHEN status = 'complete' THEN 1 ELSE 0 END) as embedded,
    SUM(CASE WHEN status = 'in_progress' THEN 1 ELSE 0 END) as in_progress,
    SUM(CASE WHEN status = 'failed' THEN 1 ELSE 0 END) as failed,
    embedding_model
FROM search_embedding_status
WHERE npcase_id = %(case_id)s
GROUP BY embedding_model

Step 5: Handle edge cases¶

Edge Case	Handling
Document re-imported after backfill	Live `DOCUMENT_PROCESSED` event triggers new embedding. Checkpoint PK prevents duplicate. Old chunks/vectors overwritten.
Backfill + live ingest concurrent	Both write to checkpoint table. Composite PK (`npcase_id, document_id`) means first writer wins. Second gets `RecoverableException`, retries, sees checkpoint = COMPLETE, skips.
Case deleted during backfill	`IMPORT_CANCELLED` event stops in-progress work. Remaining SQS messages processed and skipped (SilentSuccessException). Case deletion also deletes the vector index and per-case DB.
Voyage AI unavailable	`RecoverableException` -> SQS exponential backoff (120s -> 900s). After max retries, DLQ. Backfill can be re-triggered after API recovery.
Model upgrade mid-backfill	`embedding_model` column tracks per-document. Re-run backfill with new model -- filter picks up documents without new model's embeddings.

Non-Functional Requirements¶

NFRs that must be defined before production deployment (Phase 1). Identified via gap analysis against the documentsearch architecture.

Performance¶

Requirement	Target	Measurement
Search latency (p99)	< 2 seconds	CloudWatch alarm on API Gateway
Search latency (p50)	< 500ms	CloudWatch metric
Query embedding latency	< 100ms	Voyage AI response time
Embedding throughput (live ingest)	>= 2,560 docs/min	SQS age of oldest message
Backfill throughput	>= 2,560 docs/min (standard API)	Backfill progress tracking
Parallel search legs	BM25 and vector search run in parallel, not sequential	Implementation requirement

Availability¶

Requirement	Target	Notes
Search API uptime	99.9% (8.76 hours downtime/year)	Match existing ES availability
Embedding pipeline uptime	99.5%	Slight degradation acceptable; new docs queue in SQS
Degraded mode	If vector store is down, return BM25-only results	Graceful degradation, not full outage
Failover	OpenSearch multi-AZ (production)	Single-AZ acceptable for prototype

Degraded mode is critical: if the vector store is unavailable, search should fall back to keyword-only results (BM25 leg) rather than failing entirely. The attorney gets results — just not semantically enhanced results.

Data Retention and Lifecycle¶

Data	Retention	Deletion Trigger
Vector embeddings (OpenSearch)	Lifetime of case	Case deletion -> delete vector index
Chunk text (MySQL `search_chunks`)	Lifetime of case	Case deletion -> table dropped with per-case DB
Embedding status (MySQL `search_embedding_status`)	Lifetime of case	Case deletion -> table dropped
Search audit logs (CloudWatch)	90 days (configurable)	CloudWatch log group retention policy
Search audit logs (S3 archive)	7 years (compliance)	S3 lifecycle policy

When a case is deleted: 1. Delete the vector index (case_{id}_vectors) from OpenSearch 2. Per-case MySQL database is dropped (existing pattern) — search_chunks and search_embedding_status tables are deleted with it 3. Search audit logs in CloudWatch are retained per retention policy (they don't contain document content, only query metadata and result IDs)

Disaster Recovery¶

Component	Backup Strategy	RPO	RTO
OpenSearch vectors	Automated snapshots to S3 (hourly)	1 hour	2-4 hours (restore from snapshot)
MySQL chunks/status	Aurora automated backups (existing)	5 minutes (Aurora continuous)	< 1 hour
Voyage AI API key	Secrets Manager (replicated)	0	Minutes
CDK infrastructure	Infrastructure as code, redeployable	0	30-60 minutes

Worst case (total OpenSearch loss): Re-embed from extracted text on S3. Chunks in MySQL are the source of truth for text; vectors are derived and re-generable. Full re-embedding of Phase 1 (10M docs) takes 2-3 days. This is the backup of last resort — snapshots should prevent it.

Security and Encryption¶

Layer	Requirement	Implementation
Encryption at rest (vectors)	AES-256	OpenSearch domain encryption enabled
Encryption at rest (chunks)	AES-256	Aurora encryption enabled (existing)
Encryption in transit	TLS 1.2+	All API calls, Lambda -> OpenSearch, Lambda -> Aurora
Voyage AI data handling	SOC 2 compliant	Document text sent to Voyage API for embedding. Voyage does not store input data (verify in terms).
IAM authentication	Rails -> API Gateway	Same pattern as documentexchanger
Per-case isolation	Hard tenant boundary	Vector index-per-case, MySQL per-case database
Secret management	Voyage API key in Secrets Manager	Rotatable, auditable

If compliance requires document text to never leave VPC: switch from Voyage direct API to SageMaker endpoint (see patterns/asymmetric-embeddings.md).

Compliance¶

Standard	Status	Gap
SOC 2	Partial	Voyage AI is SOC 2 compliant. End-to-end audit trail (search audit logs) designed. Need formal review of new components.
HIPAA	Partial	Encryption at rest/transit covered. BAA with Voyage AI needed if cases contain PHI. SageMaker endpoint avoids external data transfer.
Data residency	Depends on deployment	Voyage direct API: data leaves VPC. SageMaker: data stays in VPC. Document the choice.

Scalability¶

Dimension	Current Design	Limit	Scaling Action
Concurrent searches	Lambda concurrency	1000 (AWS default)	Request limit increase
Cases with embeddings	Per-case OpenSearch index	30K shards/domain	Tiered storage (hot/warm/cold)
Documents per case	Single vector index	25M vectors per index	Sufficient for largest cases
Embedding throughput	SQS MaxConcurrency	Voyage API rate limit	Enterprise tier or SageMaker

Observability¶

Component	Monitoring	Alerting
Search latency	CloudWatch metrics (p50, p99)	Alarm if p99 > 2s
Search errors	CloudWatch error rate	Alarm if > 5%
Embedding DLQ	Queue depth monitoring	Alarm if > 0 for 15 min
Backfill DLQ	Queue depth monitoring	Alarm if > 0 for 30 min
Voyage AI errors	Error count metric	Alarm if > 10 in 5 min
OpenSearch cluster health	Cluster status (green/yellow/red)	Alarm on yellow or red
Embedding pipeline lag	SQS age of oldest message	Alarm if > 30 min
Retrieval quality	Not yet defined	Gap — define quality metrics post-prototype

Retrieval Quality (Post-Prototype)¶

Retrieval quality monitoring is a gap that should be addressed after the prototype validates the baseline. Proposed approach:

Golden query set: 20-50 queries with known-good results on a reference case
Automated regression: Run golden queries after any change to chunking, embedding model, search parameters, or RRF weights
Metrics: Precision@10, Recall@20, NDCG@20, Mean Reciprocal Rank
Drift detection: Monthly automated run of golden queries, alert if metrics drop > 5% from baseline

This is not needed for the prototype — the prototype IS the quality validation. Define the golden query set during pilot (10 cases with attorney feedback).

Determinism and Legal Defensibility¶

Are search results deterministic?¶

Semantic search introduces one new source of non-determinism compared to keyword search: the HNSW approximate nearest neighbor algorithm. Understanding each component:

Component	Deterministic?	Details
Embedding generation (Voyage AI)	Yes	Same text produces the same vector every time. No temperature, no sampling.
BM25 search (Elasticsearch)	Mostly	Same query against a static index returns same scores. Minor variance from per-shard IDF calculation in multi-shard indices. Fixable with `search_type=dfs_query_then_fetch`.
HNSW vector search	No (approximate)	Returns approximately nearest neighbors, not guaranteed exact. See below.
RRF fusion	Yes	Pure math on ranked lists. Same inputs produce same output.
LLM reranking (optional, not in T1)	No	LLMs have sampling variance. Not used in T1 architecture.

HNSW approximation behavior¶

HNSW does not scan every vector. It traverses a graph to find approximate nearest neighbors. For a static index (no concurrent writes), queries ARE deterministic: same query vector + same graph + same ef_search = same traversal = same results.

When the index is being updated (live ingest or backfill), results can vary between queries because the graph structure changes.

The approximation itself means HNSW may miss some true nearest neighbors in exchange for speed. The ef_search parameter controls this trade-off:

ef_search=100  → examines ~100 candidates → ~5ms,  ~95% recall
ef_search=256  → examines ~256 candidates → ~15ms, ~99% recall
ef_search=512  → examines ~512 candidates → ~30ms, ~99.5% recall

At ef_search=256 (our default), the results that vary between runs are documents at the relevance boundary -- documents that are barely relevant either way. The top results are effectively stable.

Comparison to keyword search¶

Keyword search has similar non-determinism sources that are less visible:

Source of variation	Keyword search	Semantic search
Index updates between queries	Yes	Yes
Shard-level scoring variance	Yes (ES IDF per shard)	Yes (HNSW graph per shard)
Approximate algorithm	No (exact match)	Yes (HNSW)
Relevance ranking stability	High (BM25 is stable)	High for top results, minor variance at boundary
Model changes	N/A	Yes (re-embedding changes all vectors)

Defensibility strategy¶

For legal defensibility, the architecture provides four mechanisms:

1. Search audit logging

Every search call is logged with full reproducibility data:

{
    "search_id": "uuid",
    "query": "communications about the safety defect",
    "query_vector": [0.22, -0.40, ...],   # the actual vector, not recomputed
    "case_id": 123,
    "filters": {"custodians": [...], "date_range": {...}},
    "timestamp": "2026-03-31T10:00:00Z",
    "embedding_model": "voyage-law-2",
    "ef_search": 256,
    "result_count": 142,
    "result_document_ids": ["doc-1", "doc-2", ...],
    "result_scores": [0.0323, 0.0318, ...],
    "index_version": "case_123_vectors_v2",
    "search_mode": "hybrid",
    "timings": {"query_embedding_ms": 45, "vector_search_ms": 80, ...}
}

The attorney can declare: "I ran this query at this time and received these specific 142 documents. The search log is attached as Exhibit A."

2. High ef_search default

ef_search=256 provides ~99% recall. The 1% of true nearest neighbors that might be missed are at the relevance boundary -- documents with cosine similarity barely above the threshold. Top results are stable across runs.

3. Point-in-time result snapshots

The legal workflow is: search -> tag results -> review tagged set. Once documents are tagged into a review folder, the review set is frozen. This is the same pattern attorneys already use with keyword search -- they don't re-run keyword searches mid-review and expect identical results while the index is being updated.

Search (results may vary slightly) -> Tag into folder (frozen) -> Review (stable)

4. Exact mode for reproducibility-critical searches

For searches that must be bit-identical across runs (e.g., for a motion declaration or a search methodology affidavit), the API supports exact mode:

POST /search
{
    "query": "communications about the safety defect",
    "case_id": 123,
    "mode": "hybrid",
    "exact": true,
    "index_snapshot": "v2"
}

Exact mode: - Runs brute-force k-NN (not HNSW) -- scans every vector, no approximation - Pins to a specific index snapshot version - Uses dfs_query_then_fetch for BM25 (global IDF, not per-shard) - Returns bit-identical results every time against the same snapshot - Slower (~500ms vs ~20ms) -- use for final declarations, not exploration

What to communicate to attorneys¶

Semantic search results are:

Stable -- the top results for a given query are consistent across runs against a static index. Minor variations occur only at the relevance boundary.
Reproducible -- every search is logged with the query vector, timestamp, full result set, and index version. The log is the definitive record.
Not bit-identical during active indexing -- same as keyword search during active imports. Results stabilize once indexing is complete.
Defensible -- result snapshots are preserved in audit logs. Exact mode available for search methodology declarations. The search methodology (hybrid BM25 + vector, RRF fusion, specific model and parameters) is fully documentable.

Search methodology documentation template¶

For declarations or search methodology letters, the system can generate:

Search Methodology Statement

Search Technology: Hybrid semantic + keyword search (documentsearch v1.0)
Embedding Model: Voyage AI voyage-law-2 (1024 dimensions, legal-optimized)
Keyword Engine: Elasticsearch 7.4 (BM25 scoring)
Fusion Method: Reciprocal Rank Fusion (k=60)
Vector Search: HNSW with ef_search=256 (~99% recall)

Search Parameters:
  Query: [natural language query text]
  Filters: [custodians, date range, batch IDs]
  Mode: hybrid (semantic + keyword)
  Results returned: [N] documents

Reproducibility:
  Search ID: [uuid]
  Executed: [timestamp]
  Index version: [version identifier]
  Full result set preserved in search audit log.

Divergences from Standard NGE Patterns¶

Aspect	documentsearch	Standard NGE	Why
External API dependency	Voyage AI embedding API	No external APIs	Embedding generation requires specialized model not available in AWS
Dual ingest queues	Live + backfill at different concurrency	Single queue per event type	Backfill must not starve live ingest or Voyage rate limits
Sync API for search	API Gateway -> Lambda (25s)	Async SQS processing	Search requires synchronous response for UI
Vector store abstraction	`shell/vectorstore/base.py` interface	No storage abstraction	Allows prototype (pgvector) and production (OpenSearch) backends
No per-batch dynamic infra	Job Processor optional (may use static queues)	Per-batch Lambda/SQS creation	Embedding is lighter-weight than document loading; static queues may suffice
Backfill pipeline	Separate trigger and queue for existing data	No backfill concept	Must handle documents that pre-date the module

Configuration¶

Setting	Default	Notes
Embedding model	`voyage-law-2`	Configurable for model upgrades
Embedding dimensions	1024	Must match model output
Chunk target size	512 tokens	Voyage AI sweet spot
Chunk overlap	50 tokens	Cross-boundary context preservation
RRF k constant	60	Standard from RRF paper
Live MaxConcurrency	10	Embedding Lambda SQS concurrency
Backfill MaxConcurrency	3	Lower priority than live
HNSW ef_search	256	~99% recall. Higher = more accurate, slower.
Exact mode default	false	Brute-force k-NN for reproducibility. ~500ms vs ~20ms.
Search top-K per leg	100	Candidates from each search leg before RRF
Search Lambda timeout	25s	API Gateway integration limit
Embedding Lambda timeout	900s	Max Lambda timeout
Embedding Lambda memory	1024MB	Sufficient for vector operations
Search Lambda memory	512MB	Lighter than embedding

Prototype vs Production¶

Concern	Prototype	Production
Vector store	pgvector on single Aurora PostgreSQL	OpenSearch k-NN or OpenSearch Serverless
Multi-tenancy	Single hardcoded case	Per-case index lifecycle
Chunking	Paragraph-level with metadata prepend	Domain-specific (email/document/spreadsheet)
Rate limiting	SQS MaxConcurrency only	SQS MaxConcurrency + Voyage API rate tracking
Frontend	Standalone React app (Claude Code)	Rails UI integration (ERB + React)
Embedding versioning	None	Model version tracked, re-embedding pipeline
Cost tracking	None	Token counting per Voyage call, per-case allocation
Monitoring	CloudWatch logs only	Latency alarms, retrieval quality metrics
Backfill	Manual script	API + on-first-search auto-trigger
Backfill scope	1 case (~50K docs, ~$18)	Phase 1: 10M docs (~$3,600), Phase 2: 78M docs (~$28K)

Future Evaluation: PageIndex (T2)¶

PageIndex (VectifyAI, open-source) is a vectorless, reasoning-based retrieval framework that builds a hierarchical tree from a document and uses LLM reasoning to navigate it. Mafin 2.5 (built on PageIndex) achieved 98.7% on FinanceBench where typical vector RAG scored 65-80%.

PageIndex does NOT replace T1 (corpus-level search). It solves a different problem: deep structured reasoning WITHIN a single document (following cross-references, navigating sections/tables). It cannot search across a corpus.

How it works: Two-step process. (1) Index generation: analyzes document's natural structure (sections, subsections, headings) and builds a hierarchical tree where each node has a title, summary, answerable-questions, and page range. Original document stays intact. (2) Reasoning-driven tree search: LLM reads top-level nodes, reasons which branch likely contains the answer, follows it, repeats at each level, retrieves complete sections (not fragments). Retrieval is explainable — you see which tree nodes were traversed and which pages retrieved.

Core critique of vector RAG relevant to our chunking strategy: Semantic similarity ≠ relevance. A paragraph about "capital expenditure 2021" and "revenue projections 2023" embed near each other because both use financial language, but only one answers the question. Chunking destroys the author's organizational intelligence (sections, cross-references, logical flow). Our domain-aware chunking (email-aware, section-aware) mitigates this but doesn't eliminate it.

Potential T2 integration: Use vector search to find top 20 documents, then PageIndex for deep extraction from those 20. Relevant for depo prep, narrative investigation, settlement prep, motion to compel — use cases where the attorney needs precise facts from structured documents (contracts, regulatory filings) after search identifies them.

Additional resources: PageIndex is open-source (github.com/VectifyAI/PageIndex), has a cloud API (api.pageindex.ai/v1) with MCP protocol integration, and a chat interface (chat.pageindex.ai) for testing on custom documents.

Evaluate during Phase 4 (agent service development). Compare against long-context LLMs on the same documents. See reference-implementations/semantic-search-use-cases.md Appendix C for full evaluation plan.

Key File Locations¶

File	Purpose
`lib/lambda/src/documentsearch/core/chunking/`	Domain-specific chunking logic
`lib/lambda/src/documentsearch/core/search/hybrid.py`	Reciprocal Rank Fusion
`lib/lambda/src/documentsearch/shell/vectorstore/base.py`	Vector store abstraction
`lib/lambda/src/documentsearch/shell/embedding/voyage_client.py`	Voyage AI integration
`lib/lambda/src/documentsearch/handlers/embedding_index.py`	Ingest handler (SQS)
`lib/lambda/src/documentsearch/handlers/search_index.py`	Search handler (API GW)
`lib/lambda/src/documentsearch/handlers/backfill_index.py`	Backfill trigger (API GW)
`lib/lambda/src/documentsearch/handlers/job_processor_index.py`	Batch lifecycle
`lib/common-resources-stack.ts`	Vector store + shared infra
`lib/search-ingest-stack.ts`	Embedding pipeline CDK
`lib/search-api-stack.ts`	Search API CDK

Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.