Skip to content

Reference Implementation: documentsearch

Overview

documentsearch is the semantic search module for the Nextpoint eDiscovery platform. It generates vector embeddings during document ingestion and serves hybrid search (BM25 + vector) via API Gateway. Attorneys can run natural language queries like "internal discussions about Special Purpose Entities" and surface documents that keyword search would miss entirely.

EDRM Stage: 6 (Review) + 7 (Analysis) -- semantic search for document review and investigative analysis. Suite: Discovery (documents). Operates alongside existing keyword search, not replacing it.

Architecture

documentsearch/
├── lib/lambda/src/documentsearch/
│   ├── core/                              # Pure business logic -- NO AWS imports
│   │   ├── exceptions.py                  # Recoverable/Permanent/Silent hierarchy
│   │   ├── checkpoints.py                 # EmbeddingCheckpoints enum (7 steps)
│   │   ├── chunking/
│   │   │   ├── strategy.py                # ChunkingStrategy interface
│   │   │   ├── email_chunker.py           # Thread-aware: preserves reply chains + metadata
│   │   │   ├── document_chunker.py        # Section-aware: headings, paragraphs, tables
│   │   │   └── metadata.py                # Prepends sender/date/subject to chunks
│   │   ├── search/
│   │   │   ├── hybrid.py                  # Reciprocal Rank Fusion (BM25 + vector)
│   │   │   ├── reranker.py                # Optional LLM reranking of top-K
│   │   │   └── filters.py                 # Custodian, date range, batch filters
│   │   ├── models/
│   │   │   ├── db_models.py               # search_chunks, search_embedding_status tables
│   │   │   └── schemas.py                 # EmbeddingRequest, SearchRequest, SearchResult
│   │   └── utils/
│   │       ├── db_transaction.py          # @retry_on_db_conflict
│   │       └── context_data.py            # ContextVar-based request context
│   ├── shell/                              # Infrastructure -- AWS, external APIs
│   │   ├── db/
│   │   │   └── database.py                # writer_session/reader_session (per-case)
│   │   ├── vectorstore/
│   │   │   ├── base.py                    # VectorStore interface (abstract)
│   │   │   ├── opensearch_store.py        # OpenSearch k-NN implementation
│   │   │   └── pgvector_store.py          # pgvector implementation (prototype)
│   │   ├── keyword/
│   │   │   └── es_ops.py                  # BM25 query against existing ES index
│   │   ├── embedding/
│   │   │   └── voyage_client.py           # Voyage AI voyage-law-2 API client
│   │   └── utils/
│   │       ├── sns_ops.py                 # EventType enum, SNS publishing
│   │       ├── sqs_ops.py                 # SQS message handling
│   │       └── s3_ops.py                  # Fetch extracted text from S3
│   ├── handlers/
│   │   ├── embedding_index.py             # SQS handler -- ingest pipeline (chunk + embed)
│   │   ├── search_index.py                # API Gateway handler -- sync search queries
│   │   ├── backfill_index.py              # API Gateway handler -- trigger backfill for case
│   │   └── job_processor_index.py         # Job orchestrator -- per-batch infra lifecycle
│   └── config.py                           # Env vars, region config, Voyage API key ref
├── lib/lambda/tests/
│   ├── conftest.py                         # Auto-use fixtures: mock AWS, DB, Voyage API
│   ├── core/test_chunking.py
│   ├── core/test_hybrid_search.py
│   ├── core/test_rrf.py
│   └── shell/test_voyage_client.py
├── lib/                                    # CDK infrastructure (TypeScript)
│   ├── common-resources-stack.ts           # Vector store infra, shared resources
│   ├── search-ingest-stack.ts              # Embedding Lambda + SQS + SNS subscription
│   ├── search-api-stack.ts                 # API Gateway + search Lambda + backfill Lambda
│   └── search-stack.ts                     # Top-level stack composition
└── CLAUDE.md

Pattern Mapping

Pattern documentsearch Implementation Standard NGE Pattern
Hexagonal boundaries core/ contains chunking, RRF, filters, checkpoints; shell/ wraps Voyage AI, OpenSearch/pgvector, ES, S3 core/ + shell/ separation
Exception hierarchy RecoverableException, PermanentFailureException, SilentSuccessException Standard 3-type hierarchy
SNS events 5 EventType enum values (DOCUMENT_EMBEDDED through BACKFILL_COMPLETED) SNS events as facts (past tense)
SQS handler embedding_index.py -- batch processing with partial failure support Standard 6-step handler
Checkpoint pipeline 7-step state machine (EMBEDDING_STARTED -> EMBEDDING_COMPLETE) Checkpoint-based composite PK
Database sessions writer_session, reader_session (per-case) Per-case MySQL databases
Retry/resilience @retry_on_db_conflict, SQS exponential backoff, DLQ redrive Standard retry patterns
Idempotency Checkpoint-based (composite PK on search_embedding_status) Standard idempotency
Multi-tenancy Per-case vector index + per-case MySQL for chunks Per-case isolation
API design API Gateway + Lambda for sync search, IAM auth Same as documentexchanger
Job processor Per-batch infra lifecycle for live ingest Standard orchestration
Vector store abstraction shell/vectorstore/base.py interface, swappable backends N/A (new pattern)
Dual entry point Async ingest (SQS) + sync query (API Gateway) Same as documentexchanger
Backfill pipeline Separate queue, lower concurrency, same embedding Lambda N/A (new pattern)

Key Design Decisions

Where It Fits in the Pipeline

Live Ingest (new documents)

Rails -> ProcessorApi.import()
  |
documentextractor                           <-- entry point (UNCHANGED)
  | SNS: DOCUMENT_PROCESSED
  |---> documentloader      -> MySQL + ES    <-- existing (UNCHANGED)
  |---> documentuploader    -> PDF/Nutrient  <-- existing (UNCHANGED)
  |---> PSM                 -> Athena        <-- existing (UNCHANGED)
  +---> documentsearch (NEW) -> chunk, embed, store vectors

Subscribes to the existing DOCUMENT_PROCESSED SNS event via standard fan-out. Zero changes to upstream modules. The SNS topic already exists; we add one more SQS subscription with a filter policy.

Backfill (existing documents)

Existing documents already passed through the pipeline and will never trigger new DOCUMENT_PROCESSED events. Their extracted text is already on S3 (from documentextractor) and their metadata is already in MySQL (from documentloader). Only chunking and embedding is needed.

Rails (admin UI or CLI)
  | POST /backfill { "case_id": 123 }
  v
Backfill Lambda (orchestrator)
  | Query MySQL: SELECT document_id, s3_path FROM exhibits
  |              WHERE NOT EXISTS in search_embedding_status
  v
For each un-embedded document:
  | Publish to SNS: { eventType: "BACKFILL_REQUESTED", ... }
  v
backfill_embedding_queue (SQS, lower concurrency)
  | Same Embedding Lambda code path
  v
Chunk -> Embed -> Store (identical to live ingest)

Key properties: - Incremental: Only processes documents without existing embeddings. Safe to re-run; checkpoint table prevents duplicate work. - Resumable: If the backfill is interrupted (Lambda timeout, deployment, etc.), re-triggering picks up where it left off. - Throttled: Backfill queue runs at lower MaximumConcurrency than live ingest to avoid starving live imports or hitting Voyage AI rate limits. - Progress tracking: DOCUMENT_EMBEDDED events flow through PSM -> Athena. Rails polls progress via existing NgeCaseTrackerJob.

Search Query Flow

Sits alongside existing keyword search, does not replace it:

                          +--> existing ES index (BM25 keyword search)
                          |         |
Rails search request -----+    Reciprocal Rank Fusion (in documentsearch)
                          |         ^
                          +--> vector store (k-NN semantic search)

Keyword search stays exactly where it is. The module adds a parallel vector search leg and fuses both result sets via RRF. Attorneys can still search Bates numbers and exact phrases (keyword path). Conceptual queries hit the vector path. RRF combines both into one ranked list.

Checkpoint Pipeline

class EmbeddingCheckpoints(Enum):
    EMBEDDING_STARTED = 0
    TEXT_FETCHED = 1          # Extracted text retrieved from S3
    CHUNKS_CREATED = 2        # Document chunked with metadata
    EMBEDDINGS_GENERATED = 3  # Voyage AI called, vectors returned
    VECTORS_STORED = 4        # Vectors upserted to vector store
    CHUNKS_STORED = 5         # Chunk text + metadata stored in MySQL
    EMBEDDING_COMPLETE = 6

Why 7 steps: Voyage AI API calls are the most expensive step (cost and latency). If Lambda times out after generating embeddings but before storing them, we resume from EMBEDDINGS_GENERATED and skip the re-call. Without this checkpoint, every retry re-calls Voyage AI unnecessarily.

Composite PK: (npcase_id, document_id) -- one embedding pipeline per document. No batch_id in the PK because the same document should never be embedded twice regardless of which batch triggered it.

Two Queues for Ingest

SNS topic (existing)
  |---> live_embedding_queue    (MaximumConcurrency: 10)  <-- DOCUMENT_PROCESSED
  +---> backfill_embedding_queue (MaximumConcurrency: 3)  <-- BACKFILL_REQUESTED
              |                         |
           Same Embedding Lambda (different concurrency limits)

Live ingest gets priority via higher concurrency. Backfill runs at lower concurrency so it doesn't compete for Voyage API rate limits or vector store write throughput. Both queues feed the same Lambda code -- the handler doesn't know or care whether the trigger was live or backfill.

Backfill Triggering Options

Option Mechanism UX
Explicit per-case Rails admin UI: "Enable Semantic Search" button per case. POST /backfill { case_id } Full control, phased rollout
Bulk migration Script iterates all nge_enabled? cases, triggers backfill for each. One-time job. Initial rollout
On first search If case has no embeddings, trigger backfill automatically. Return keyword-only results with "Semantic search is being prepared" message. Zero admin overhead, best UX

Recommended: Option 3 for GA, Option 1 for early access. On-first-search eliminates the enablement step entirely, but early access benefits from explicit control over which cases get the feature.

Re-embedding (Model Upgrades)

The embedding_model column on every chunk row tracks which model generated each embedding. When upgrading (e.g., voyage-law-2 -> voyage-law-3):

-- Backfill filter becomes model-aware:
SELECT document_id, s3_path FROM exhibits
WHERE document_id NOT IN (
    SELECT document_id FROM search_embedding_status
    WHERE embedding_model = 'voyage-law-3'  -- target model
)

Same pipeline, same throttling, same progress tracking. Old vectors are overwritten in the vector store; old chunk rows are updated in MySQL.

Vector Store Abstraction

The shell/vectorstore/ layer abstracts the storage backend behind an interface:

# shell/vectorstore/base.py
class VectorStore(ABC):
    @abstractmethod
    def create_index(self, case_id: int, dimension: int) -> None: ...

    @abstractmethod
    def upsert_vectors(self, case_id: int, vectors: list[VectorRecord]) -> None: ...

    @abstractmethod
    def search(self, case_id: int, query_vector: list[float],
               k: int, filters: dict) -> list[VectorResult]: ...

    @abstractmethod
    def delete_index(self, case_id: int) -> None: ...

This allows swapping backends without touching core/: - Prototype: pgvector on Aurora PostgreSQL (simple, low ops overhead) - Production option A: OpenSearch k-NN (native hybrid, matches ES pattern) - Production option B: OpenSearch Serverless (no cluster sizing, pay-per-query) - Evaluation: Bedrock Knowledge Bases (managed RAG, but limited customization)

Vector Store Options

Full analysis with cost models, scale limits, and query latency numbers: see adr/adr-vector-store-selection.md.

Summary:

Option Native Hybrid Replaces ES? Monthly Cost Recommendation
OpenSearch Managed YES YES $3,900-5,000 Production (high volume)
OpenSearch Serverless YES YES $700-2,000 Production (moderate volume)
Aurora pgvector No No $3,000-3,500* Prototype
Bedrock KB No No Variable Eliminated
MemoryDB Redis No No $1,200-5,000 Eliminated

*pgvector cost includes maintaining existing ES cluster for keyword search.

Decision: pgvector for prototype, OpenSearch (Managed or Serverless) for production. The VectorStore abstraction in shell/vectorstore/ makes the switch a shell-layer change only. Bedrock KB and MemoryDB eliminated (no hybrid search, no Voyage AI, or single-shard limits).

Multi-Tenant Isolation

Vector Store: Index-per-case

case_{case_id}_vectors    # One vector index per case

Why index-per-case: - Matches existing ES per-case index pattern (Rails already uses this) - Hard tenant boundary -- no filter policy bugs can leak data across cases - Independent lifecycle -- delete case = delete index - Index size stays manageable (even large cases are <1M documents)

MySQL: Per-case database

Chunk text and embedding metadata are stored in the per-case database (nextpoint_case_{id}), same as documentloader's exhibits and attachments.

class SearchChunk(Base):
    __tablename__ = "search_chunks"
    id = Column(Integer, primary_key=True, autoincrement=True)
    document_id = Column(String(255), nullable=False, index=True)
    chunk_id = Column(String(255), nullable=False, unique=True)
    chunk_index = Column(Integer, nullable=False)
    chunk_text = Column(Text, nullable=False)
    metadata_json = Column(JSON)        # custodian, date, subject, page range
    embedding_model = Column(String(100))  # "voyage-law-2"
    created_at = Column(DateTime, server_default=func.now())

class SearchEmbeddingStatus(Base):
    __tablename__ = "search_embedding_status"
    npcase_id = Column(Integer, primary_key=True)
    document_id = Column(String(255), primary_key=True)
    checkpoint_id = Column(Integer, nullable=False, default=0)
    embedding_model = Column(String(100))
    chunk_count = Column(Integer)
    status = Column(String(20))         # "complete", "in_progress", "failed"
    created_at = Column(DateTime, server_default=func.now())
    updated_at = Column(DateTime, onupdate=func.now())

Chunks are the source of truth for snippet rendering. Vectors in the vector store are derived and re-generable from chunk text.

Event Types

Events consumed

Event Source SNS Filter Purpose
DOCUMENT_PROCESSED documentextractor eventType, caseId, batchId Triggers embedding for new document
IMPORT_CANCELLED documentloader eventType, caseId, batchId Cancel in-progress embedding work

Events published

Event Purpose Consumers
DOCUMENT_EMBEDDED Embedding complete for one document PSM (Athena tracking)
EMBEDDING_FAILED Permanent embedding failure PSM, Rails (error display)
EMBEDDING_JOB_FINISHED All documents in batch embedded Job Processor (teardown)
BACKFILL_STARTED Backfill initiated for a case PSM (tracking)
BACKFILL_COMPLETED All existing docs in case embedded PSM, Rails (UI update)

SNS Message Structure

{
    "source": "documentsearch",
    "jobId": str,
    "caseId": int,
    "batchId": int,
    "documentId": str,
    "eventType": "DOCUMENT_EMBEDDED",
    "status": "SUCCESS",
    "timestamp": "2026-03-31T12:00:00Z",
    "eventDetail": {
        "chunk_count": 12,
        "embedding_model": "voyage-law-2",
        "processing_time_ms": 3400
    }
}

MessageAttributes on every publish: eventType, caseId, batchId.

Chunking Strategy

Why domain-specific chunking matters

Naive 500-token splitting will: - Split an email reply from the message it's replying to - Separate "stay the course on the timeline" from the safety defect context - Lose sender/date/subject metadata that makes results useful for attorneys

Email chunking (core/chunking/email_chunker.py)

Input:  Email thread with 4 messages
Output: 4 chunks, one per message in thread

Each chunk:
  [METADATA] From: jeff.skilling@enron.com | Date: 2001-03-15 | Subject: RE: Raptor update
  [BODY] The full text of this single message in the thread
  [CONTEXT] Replying to: andy.fastow@enron.com on 2001-03-14 re: "Raptor update"
  • Preserves sender attribution (critical for custodian-scoped search)
  • Keeps reply context without duplicating entire thread in every chunk
  • Metadata prefix means the embedding captures WHO said WHAT and WHEN

Document chunking (core/chunking/document_chunker.py)

Input:  20-page contract
Output: N chunks at natural section boundaries

Strategy:
  1. Split at section headings (if detected)
  2. Within sections, split at paragraph boundaries
  3. Target chunk size: 512 tokens (voyage-law-2 sweet spot)
  4. Overlap: 50 tokens between adjacent chunks
  5. Prepend document metadata to first chunk

Chunk metadata

Every chunk carries metadata both as: - Prepended text (embedded with content -- improves retrieval quality) - Structured fields (stored in MySQL/vector store -- enables filtering)

@dataclass
class ChunkMetadata:
    document_id: str
    chunk_index: int
    custodian: str | None        # Email sender or document author
    date: datetime | None         # Email date or document date
    subject: str | None           # Email subject or document title
    page_range: tuple[int, int] | None
    document_type: str            # "email", "document", "spreadsheet"

Embedding Pipeline

Voyage AI integration (shell/embedding/voyage_client.py)

class VoyageClient:
    MODEL = "voyage-law-2"
    DIMENSIONS = 1024
    MAX_BATCH_SIZE = 128       # Voyage API limit
    MAX_TOKENS_PER_BATCH = 120_000

    def embed_documents(self, texts: list[str]) -> list[list[float]]:
        """Batch embed document chunks. Auto-batches to stay within limits."""
        # Voyage API: POST https://api.voyageai.com/v1/embeddings
        # input_type="document" for corpus embeddings

    def embed_query(self, query: str) -> list[float]:
        """Embed a single search query."""
        # input_type="query" -- asymmetric embedding for retrieval

Asymmetric embeddings: Voyage AI uses different input_type for documents vs queries ("document" at ingest, "query" at search time). This is handled in shell/, not core/. See patterns/asymmetric-embeddings.md for full explanation, implementation, and AWS deployment options (direct API vs SageMaker endpoint).

Rate limiting and concurrency

  • Voyage AI rate limits: Managed via SQS MaximumConcurrency on embedding Lambda
  • Live ingest: MaximumConcurrency: 10 (higher priority)
  • Backfill: MaximumConcurrency: 3 (background, don't starve live)
  • Tuning: Adjust based on Voyage API tier and case sizes

Handler flow

# handlers/embedding_index.py -- standard 6-step SQS handler

def lambda_handler(event, _context):
    batch_item_failures = []
    for record in event["Records"]:
        try:
            set_event(record)
            message = _parse_record(record)  # Double-encoded SNS->SQS
            set_npcase(str(message["caseId"]))

            if not get_batch_context():
                with writer_session() as session:
                    set_batch_context(_load_batch_context(session, message))

            SNS(message)
            _route_event(message)  # -> core process: fetch, chunk, embed, store

        except Exception as e:
            _handle_exception(e, record, batch_item_failures)
        finally:
            set_event(None)

    set_batch_context(None)
    return {"batchItemFailures": batch_item_failures}

Current Elasticsearch State (BM25 Leg)

Understanding the existing ES infrastructure is critical because the BM25 leg of hybrid search queries these indices read-only. No ES changes are needed.

BM25 Configuration

No custom BM25 tuning. Uses Elasticsearch 7.4 defaults:

Parameter Current Value ES Default What It Controls
k1 1.2 1.2 Term frequency saturation. Higher = repeated terms weighted more heavily.
b 0.75 0.75 Document length normalization. 1.0 = full normalization, 0.0 = none.
search_type dfs_query_then_fetch query_then_fetch Global IDF scoring across all shards (not per-shard).

Observations and Tuning Opportunities

Length normalization (b=0.75) may need tuning for legal search.

Default b=0.75 penalizes long documents relative to short ones. Legal productions contain an extreme mix: 2-line emails, 5-page memos, 200-page contracts. With b=0.75: - A keyword match in a 200-page contract scores lower than the same match in a 2-line email (normalized by document length) - This is often wrong for legal search -- a contract mentioning "Special Purpose Entity" once in 200 pages IS relevant, and shouldn't be penalized relative to a short email

Post-prototype evaluation: Test b=0.5 or b=0.3 on a known matter and compare BM25 ranking quality. Lower b reduces the length penalty, giving long documents fairer ranking. This is a per-index setting change, not a code change.

dfs_query_then_fetch is already correct. Global IDF prevents scoring anomalies from uneven document distribution across shards. This is important for hybrid search determinism -- per-shard IDF would introduce variance.

Index Architecture

Aspect Current Configuration
Strategy Shared physical indices with per-case filtered aliases (NOT index-per-case)
Alias naming {environment}_{npcase_id}_{type} (e.g., production_12345_exhibits)
Physical index naming {environment}_{type}_{identifier}_{sequential_number}
Max shard size 30 GB (new physical index created when exceeded)
Join field Parent-child for exhibit -> pages (exhibit_join)
Full-text field search_text (uses nextpoint_analyzer)
Exhibit mapping 290+ fields
Dynamic mapping Strict (no auto-detected fields)

Custom Analyzers

Analyzer Purpose
nextpoint_analyzer Index-time: email parsing, edge n-gram, path hierarchies
nextpoint_search_analyzer Search-time paired analyzer
edge_ngram_analyzer Autocomplete support (min_gram: 1, max_gram: 6)
custom_path_tree Folder path hierarchy processing

How Hybrid Search Queries ES

The BM25 leg queries the existing per-case filtered alias:

# shell/keyword/es_ops.py
def keyword_search(case_id, query, filters, size=100):
    alias = f"{config.ENVIRONMENT}_{case_id}_exhibits"
    # Standard ES query against existing alias
    # Uses existing nextpoint_search_analyzer
    # Returns BM25-scored results with document_id, score
    # No mapping changes, no new fields, read-only consumer

Key point: The vector index (OpenSearch/pgvector) and the keyword index (existing ES 7.4) are completely separate systems. The hybrid search module reads from both and fuses results in application code via RRF. If OpenSearch replaces ES 7.4 in the future, both legs move to one system and native hybrid search becomes possible (eliminating the need for application-level RRF).

Search API

Endpoints

Method Path Auth Purpose
POST /search /search IAM Hybrid semantic + keyword search
POST /backfill /backfill IAM Trigger embedding backfill for a case
GET /status/{case_id} /status/{case_id} IAM Embedding progress for a case

Search request

{
    "query": "internal discussions about Special Purpose Entities",
    "case_id": 123,
    "filters": {
        "custodians": ["jeff.skilling@enron.com"],
        "date_range": {"start": "2001-01-01", "end": "2001-12-31"},
        "batch_ids": [456, 789]
    },
    "limit": 20,
    "offset": 0,
    "mode": "hybrid",
    "exact": false
}
Parameter Values Default Purpose
mode "hybrid", "semantic", "keyword" "hybrid" Which search legs to run
exact true, false false Brute-force k-NN + global IDF for bit-identical reproducibility. Slower (~500ms). Use for search methodology declarations.

Search response

{
    "status": "success",
    "data": {
        "results": [
            {
                "document_id": "abc-123",
                "exhibit_id": 42,
                "score": 0.87,
                "bm25_rank": 15,
                "vector_rank": 2,
                "snippets": [
                    {
                        "text": "We need to discuss the Raptor structure before...",
                        "chunk_id": "abc-123-chunk-4",
                        "page": 2
                    }
                ],
                "metadata": {
                    "author": "jeff.skilling@enron.com",
                    "date": "2001-03-15",
                    "subject": "RE: Raptor update"
                }
            }
        ],
        "total": 142,
        "embedding_status": "complete",
        "search_id": "uuid",
        "search_audit": {
            "embedding_model": "voyage-law-2",
            "ef_search": 256,
            "index_version": "case_123_vectors_v2",
            "exact_mode": false,
            "search_mode": "hybrid"
        },
        "timings": {
            "query_embedding_ms": 45,
            "vector_search_ms": 80,
            "keyword_search_ms": 40,
            "fusion_ms": 5,
            "total_ms": 170
        }
    },
    "requestId": "uuid"
}

Search handler flow

# handlers/search_index.py -- API Gateway handler (<=25s)

def lambda_handler(event, _context):
    try:
        request = SearchRequest.validate(event["body"])
        set_npcase(str(request.case_id))

        # Step 1: Embed the query (Voyage AI, ~50ms)
        query_vector = voyage_client.embed_query(request.query)

        # Step 2: Parallel search (both legs)
        vector_results = vector_store.search(
            case_id=request.case_id,
            query_vector=query_vector,
            k=100,
            filters=request.filters
        )
        bm25_results = es_ops.keyword_search(
            index=f"case_{request.case_id}",   # Existing ES index
            query=request.query,
            filters=request.filters,
            size=100
        )

        # Step 3: Reciprocal Rank Fusion (core/ -- pure logic)
        fused = hybrid_search.rrf(vector_results, bm25_results, k=60)

        # Step 4: Fetch snippets for top results
        top_results = fused[:request.limit]
        enriched = _enrich_with_snippets(top_results, request.case_id)

        return api_response(200, enriched)

    except ValidationError as e:
        return api_response(400, error=e.message)
    except Exception as e:
        log_message("search_error", error=str(e))
        return api_response(500, error="Search failed")

Reciprocal Rank Fusion (core/search/hybrid.py)

def rrf(vector_results: list, bm25_results: list, k: int = 60) -> list:
    """
    Combine two ranked lists using Reciprocal Rank Fusion.
    score(d) = sum(1 / (k + rank_i(d))) for each result set i
    k=60 is the standard constant from the original RRF paper.
    """
    scores = defaultdict(float)
    metadata = {}

    for rank, result in enumerate(vector_results):
        doc_id = result["document_id"]
        scores[doc_id] += 1.0 / (k + rank + 1)
        metadata[doc_id] = {**result, "vector_rank": rank + 1}

    for rank, result in enumerate(bm25_results):
        doc_id = result["document_id"]
        scores[doc_id] += 1.0 / (k + rank + 1)
        if doc_id in metadata:
            metadata[doc_id]["bm25_rank"] = rank + 1
        else:
            metadata[doc_id] = {**result, "bm25_rank": rank + 1}

    ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    return [{"document_id": doc_id, "rrf_score": score, **metadata[doc_id]}
            for doc_id, score in ranked]

Pure business logic -- lives in core/search/, no AWS imports, fully unit-testable.

New Infrastructure Required

Complete Inventory

Everything the documentsearch module adds. Existing infrastructure is NOT modified -- see "Current Elasticsearch State" section above.

Voyage AI API Access

Item Details
What API key for https://api.voyageai.com/v1/embeddings
Model voyage-law-2 (1024 dimensions, legal-optimized)
Storage Secrets Manager (documentsearch/voyage-api-key)
Network Lambda -> NAT Gateway -> Voyage AI (HTTPS). NAT Gateway already exists.
Cost $0.12 per million tokens
Alternative SageMaker endpoint ($737/mo) if data-in-VPC required

Vector Store

Phase Technology Instance Cost
Prototype Aurora PostgreSQL 15 + pgvector db.t3.medium ~$157/mo (incl. RDS Proxy)
Production (option A) OpenSearch Managed + k-NN r6g.large.search x 2 ~$3,900-5,000/mo
Production (option B) OpenSearch Serverless Auto-scaled OCUs ~$700-2,000/mo

Decision deferred. See adr/adr-vector-store-selection.md.

Lambda Functions (4)

Lambda Memory Timeout Trigger Purpose
Embedding Lambda 1024 MB 900s SQS (live + backfill) Chunk, embed via Voyage AI, store vectors
Search Lambda 512 MB 25s API Gateway Embed query, parallel BM25 + k-NN, RRF
Backfill Lambda 256 MB 25s API Gateway Trigger backfill for a case
Job Processor Lambda 256 MB 900s SQS Per-batch infrastructure lifecycle

All: Python 3.10+, private subnets, Lambda layer.

SQS Queues (4)

Queue Purpose MaximumConcurrency DLQ Retention
live_embedding_queue DOCUMENT_PROCESSED events 10 14 days
live_embedding_dlq Dead letters (live) N/A 14 days
backfill_embedding_queue BACKFILL_REQUESTED events 3 14 days
backfill_embedding_dlq Dead letters (backfill) N/A 14 days

Visibility timeout = 900s. maxReceiveCount = 3.

SNS Subscriptions (2)

Subscription Filter Policy
Live ingest eventType: ["DOCUMENT_PROCESSED", "IMPORT_CANCELLED"], caseId, batchId
Backfill eventType: ["BACKFILL_REQUESTED"], caseId

Added to existing shared SNS topic. Topic itself is unchanged.

API Gateway

Endpoint Auth Timeout Purpose
POST /search IAM 25s Hybrid semantic + keyword search
POST /backfill IAM 25s Trigger embedding for existing docs
GET /status/{case_id} IAM 10s Embedding progress

URL published to SSM Parameter Store (documentsearch_api_url).

MySQL Tables (per-case, 2 new tables)

Created in each nextpoint_case_{id} database by documentsearch's own migration:

Table Purpose Size Estimate
search_chunks Chunk text, metadata, model version ~1 KB/chunk x chunks
search_embedding_status Per-document checkpoint, status ~100 bytes/document

CloudWatch Alarms (5)

Alarm Threshold
Search p99 latency > 2 seconds
Search error rate > 5%
Embedding DLQ depth > 0 for 15 minutes
Backfill DLQ depth > 0 for 30 minutes
Voyage AI error rate > 10 in 5 minutes

CDK Stacks (3)

CommonResourcesStack -- deployed once per environment: - Vector store infrastructure (OpenSearch domain or Aurora PostgreSQL) - Secrets Manager reference for Voyage AI API key - IAM roles for Lambda -> vector store access - SNS topic subscription for DOCUMENT_PROCESSED events

SearchIngestStack -- embedding pipeline: - Embedding Lambda + Lambda layer - Live ingest SQS queue + DLQ - Backfill SQS queue + DLQ (separate, lower concurrency) - SNS filter policies - MaximumConcurrency: 10 (live), MaximumConcurrency: 3 (backfill) - Job Processor Lambda for per-batch infrastructure lifecycle

SearchApiStack -- search and backfill endpoints: - API Gateway REST API - Search/Backfill/Status Lambdas - IAM authorizer (Rails -> Lambda) - CloudWatch alarms

Cost Summary

Production corpus: 870M documents, 6.4B pages. Realistic backfill scope after filters (NGE-enabled ~10%, Discovery ~90%): ~78M documents.

One-Time Backfill Costs

Phase Documents Voyage AI Compute Total
Prototype 50K $18 ~$2 ~$20
Pilot (10 cases) 500K $180 ~$15 ~$195
Phase 1 (100 cases) 10M $3,600 ~$300 ~$3,900
Phase 2 (all NGE Discovery) 78M $28,000 ~$2,300 ~$30,300

Ongoing Monthly Costs (Post-Backfill, Steady State)

Cost Category Managed OpenSearch Serverless OpenSearch
Infrastructure (vector store, SQS, CW) $3,950-5,050 $923-2,223
New document embedding (~2-5M docs/mo) $790-1,970 $790-1,970
Search queries (~30-100K/mo) $15-50 $15-50
Total monthly $4,755-7,070 $1,728-4,243

New document embedding is the ongoing cost driver (~$790-1,970/mo for new imports flowing through the pipeline). Search query embedding cost is negligible (~$0.0000012 per query for Voyage AI).

Note: These costs are in addition to existing ES 7.4 cluster. If OpenSearch replaces ES 7.4 long-term (consolidation), the vector store cost is partially offset by the eliminated ES cluster.

See reference-implementations/semantic-search-infrastructure.md for full breakdown by component, per-query cost analysis, and cost comparison.

Infrastructure That Already Exists (Zero Changes)

Component Role in Semantic Search Modified?
SNS topic Embedding pipeline subscribes to existing events No
Extracted text on S3 Input to chunking (documentextractor already ran) No
Elasticsearch 7.4 BM25 leg queries existing per-case aliases read-only No
Aurora MySQL (per-case DBs) New tables added via module migration; exhibits table read-only No
PSM -> Firehose -> Athena Captures DOCUMENT_EMBEDDED events automatically No
NgeCaseTrackerJob Polls embedding progress via existing Athena pipeline No
VPC / subnets / NAT Gateway CDK stacks deploy into existing network No
Secrets Manager Stores Voyage AI API key (standard pattern) No
SSM Parameter Store Publishes documentsearch API URL No

Integration with Rails

Ingest -- no Rails changes needed

The embedding pipeline subscribes to existing SNS events. Rails doesn't need to know about it. PSM captures DOCUMENT_EMBEDDED events via the existing Firehose -> Athena pipeline, so Rails can track embedding progress via NgeCaseTrackerJob (existing polling mechanism).

Backfill -- one new API call

# app/helpers/semantic_search_helper.rb (new)

def trigger_semantic_backfill(case_id)
  # IAM-authenticated call to documentsearch API Gateway
  response = iam_post(
    ssm_parameter("documentsearch_api_url") + "/backfill",
    { case_id: case_id }.to_json
  )
  JSON.parse(response.body)
end

def semantic_search_status(case_id)
  response = iam_get(
    ssm_parameter("documentsearch_api_url") + "/status/#{case_id}"
  )
  JSON.parse(response.body)
end
# app/helpers/semantic_search_helper.rb (continued)

def semantic_search(case_id:, query:, filters: {}, limit: 20)
  response = iam_post(
    ssm_parameter("documentsearch_api_url") + "/search",
    { query: query, case_id: case_id, filters: filters, limit: limit }.to_json
  )
  JSON.parse(response.body)
end

UI integration (post-prototype)

  • "Semantic Search" toggle/tab alongside existing keyword search
  • Results page reuses existing document list components
  • Snippet highlighting is new (rendered from chunk text in search response)
  • Side-by-side comparison mode (keyword vs semantic) for attorney evaluation
  • "Semantic search is being prepared" state when backfill is in progress
  • Embedding progress indicator in import status UI

Existing Data Handling (Complete Flow)

What Already Exists for Each Document

Every document that went through the NGE pipeline already has all the raw material needed for embedding. No re-extraction or re-processing is required.

Data Location Created By Available?
Raw file S3 (case_{id}/documents/{uid}/{filename}) Upload Yes
Extracted text S3 (case_{id}/documents/{uid}/extracted.txt) documentextractor Yes
Metadata (author, date, subject) MySQL (exhibits table in nextpoint_case_{id}) documentloader Yes
ES index entry (BM25-searchable) ES alias ({env}_{case_id}_exhibits) documentloader Yes
Chunk text Not yet created documentsearch No
Vector embeddings Not yet created documentsearch No
Embedding status Not yet created documentsearch No

Backfill creates only the last three rows. The expensive work (extraction, metadata parsing, ES indexing) is already done.

What Backfill Does NOT Require

Not Needed Why
Re-extraction of text Already on S3 from documentextractor
Re-indexing in Elasticsearch Existing ES entries used as-is for BM25 leg
Rails downtime Backfill is background SQS processing
Import re-processing Documents don't re-enter the pipeline
Schema migration on exhibits table New tables only; exhibits table is read-only
Upstream module changes documentsearch subscribes independently

Backfill Scale Estimates

Per-Case Estimates

Case Size Documents Chunks (est.) Voyage AI Cost Time (MaxConc=3) Time (MaxConc=10)
Small 1,000 10,000 ~$0.24 ~10 min ~3 min
Medium 50,000 500,000 ~$12 ~8 hours ~2.5 hours
Large 500,000 5,000,000 ~$120 ~3.5 days ~1 day

Assumptions: 1 document per invocation, ~2 seconds per document (S3 fetch + chunk + Voyage API + vector store + MySQL). Voyage AI cost: ~15 chunks/doc x ~200 tokens/chunk x $0.12/M tokens.

Production Corpus (870M documents, 6.4B pages)

Production corpus analysis (end of 2025):

Metric Value
Total documents 870M
Total pages 6.4B
Avg pages per document ~7.4
Estimated chunks per document ~15

Scope filters (cumulative):

Filter Docs After Reduction Rationale
Full corpus 870M -- Starting point
NGE-enabled cases only ~87M ~90% out Only ~10% of cases are NGE-enabled
Discovery suite only ~78M ~10% out ~90% of docs are Discovery; Litigation is T2+
Realistic backfill scope ~78M ~91% Active NGE Discovery cases

Phased rollout:

Phase Scope Docs Cost Timeline
Prototype 1 known case ~50K ~$18 1 day
Pilot 10 active NGE cases ~500K ~$180 1 day
Phase 1 Top 100 active NGE cases ~10M ~$3,600 2-3 days
Phase 2 All active NGE Discovery ~78M ~$28,000 1-2 weeks
On-demand Remaining cases Per-case Per-case On search

Phase 1 ($3,600) proves value. Phase 2 ($28K) requires business case approval. On-demand backfill handles the long tail of 870M documents that will rarely be searched.

See reference-implementations/semantic-search-infrastructure.md for full cost model with storage, compute, and rate limit analysis.

Step 1: Identify un-embedded documents

-- Per-case query (runs in backfill Lambda)
SELECT e.document_id, e.s3_text_path, e.author, e.date_sent, e.subject
FROM exhibits e
LEFT JOIN search_embedding_status s
    ON s.document_id = e.document_id
    AND s.embedding_model = %(target_model)s
WHERE s.document_id IS NULL
    AND e.delete_at_gmt IS NULL
ORDER BY e.id
LIMIT 1000 OFFSET %(offset)s

Step 2: Publish backfill events (batched)

# handlers/backfill_index.py

def lambda_handler(event, _context):
    request = BackfillRequest.validate(event["body"])
    case_id = request.case_id

    with reader_session(npcase_id=str(case_id)) as session:
        total = _count_unembedded(session, case_id, config.EMBEDDING_MODEL)
        if total == 0:
            return api_response(200, {"message": "All documents already embedded"})

        documents = _get_unembedded_batch(session, case_id, config.EMBEDDING_MODEL)

    # Publish in batches of 100 to avoid Lambda timeout
    for batch in chunked(documents, 100):
        for doc in batch:
            sns.publish(
                eventType="BACKFILL_REQUESTED",
                caseId=case_id,
                documentId=doc.document_id,
                eventDetail={
                    "s3_path": doc.s3_text_path,
                    "embedding_model": config.EMBEDDING_MODEL
                }
            )

    return api_response(202, {
        "message": f"Backfill initiated for {total} documents",
        "case_id": case_id,
        "total_documents": total
    })

Step 3: Embedding Lambda processes (same code path)

The embedding Lambda receives messages from either live_embedding_queue or backfill_embedding_queue. The handler code is identical -- it doesn't distinguish between live and backfill events.

Step 4: Track progress

-- Status query (runs in status Lambda)
SELECT
    COUNT(*) as total_documents,
    SUM(CASE WHEN status = 'complete' THEN 1 ELSE 0 END) as embedded,
    SUM(CASE WHEN status = 'in_progress' THEN 1 ELSE 0 END) as in_progress,
    SUM(CASE WHEN status = 'failed' THEN 1 ELSE 0 END) as failed,
    embedding_model
FROM search_embedding_status
WHERE npcase_id = %(case_id)s
GROUP BY embedding_model

Step 5: Handle edge cases

Edge Case Handling
Document re-imported after backfill Live DOCUMENT_PROCESSED event triggers new embedding. Checkpoint PK prevents duplicate. Old chunks/vectors overwritten.
Backfill + live ingest concurrent Both write to checkpoint table. Composite PK (npcase_id, document_id) means first writer wins. Second gets RecoverableException, retries, sees checkpoint = COMPLETE, skips.
Case deleted during backfill IMPORT_CANCELLED event stops in-progress work. Remaining SQS messages processed and skipped (SilentSuccessException). Case deletion also deletes the vector index and per-case DB.
Voyage AI unavailable RecoverableException -> SQS exponential backoff (120s -> 900s). After max retries, DLQ. Backfill can be re-triggered after API recovery.
Model upgrade mid-backfill embedding_model column tracks per-document. Re-run backfill with new model -- filter picks up documents without new model's embeddings.

Non-Functional Requirements

NFRs that must be defined before production deployment (Phase 1). Identified via gap analysis against the documentsearch architecture.

Performance

Requirement Target Measurement
Search latency (p99) < 2 seconds CloudWatch alarm on API Gateway
Search latency (p50) < 500ms CloudWatch metric
Query embedding latency < 100ms Voyage AI response time
Embedding throughput (live ingest) >= 2,560 docs/min SQS age of oldest message
Backfill throughput >= 2,560 docs/min (standard API) Backfill progress tracking
Parallel search legs BM25 and vector search run in parallel, not sequential Implementation requirement

Availability

Requirement Target Notes
Search API uptime 99.9% (8.76 hours downtime/year) Match existing ES availability
Embedding pipeline uptime 99.5% Slight degradation acceptable; new docs queue in SQS
Degraded mode If vector store is down, return BM25-only results Graceful degradation, not full outage
Failover OpenSearch multi-AZ (production) Single-AZ acceptable for prototype

Degraded mode is critical: if the vector store is unavailable, search should fall back to keyword-only results (BM25 leg) rather than failing entirely. The attorney gets results — just not semantically enhanced results.

Data Retention and Lifecycle

Data Retention Deletion Trigger
Vector embeddings (OpenSearch) Lifetime of case Case deletion -> delete vector index
Chunk text (MySQL search_chunks) Lifetime of case Case deletion -> table dropped with per-case DB
Embedding status (MySQL search_embedding_status) Lifetime of case Case deletion -> table dropped
Search audit logs (CloudWatch) 90 days (configurable) CloudWatch log group retention policy
Search audit logs (S3 archive) 7 years (compliance) S3 lifecycle policy

When a case is deleted: 1. Delete the vector index (case_{id}_vectors) from OpenSearch 2. Per-case MySQL database is dropped (existing pattern) — search_chunks and search_embedding_status tables are deleted with it 3. Search audit logs in CloudWatch are retained per retention policy (they don't contain document content, only query metadata and result IDs)

Disaster Recovery

Component Backup Strategy RPO RTO
OpenSearch vectors Automated snapshots to S3 (hourly) 1 hour 2-4 hours (restore from snapshot)
MySQL chunks/status Aurora automated backups (existing) 5 minutes (Aurora continuous) < 1 hour
Voyage AI API key Secrets Manager (replicated) 0 Minutes
CDK infrastructure Infrastructure as code, redeployable 0 30-60 minutes

Worst case (total OpenSearch loss): Re-embed from extracted text on S3. Chunks in MySQL are the source of truth for text; vectors are derived and re-generable. Full re-embedding of Phase 1 (10M docs) takes 2-3 days. This is the backup of last resort — snapshots should prevent it.

Security and Encryption

Layer Requirement Implementation
Encryption at rest (vectors) AES-256 OpenSearch domain encryption enabled
Encryption at rest (chunks) AES-256 Aurora encryption enabled (existing)
Encryption in transit TLS 1.2+ All API calls, Lambda -> OpenSearch, Lambda -> Aurora
Voyage AI data handling SOC 2 compliant Document text sent to Voyage API for embedding. Voyage does not store input data (verify in terms).
IAM authentication Rails -> API Gateway Same pattern as documentexchanger
Per-case isolation Hard tenant boundary Vector index-per-case, MySQL per-case database
Secret management Voyage API key in Secrets Manager Rotatable, auditable

If compliance requires document text to never leave VPC: switch from Voyage direct API to SageMaker endpoint (see patterns/asymmetric-embeddings.md).

Compliance

Standard Status Gap
SOC 2 Partial Voyage AI is SOC 2 compliant. End-to-end audit trail (search audit logs) designed. Need formal review of new components.
HIPAA Partial Encryption at rest/transit covered. BAA with Voyage AI needed if cases contain PHI. SageMaker endpoint avoids external data transfer.
Data residency Depends on deployment Voyage direct API: data leaves VPC. SageMaker: data stays in VPC. Document the choice.

Scalability

Dimension Current Design Limit Scaling Action
Concurrent searches Lambda concurrency 1000 (AWS default) Request limit increase
Cases with embeddings Per-case OpenSearch index 30K shards/domain Tiered storage (hot/warm/cold)
Documents per case Single vector index 25M vectors per index Sufficient for largest cases
Embedding throughput SQS MaxConcurrency Voyage API rate limit Enterprise tier or SageMaker

Observability

Component Monitoring Alerting
Search latency CloudWatch metrics (p50, p99) Alarm if p99 > 2s
Search errors CloudWatch error rate Alarm if > 5%
Embedding DLQ Queue depth monitoring Alarm if > 0 for 15 min
Backfill DLQ Queue depth monitoring Alarm if > 0 for 30 min
Voyage AI errors Error count metric Alarm if > 10 in 5 min
OpenSearch cluster health Cluster status (green/yellow/red) Alarm on yellow or red
Embedding pipeline lag SQS age of oldest message Alarm if > 30 min
Retrieval quality Not yet defined Gap — define quality metrics post-prototype

Retrieval Quality (Post-Prototype)

Retrieval quality monitoring is a gap that should be addressed after the prototype validates the baseline. Proposed approach:

  1. Golden query set: 20-50 queries with known-good results on a reference case
  2. Automated regression: Run golden queries after any change to chunking, embedding model, search parameters, or RRF weights
  3. Metrics: Precision@10, Recall@20, NDCG@20, Mean Reciprocal Rank
  4. Drift detection: Monthly automated run of golden queries, alert if metrics drop > 5% from baseline

This is not needed for the prototype — the prototype IS the quality validation. Define the golden query set during pilot (10 cases with attorney feedback).

Are search results deterministic?

Semantic search introduces one new source of non-determinism compared to keyword search: the HNSW approximate nearest neighbor algorithm. Understanding each component:

Component Deterministic? Details
Embedding generation (Voyage AI) Yes Same text produces the same vector every time. No temperature, no sampling.
BM25 search (Elasticsearch) Mostly Same query against a static index returns same scores. Minor variance from per-shard IDF calculation in multi-shard indices. Fixable with search_type=dfs_query_then_fetch.
HNSW vector search No (approximate) Returns approximately nearest neighbors, not guaranteed exact. See below.
RRF fusion Yes Pure math on ranked lists. Same inputs produce same output.
LLM reranking (optional, not in T1) No LLMs have sampling variance. Not used in T1 architecture.

HNSW approximation behavior

HNSW does not scan every vector. It traverses a graph to find approximate nearest neighbors. For a static index (no concurrent writes), queries ARE deterministic: same query vector + same graph + same ef_search = same traversal = same results.

When the index is being updated (live ingest or backfill), results can vary between queries because the graph structure changes.

The approximation itself means HNSW may miss some true nearest neighbors in exchange for speed. The ef_search parameter controls this trade-off:

ef_search=100  → examines ~100 candidates → ~5ms,  ~95% recall
ef_search=256  → examines ~256 candidates → ~15ms, ~99% recall
ef_search=512  → examines ~512 candidates → ~30ms, ~99.5% recall

At ef_search=256 (our default), the results that vary between runs are documents at the relevance boundary -- documents that are barely relevant either way. The top results are effectively stable.

Keyword search has similar non-determinism sources that are less visible:

Source of variation Keyword search Semantic search
Index updates between queries Yes Yes
Shard-level scoring variance Yes (ES IDF per shard) Yes (HNSW graph per shard)
Approximate algorithm No (exact match) Yes (HNSW)
Relevance ranking stability High (BM25 is stable) High for top results, minor variance at boundary
Model changes N/A Yes (re-embedding changes all vectors)

Defensibility strategy

For legal defensibility, the architecture provides four mechanisms:

1. Search audit logging

Every search call is logged with full reproducibility data:

{
    "search_id": "uuid",
    "query": "communications about the safety defect",
    "query_vector": [0.22, -0.40, ...],   # the actual vector, not recomputed
    "case_id": 123,
    "filters": {"custodians": [...], "date_range": {...}},
    "timestamp": "2026-03-31T10:00:00Z",
    "embedding_model": "voyage-law-2",
    "ef_search": 256,
    "result_count": 142,
    "result_document_ids": ["doc-1", "doc-2", ...],
    "result_scores": [0.0323, 0.0318, ...],
    "index_version": "case_123_vectors_v2",
    "search_mode": "hybrid",
    "timings": {"query_embedding_ms": 45, "vector_search_ms": 80, ...}
}

The attorney can declare: "I ran this query at this time and received these specific 142 documents. The search log is attached as Exhibit A."

2. High ef_search default

ef_search=256 provides ~99% recall. The 1% of true nearest neighbors that might be missed are at the relevance boundary -- documents with cosine similarity barely above the threshold. Top results are stable across runs.

3. Point-in-time result snapshots

The legal workflow is: search -> tag results -> review tagged set. Once documents are tagged into a review folder, the review set is frozen. This is the same pattern attorneys already use with keyword search -- they don't re-run keyword searches mid-review and expect identical results while the index is being updated.

Search (results may vary slightly) -> Tag into folder (frozen) -> Review (stable)

4. Exact mode for reproducibility-critical searches

For searches that must be bit-identical across runs (e.g., for a motion declaration or a search methodology affidavit), the API supports exact mode:

POST /search
{
    "query": "communications about the safety defect",
    "case_id": 123,
    "mode": "hybrid",
    "exact": true,
    "index_snapshot": "v2"
}

Exact mode: - Runs brute-force k-NN (not HNSW) -- scans every vector, no approximation - Pins to a specific index snapshot version - Uses dfs_query_then_fetch for BM25 (global IDF, not per-shard) - Returns bit-identical results every time against the same snapshot - Slower (~500ms vs ~20ms) -- use for final declarations, not exploration

What to communicate to attorneys

Semantic search results are:

  • Stable -- the top results for a given query are consistent across runs against a static index. Minor variations occur only at the relevance boundary.
  • Reproducible -- every search is logged with the query vector, timestamp, full result set, and index version. The log is the definitive record.
  • Not bit-identical during active indexing -- same as keyword search during active imports. Results stabilize once indexing is complete.
  • Defensible -- result snapshots are preserved in audit logs. Exact mode available for search methodology declarations. The search methodology (hybrid BM25 + vector, RRF fusion, specific model and parameters) is fully documentable.

Search methodology documentation template

For declarations or search methodology letters, the system can generate:

Search Methodology Statement

Search Technology: Hybrid semantic + keyword search (documentsearch v1.0)
Embedding Model: Voyage AI voyage-law-2 (1024 dimensions, legal-optimized)
Keyword Engine: Elasticsearch 7.4 (BM25 scoring)
Fusion Method: Reciprocal Rank Fusion (k=60)
Vector Search: HNSW with ef_search=256 (~99% recall)

Search Parameters:
  Query: [natural language query text]
  Filters: [custodians, date range, batch IDs]
  Mode: hybrid (semantic + keyword)
  Results returned: [N] documents

Reproducibility:
  Search ID: [uuid]
  Executed: [timestamp]
  Index version: [version identifier]
  Full result set preserved in search audit log.

Divergences from Standard NGE Patterns

Aspect documentsearch Standard NGE Why
External API dependency Voyage AI embedding API No external APIs Embedding generation requires specialized model not available in AWS
Dual ingest queues Live + backfill at different concurrency Single queue per event type Backfill must not starve live ingest or Voyage rate limits
Sync API for search API Gateway -> Lambda (25s) Async SQS processing Search requires synchronous response for UI
Vector store abstraction shell/vectorstore/base.py interface No storage abstraction Allows prototype (pgvector) and production (OpenSearch) backends
No per-batch dynamic infra Job Processor optional (may use static queues) Per-batch Lambda/SQS creation Embedding is lighter-weight than document loading; static queues may suffice
Backfill pipeline Separate trigger and queue for existing data No backfill concept Must handle documents that pre-date the module

Configuration

Setting Default Notes
Embedding model voyage-law-2 Configurable for model upgrades
Embedding dimensions 1024 Must match model output
Chunk target size 512 tokens Voyage AI sweet spot
Chunk overlap 50 tokens Cross-boundary context preservation
RRF k constant 60 Standard from RRF paper
Live MaxConcurrency 10 Embedding Lambda SQS concurrency
Backfill MaxConcurrency 3 Lower priority than live
HNSW ef_search 256 ~99% recall. Higher = more accurate, slower.
Exact mode default false Brute-force k-NN for reproducibility. ~500ms vs ~20ms.
Search top-K per leg 100 Candidates from each search leg before RRF
Search Lambda timeout 25s API Gateway integration limit
Embedding Lambda timeout 900s Max Lambda timeout
Embedding Lambda memory 1024MB Sufficient for vector operations
Search Lambda memory 512MB Lighter than embedding

Prototype vs Production

Concern Prototype Production
Vector store pgvector on single Aurora PostgreSQL OpenSearch k-NN or OpenSearch Serverless
Multi-tenancy Single hardcoded case Per-case index lifecycle
Chunking Paragraph-level with metadata prepend Domain-specific (email/document/spreadsheet)
Rate limiting SQS MaxConcurrency only SQS MaxConcurrency + Voyage API rate tracking
Frontend Standalone React app (Claude Code) Rails UI integration (ERB + React)
Embedding versioning None Model version tracked, re-embedding pipeline
Cost tracking None Token counting per Voyage call, per-case allocation
Monitoring CloudWatch logs only Latency alarms, retrieval quality metrics
Backfill Manual script API + on-first-search auto-trigger
Backfill scope 1 case (~50K docs, ~$18) Phase 1: 10M docs (~$3,600), Phase 2: 78M docs (~$28K)

Future Evaluation: PageIndex (T2)

PageIndex (VectifyAI, open-source) is a vectorless, reasoning-based retrieval framework that builds a hierarchical tree from a document and uses LLM reasoning to navigate it. Mafin 2.5 (built on PageIndex) achieved 98.7% on FinanceBench where typical vector RAG scored 65-80%.

PageIndex does NOT replace T1 (corpus-level search). It solves a different problem: deep structured reasoning WITHIN a single document (following cross-references, navigating sections/tables). It cannot search across a corpus.

How it works: Two-step process. (1) Index generation: analyzes document's natural structure (sections, subsections, headings) and builds a hierarchical tree where each node has a title, summary, answerable-questions, and page range. Original document stays intact. (2) Reasoning-driven tree search: LLM reads top-level nodes, reasons which branch likely contains the answer, follows it, repeats at each level, retrieves complete sections (not fragments). Retrieval is explainable — you see which tree nodes were traversed and which pages retrieved.

Core critique of vector RAG relevant to our chunking strategy: Semantic similarity ≠ relevance. A paragraph about "capital expenditure 2021" and "revenue projections 2023" embed near each other because both use financial language, but only one answers the question. Chunking destroys the author's organizational intelligence (sections, cross-references, logical flow). Our domain-aware chunking (email-aware, section-aware) mitigates this but doesn't eliminate it.

Potential T2 integration: Use vector search to find top 20 documents, then PageIndex for deep extraction from those 20. Relevant for depo prep, narrative investigation, settlement prep, motion to compel — use cases where the attorney needs precise facts from structured documents (contracts, regulatory filings) after search identifies them.

Additional resources: PageIndex is open-source (github.com/VectifyAI/PageIndex), has a cloud API (api.pageindex.ai/v1) with MCP protocol integration, and a chat interface (chat.pageindex.ai) for testing on custom documents.

Evaluate during Phase 4 (agent service development). Compare against long-context LLMs on the same documents. See reference-implementations/semantic-search-use-cases.md Appendix C for full evaluation plan.

Key File Locations

File Purpose
lib/lambda/src/documentsearch/core/chunking/ Domain-specific chunking logic
lib/lambda/src/documentsearch/core/search/hybrid.py Reciprocal Rank Fusion
lib/lambda/src/documentsearch/shell/vectorstore/base.py Vector store abstraction
lib/lambda/src/documentsearch/shell/embedding/voyage_client.py Voyage AI integration
lib/lambda/src/documentsearch/handlers/embedding_index.py Ingest handler (SQS)
lib/lambda/src/documentsearch/handlers/search_index.py Search handler (API GW)
lib/lambda/src/documentsearch/handlers/backfill_index.py Backfill trigger (API GW)
lib/lambda/src/documentsearch/handlers/job_processor_index.py Batch lifecycle
lib/common-resources-stack.ts Vector store + shared infra
lib/search-ingest-stack.ts Embedding pipeline CDK
lib/search-api-stack.ts Search API CDK
Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.