Skip to content

ADR: Vector Store Selection for Semantic Search

Status: Proposed Date: 2026-03-31 Context: documentsearch module needs a vector store for embedding storage and retrieval

Decision Drivers

  1. Multi-tenant isolation -- each case's vectors must be hard-isolated (data from case A must never leak to case B)
  2. Hybrid search -- attorneys need both keyword (Bates numbers, exact names) and semantic (conceptual queries) in one ranked result
  3. Scale -- ~1000 active cases, ranging from 100 to 500K documents each; 5-50 chunks per document; total vectors: 500 to 25M per case
  4. Existing stack -- Aurora MySQL, Elasticsearch 7.4, Lambda, SNS/SQS (all AWS)
  5. Embedding model -- Voyage AI voyage-law-2 (1024 dimensions)
  6. Index lifecycle -- create per case, delete on case deletion, rebuild on re-embedding

Options Evaluated

Option 1: Amazon OpenSearch Service (Managed) with k-NN Plugin

How it works: Managed OpenSearch cluster with k-NN plugin (Faiss HNSW engine). Index-per-case for tenant isolation. Search pipelines provide native hybrid search (BM25 + k-NN) with configurable normalization (min-max, z-score) or rank-based fusion (RRF) in a single query.

Multi-tenant model: Index-per-case (matches existing ES pattern). For 1000 cases, this means 1000+ indices. Practical with tiered approach: dedicated indices for large cases, shared index with case_id filtering for small cases.

Hybrid search: Native and best-in-class. normalization-processor and score_ranker combine BM25 + k-NN in a single query. Configurable weights. This is the ONLY option that does true in-engine hybrid fusion.

Key numbers: - Warm query latency: 5-20ms (k-NN), 20-50ms (hybrid) - Cold query penalty: seconds (HNSW graph must load from disk) - k-NN memory per vector (1024-dim, M=16): ~4.6 KB - Max shards per domain: 30,000 (hard limit 1000/node) - k-NN graph memory: up to 50% of non-heap RAM per node

Cost estimate (1000 cases): - 6x r6g.2xlarge data nodes: ~$2,935/mo - 3x c6g.large masters: ~$247/mo - EBS gp3 (6 TB): ~$732/mo - Total: ~$3,900-$5,000/mo on-demand; ~$2,700-$3,400/mo with 1-year RI

Pros: - Strongest migration path from existing ES 7.4 (API-compatible) - Only option with native hybrid search - Can REPLACE existing ES cluster -- consolidates keyword + semantic into one system - Full index lifecycle control - Faiss HNSW with SIMD (AVX2/AVX-512) optimized for 1024-dim vectors

Cons: - Highest operational complexity (cluster sizing, shard management, k-NN memory tuning) - Cold query penalty when HNSW graphs evicted from cache (1000 indices compete for memory) - Highest base cost


Option 2: Amazon OpenSearch Serverless (Vector Search Collection)

How it works: Serverless OpenSearch with automatic scaling via OCUs (OpenSearch Capacity Units). Collection groups (Feb 2026) allow sharing OCU pools across collections, reducing per-tenant overhead.

Multi-tenant model: One collection with index-per-tenant (most cost-effective), or collection-per-tenant with collection groups for shared capacity. Default limit: 50 collections per account (increasable).

Hybrid search: Supported. Same Neural Search pipeline as managed OpenSearch.

Key numbers: - Warm query latency: 5-20ms (comparable to managed) - Index refresh interval: 60 seconds (NOT configurable) -- newly embedded docs invisible for up to 60s - Only Faiss HNSW engine supported - Max OCUs: 200 per account (increasable)

Cost estimate: - Minimum: 4 half-OCUs = ~$350/mo for a single collection - 1000 cases as indices in shared collection: ~$700-$2,000/mo - Storage: $0.024/GB-month

Pros: - No cluster management, patching, or shard tuning - Auto-scales (including to near-zero with collection groups) - Lowest operational overhead - Native hybrid search - Cheapest at low-to-moderate query volume

Cons: - 60-second refresh interval (unacceptable for real-time indexing during import) - Less tuning control (no per-query ef_search, limited engine options) - Newer service, fewer production references at scale - Cannot use Lucene engine


Option 3: Aurora PostgreSQL + pgvector

How it works: PostgreSQL extension adding vector data type, HNSW and IVFFlat index types, and distance operators. Runs on existing Aurora infrastructure. Schema-per-tenant mirrors the existing _case_{id} MySQL pattern.

Multi-tenant model: Schema-per-tenant (each case gets its own schema with its own HNSW index). Natural fit with existing per-case MySQL databases. pgvector 0.8.0's iterative scan fixes historical issues with HNSW + tenant filtering.

Hybrid search: NOT native. Must run two queries: 1. PostgreSQL full-text search (tsvector + ts_rank) or pg_trgm for keyword 2. pgvector similarity search for semantic 3. Combine in application code (RRF)

PostgreSQL full-text search is functional but significantly less capable than ES/OpenSearch BM25 for legal text (no advanced analyzers, no field boosting, no phrase slop, no fuzzy matching on the same level).

Key numbers: - In-memory HNSW query: 5-20ms (comparable to OpenSearch) - HNSW index build (10M vectors, 1024-dim): 30-120 minutes depending on instance - Memory per vector (1024-dim, M=16): ~4.6 KB; halfvec cuts 50% - pgvector 0.8.0 reported 150x speedup over earlier versions

Cost estimate: - Aurora PostgreSQL r6g.2xlarge: ~$900/mo - RDS Proxy (required for Lambda): ~$87/mo - Storage: $0.10-$0.20/GB-month - Total: ~$1,000-$1,500/mo - BUT: you still need existing ES cluster for keyword search (~$2,000+/mo) - True total: ~$3,000-$3,500/mo (pgvector + ES)

Pros: - Lowest standalone cost - Schema-per-tenant matches existing MySQL pattern - Familiar SQL operations - Transactional consistency (chunks + vectors in same DB) - RDS Proxy solves Lambda connection pooling - pgvector 0.8.0 performance is competitive

Cons: - NO native hybrid search -- must maintain TWO search systems (pgvector + ES) - PostgreSQL full-text search is weaker than ES BM25 for legal text - New database engine to operate (current stack is Aurora MySQL, not PostgreSQL) - HNSW index build time is slow for large cases (30-120 min for 10M vectors) - When true cost includes ES, it's NOT cheaper than OpenSearch


Option 4: Amazon Bedrock Knowledge Bases

How it works: Managed RAG pipeline. You provide documents, Bedrock handles chunking, embedding, vector storage, and retrieval.

Multi-tenant model: One KB per case (silo pattern). AWS explicitly supports this. For 1000 cases = 1000 Knowledge Bases.

Hybrid search: NOT supported. Bedrock KB does semantic retrieval only. No integration with existing ES for BM25. Must fuse externally.

Voyage AI support: NOT natively supported. Supported models: Amazon Titan, Cohere Embed, Amazon Nova. Voyage AI requires SageMaker JumpStart custom endpoint -- adds latency and complexity.

Key numbers: - Query latency: 50-200ms (API overhead, higher than direct vector store) - Max 5M documents per data source - 100 GB per ingestion job - Custom chunking supported via Lambda function

Cost: - Per-query and per-ingestion charges - Vector store backend costs apply separately (you choose: OpenSearch Serverless, pgvector, Pinecone, S3 Vectors) - Complex to estimate; highly variable

Pros: - Lowest development effort for basic RAG - Managed chunking pipeline - S3 Vectors backend (Dec 2025) up to 90% cheaper for storage

Cons: - No native Voyage AI support (the chosen embedding model) - No hybrid search - Black box -- cannot customize ranking, normalization, or fusion - 1000 KBs is significant management overhead - Adds unnecessary managed layer over a vector store you can use directly


How it works: In-memory vector search on Redis protocol. Fastest raw k-NN performance.

Multi-tenant model: Key-prefix isolation only (case:123:vector:456). No schema-level or index-level isolation. For hard isolation, separate clusters (extremely expensive).

Hybrid search: None. Pure k-NN only.

Key numbers: - Query latency: 1-5ms (best-in-class) - CRITICAL: single-shard only. No horizontal scaling for vector search. - All vectors must fit in one node's memory - r7g.16xlarge max: ~423 GB usable memory

Cost: - r7g.4xlarge (~105 GB): ~$1,263/mo - r7g.16xlarge (~423 GB): ~$5,044/mo

Pros: - Fastest query latency - Simple API

Cons: - Single-shard limit is a hard ceiling - No text search, no hybrid - No tenant isolation beyond key prefix - Most expensive per GB - No index lifecycle management

Decision Matrix

Dimension OpenSearch Managed OpenSearch Serverless pgvector Bedrock KB MemoryDB
Multi-tenant isolation Strong Moderate Strong Strong Weak
Native hybrid search YES YES No No No
Voyage AI compatible Yes Yes Yes Via SageMaker Yes
Replaces existing ES YES YES No No No
Query latency Good Good Good Moderate Best
Operational complexity High Low Medium Medium Medium
True monthly cost $3,900-5,000 $700-2,000 $3,000-3,500* Variable $1,200-5,000
Index lifecycle control Full Limited Full Managed Limited
Scale ceiling 30K shards 200 OCUs Instance RAM 5M docs Single node RAM
Migration from ES 7.4 Natural Moderate N/A N/A N/A

*pgvector cost includes maintaining existing ES cluster for keyword search

Recommendation

Prototype: Aurora PostgreSQL + pgvector

Why: Fastest to stand up. One CREATE EXTENSION vector on a small Aurora PostgreSQL instance. No cluster to size. The VectorStore abstraction in shell/vectorstore/ means we can swap backends without touching business logic. For a single demo case with known documents, pgvector performance is more than sufficient.

Prototype does NOT need native hybrid search -- the application-level RRF in core/search/hybrid.py combines pgvector results with existing ES BM25 results. This is adequate for validating retrieval quality.

Production: Evaluate OpenSearch Managed vs OpenSearch Serverless

The production decision should be deferred until the prototype validates: 1. Retrieval quality -- does voyage-law-2 + hybrid search produce the "wow" moment? 2. Query volume -- how many searches per day per case? 3. Case distribution -- how many large cases (>100K docs) vs small cases?

If query volume is moderate and cases are mostly small: OpenSearch Serverless. Lower ops overhead, auto-scaling, native hybrid search, $700-2,000/mo.

If query volume is high or many large cases exist: OpenSearch Managed. Full tuning control, best hybrid search, consolidates keyword + semantic into one system (eliminates separate ES cluster), $2,700-3,400/mo with RI.

Either way, OpenSearch wins for production because: 1. Native hybrid search eliminates the two-system problem 2. Can replace existing ES 7.4 (consolidation) 3. Index-per-case matches existing isolation pattern 4. API-compatible migration path from ES 7.4

Eliminate from consideration

  • Bedrock KB: No native Voyage AI, no hybrid search, black-box ranking. Adds complexity without value for this use case.
  • MemoryDB: Single-shard limit, no hybrid search, no tenant isolation. Wrong tool for this job.

The vector store decision is independent of how embeddings are generated. The embedding provider is abstracted behind shell/embedding/provider.py. Three options were evaluated:

Lambda (VPC) -> NAT Gateway -> https://api.voyageai.com/v1/embeddings
  • $0.12/million tokens. Zero infra. SOC 2 compliant.
  • Best retrieval quality (voyage-law-2 is legal-optimized).
  • Data leaves VPC (document text sent to Voyage AI). Acceptable for most SaaS e-discovery platforms.
  • Recommended for both prototype and production unless compliance requires data-in-VPC.

SageMaker Endpoint (If Data Residency Required)

Lambda (VPC) -> SageMaker Endpoint (VPC) -> voyage-law-2 on ml.g6.xlarge
  • voyage-law-2 available on AWS Marketplace (prodview-bknagyko2vl7a).
  • $0.22/million tokens + $737/mo instance cost. Data stays in VPC.
  • Only if compliance team mandates no document text leaves VPC.

Bedrock Native Models (Cost Optimization, Lower Quality)

Lambda (VPC) -> Bedrock InvokeModel -> amazon.titan-embed-text-v2:0
  • voyage-law-2 is NOT available on Bedrock.
  • Bedrock native options: Titan V2 ($0.02/M), Cohere Embed ($0.10/M), Nova Embed.
  • 5-15% lower retrieval quality on legal text vs voyage-law-2.
  • No asymmetric embedding support (Titan/Cohere are symmetric).
  • Evaluate as cost optimization AFTER prototype proves value with voyage-law-2. Starting with a weaker model risks the prototype failing to produce the "wow" moment with attorneys.

Embedding Provider Decision

Phase Provider Why
Prototype Voyage AI direct API Best quality, simplest, ~$60/case
Production (default) Voyage AI direct API Still simplest, SOC 2 compliant
Production (data-in-VPC) SageMaker endpoint +$737/mo, data never leaves VPC
Cost optimization (later) Evaluate Bedrock Titan V2 6x cheaper, test if quality acceptable

The EmbeddingProvider abstraction makes switching a config change (EMBEDDING_PROVIDER=voyage_api|sagemaker|bedrock), not a code change. See patterns/asymmetric-embeddings.md for full implementation details.

Consequences

  • shell/vectorstore/base.py interface must support: create_index, upsert_vectors, search, delete_index
  • shell/embedding/provider.py interface must support: embed_documents, embed_query (swappable provider)
  • Prototype uses pgvector_store.py + VoyageDirectClient; production uses opensearch_store.py + same or SageMaker client
  • core/search/hybrid.py (RRF) remains the fusion logic regardless of vector store or embedding provider
  • If OpenSearch Managed is chosen for production, plan ES 7.4 -> OpenSearch migration as a separate workstream (consolidation opportunity)
  • OpenSearch Serverless 60-second refresh interval must be tested against real-time import UX expectations before choosing Serverless over Managed
  • If Bedrock Titan V2 is later evaluated for cost optimization, run retrieval quality comparison on a known matter against voyage-law-2 before switching
Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.