ADR: Vector Store Selection for Semantic Search¶
Status: Proposed Date: 2026-03-31 Context: documentsearch module needs a vector store for embedding storage and retrieval
Decision Drivers¶
- Multi-tenant isolation -- each case's vectors must be hard-isolated (data from case A must never leak to case B)
- Hybrid search -- attorneys need both keyword (Bates numbers, exact names) and semantic (conceptual queries) in one ranked result
- Scale -- ~1000 active cases, ranging from 100 to 500K documents each; 5-50 chunks per document; total vectors: 500 to 25M per case
- Existing stack -- Aurora MySQL, Elasticsearch 7.4, Lambda, SNS/SQS (all AWS)
- Embedding model -- Voyage AI voyage-law-2 (1024 dimensions)
- Index lifecycle -- create per case, delete on case deletion, rebuild on re-embedding
Options Evaluated¶
Option 1: Amazon OpenSearch Service (Managed) with k-NN Plugin¶
How it works: Managed OpenSearch cluster with k-NN plugin (Faiss HNSW engine). Index-per-case for tenant isolation. Search pipelines provide native hybrid search (BM25 + k-NN) with configurable normalization (min-max, z-score) or rank-based fusion (RRF) in a single query.
Multi-tenant model: Index-per-case (matches existing ES pattern). For 1000 cases, this means 1000+ indices. Practical with tiered approach: dedicated indices for large cases, shared index with case_id filtering for small cases.
Hybrid search: Native and best-in-class. normalization-processor and score_ranker combine BM25 + k-NN in a single query. Configurable weights. This is the ONLY option that does true in-engine hybrid fusion.
Key numbers: - Warm query latency: 5-20ms (k-NN), 20-50ms (hybrid) - Cold query penalty: seconds (HNSW graph must load from disk) - k-NN memory per vector (1024-dim, M=16): ~4.6 KB - Max shards per domain: 30,000 (hard limit 1000/node) - k-NN graph memory: up to 50% of non-heap RAM per node
Cost estimate (1000 cases): - 6x r6g.2xlarge data nodes: ~$2,935/mo - 3x c6g.large masters: ~$247/mo - EBS gp3 (6 TB): ~$732/mo - Total: ~$3,900-$5,000/mo on-demand; ~$2,700-$3,400/mo with 1-year RI
Pros: - Strongest migration path from existing ES 7.4 (API-compatible) - Only option with native hybrid search - Can REPLACE existing ES cluster -- consolidates keyword + semantic into one system - Full index lifecycle control - Faiss HNSW with SIMD (AVX2/AVX-512) optimized for 1024-dim vectors
Cons: - Highest operational complexity (cluster sizing, shard management, k-NN memory tuning) - Cold query penalty when HNSW graphs evicted from cache (1000 indices compete for memory) - Highest base cost
Option 2: Amazon OpenSearch Serverless (Vector Search Collection)¶
How it works: Serverless OpenSearch with automatic scaling via OCUs (OpenSearch Capacity Units). Collection groups (Feb 2026) allow sharing OCU pools across collections, reducing per-tenant overhead.
Multi-tenant model: One collection with index-per-tenant (most cost-effective), or collection-per-tenant with collection groups for shared capacity. Default limit: 50 collections per account (increasable).
Hybrid search: Supported. Same Neural Search pipeline as managed OpenSearch.
Key numbers: - Warm query latency: 5-20ms (comparable to managed) - Index refresh interval: 60 seconds (NOT configurable) -- newly embedded docs invisible for up to 60s - Only Faiss HNSW engine supported - Max OCUs: 200 per account (increasable)
Cost estimate: - Minimum: 4 half-OCUs = ~$350/mo for a single collection - 1000 cases as indices in shared collection: ~$700-$2,000/mo - Storage: $0.024/GB-month
Pros: - No cluster management, patching, or shard tuning - Auto-scales (including to near-zero with collection groups) - Lowest operational overhead - Native hybrid search - Cheapest at low-to-moderate query volume
Cons: - 60-second refresh interval (unacceptable for real-time indexing during import) - Less tuning control (no per-query ef_search, limited engine options) - Newer service, fewer production references at scale - Cannot use Lucene engine
Option 3: Aurora PostgreSQL + pgvector¶
How it works: PostgreSQL extension adding vector data type, HNSW and IVFFlat index types, and distance operators. Runs on existing Aurora infrastructure. Schema-per-tenant mirrors the existing _case_{id} MySQL pattern.
Multi-tenant model: Schema-per-tenant (each case gets its own schema with its own HNSW index). Natural fit with existing per-case MySQL databases. pgvector 0.8.0's iterative scan fixes historical issues with HNSW + tenant filtering.
Hybrid search: NOT native. Must run two queries:
1. PostgreSQL full-text search (tsvector + ts_rank) or pg_trgm for keyword
2. pgvector similarity search for semantic
3. Combine in application code (RRF)
PostgreSQL full-text search is functional but significantly less capable than ES/OpenSearch BM25 for legal text (no advanced analyzers, no field boosting, no phrase slop, no fuzzy matching on the same level).
Key numbers: - In-memory HNSW query: 5-20ms (comparable to OpenSearch) - HNSW index build (10M vectors, 1024-dim): 30-120 minutes depending on instance - Memory per vector (1024-dim, M=16): ~4.6 KB; halfvec cuts 50% - pgvector 0.8.0 reported 150x speedup over earlier versions
Cost estimate: - Aurora PostgreSQL r6g.2xlarge: ~$900/mo - RDS Proxy (required for Lambda): ~$87/mo - Storage: $0.10-$0.20/GB-month - Total: ~$1,000-$1,500/mo - BUT: you still need existing ES cluster for keyword search (~$2,000+/mo) - True total: ~$3,000-$3,500/mo (pgvector + ES)
Pros: - Lowest standalone cost - Schema-per-tenant matches existing MySQL pattern - Familiar SQL operations - Transactional consistency (chunks + vectors in same DB) - RDS Proxy solves Lambda connection pooling - pgvector 0.8.0 performance is competitive
Cons: - NO native hybrid search -- must maintain TWO search systems (pgvector + ES) - PostgreSQL full-text search is weaker than ES BM25 for legal text - New database engine to operate (current stack is Aurora MySQL, not PostgreSQL) - HNSW index build time is slow for large cases (30-120 min for 10M vectors) - When true cost includes ES, it's NOT cheaper than OpenSearch
Option 4: Amazon Bedrock Knowledge Bases¶
How it works: Managed RAG pipeline. You provide documents, Bedrock handles chunking, embedding, vector storage, and retrieval.
Multi-tenant model: One KB per case (silo pattern). AWS explicitly supports this. For 1000 cases = 1000 Knowledge Bases.
Hybrid search: NOT supported. Bedrock KB does semantic retrieval only. No integration with existing ES for BM25. Must fuse externally.
Voyage AI support: NOT natively supported. Supported models: Amazon Titan, Cohere Embed, Amazon Nova. Voyage AI requires SageMaker JumpStart custom endpoint -- adds latency and complexity.
Key numbers: - Query latency: 50-200ms (API overhead, higher than direct vector store) - Max 5M documents per data source - 100 GB per ingestion job - Custom chunking supported via Lambda function
Cost: - Per-query and per-ingestion charges - Vector store backend costs apply separately (you choose: OpenSearch Serverless, pgvector, Pinecone, S3 Vectors) - Complex to estimate; highly variable
Pros: - Lowest development effort for basic RAG - Managed chunking pipeline - S3 Vectors backend (Dec 2025) up to 90% cheaper for storage
Cons: - No native Voyage AI support (the chosen embedding model) - No hybrid search - Black box -- cannot customize ranking, normalization, or fusion - 1000 KBs is significant management overhead - Adds unnecessary managed layer over a vector store you can use directly
Option 5: Amazon MemoryDB for Redis (Vector Search)¶
How it works: In-memory vector search on Redis protocol. Fastest raw k-NN performance.
Multi-tenant model: Key-prefix isolation only (case:123:vector:456). No schema-level or index-level isolation. For hard isolation, separate clusters (extremely expensive).
Hybrid search: None. Pure k-NN only.
Key numbers: - Query latency: 1-5ms (best-in-class) - CRITICAL: single-shard only. No horizontal scaling for vector search. - All vectors must fit in one node's memory - r7g.16xlarge max: ~423 GB usable memory
Cost: - r7g.4xlarge (~105 GB): ~$1,263/mo - r7g.16xlarge (~423 GB): ~$5,044/mo
Pros: - Fastest query latency - Simple API
Cons: - Single-shard limit is a hard ceiling - No text search, no hybrid - No tenant isolation beyond key prefix - Most expensive per GB - No index lifecycle management
Decision Matrix¶
| Dimension | OpenSearch Managed | OpenSearch Serverless | pgvector | Bedrock KB | MemoryDB |
|---|---|---|---|---|---|
| Multi-tenant isolation | Strong | Moderate | Strong | Strong | Weak |
| Native hybrid search | YES | YES | No | No | No |
| Voyage AI compatible | Yes | Yes | Yes | Via SageMaker | Yes |
| Replaces existing ES | YES | YES | No | No | No |
| Query latency | Good | Good | Good | Moderate | Best |
| Operational complexity | High | Low | Medium | Medium | Medium |
| True monthly cost | $3,900-5,000 | $700-2,000 | $3,000-3,500* | Variable | $1,200-5,000 |
| Index lifecycle control | Full | Limited | Full | Managed | Limited |
| Scale ceiling | 30K shards | 200 OCUs | Instance RAM | 5M docs | Single node RAM |
| Migration from ES 7.4 | Natural | Moderate | N/A | N/A | N/A |
*pgvector cost includes maintaining existing ES cluster for keyword search
Recommendation¶
Prototype: Aurora PostgreSQL + pgvector¶
Why: Fastest to stand up. One CREATE EXTENSION vector on a small Aurora PostgreSQL instance. No cluster to size. The VectorStore abstraction in shell/vectorstore/ means we can swap backends without touching business logic. For a single demo case with known documents, pgvector performance is more than sufficient.
Prototype does NOT need native hybrid search -- the application-level RRF in core/search/hybrid.py combines pgvector results with existing ES BM25 results. This is adequate for validating retrieval quality.
Production: Evaluate OpenSearch Managed vs OpenSearch Serverless¶
The production decision should be deferred until the prototype validates: 1. Retrieval quality -- does voyage-law-2 + hybrid search produce the "wow" moment? 2. Query volume -- how many searches per day per case? 3. Case distribution -- how many large cases (>100K docs) vs small cases?
If query volume is moderate and cases are mostly small: OpenSearch Serverless. Lower ops overhead, auto-scaling, native hybrid search, $700-2,000/mo.
If query volume is high or many large cases exist: OpenSearch Managed. Full tuning control, best hybrid search, consolidates keyword + semantic into one system (eliminates separate ES cluster), $2,700-3,400/mo with RI.
Either way, OpenSearch wins for production because: 1. Native hybrid search eliminates the two-system problem 2. Can replace existing ES 7.4 (consolidation) 3. Index-per-case matches existing isolation pattern 4. API-compatible migration path from ES 7.4
Eliminate from consideration¶
- Bedrock KB: No native Voyage AI, no hybrid search, black-box ranking. Adds complexity without value for this use case.
- MemoryDB: Single-shard limit, no hybrid search, no tenant isolation. Wrong tool for this job.
Related Decision: Embedding Model Deployment¶
The vector store decision is independent of how embeddings are generated. The
embedding provider is abstracted behind shell/embedding/provider.py. Three
options were evaluated:
Voyage AI Direct API (Recommended)¶
- $0.12/million tokens. Zero infra. SOC 2 compliant.
- Best retrieval quality (voyage-law-2 is legal-optimized).
- Data leaves VPC (document text sent to Voyage AI). Acceptable for most SaaS e-discovery platforms.
- Recommended for both prototype and production unless compliance requires data-in-VPC.
SageMaker Endpoint (If Data Residency Required)¶
- voyage-law-2 available on AWS Marketplace (prodview-bknagyko2vl7a).
- $0.22/million tokens + $737/mo instance cost. Data stays in VPC.
- Only if compliance team mandates no document text leaves VPC.
Bedrock Native Models (Cost Optimization, Lower Quality)¶
- voyage-law-2 is NOT available on Bedrock.
- Bedrock native options: Titan V2 ($0.02/M), Cohere Embed ($0.10/M), Nova Embed.
- 5-15% lower retrieval quality on legal text vs voyage-law-2.
- No asymmetric embedding support (Titan/Cohere are symmetric).
- Evaluate as cost optimization AFTER prototype proves value with voyage-law-2. Starting with a weaker model risks the prototype failing to produce the "wow" moment with attorneys.
Embedding Provider Decision¶
| Phase | Provider | Why |
|---|---|---|
| Prototype | Voyage AI direct API | Best quality, simplest, ~$60/case |
| Production (default) | Voyage AI direct API | Still simplest, SOC 2 compliant |
| Production (data-in-VPC) | SageMaker endpoint | +$737/mo, data never leaves VPC |
| Cost optimization (later) | Evaluate Bedrock Titan V2 | 6x cheaper, test if quality acceptable |
The EmbeddingProvider abstraction makes switching a config change
(EMBEDDING_PROVIDER=voyage_api|sagemaker|bedrock), not a code change.
See patterns/asymmetric-embeddings.md for full implementation details.
Consequences¶
shell/vectorstore/base.pyinterface must support: create_index, upsert_vectors, search, delete_indexshell/embedding/provider.pyinterface must support: embed_documents, embed_query (swappable provider)- Prototype uses
pgvector_store.py+VoyageDirectClient; production usesopensearch_store.py+ same or SageMaker client core/search/hybrid.py(RRF) remains the fusion logic regardless of vector store or embedding provider- If OpenSearch Managed is chosen for production, plan ES 7.4 -> OpenSearch migration as a separate workstream (consolidation opportunity)
- OpenSearch Serverless 60-second refresh interval must be tested against real-time import UX expectations before choosing Serverless over Managed
- If Bedrock Titan V2 is later evaluated for cost optimization, run retrieval quality comparison on a known matter against voyage-law-2 before switching
Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.