Pattern: Asymmetric Embeddings for Legal Search¶
Context¶
The documentsearch module uses vector embeddings to enable natural language search. The embedding model converts text into vectors (lists of numbers) that encode semantic meaning. Similar concepts produce vectors that are close together in high-dimensional space.
Asymmetric embedding is the technique of using DIFFERENT encoding modes for documents vs queries. This is critical for retrieval quality in legal search.
The Problem with Symmetric Embeddings¶
A symmetric model treats documents and queries identically:
# Symmetric approach (wrong for search)
doc_vector = embed("The Raptor structure needs the equity infusion before
Q2 close. Andy and I discussed the timeline yesterday.
We agreed to proceed with the transfer without flagging
it to the audit committee.")
query_vector = embed("internal discussions about Special Purpose Entities")
These texts share zero vocabulary. A symmetric model trained to make similar texts close together has no reason to understand that a 12-word question should map to the same region of vector space as a 40-word email. They are different lengths, different structures, different purposes. The result: moderate similarity scores at best, missed relevant documents at worst.
The Asymmetric Solution¶
Asymmetric embedding trains the model with two modes:
- Document mode (
input_type="document"): Encode the full semantic content of the passage. Optimized for being FOUND by queries. - Query mode (
input_type="query"): Encode the INTENT behind the question. Optimized for FINDING relevant documents.
# Asymmetric approach (correct for search)
doc_vector = voyage.embed(
"The Raptor structure needs the equity infusion before Q2 close...",
input_type="document" # "I am evidence. Encode my full meaning."
)
query_vector = voyage.embed(
"internal discussions about Special Purpose Entities",
input_type="query" # "I am a question. Encode what I'm looking for."
)
A short query about SPEs lands in the same vector space region as long emails discussing Raptor, JEDI, and off-balance-sheet structures -- even though they share no keywords. The model learned this mapping from millions of (query, relevant_document) training pairs.
How Training Works¶
The model is trained on triplets:
Query: "attorney-client privilege discussions about merger risk"
Positive document: "I'd suggest we get Sarah's input on the antitrust exposure
before responding to their due diligence request. She flagged
some concerns last week that we should address internally first."
Negative document: "Please see attached the attorney fee schedule for Q3. Legal
department costs are tracked in the shared budget spreadsheet."
Training signal: query should be CLOSE to positive, FAR from negative
After millions of legal-domain triplets, the model learns: - "attorney-client privilege" as a QUERY means "legal advice being sought or given" - NOT "documents containing the word attorney" - The positive document IS privilege (Sarah = in-house counsel, "address internally first" = legal strategy) -- no keyword overlap with the query - The negative document contains "attorney" and "legal" but is NOT privilege
This is why voyage-law-2 outperforms general-purpose models: it was trained on
legal (query, document) pairs where the semantic relationships are domain-specific.
What Happens with the Wrong input_type¶
# WRONG: embedding a query as a document
query_vector = voyage.embed("discussions about SPEs", input_type="document")
The model interprets this as a 4-word document fragment, not a search intent. It encodes the literal content rather than the concept "find me emails where executives discussed SPE structures by any name." Result: lower recall. Documents containing "SPE" are found, but the Raptor and JEDI emails are missed because the model did not activate the search-intent pathway.
Impact: asymmetric queries retrieve 15-30% more relevant documents than symmetric queries on legal benchmarks.
Implementation in documentsearch¶
When to Use Each Mode¶
| Operation | input_type | When | Why |
|---|---|---|---|
| Document embedding | "document" |
Ingest time (embedding pipeline) | Encodes what the document IS ABOUT |
| Query embedding | "query" |
Search time (per request) | Encodes what the searcher WANTS TO FIND |
| Backfill embedding | "document" |
Backfill pipeline | Same as ingest -- encoding existing documents |
| Re-embedding | "document" |
Model upgrade | Same as ingest -- re-encoding with new model |
Hexagonal Boundary¶
The input_type distinction is an infrastructure concern, not business logic:
shell/embedding/voyage_client.py— setsinput_typebased on which method is called (embed_documentsvsembed_query)core/— never knows aboutinput_type. Receives vectors as parameters. The chunking logic incore/chunking/produces text; the embedding call inshell/converts it to vectors.
# shell/embedding/voyage_client.py
class VoyageClient:
MODEL = "voyage-law-2"
DIMENSIONS = 1024
MAX_BATCH_SIZE = 128
MAX_TOKENS_PER_BATCH = 120_000
def __init__(self):
self._client = voyageai.Client(api_key=Config.VOYAGE_API_KEY)
def embed_documents(self, texts: list[str]) -> list[list[float]]:
"""
Embed document chunks for storage. Called at ingest time.
input_type="document" tells the model these are evidence passages.
Auto-batches to stay within Voyage API limits.
"""
all_embeddings = []
for batch in self._batch_texts(texts):
result = self._client.embed(
texts=batch,
model=self.MODEL,
input_type="document",
truncation=True
)
all_embeddings.extend(result.embeddings)
return all_embeddings
def embed_query(self, query: str) -> list[float]:
"""
Embed a single search query. Called at search time.
input_type="query" tells the model this is search intent.
~50ms latency for a single query.
"""
result = self._client.embed(
texts=[query],
model=self.MODEL,
input_type="query",
truncation=True
)
return result.embeddings[0]
def _batch_texts(self, texts: list[str]) -> list[list[str]]:
"""
Split texts into batches respecting API limits:
- Max 128 texts per request
- Max 120,000 tokens per request
Estimates tokens as len(text) / 4 (conservative for English).
"""
batches = []
current_batch = []
current_tokens = 0
for text in texts:
estimated_tokens = len(text) // 4
would_exceed_count = len(current_batch) >= self.MAX_BATCH_SIZE
would_exceed_tokens = (current_tokens + estimated_tokens) > self.MAX_TOKENS_PER_BATCH
if current_batch and (would_exceed_count or would_exceed_tokens):
batches.append(current_batch)
current_batch = []
current_tokens = 0
current_batch.append(text)
current_tokens += estimated_tokens
if current_batch:
batches.append(current_batch)
return batches
Handler Bridge (How Core Receives Vectors)¶
# handlers/embedding_index.py (ingest)
def _process_document(message, session):
# core/ produces text chunks (no AWS knowledge)
chunks = chunk_document(text, metadata)
# shell/ converts to vectors (knows about Voyage AI)
vectors = voyage.embed_documents([c.full_text for c in chunks])
# shell/ stores vectors (knows about vector store)
vector_store.upsert_vectors(case_id, zip(chunks, vectors))
# handlers/search_index.py (query)
def _handle_search(request):
# shell/ embeds query (input_type="query")
query_vector = voyage.embed_query(request.query)
# shell/ runs search legs
vector_results = vector_store.search(case_id, query_vector, k=100)
bm25_results = es_ops.keyword_search(case_id, request.query, size=100)
# core/ fuses results (pure logic, no AWS)
ranked = rrf(vector_results, bm25_results, k=60)
Why the Checkpoint Matters¶
Voyage API calls are the most expensive step (cost and latency). The checkpoint
at EMBEDDINGS_GENERATED prevents re-calling Voyage on retries:
EMBEDDING_STARTED ✓
TEXT_FETCHED ✓ (S3: 50ms, ~$0)
CHUNKS_CREATED ✓ (chunking: 10ms, $0)
EMBEDDINGS_GENERATED ✓ (Voyage API: 400ms, $0.003) <-- expensive
VECTORS_STORED ✗ (Lambda timeout before write completes)
On retry: resumes from VECTORS_STORED. Voyage API NOT re-called.
With 500K documents and 5% retry rate, this saves 25K unnecessary API calls.
Voyage AI on AWS¶
Deployment Options¶
voyage-law-2 Is NOT on Amazon Bedrock¶
Bedrock's native embedding models are Amazon Titan, Cohere Embed, and Amazon Nova. Voyage AI is not a Bedrock model provider. This means there is no "just call Bedrock" option for legal-optimized embeddings.
Deployment Options¶
Three paths, in order of recommendation:
Option 1: Voyage AI Direct API (Recommended for Prototype AND Production)¶
Call the Voyage AI hosted API from Lambda. Simplest possible setup.
- Pros: Zero infra to manage. Latest model always available. SOC 2 compliant. No GPU instances, no SageMaker quotas, no endpoint management.
- Cons: Data leaves your VPC (document text sent to Voyage AI for embedding). External dependency. ~50ms network latency per call.
- Cost: $0.12 per million tokens (voyage-law-2).
- Best for: Prototype AND production unless compliance requires data-in-VPC.
This is the same pattern as calling any external API (Stripe, Twilio, etc.). Most SaaS e-discovery platforms already send data to external services. Voyage AI's API is SOC 2 compliant. Unless your compliance team specifically requires that document text never leave the VPC, this is the right choice.
Option 2: AWS Marketplace SageMaker Endpoint (If Data Residency Required)¶
Deploy voyage-law-2 as a SageMaker real-time inference endpoint within your VPC. The model runs on your GPU instances.
voyage-law-2AWS Marketplace listing: https://aws.amazon.com/marketplace/pp/prodview-bknagyko2vl7a- Deployable via CloudFormation, SageMaker Console, AWS CLI, or CDK
-
Official deployment notebook: github.com/voyage-ai/voyageai-aws
-
Pros: Data stays in your VPC. No external API dependency. Lower latency (~20ms in-VPC). Inherits AWS compliance (SOC 2, HIPAA-eligible).
- Cons: GPU instance cost (ml.g6.xlarge ~$1.01/hr = ~$737/mo always-on). Instance quota increase required (default = 0). Model updates require redeployment. Operational overhead (endpoint monitoring, scaling).
- Cost: $0.22 per million tokens + $737/mo instance cost. Throughput: 12.6M tokens/hr.
- Best for: Production ONLY if compliance requires data never leaving VPC.
Option 3: Bedrock with Alternative Model (If Vendor Lock-In to AWS Required)¶
If the organization mandates Bedrock-only services, use a Bedrock-native embedding model instead of Voyage AI:
| Bedrock Model | Dimensions | Legal Quality | Cost/M tokens |
|---|---|---|---|
| Amazon Titan Text V2 | 256/512/1024 | General-purpose, NOT legal-tuned | $0.02 |
| Cohere Embed English v3 | 1024 | Good general, NOT legal-specific | $0.10 |
| Amazon Nova Embed | 1024 | Newest general-purpose | Check pricing |
- Pros: Fully managed. No endpoints. No external API. Cheapest (Titan: $0.02/M). Native AWS integration.
- Cons: 5-15% lower retrieval quality on legal text vs voyage-law-2. Not trained on legal corpus. May miss the "stay the course" email that voyage-law-2 finds. Risks the prototype not producing the "wow" moment.
- Best for: Cost optimization AFTER proving value with voyage-law-2. Or if AWS-only is a hard organizational constraint.
Recommended Deployment for Nextpoint¶
| Phase | Method | Why |
|---|---|---|
| Prototype | Voyage AI direct API | Fastest setup. Best retrieval quality. ~$60 per case. |
| Production | Voyage AI direct API | Still simplest. SOC 2 compliant. No infra to manage. |
| Production (if data-in-VPC required) | SageMaker endpoint | Data stays in VPC. +$737/mo. |
| Cost optimization (later) | Evaluate Titan V2 | If Titan retrieval quality is acceptable, 6x cheaper. |
Start with the best model (voyage-law-2 via direct API). Prove the value. Then optimize cost. Starting with a weaker model risks the prototype not producing the "wow" moment with attorneys.
The VoyageClient abstraction in shell/embedding/ handles all three targets.
Switching between direct API, SageMaker, or Bedrock is a config change
(EMBEDDING_PROVIDER=voyage_api|sagemaker|bedrock), not a code change.
Embedding Provider Abstraction¶
# shell/embedding/provider.py
class EmbeddingProvider(ABC):
@abstractmethod
def embed_documents(self, texts: list[str]) -> list[list[float]]: ...
@abstractmethod
def embed_query(self, query: str) -> list[float]: ...
class VoyageDirectClient(EmbeddingProvider):
"""Calls Voyage AI hosted API. Prototype and production default."""
# input_type="document" / "query" handled internally
class VoyageSageMakerClient(EmbeddingProvider):
"""Calls SageMaker endpoint running voyage-law-2 in VPC."""
# Same input_type semantics, different transport
class BedrockEmbeddingClient(EmbeddingProvider):
"""Calls Bedrock InvokeModel for Titan/Cohere/Nova."""
# No asymmetric input_type (Bedrock models are symmetric)
# Lower retrieval quality on legal text
The handler doesn't know which provider is active:
# handlers/embedding_index.py
from shell.embedding.provider import get_embedding_provider
provider = get_embedding_provider() # reads EMBEDDING_PROVIDER config
vectors = provider.embed_documents(chunk_texts)
CDK for SageMaker Endpoint (If Needed)¶
// lib/common-resources-stack.ts (only if SageMaker path chosen)
import { SageMakerEndpoint } from 'generative-ai-cdk-constructs';
const voyageEndpoint = new SageMakerEndpoint(this, 'VoyageLaw2Endpoint', {
modelId: 'voyage-law-2',
instanceType: 'ml.g6.xlarge',
instanceCount: 1,
vpcConfig: {
vpc: props.vpc,
subnets: props.privateSubnets,
securityGroups: [props.lambdaSecurityGroup]
}
});
voyageEndpoint.grantInvoke(embeddingLambda);
voyage-law-2 vs Newer Models¶
MongoDB acquired Voyage AI in 2024. Newer models are available:
| Model | Dims | Context | Legal Benchmark | Status |
|---|---|---|---|---|
| voyage-law-2 | 1024 | 16K | Legal-optimized, MTEB #1 for legal | AWS Marketplace, direct API |
| voyage-3-large | 1024 | 32K | General-purpose, strong on legal | AWS Marketplace, direct API |
| voyage-3.5 | 1024 | 32K | +8.26% vs OpenAI on average | AWS Marketplace, direct API |
| voyage-4-large | 1024 | 32K | MoE, shared embedding space | Check availability |
Note on Voyage 4 family: Voyage 4 models share a common embedding space. You can embed documents with voyage-4-large (highest accuracy) and query with voyage-4-lite (fastest) without re-indexing. This would eliminate the trade-off between ingest accuracy and query latency. Worth evaluating when available.
The embedding_model column on search_chunks and the EmbeddingProvider
abstraction make model upgrades a re-embedding operation, not a code change.
The architecture is designed for this evolution.
Key Numbers¶
| Metric | Value | Impact |
|---|---|---|
| Embedding dimensions | 1024 | ~4.6 KB/vector with HNSW overhead |
| Document embedding latency | 100-500ms per batch of 128 | Ingest: ~250-1280 chunks/second per Lambda |
| Query embedding latency | ~50ms (single query) | Adds 50ms to every search request |
| Asymmetric retrieval lift | 15-30% more relevant docs | Justification for input_type discipline |
| Cost (direct API) | $0.12 per million tokens | 500K docs x 5 chunks x 200 tokens = ~$60/case |
| Cost (SageMaker) | $0.22 per million tokens + instance | Higher per-token but data stays in VPC |
| SageMaker throughput | 12.6M tokens/hr on ml.g6.xlarge | Backfill 500K docs in ~8 hours |
| Max tokens per text | 16,000 (voyage-law-2) | 512-token chunk target is well within limit |
| Max batch size | 128 texts per request | Auto-batching handles transparently |
| Query cost at scale | ~$0.012/day at 1000 searches/day | Negligible |
Anti-Patterns¶
- Using input_type="document" for queries: 15-30% recall loss.
- Using input_type="query" for documents: Encodes intent, not content. Vectors are less informative for similarity comparison.
- Omitting input_type entirely: Falls back to symmetric mode. Loses the asymmetric training benefit.
- Mixing models across document/query: Document embedded with voyage-law-2 and
queried with voyage-3-large will produce poor results -- vectors are in different
spaces. The
embedding_modelcolumn prevents this. - Embedding metadata separately from content: Prepend metadata to chunk text BEFORE embedding so the vector captures both WHO and WHAT. Don't store metadata only as structured fields.
Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.