Pattern: Asymmetric Embeddings for Legal Search¶

Context¶

The documentsearch module uses vector embeddings to enable natural language search. The embedding model converts text into vectors (lists of numbers) that encode semantic meaning. Similar concepts produce vectors that are close together in high-dimensional space.

Asymmetric embedding is the technique of using DIFFERENT encoding modes for documents vs queries. This is critical for retrieval quality in legal search.

The Problem with Symmetric Embeddings¶

A symmetric model treats documents and queries identically:

# Symmetric approach (wrong for search)
doc_vector   = embed("The Raptor structure needs the equity infusion before
                      Q2 close. Andy and I discussed the timeline yesterday.
                      We agreed to proceed with the transfer without flagging
                      it to the audit committee.")

query_vector = embed("internal discussions about Special Purpose Entities")

These texts share zero vocabulary. A symmetric model trained to make similar texts close together has no reason to understand that a 12-word question should map to the same region of vector space as a 40-word email. They are different lengths, different structures, different purposes. The result: moderate similarity scores at best, missed relevant documents at worst.

The Asymmetric Solution¶

Asymmetric embedding trains the model with two modes:

Document mode (input_type="document"): Encode the full semantic content of the passage. Optimized for being FOUND by queries.
Query mode (input_type="query"): Encode the INTENT behind the question. Optimized for FINDING relevant documents.

# Asymmetric approach (correct for search)
doc_vector = voyage.embed(
    "The Raptor structure needs the equity infusion before Q2 close...",
    input_type="document"      # "I am evidence. Encode my full meaning."
)

query_vector = voyage.embed(
    "internal discussions about Special Purpose Entities",
    input_type="query"         # "I am a question. Encode what I'm looking for."
)

A short query about SPEs lands in the same vector space region as long emails discussing Raptor, JEDI, and off-balance-sheet structures -- even though they share no keywords. The model learned this mapping from millions of (query, relevant_document) training pairs.

How Training Works¶

The model is trained on triplets:

Query:              "attorney-client privilege discussions about merger risk"

Positive document:  "I'd suggest we get Sarah's input on the antitrust exposure
                     before responding to their due diligence request. She flagged
                     some concerns last week that we should address internally first."

Negative document:  "Please see attached the attorney fee schedule for Q3. Legal
                     department costs are tracked in the shared budget spreadsheet."

Training signal:    query should be CLOSE to positive, FAR from negative

After millions of legal-domain triplets, the model learns: - "attorney-client privilege" as a QUERY means "legal advice being sought or given" - NOT "documents containing the word attorney" - The positive document IS privilege (Sarah = in-house counsel, "address internally first" = legal strategy) -- no keyword overlap with the query - The negative document contains "attorney" and "legal" but is NOT privilege

This is why voyage-law-2 outperforms general-purpose models: it was trained on legal (query, document) pairs where the semantic relationships are domain-specific.

What Happens with the Wrong input_type¶

# WRONG: embedding a query as a document
query_vector = voyage.embed("discussions about SPEs", input_type="document")

The model interprets this as a 4-word document fragment, not a search intent. It encodes the literal content rather than the concept "find me emails where executives discussed SPE structures by any name." Result: lower recall. Documents containing "SPE" are found, but the Raptor and JEDI emails are missed because the model did not activate the search-intent pathway.

Impact: asymmetric queries retrieve 15-30% more relevant documents than symmetric queries on legal benchmarks.

Implementation in documentsearch¶

When to Use Each Mode¶

Operation	input_type	When	Why
Document embedding	`"document"`	Ingest time (embedding pipeline)	Encodes what the document IS ABOUT
Query embedding	`"query"`	Search time (per request)	Encodes what the searcher WANTS TO FIND
Backfill embedding	`"document"`	Backfill pipeline	Same as ingest -- encoding existing documents
Re-embedding	`"document"`	Model upgrade	Same as ingest -- re-encoding with new model

Hexagonal Boundary¶

The input_type distinction is an infrastructure concern, not business logic:

shell/embedding/voyage_client.py — sets input_type based on which method is called (embed_documents vs embed_query)
core/ — never knows about input_type. Receives vectors as parameters. The chunking logic in core/chunking/ produces text; the embedding call in shell/ converts it to vectors.

# shell/embedding/voyage_client.py

class VoyageClient:
    MODEL = "voyage-law-2"
    DIMENSIONS = 1024
    MAX_BATCH_SIZE = 128
    MAX_TOKENS_PER_BATCH = 120_000

    def __init__(self):
        self._client = voyageai.Client(api_key=Config.VOYAGE_API_KEY)

    def embed_documents(self, texts: list[str]) -> list[list[float]]:
        """
        Embed document chunks for storage. Called at ingest time.
        input_type="document" tells the model these are evidence passages.
        Auto-batches to stay within Voyage API limits.
        """
        all_embeddings = []
        for batch in self._batch_texts(texts):
            result = self._client.embed(
                texts=batch,
                model=self.MODEL,
                input_type="document",
                truncation=True
            )
            all_embeddings.extend(result.embeddings)
        return all_embeddings

    def embed_query(self, query: str) -> list[float]:
        """
        Embed a single search query. Called at search time.
        input_type="query" tells the model this is search intent.
        ~50ms latency for a single query.
        """
        result = self._client.embed(
            texts=[query],
            model=self.MODEL,
            input_type="query",
            truncation=True
        )
        return result.embeddings[0]

    def _batch_texts(self, texts: list[str]) -> list[list[str]]:
        """
        Split texts into batches respecting API limits:
        - Max 128 texts per request
        - Max 120,000 tokens per request
        Estimates tokens as len(text) / 4 (conservative for English).
        """
        batches = []
        current_batch = []
        current_tokens = 0

        for text in texts:
            estimated_tokens = len(text) // 4
            would_exceed_count = len(current_batch) >= self.MAX_BATCH_SIZE
            would_exceed_tokens = (current_tokens + estimated_tokens) > self.MAX_TOKENS_PER_BATCH

            if current_batch and (would_exceed_count or would_exceed_tokens):
                batches.append(current_batch)
                current_batch = []
                current_tokens = 0

            current_batch.append(text)
            current_tokens += estimated_tokens

        if current_batch:
            batches.append(current_batch)

        return batches

Handler Bridge (How Core Receives Vectors)¶

# handlers/embedding_index.py (ingest)

def _process_document(message, session):
    # core/ produces text chunks (no AWS knowledge)
    chunks = chunk_document(text, metadata)

    # shell/ converts to vectors (knows about Voyage AI)
    vectors = voyage.embed_documents([c.full_text for c in chunks])

    # shell/ stores vectors (knows about vector store)
    vector_store.upsert_vectors(case_id, zip(chunks, vectors))

# handlers/search_index.py (query)

def _handle_search(request):
    # shell/ embeds query (input_type="query")
    query_vector = voyage.embed_query(request.query)

    # shell/ runs search legs
    vector_results = vector_store.search(case_id, query_vector, k=100)
    bm25_results = es_ops.keyword_search(case_id, request.query, size=100)

    # core/ fuses results (pure logic, no AWS)
    ranked = rrf(vector_results, bm25_results, k=60)

Why the Checkpoint Matters¶

Voyage API calls are the most expensive step (cost and latency). The checkpoint at EMBEDDINGS_GENERATED prevents re-calling Voyage on retries:

EMBEDDING_STARTED     ✓
TEXT_FETCHED          ✓  (S3: 50ms, ~$0)
CHUNKS_CREATED        ✓  (chunking: 10ms, $0)
EMBEDDINGS_GENERATED  ✓  (Voyage API: 400ms, $0.003)  <-- expensive
VECTORS_STORED        ✗  (Lambda timeout before write completes)

On retry: resumes from VECTORS_STORED. Voyage API NOT re-called.

With 500K documents and 5% retry rate, this saves 25K unnecessary API calls.

Voyage AI on AWS¶

Deployment Options¶

voyage-law-2 Is NOT on Amazon Bedrock¶

Bedrock's native embedding models are Amazon Titan, Cohere Embed, and Amazon Nova. Voyage AI is not a Bedrock model provider. This means there is no "just call Bedrock" option for legal-optimized embeddings.

Deployment Options¶

Three paths, in order of recommendation:

Option 1: Voyage AI Direct API (Recommended for Prototype AND Production)¶

Call the Voyage AI hosted API from Lambda. Simplest possible setup.

Lambda (VPC) -> NAT Gateway -> https://api.voyageai.com/v1/embeddings

Pros: Zero infra to manage. Latest model always available. SOC 2 compliant. No GPU instances, no SageMaker quotas, no endpoint management.
Cons: Data leaves your VPC (document text sent to Voyage AI for embedding). External dependency. ~50ms network latency per call.
Cost: $0.12 per million tokens (voyage-law-2).
Best for: Prototype AND production unless compliance requires data-in-VPC.

This is the same pattern as calling any external API (Stripe, Twilio, etc.). Most SaaS e-discovery platforms already send data to external services. Voyage AI's API is SOC 2 compliant. Unless your compliance team specifically requires that document text never leave the VPC, this is the right choice.

Option 2: AWS Marketplace SageMaker Endpoint (If Data Residency Required)¶

Deploy voyage-law-2 as a SageMaker real-time inference endpoint within your VPC. The model runs on your GPU instances.

Lambda (VPC) -> SageMaker Endpoint (VPC) -> voyage-law-2 on ml.g6.xlarge

voyage-law-2 AWS Marketplace listing: https://aws.amazon.com/marketplace/pp/prodview-bknagyko2vl7a
Deployable via CloudFormation, SageMaker Console, AWS CLI, or CDK
Official deployment notebook: github.com/voyage-ai/voyageai-aws
Pros: Data stays in your VPC. No external API dependency. Lower latency (~20ms in-VPC). Inherits AWS compliance (SOC 2, HIPAA-eligible).
Cons: GPU instance cost (ml.g6.xlarge ~$1.01/hr = ~$737/mo always-on). Instance quota increase required (default = 0). Model updates require redeployment. Operational overhead (endpoint monitoring, scaling).
Cost: $0.22 per million tokens + $737/mo instance cost. Throughput: 12.6M tokens/hr.
Best for: Production ONLY if compliance requires data never leaving VPC.

Option 3: Bedrock with Alternative Model (If Vendor Lock-In to AWS Required)¶

If the organization mandates Bedrock-only services, use a Bedrock-native embedding model instead of Voyage AI:

Bedrock Model	Dimensions	Legal Quality	Cost/M tokens
Amazon Titan Text V2	256/512/1024	General-purpose, NOT legal-tuned	$0.02
Cohere Embed English v3	1024	Good general, NOT legal-specific	$0.10
Amazon Nova Embed	1024	Newest general-purpose	Check pricing

Lambda (VPC) -> Bedrock InvokeModel -> amazon.titan-embed-text-v2:0

Pros: Fully managed. No endpoints. No external API. Cheapest (Titan: $0.02/M). Native AWS integration.
Cons: 5-15% lower retrieval quality on legal text vs voyage-law-2. Not trained on legal corpus. May miss the "stay the course" email that voyage-law-2 finds. Risks the prototype not producing the "wow" moment.
Best for: Cost optimization AFTER proving value with voyage-law-2. Or if AWS-only is a hard organizational constraint.

Recommended Deployment for Nextpoint¶

Phase	Method	Why
Prototype	Voyage AI direct API	Fastest setup. Best retrieval quality. ~$60 per case.
Production	Voyage AI direct API	Still simplest. SOC 2 compliant. No infra to manage.
Production (if data-in-VPC required)	SageMaker endpoint	Data stays in VPC. +$737/mo.
Cost optimization (later)	Evaluate Titan V2	If Titan retrieval quality is acceptable, 6x cheaper.

Start with the best model (voyage-law-2 via direct API). Prove the value. Then optimize cost. Starting with a weaker model risks the prototype not producing the "wow" moment with attorneys.

The VoyageClient abstraction in shell/embedding/ handles all three targets. Switching between direct API, SageMaker, or Bedrock is a config change (EMBEDDING_PROVIDER=voyage_api|sagemaker|bedrock), not a code change.

Embedding Provider Abstraction¶

# shell/embedding/provider.py

class EmbeddingProvider(ABC):
    @abstractmethod
    def embed_documents(self, texts: list[str]) -> list[list[float]]: ...

    @abstractmethod
    def embed_query(self, query: str) -> list[float]: ...

class VoyageDirectClient(EmbeddingProvider):
    """Calls Voyage AI hosted API. Prototype and production default."""
    # input_type="document" / "query" handled internally

class VoyageSageMakerClient(EmbeddingProvider):
    """Calls SageMaker endpoint running voyage-law-2 in VPC."""
    # Same input_type semantics, different transport

class BedrockEmbeddingClient(EmbeddingProvider):
    """Calls Bedrock InvokeModel for Titan/Cohere/Nova."""
    # No asymmetric input_type (Bedrock models are symmetric)
    # Lower retrieval quality on legal text

The handler doesn't know which provider is active:

# handlers/embedding_index.py
from shell.embedding.provider import get_embedding_provider

provider = get_embedding_provider()  # reads EMBEDDING_PROVIDER config
vectors = provider.embed_documents(chunk_texts)

CDK for SageMaker Endpoint (If Needed)¶

// lib/common-resources-stack.ts (only if SageMaker path chosen)

import { SageMakerEndpoint } from 'generative-ai-cdk-constructs';

const voyageEndpoint = new SageMakerEndpoint(this, 'VoyageLaw2Endpoint', {
  modelId: 'voyage-law-2',
  instanceType: 'ml.g6.xlarge',
  instanceCount: 1,
  vpcConfig: {
    vpc: props.vpc,
    subnets: props.privateSubnets,
    securityGroups: [props.lambdaSecurityGroup]
  }
});

voyageEndpoint.grantInvoke(embeddingLambda);

voyage-law-2 vs Newer Models¶

MongoDB acquired Voyage AI in 2024. Newer models are available:

Model	Dims	Context	Legal Benchmark	Status
voyage-law-2	1024	16K	Legal-optimized, MTEB #1 for legal	AWS Marketplace, direct API
voyage-3-large	1024	32K	General-purpose, strong on legal	AWS Marketplace, direct API
voyage-3.5	1024	32K	+8.26% vs OpenAI on average	AWS Marketplace, direct API
voyage-4-large	1024	32K	MoE, shared embedding space	Check availability

Note on Voyage 4 family: Voyage 4 models share a common embedding space. You can embed documents with voyage-4-large (highest accuracy) and query with voyage-4-lite (fastest) without re-indexing. This would eliminate the trade-off between ingest accuracy and query latency. Worth evaluating when available.

The embedding_model column on search_chunks and the EmbeddingProvider abstraction make model upgrades a re-embedding operation, not a code change. The architecture is designed for this evolution.

Key Numbers¶

Metric	Value	Impact
Embedding dimensions	1024	~4.6 KB/vector with HNSW overhead
Document embedding latency	100-500ms per batch of 128	Ingest: ~250-1280 chunks/second per Lambda
Query embedding latency	~50ms (single query)	Adds 50ms to every search request
Asymmetric retrieval lift	15-30% more relevant docs	Justification for input_type discipline
Cost (direct API)	$0.12 per million tokens	500K docs x 5 chunks x 200 tokens = ~$60/case
Cost (SageMaker)	$0.22 per million tokens + instance	Higher per-token but data stays in VPC
SageMaker throughput	12.6M tokens/hr on ml.g6.xlarge	Backfill 500K docs in ~8 hours
Max tokens per text	16,000 (voyage-law-2)	512-token chunk target is well within limit
Max batch size	128 texts per request	Auto-batching handles transparently
Query cost at scale	~$0.012/day at 1000 searches/day	Negligible

Anti-Patterns¶

Using input_type="document" for queries: 15-30% recall loss.
Using input_type="query" for documents: Encodes intent, not content. Vectors are less informative for similarity comparison.
Omitting input_type entirely: Falls back to symmetric mode. Loses the asymmetric training benefit.
Mixing models across document/query: Document embedded with voyage-law-2 and queried with voyage-3-large will produce poor results -- vectors are in different spaces. The embedding_model column prevents this.
Embedding metadata separately from content: Prepend metadata to chunk text BEFORE embedding so the vector captures both WHO and WHAT. Don't store metadata only as structured fields.

Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.

Pattern: Asymmetric Embeddings for Legal Search¶

Context¶

The Problem with Symmetric Embeddings¶

The Asymmetric Solution¶

How Training Works¶

What Happens with the Wrong input_type¶

Implementation in documentsearch¶

When to Use Each Mode¶

Hexagonal Boundary¶

Handler Bridge (How Core Receives Vectors)¶

Why the Checkpoint Matters¶

Voyage AI on AWS¶

Deployment Options¶

voyage-law-2 Is NOT on Amazon Bedrock¶

Deployment Options¶

Option 1: Voyage AI Direct API (Recommended for Prototype AND Production)¶

Option 2: AWS Marketplace SageMaker Endpoint (If Data Residency Required)¶

Option 3: Bedrock with Alternative Model (If Vendor Lock-In to AWS Required)¶

Recommended Deployment for Nextpoint¶

Embedding Provider Abstraction¶

CDK for SageMaker Endpoint (If Needed)¶

voyage-law-2 vs Newer Models¶

Key Numbers¶

Anti-Patterns¶

Sign In