Skip to content

Pattern: Asymmetric Embeddings for Legal Search

Context

The documentsearch module uses vector embeddings to enable natural language search. The embedding model converts text into vectors (lists of numbers) that encode semantic meaning. Similar concepts produce vectors that are close together in high-dimensional space.

Asymmetric embedding is the technique of using DIFFERENT encoding modes for documents vs queries. This is critical for retrieval quality in legal search.

The Problem with Symmetric Embeddings

A symmetric model treats documents and queries identically:

# Symmetric approach (wrong for search)
doc_vector   = embed("The Raptor structure needs the equity infusion before
                      Q2 close. Andy and I discussed the timeline yesterday.
                      We agreed to proceed with the transfer without flagging
                      it to the audit committee.")

query_vector = embed("internal discussions about Special Purpose Entities")

These texts share zero vocabulary. A symmetric model trained to make similar texts close together has no reason to understand that a 12-word question should map to the same region of vector space as a 40-word email. They are different lengths, different structures, different purposes. The result: moderate similarity scores at best, missed relevant documents at worst.

The Asymmetric Solution

Asymmetric embedding trains the model with two modes:

  • Document mode (input_type="document"): Encode the full semantic content of the passage. Optimized for being FOUND by queries.
  • Query mode (input_type="query"): Encode the INTENT behind the question. Optimized for FINDING relevant documents.
# Asymmetric approach (correct for search)
doc_vector = voyage.embed(
    "The Raptor structure needs the equity infusion before Q2 close...",
    input_type="document"      # "I am evidence. Encode my full meaning."
)

query_vector = voyage.embed(
    "internal discussions about Special Purpose Entities",
    input_type="query"         # "I am a question. Encode what I'm looking for."
)

A short query about SPEs lands in the same vector space region as long emails discussing Raptor, JEDI, and off-balance-sheet structures -- even though they share no keywords. The model learned this mapping from millions of (query, relevant_document) training pairs.

How Training Works

The model is trained on triplets:

Query:              "attorney-client privilege discussions about merger risk"

Positive document:  "I'd suggest we get Sarah's input on the antitrust exposure
                     before responding to their due diligence request. She flagged
                     some concerns last week that we should address internally first."

Negative document:  "Please see attached the attorney fee schedule for Q3. Legal
                     department costs are tracked in the shared budget spreadsheet."

Training signal:    query should be CLOSE to positive, FAR from negative

After millions of legal-domain triplets, the model learns: - "attorney-client privilege" as a QUERY means "legal advice being sought or given" - NOT "documents containing the word attorney" - The positive document IS privilege (Sarah = in-house counsel, "address internally first" = legal strategy) -- no keyword overlap with the query - The negative document contains "attorney" and "legal" but is NOT privilege

This is why voyage-law-2 outperforms general-purpose models: it was trained on legal (query, document) pairs where the semantic relationships are domain-specific.

What Happens with the Wrong input_type

# WRONG: embedding a query as a document
query_vector = voyage.embed("discussions about SPEs", input_type="document")

The model interprets this as a 4-word document fragment, not a search intent. It encodes the literal content rather than the concept "find me emails where executives discussed SPE structures by any name." Result: lower recall. Documents containing "SPE" are found, but the Raptor and JEDI emails are missed because the model did not activate the search-intent pathway.

Impact: asymmetric queries retrieve 15-30% more relevant documents than symmetric queries on legal benchmarks.

Implementation in documentsearch

When to Use Each Mode

Operation input_type When Why
Document embedding "document" Ingest time (embedding pipeline) Encodes what the document IS ABOUT
Query embedding "query" Search time (per request) Encodes what the searcher WANTS TO FIND
Backfill embedding "document" Backfill pipeline Same as ingest -- encoding existing documents
Re-embedding "document" Model upgrade Same as ingest -- re-encoding with new model

Hexagonal Boundary

The input_type distinction is an infrastructure concern, not business logic:

  • shell/embedding/voyage_client.py — sets input_type based on which method is called (embed_documents vs embed_query)
  • core/ — never knows about input_type. Receives vectors as parameters. The chunking logic in core/chunking/ produces text; the embedding call in shell/ converts it to vectors.
# shell/embedding/voyage_client.py

class VoyageClient:
    MODEL = "voyage-law-2"
    DIMENSIONS = 1024
    MAX_BATCH_SIZE = 128
    MAX_TOKENS_PER_BATCH = 120_000

    def __init__(self):
        self._client = voyageai.Client(api_key=Config.VOYAGE_API_KEY)

    def embed_documents(self, texts: list[str]) -> list[list[float]]:
        """
        Embed document chunks for storage. Called at ingest time.
        input_type="document" tells the model these are evidence passages.
        Auto-batches to stay within Voyage API limits.
        """
        all_embeddings = []
        for batch in self._batch_texts(texts):
            result = self._client.embed(
                texts=batch,
                model=self.MODEL,
                input_type="document",
                truncation=True
            )
            all_embeddings.extend(result.embeddings)
        return all_embeddings

    def embed_query(self, query: str) -> list[float]:
        """
        Embed a single search query. Called at search time.
        input_type="query" tells the model this is search intent.
        ~50ms latency for a single query.
        """
        result = self._client.embed(
            texts=[query],
            model=self.MODEL,
            input_type="query",
            truncation=True
        )
        return result.embeddings[0]

    def _batch_texts(self, texts: list[str]) -> list[list[str]]:
        """
        Split texts into batches respecting API limits:
        - Max 128 texts per request
        - Max 120,000 tokens per request
        Estimates tokens as len(text) / 4 (conservative for English).
        """
        batches = []
        current_batch = []
        current_tokens = 0

        for text in texts:
            estimated_tokens = len(text) // 4
            would_exceed_count = len(current_batch) >= self.MAX_BATCH_SIZE
            would_exceed_tokens = (current_tokens + estimated_tokens) > self.MAX_TOKENS_PER_BATCH

            if current_batch and (would_exceed_count or would_exceed_tokens):
                batches.append(current_batch)
                current_batch = []
                current_tokens = 0

            current_batch.append(text)
            current_tokens += estimated_tokens

        if current_batch:
            batches.append(current_batch)

        return batches

Handler Bridge (How Core Receives Vectors)

# handlers/embedding_index.py (ingest)

def _process_document(message, session):
    # core/ produces text chunks (no AWS knowledge)
    chunks = chunk_document(text, metadata)

    # shell/ converts to vectors (knows about Voyage AI)
    vectors = voyage.embed_documents([c.full_text for c in chunks])

    # shell/ stores vectors (knows about vector store)
    vector_store.upsert_vectors(case_id, zip(chunks, vectors))
# handlers/search_index.py (query)

def _handle_search(request):
    # shell/ embeds query (input_type="query")
    query_vector = voyage.embed_query(request.query)

    # shell/ runs search legs
    vector_results = vector_store.search(case_id, query_vector, k=100)
    bm25_results = es_ops.keyword_search(case_id, request.query, size=100)

    # core/ fuses results (pure logic, no AWS)
    ranked = rrf(vector_results, bm25_results, k=60)

Why the Checkpoint Matters

Voyage API calls are the most expensive step (cost and latency). The checkpoint at EMBEDDINGS_GENERATED prevents re-calling Voyage on retries:

EMBEDDING_STARTED     ✓
TEXT_FETCHED          ✓  (S3: 50ms, ~$0)
CHUNKS_CREATED        ✓  (chunking: 10ms, $0)
EMBEDDINGS_GENERATED  ✓  (Voyage API: 400ms, $0.003)  <-- expensive
VECTORS_STORED        ✗  (Lambda timeout before write completes)

On retry: resumes from VECTORS_STORED. Voyage API NOT re-called.

With 500K documents and 5% retry rate, this saves 25K unnecessary API calls.

Voyage AI on AWS

Deployment Options

voyage-law-2 Is NOT on Amazon Bedrock

Bedrock's native embedding models are Amazon Titan, Cohere Embed, and Amazon Nova. Voyage AI is not a Bedrock model provider. This means there is no "just call Bedrock" option for legal-optimized embeddings.

Deployment Options

Three paths, in order of recommendation:

Call the Voyage AI hosted API from Lambda. Simplest possible setup.

Lambda (VPC) -> NAT Gateway -> https://api.voyageai.com/v1/embeddings
  • Pros: Zero infra to manage. Latest model always available. SOC 2 compliant. No GPU instances, no SageMaker quotas, no endpoint management.
  • Cons: Data leaves your VPC (document text sent to Voyage AI for embedding). External dependency. ~50ms network latency per call.
  • Cost: $0.12 per million tokens (voyage-law-2).
  • Best for: Prototype AND production unless compliance requires data-in-VPC.

This is the same pattern as calling any external API (Stripe, Twilio, etc.). Most SaaS e-discovery platforms already send data to external services. Voyage AI's API is SOC 2 compliant. Unless your compliance team specifically requires that document text never leave the VPC, this is the right choice.

Option 2: AWS Marketplace SageMaker Endpoint (If Data Residency Required)

Deploy voyage-law-2 as a SageMaker real-time inference endpoint within your VPC. The model runs on your GPU instances.

Lambda (VPC) -> SageMaker Endpoint (VPC) -> voyage-law-2 on ml.g6.xlarge
  • voyage-law-2 AWS Marketplace listing: https://aws.amazon.com/marketplace/pp/prodview-bknagyko2vl7a
  • Deployable via CloudFormation, SageMaker Console, AWS CLI, or CDK
  • Official deployment notebook: github.com/voyage-ai/voyageai-aws

  • Pros: Data stays in your VPC. No external API dependency. Lower latency (~20ms in-VPC). Inherits AWS compliance (SOC 2, HIPAA-eligible).

  • Cons: GPU instance cost (ml.g6.xlarge ~$1.01/hr = ~$737/mo always-on). Instance quota increase required (default = 0). Model updates require redeployment. Operational overhead (endpoint monitoring, scaling).
  • Cost: $0.22 per million tokens + $737/mo instance cost. Throughput: 12.6M tokens/hr.
  • Best for: Production ONLY if compliance requires data never leaving VPC.

Option 3: Bedrock with Alternative Model (If Vendor Lock-In to AWS Required)

If the organization mandates Bedrock-only services, use a Bedrock-native embedding model instead of Voyage AI:

Bedrock Model Dimensions Legal Quality Cost/M tokens
Amazon Titan Text V2 256/512/1024 General-purpose, NOT legal-tuned $0.02
Cohere Embed English v3 1024 Good general, NOT legal-specific $0.10
Amazon Nova Embed 1024 Newest general-purpose Check pricing
Lambda (VPC) -> Bedrock InvokeModel -> amazon.titan-embed-text-v2:0
  • Pros: Fully managed. No endpoints. No external API. Cheapest (Titan: $0.02/M). Native AWS integration.
  • Cons: 5-15% lower retrieval quality on legal text vs voyage-law-2. Not trained on legal corpus. May miss the "stay the course" email that voyage-law-2 finds. Risks the prototype not producing the "wow" moment.
  • Best for: Cost optimization AFTER proving value with voyage-law-2. Or if AWS-only is a hard organizational constraint.
Phase Method Why
Prototype Voyage AI direct API Fastest setup. Best retrieval quality. ~$60 per case.
Production Voyage AI direct API Still simplest. SOC 2 compliant. No infra to manage.
Production (if data-in-VPC required) SageMaker endpoint Data stays in VPC. +$737/mo.
Cost optimization (later) Evaluate Titan V2 If Titan retrieval quality is acceptable, 6x cheaper.

Start with the best model (voyage-law-2 via direct API). Prove the value. Then optimize cost. Starting with a weaker model risks the prototype not producing the "wow" moment with attorneys.

The VoyageClient abstraction in shell/embedding/ handles all three targets. Switching between direct API, SageMaker, or Bedrock is a config change (EMBEDDING_PROVIDER=voyage_api|sagemaker|bedrock), not a code change.

Embedding Provider Abstraction

# shell/embedding/provider.py

class EmbeddingProvider(ABC):
    @abstractmethod
    def embed_documents(self, texts: list[str]) -> list[list[float]]: ...

    @abstractmethod
    def embed_query(self, query: str) -> list[float]: ...

class VoyageDirectClient(EmbeddingProvider):
    """Calls Voyage AI hosted API. Prototype and production default."""
    # input_type="document" / "query" handled internally

class VoyageSageMakerClient(EmbeddingProvider):
    """Calls SageMaker endpoint running voyage-law-2 in VPC."""
    # Same input_type semantics, different transport

class BedrockEmbeddingClient(EmbeddingProvider):
    """Calls Bedrock InvokeModel for Titan/Cohere/Nova."""
    # No asymmetric input_type (Bedrock models are symmetric)
    # Lower retrieval quality on legal text

The handler doesn't know which provider is active:

# handlers/embedding_index.py
from shell.embedding.provider import get_embedding_provider

provider = get_embedding_provider()  # reads EMBEDDING_PROVIDER config
vectors = provider.embed_documents(chunk_texts)

CDK for SageMaker Endpoint (If Needed)

// lib/common-resources-stack.ts (only if SageMaker path chosen)

import { SageMakerEndpoint } from 'generative-ai-cdk-constructs';

const voyageEndpoint = new SageMakerEndpoint(this, 'VoyageLaw2Endpoint', {
  modelId: 'voyage-law-2',
  instanceType: 'ml.g6.xlarge',
  instanceCount: 1,
  vpcConfig: {
    vpc: props.vpc,
    subnets: props.privateSubnets,
    securityGroups: [props.lambdaSecurityGroup]
  }
});

voyageEndpoint.grantInvoke(embeddingLambda);

voyage-law-2 vs Newer Models

MongoDB acquired Voyage AI in 2024. Newer models are available:

Model Dims Context Legal Benchmark Status
voyage-law-2 1024 16K Legal-optimized, MTEB #1 for legal AWS Marketplace, direct API
voyage-3-large 1024 32K General-purpose, strong on legal AWS Marketplace, direct API
voyage-3.5 1024 32K +8.26% vs OpenAI on average AWS Marketplace, direct API
voyage-4-large 1024 32K MoE, shared embedding space Check availability

Note on Voyage 4 family: Voyage 4 models share a common embedding space. You can embed documents with voyage-4-large (highest accuracy) and query with voyage-4-lite (fastest) without re-indexing. This would eliminate the trade-off between ingest accuracy and query latency. Worth evaluating when available.

The embedding_model column on search_chunks and the EmbeddingProvider abstraction make model upgrades a re-embedding operation, not a code change. The architecture is designed for this evolution.

Key Numbers

Metric Value Impact
Embedding dimensions 1024 ~4.6 KB/vector with HNSW overhead
Document embedding latency 100-500ms per batch of 128 Ingest: ~250-1280 chunks/second per Lambda
Query embedding latency ~50ms (single query) Adds 50ms to every search request
Asymmetric retrieval lift 15-30% more relevant docs Justification for input_type discipline
Cost (direct API) $0.12 per million tokens 500K docs x 5 chunks x 200 tokens = ~$60/case
Cost (SageMaker) $0.22 per million tokens + instance Higher per-token but data stays in VPC
SageMaker throughput 12.6M tokens/hr on ml.g6.xlarge Backfill 500K docs in ~8 hours
Max tokens per text 16,000 (voyage-law-2) 512-token chunk target is well within limit
Max batch size 128 texts per request Auto-batching handles transparently
Query cost at scale ~$0.012/day at 1000 searches/day Negligible

Anti-Patterns

  • Using input_type="document" for queries: 15-30% recall loss.
  • Using input_type="query" for documents: Encodes intent, not content. Vectors are less informative for similarity comparison.
  • Omitting input_type entirely: Falls back to symmetric mode. Loses the asymmetric training benefit.
  • Mixing models across document/query: Document embedded with voyage-law-2 and queried with voyage-3-large will produce poor results -- vectors are in different spaces. The embedding_model column prevents this.
  • Embedding metadata separately from content: Prepend metadata to chunk text BEFORE embedding so the vector captures both WHO and WHAT. Don't store metadata only as structured fields.
Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.