Skip to content

Semantic Search: Executive Cost Summary

Production Corpus

Metric Value
Total documents 870M
Total pages 6.4B
NGE-enabled cases ~10% of cases (~87M documents)
Discovery documents (NGE) ~90% of NGE (~78M documents)

Scope: ~78M documents across active NGE Discovery cases.

One-Time Backfill

Phase Documents Embedding Cost Compute Total Timeline
Prototype (1 case) 50K $18 $2 $20 1 day
Pilot (10 cases) 500K $180 $15 $195 1 day
Phase 1 (100 cases) 10M $3,600 $300 $3,900 2-3 days
Phase 2 (all NGE Discovery) 78M $28,000 $2,300 $30,300 1-2 weeks

Ongoing Monthly (Post-Backfill)

OpenSearch Managed OpenSearch Serverless
Infrastructure
Vector store (cluster/OCUs) $3,900-5,000 $700-2,000
Storage (7.2 TB vectors) Included $173
SQS, CloudWatch, API Gateway $100 $100
Subtotal infrastructure $4,000-5,100 $973-2,273
Embedding (new documents)
Voyage AI (~2-5M new docs/mo) $790-1,970 $790-1,970
Lambda compute (embedding) $70-170 $70-170
Subtotal embedding $860-2,140 $860-2,140
Search queries
Voyage AI (~30-100K queries/mo) <$1 <$1
Lambda compute (search) $5-15 $5-15
API Gateway $10-35 $10-35
Subtotal search $15-50 $15-50
Total monthly $4,875-7,290 $1,848-4,463

Year 1 Total Cost (NGE Discovery)

OpenSearch Managed OpenSearch Serverless
Backfill (Phase 1 + Phase 2) $34,200 $34,200
Monthly x 12 $58,500-87,480 $22,176-53,556
Year 1 total $92,700-121,680 $56,376-87,756

Scenario 2: Full Corpus (All 870M Documents)

Scope: All documents across all cases, including legacy non-NGE cases.

One-Time Backfill

Scope Documents Embedding Cost Compute Total Timeline
Full corpus 870M $312,000 $30,000 $342,000 2-3 months

Ongoing Monthly (Post-Backfill)

OpenSearch Managed OpenSearch Serverless
Infrastructure
Vector store $8,000-12,000 $2,000-5,000
Storage (80 TB vectors) Included $1,920
SQS, CloudWatch, API Gateway $100 $100
Subtotal infrastructure $8,100-12,100 $4,020-7,020
Embedding + Search
Same as Scenario 1 $875-2,190 $875-2,190
Total monthly $8,975-14,290 $4,895-9,210

Year 1 Total Cost (Full Corpus)

OpenSearch Managed OpenSearch Serverless
Backfill (full) $342,000 $342,000
Monthly x 12 $107,700-171,480 $58,740-110,520
Year 1 total $449,700-513,480 $400,740-452,520

Scenario Comparison

NGE Discovery Only Full Corpus
Documents to embed 78M 870M
Backfill cost $34,200 $342,000
Backfill time (standard API) ~21 days ~8 months
Backfill time (enterprise API) ~15 hours ~7 days
Vector storage 7.2 TB 80 TB
Monthly (Managed OS) $4,875-7,290 $8,975-14,290
Monthly (Serverless OS) $1,848-4,463 $4,895-9,210
Year 1 (Managed OS) $92,700-121,680 $449,700-513,480
Year 1 (Serverless OS) $56,376-87,756 $400,740-452,520

Full corpus is 4-5x the cost of NGE-only, driven almost entirely by the $342K backfill. Monthly ongoing costs roughly double because of larger vector store infrastructure (80 TB vs 7.2 TB).

Standard API rate limits (300 req/min × 128 chunks/req = 2,560 docs/min) handle NGE Discovery backfill in ~3 weeks. Full corpus at standard rates takes ~8 months — enterprise tier or SageMaker needed to compress that to days. Ongoing ingest (~2-5M new docs/month) fits within standard rate limits comfortably.


Build Time and Team Estimates

Engineering Time to Build

Phase Scope Team Duration Prerequisite
Prototype 1 case, pgvector, standalone UI 1 senior backend eng + Claude Code (frontend) 2 weeks None
Production T1 Multi-tenant, OpenSearch, domain chunking, backfill pipeline 1-2 backend eng 1 quarter Prototype validates quality
T1+ Rails integration Search toggle, save to folder, review queue sort 1 frontend eng 2-3 weeks T1 production
T2 Agent service Gap analysis, pattern ID, motion to compel agents 1-2 backend eng 1 quarter T1 production
T2+ Cross-corpus Transcript embeddings, multi-depo analysis 1-2 eng 1 quarter T2 + Litigation suite

Backfill Time (Wall Clock)

Scope Documents Voyage Standard API Voyage Enterprise API
Prototype (1 case) 50K ~20 min < 1 min
Pilot (10 cases) 500K ~3 hours ~6 min
Phase 1 (100 cases) 10M ~2.7 days ~2 hours
Phase 2 (NGE Discovery) 78M ~21 days ~15 hours
Full corpus 870M ~8 months ~7 days

Standard API: 300 req/min × 128 chunks/req = ~2,560 docs/min. Enterprise API: 10K req/min × 128 chunks/req = ~85K docs/min. Each request batches up to 128 chunks. At ~15 chunks/doc, the batch size means standard rate limits are more capable than they appear.

Standard rate limits handle prototype through Phase 2 (~3 weeks for 78M docs). Enterprise tier is only needed to accelerate full corpus backfill from ~8 months to ~7 days.

Backfill Time: Full Corpus Detail

The binding constraint for backfill is the Voyage AI API rate limit, not Lambda concurrency. At enterprise tier:

Calculation
Voyage API throughput 10,000 requests/min x 128 texts/request
Chunks per minute 1,280,000
Documents per minute (at 15 chunks/doc) ~85,000
NGE Discovery (78M docs) 78M / 85K per min = ~15 hours
With 50% buffer (retries, backoff) ~1-2 days realistic
Full corpus (870M docs) 870M / 85K per min = ~7 days
With 50% buffer (retries, backoff) ~10-14 days realistic

Lambda concurrency must match API throughput to avoid being the bottleneck. At 85K docs/min and ~2 sec/doc, you need ~2,800 concurrent Lambda invocations to saturate the API. In practice, set MaximumConcurrency=100-200 on the backfill queue and let the API rate limit be the natural throttle.

Full corpus backfill at lower concurrency (if API rate limits or cost require throttling):

MaximumConcurrency Throughput NGE Discovery (78M) Full Corpus (870M)
10 ~5 docs/sec 180 days 5.5 years
50 ~25 docs/sec 36 days 1.1 years
100 ~50 docs/sec 18 days 201 days
200 ~100 docs/sec 9 days 101 days
API-limited (~2,800) ~85K docs/min 1-2 days 10-14 days

Total Timeline: Prototype to Production

Week 1-2:     Prototype (1 eng, 1 case, validate retrieval quality)
Week 3:       Pilot (10 cases, attorney feedback)
Week 3-14:    Production T1 build (1-2 eng, multi-tenant, OpenSearch, chunking)
Week 14-15:   Phase 1 backfill (100 cases, 2-3 days)
Week 15-17:   T1+ Rails integration (search UI, review queue)
Week 17-18:   Phase 2 backfill — NGE Discovery (78M docs, 1-2 weeks)
              --- Semantic search live for all NGE Discovery cases ---
Week 18-20:   Full corpus backfill (870M docs, 10-14 days) — if approved
              --- Semantic search live for all cases ---
Week 18-30:   T2 agent service (gap analysis, pattern ID)

Prototype to Phase 1 live: ~14-15 weeks (1 quarter). Full NGE Discovery live: ~18 weeks. Full corpus live (if approved): ~20 weeks.

Recommendation

Start with NGE Discovery cases (Scenario 1). Use on-demand backfill for legacy cases.

Rationale
91% cost reduction $34K backfill vs $342K
Covers active users NGE cases are where attorneys are actively working
On-demand handles the tail Legacy cases are embedded when an attorney first searches them
De-risks the investment Prove value at $34K before committing $342K
Same user experience Attorneys searching NGE cases get full semantic search immediately

Legacy cases that are searched get embedded on-demand (~$18 per 50K-doc case, ~$120 per 500K-doc case). Most legacy cases will never be searched again — the on-demand model avoids embedding documents nobody will ever query.


Cost-Optimized Alternative Design

The standard design prioritizes retrieval quality. The cost-optimized design applies six levers to reduce Year 1 cost by 74-91% while maintaining a viable product.

Lever Summary

Lever Mechanism Impact Trade-off
On-demand backfill Only pre-embed top 100 cases; embed rest when attorney first searches Backfill: $30K -> $3.9K First search on un-embedded cases is keyword-only
Cheaper embedding model Bedrock Titan V2 ($0.02/M) instead of Voyage AI ($0.12/M) for bulk 6x cheaper per token 5-15% lower retrieval quality on legal text
Fewer chunks 1024-token chunks (~7/doc) instead of 512-token (~15/doc) 53% fewer embeddings Slightly less precise snippets
Scalar quantization Store vectors as int8 instead of float32 4x storage reduction ~1-2% recall loss
Serverless OpenSearch Auto-scaled OCUs instead of dedicated cluster ~60-75% infra savings 60-second index refresh interval
Tiered storage Hot (vector store) / Warm (S3) / Cold (not embedded) Smaller always-on footprint Minutes to load warm cases on first search

Lever 1: On-Demand Backfill (Eliminates Upfront Cost)

Instead of backfilling all documents upfront, embed each case when an attorney first searches it:

Attorney searches case 123 for the first time
  -> No embeddings exist
  -> Return keyword-only results immediately
  -> Trigger background backfill for case 123
  -> "Semantic search results will be available in ~X minutes"
  -> Subsequent searches return hybrid results
Upfront Backfill On-Demand Only
Backfill cost $28K-$342K $0 (cases embedded as needed)
Cases never searched Embedded anyway (wasted) Never embedded ($0)
First search experience Full hybrid Keyword-only, then hybrid after background embed

Mitigation: Pre-embed the top 100 most active cases ($3,900) so demo and high-usage cases are ready. Everything else embeds on demand.

Lever 2: Cheaper Embedding Model

Model Cost/M Tokens NGE Backfill (78M docs) Full Corpus (870M) Quality
Voyage AI voyage-law-2 $0.12 $28,000 $312,000 Best (legal-tuned)
Bedrock Titan V2 $0.02 $4,700 $52,000 Good (general, not legal-tuned)
Self-hosted open-source (BGE, E5, Nomic) ~$0.01-0.02 equiv. $2,300-4,700 $26,000-52,000 Moderate

Titan V2 is fully managed via Bedrock — no endpoints, no rate limits to negotiate, no GPU instances. The risk: it may not find the "stay the course" email that voyage-law-2 finds. Prototype must use voyage-law-2 to prove value. Evaluate Titan V2 for production bulk embedding afterward.

Lever 3: Fewer Chunks Per Document

Strategy Chunks/Doc Embedding Cost Impact Quality Impact
512-token chunks (baseline) ~15 Baseline Baseline
1024-token chunks ~7 53% reduction Slightly less precise snippets
Document summary + key passages ~3-5 67-80% reduction Much less granular

At 7 chunks/doc, NGE backfill drops from $28K to ~$13K with Voyage AI, or from $4,700 to ~$2,200 with Titan V2.

Lever 4: Vector Quantization (Cuts Storage 4-32x)

Approach Storage Per Vector NGE Storage Full Corpus Storage Quality Loss
Full float32 (baseline) 4,096 bytes 7.2 TB 80 TB None
Scalar quantization (int8) 1,024 bytes 1.8 TB 20 TB ~1-2%
Binary quantization (1 bit) 128 bytes 225 GB 2.5 TB ~5-10% (rerank mitigates)

Scalar quantization is the sweet spot — 4x storage reduction with negligible quality loss. OpenSearch supports this natively.

Lever 5: OpenSearch Serverless

Managed Serverless
Monthly infrastructure (NGE) $4,000-5,100 $973-2,273
Operational overhead Cluster sizing, shard mgmt Zero
Trade-off Full control 60-second refresh interval

Lever 6: Tiered Vector Storage

Tier Cases Storage Access
Hot Top 100-500 active cases OpenSearch (always loaded) Sub-second
Warm Remaining embedded cases Vectors serialized to S3 Load on first search (~minutes)
Cold Legacy, never searched Not embedded Embed + load on first search

Reduces always-on vector store from 7.2 TB (all NGE) to ~500 GB (top 100 cases).

Cost-Optimized vs Standard: NGE Discovery

Standard Design Cost-Optimized Design Savings
Backfill cost $30,300 (all 78M docs) $3,900 (top 100 cases, rest on-demand) 87%
Monthly infrastructure $4,000-5,100 $500-1,200 (Serverless, scalar quantized) 75-80%
Monthly embedding $860-2,140 $430-1,070 (7 chunks/doc) 50%
Monthly search $15-50 $15-50 0%
Total monthly $4,875-7,290 $945-2,320 68-81%
Year 1 total $92,700-121,680 $15,240-31,740 74-84%

Cost-Optimized vs Standard: Full Corpus (Titan V2)

Standard Design Cost-Optimized (Titan V2) Savings
Backfill cost $342,000 $24,500 (Titan V2, 7 chunks/doc) 93%
Monthly infrastructure $8,100-12,100 $1,500-3,500 (Serverless, scalar quantized) 70-82%
Year 1 total $449,700-513,480 $42,500-66,500 87-91%

The prototype must validate BOTH retrieval quality AND production cost. Proving value with a $0.12/M model and then discovering production requires $0.12/M at scale is worse than testing the cheaper model early. The prototype should run both models side-by-side so the go/no-go decision is informed by real quality comparisons AND real cost projections.

Phase What Models Cost
Prototype 1 case, validate retrieval quality AND cost model Both: voyage-law-2 AND Titan V2, same case, same queries ~$20 + ~$3
Pilot (10 cases) Attorney feedback on both result sets Both models, side-by-side comparison ~$195 + ~$32
Go/No-Go decision Is Titan V2 quality acceptable for production? Compare retrieval results across pilot queries $0 (analysis only)
If Titan V2 passes:
Phase 1 production Top 100 cases, on-demand rest, Serverless OS, scalar quant Titan V2 for bulk, voyage-law-2 for query $650 + ~$1K-2.3K/mo
Phase 2 All NGE Discovery Titan V2 ~$4,700
Full corpus If justified Titan V2 ~$24,500
If Titan V2 fails:
Phase 1 production Top 100 cases, on-demand rest, Serverless OS, scalar quant Voyage AI voyage-law-2 $3,900 + ~$1K-2.3K/mo
Phase 2 All NGE Discovery Voyage AI voyage-law-2 ~$28,000
Full corpus If justified, likely requires enterprise pricing negotiation Voyage AI voyage-law-2 ~$312,000 (negotiate)

The prototype costs ~$23 to test both models on the same case. That $3 for Titan V2 buys the answer to a $280K question (the difference between $28K and $312K at scale). Not testing both early would be negligent.

What the Prototype Comparison Looks Like

Same case, same 20 test queries, both models:

Query: "internal discussions about Special Purpose Entities"

voyage-law-2 results:        Titan V2 results:
  #1: Fastow email (Raptor)    #1: Fastow email (Raptor)
  #2: Skilling reply            #2: Board presentation (SPE)
  #3: Board presentation        #3: Skilling reply
  #4: JEDI memo                 #4: ??? (did it find the JEDI memo?)
  ...                           ...

Evaluation:
  - Do both models find the same key documents?
  - Does Titan V2 miss any documents voyage-law-2 found?
  - Are the misses at the top (critical) or bottom (boundary)?
  - Would an attorney notice the difference?

If Titan V2 misses 1-2 boundary documents out of 20 queries, that's likely acceptable. If it misses the "stay the course" email that's the whole point of semantic search, it's not.

This comparison takes 1 day during the prototype. The cost to run it is $3. The cost of NOT running it is discovering 3 months later that production requires the expensive model.

Cost Drivers (What Moves the Numbers)

Factor Impact How to Reduce
Voyage AI token cost ($0.12/M) Largest variable cost Evaluate Bedrock Titan V2 ($0.02/M) post-prototype. 6x cheaper, 5-15% lower quality.
Vector store infrastructure Largest fixed cost OpenSearch Serverless vs Managed. Serverless ~60% cheaper at moderate volume.
New document import volume Drives ongoing embedding cost Only embed documents in NGE cases. Monitor via PSM/Athena.
Chunks per document Multiplies embedding cost Optimize chunking strategy. Fewer, larger chunks = lower cost but potentially lower retrieval quality.
OpenSearch vs ES consolidation Potential offset If OpenSearch replaces ES 7.4, existing ES cluster cost (~$X/mo) offsets new vector store cost.

Voyage AI Pricing: What We Know vs What We Need to Validate

All cost estimates in this document use $0.12/million tokens — the published standard rate for voyage-law-2. This is the ONLY publicly available pricing.

Pricing Tier Rate Rate Limits Effective Throughput Status
Standard (published) $0.12/M tokens 300 req/min × 128 texts/req ~2,560 docs/min Available now
Enterprise (negotiated) Unknown ~10K req/min × 128 texts/req ~85,000 docs/min Requires sales contact

Note: Each API request can batch up to 128 chunks. At 15 chunks/doc, 300 req/min = 38,400 chunks/min = ~2,560 documents/min.

What This Means for Backfill Estimates

The backfill TIME estimates assume enterprise-tier rate limits (~10K req/min). The backfill COST estimates use standard pricing ($0.12/M). These two assumptions may not be compatible:

Scenario Cost Impact Time Impact
Standard tier pricing + standard rate limits Cost as estimated ($0.12/M) Slower — 2,560 docs/min. Phase 2 in ~21 days. Full corpus in ~8 months.
Enterprise tier pricing + enterprise rate limits Likely discounted (volume pricing) Fast — 85K docs/min. Phase 2 in ~15 hours. Full corpus in ~7 days.
SageMaker (self-hosted, no rate limit) $0.22/M tokens + $737/mo per instance. Scale by adding instances. Controlled by instance count, not API limits.

The embedding Lambda's SQS event source mapping is configured with batchSize: 10 and maxBatchingWindow: 60s — following the same pattern documentloader uses for document processing. This requires new dedicated queues (live_embedding_queue, backfill_embedding_queue) with their own SNS subscriptions, DLQs, and event source mappings configured in CDK. The pattern is proven; the infrastructure is new.

Multiple DOCUMENT_PROCESSED messages arrive in one Lambda invocation. The Lambda chunks all documents, then batches the chunks into efficient Voyage API requests (up to 128 chunks per request).

10 documents per Lambda invocation
× 15 chunks per document
= 150 chunks
÷ 128 chunks per API request
= 2 API requests (128 + 22)

This means live ingest achieves the same batching efficiency as backfill.

Operation Docs per Lambda Chunks per API Request Effective Docs/Min (Standard 300 req/min)
Backfill Batched Up to 128 ~2,560
Live ingest (batchSize=10) Up to 10 Up to 128 ~2,560
Search query N/A 1 (single query) 300 queries/min

Ongoing Ingest: Standard Rate Limits Are Sufficient

At ~2,560 docs/min with SQS batching: - 2,560 docs/min × 60 min × 24 hr = ~3.7M docs/day - ~110M docs/month capacity

New document imports (~2-5M docs/month) use ~2-5% of available throughput. Standard rate limits handle ongoing ingest with massive headroom.

Enterprise tier is NOT needed for ongoing ingest — only for accelerating full corpus backfill.

Backfill Time at Standard vs Enterprise Rate Limits

Scope Documents Standard (2,560 docs/min) Enterprise (85K docs/min)
Prototype (1 case) 50K ~20 min < 1 min
Pilot (10 cases) 500K ~3 hours ~6 min
Phase 1 (100 cases) 10M ~2.7 days ~2 hours
Phase 2 (NGE Discovery) 78M ~21 days ~15 hours
Full corpus 870M ~236 days (~8 months) ~7 days

Phase 1 and Phase 2 are workable at standard rate limits (~3 days and ~3 weeks respectively). Full corpus backfill at standard rates takes ~8 months — enterprise tier or SageMaker needed if that timeline is unacceptable.

Action Item

Before committing to Phase 2 or full corpus backfill:

  1. Contact Voyage AI / MongoDB sales — negotiate enterprise pricing and rate limits for ~2.6T tokens (full corpus) or ~234B tokens (NGE only)
  2. Get volume discount — at this scale, expect significant discount off $0.12/M (possibly $0.06-0.08/M — speculative)
  3. Confirm rate limits — 10K req/min or higher needed for sub-2-week full corpus backfill
  4. Evaluate SageMaker as alternative — $0.22/M tokens but no rate limits, data stays in VPC, scale by adding GPU instances

For prototype and Phase 1, standard pricing and rate limits are sufficient. Phase 1 (10M docs) costs ~$3,600 and completes in ~3.5 days even at standard rate limits. No enterprise negotiation needed to start.

Assumptions and Validation Needed

Assumption Value Used How to Validate
Voyage AI price $0.12/M tokens (standard published) Negotiate enterprise pricing for volume
Voyage AI rate limit (backfill time) 10K req/min (enterprise, assumed) Confirm with Voyage AI sales
NGE-enabled = ~10% of cases ~87M docs Query core DB for nge_enabled case count + doc volumes
Discovery = ~90% of NGE docs ~78M docs Query per-case DBs for document type distribution
~15 chunks per document Based on 7.4 avg pages Sample 100 documents, run chunking, measure actual
~200 tokens per chunk Conservative estimate Sample chunked documents, count actual tokens
~2-5M new documents/month Estimate Pull from PSM/Athena: DOCUMENT_PROCESSED events per month
~30-100K search queries/month Estimate No current semantic search; estimate from keyword search volume

These estimates could shift 20-30% in either direction once validated against actual production data. The Voyage AI pricing is the single largest uncertainty — enterprise volume discounts could reduce the backfill cost by 30-50%.

Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.