Semantic Search: Executive Cost Summary¶

Production Corpus¶

Metric	Value
Total documents	870M
Total pages	6.4B
NGE-enabled cases	~10% of cases (~87M documents)
Discovery documents (NGE)	~90% of NGE (~78M documents)

Scenario 1: NGE-Enabled Discovery Cases Only (Recommended)¶

Scope: ~78M documents across active NGE Discovery cases.

One-Time Backfill¶

Phase	Documents	Embedding Cost	Compute	Total	Timeline
Prototype (1 case)	50K	$18	$2	$20	1 day
Pilot (10 cases)	500K	$180	$15	$195	1 day
Phase 1 (100 cases)	10M	$3,600	$300	$3,900	2-3 days
Phase 2 (all NGE Discovery)	78M	$28,000	$2,300	$30,300	1-2 weeks

Ongoing Monthly (Post-Backfill)¶

	OpenSearch Managed	OpenSearch Serverless
Infrastructure
Vector store (cluster/OCUs)	$3,900-5,000	$700-2,000
Storage (7.2 TB vectors)	Included	$173
SQS, CloudWatch, API Gateway	$100	$100
Subtotal infrastructure	$4,000-5,100	$973-2,273

Embedding (new documents)
Voyage AI (~2-5M new docs/mo)	$790-1,970	$790-1,970
Lambda compute (embedding)	$70-170	$70-170
Subtotal embedding	$860-2,140	$860-2,140

Search queries
Voyage AI (~30-100K queries/mo)	<$1	<$1
Lambda compute (search)	$5-15	$5-15
API Gateway	$10-35	$10-35
Subtotal search	$15-50	$15-50

Total monthly	$4,875-7,290	$1,848-4,463

Year 1 Total Cost (NGE Discovery)¶

	OpenSearch Managed	OpenSearch Serverless
Backfill (Phase 1 + Phase 2)	$34,200	$34,200
Monthly x 12	$58,500-87,480	$22,176-53,556
Year 1 total	$92,700-121,680	$56,376-87,756

Scenario 2: Full Corpus (All 870M Documents)¶

Scope: All documents across all cases, including legacy non-NGE cases.

One-Time Backfill¶

Scope	Documents	Embedding Cost	Compute	Total	Timeline
Full corpus	870M	$312,000	$30,000	$342,000	2-3 months

Ongoing Monthly (Post-Backfill)¶

	OpenSearch Managed	OpenSearch Serverless
Infrastructure
Vector store	$8,000-12,000	$2,000-5,000
Storage (80 TB vectors)	Included	$1,920
SQS, CloudWatch, API Gateway	$100	$100
Subtotal infrastructure	$8,100-12,100	$4,020-7,020

Embedding + Search
Same as Scenario 1	$875-2,190	$875-2,190

Total monthly	$8,975-14,290	$4,895-9,210

Year 1 Total Cost (Full Corpus)¶

	OpenSearch Managed	OpenSearch Serverless
Backfill (full)	$342,000	$342,000
Monthly x 12	$107,700-171,480	$58,740-110,520
Year 1 total	$449,700-513,480	$400,740-452,520

Scenario Comparison¶

	NGE Discovery Only	Full Corpus
Documents to embed	78M	870M
Backfill cost	$34,200	$342,000
Backfill time (standard API)	~21 days	~8 months
Backfill time (enterprise API)	~15 hours	~7 days
Vector storage	7.2 TB	80 TB
Monthly (Managed OS)	$4,875-7,290	$8,975-14,290
Monthly (Serverless OS)	$1,848-4,463	$4,895-9,210
Year 1 (Managed OS)	$92,700-121,680	$449,700-513,480
Year 1 (Serverless OS)	$56,376-87,756	$400,740-452,520

Full corpus is 4-5x the cost of NGE-only, driven almost entirely by the $342K backfill. Monthly ongoing costs roughly double because of larger vector store infrastructure (80 TB vs 7.2 TB).

Standard API rate limits (300 req/min × 128 chunks/req = 2,560 docs/min) handle NGE Discovery backfill in ~3 weeks. Full corpus at standard rates takes ~8 months — enterprise tier or SageMaker needed to compress that to days. Ongoing ingest (~2-5M new docs/month) fits within standard rate limits comfortably.

Build Time and Team Estimates¶

Engineering Time to Build¶

Phase	Scope	Team	Duration	Prerequisite
Prototype	1 case, pgvector, standalone UI	1 senior backend eng + Claude Code (frontend)	2 weeks	None
Production T1	Multi-tenant, OpenSearch, domain chunking, backfill pipeline	1-2 backend eng	1 quarter	Prototype validates quality
T1+ Rails integration	Search toggle, save to folder, review queue sort	1 frontend eng	2-3 weeks	T1 production
T2 Agent service	Gap analysis, pattern ID, motion to compel agents	1-2 backend eng	1 quarter	T1 production
T2+ Cross-corpus	Transcript embeddings, multi-depo analysis	1-2 eng	1 quarter	T2 + Litigation suite

Backfill Time (Wall Clock)¶

Scope	Documents	Voyage Standard API	Voyage Enterprise API
Prototype (1 case)	50K	~20 min	< 1 min
Pilot (10 cases)	500K	~3 hours	~6 min
Phase 1 (100 cases)	10M	~2.7 days	~2 hours
Phase 2 (NGE Discovery)	78M	~21 days	~15 hours
Full corpus	870M	~8 months	~7 days

Standard API: 300 req/min × 128 chunks/req = ~2,560 docs/min. Enterprise API: 10K req/min × 128 chunks/req = ~85K docs/min. Each request batches up to 128 chunks. At ~15 chunks/doc, the batch size means standard rate limits are more capable than they appear.

Standard rate limits handle prototype through Phase 2 (~3 weeks for 78M docs). Enterprise tier is only needed to accelerate full corpus backfill from ~8 months to ~7 days.

Backfill Time: Full Corpus Detail¶

The binding constraint for backfill is the Voyage AI API rate limit, not Lambda concurrency. At enterprise tier:

	Calculation
Voyage API throughput	10,000 requests/min x 128 texts/request
Chunks per minute	1,280,000
Documents per minute (at 15 chunks/doc)	~85,000

NGE Discovery (78M docs)	78M / 85K per min = ~15 hours
With 50% buffer (retries, backoff)	~1-2 days realistic

Full corpus (870M docs)	870M / 85K per min = ~7 days
With 50% buffer (retries, backoff)	~10-14 days realistic

Lambda concurrency must match API throughput to avoid being the bottleneck. At 85K docs/min and ~2 sec/doc, you need ~2,800 concurrent Lambda invocations to saturate the API. In practice, set MaximumConcurrency=100-200 on the backfill queue and let the API rate limit be the natural throttle.

Full corpus backfill at lower concurrency (if API rate limits or cost require throttling):

MaximumConcurrency	Throughput	NGE Discovery (78M)	Full Corpus (870M)
10	~5 docs/sec	180 days	5.5 years
50	~25 docs/sec	36 days	1.1 years
100	~50 docs/sec	18 days	201 days
200	~100 docs/sec	9 days	101 days
API-limited (~2,800)	~85K docs/min	1-2 days	10-14 days

Total Timeline: Prototype to Production¶

Week 1-2:     Prototype (1 eng, 1 case, validate retrieval quality)
Week 3:       Pilot (10 cases, attorney feedback)
Week 3-14:    Production T1 build (1-2 eng, multi-tenant, OpenSearch, chunking)
Week 14-15:   Phase 1 backfill (100 cases, 2-3 days)
Week 15-17:   T1+ Rails integration (search UI, review queue)
Week 17-18:   Phase 2 backfill — NGE Discovery (78M docs, 1-2 weeks)
              --- Semantic search live for all NGE Discovery cases ---
Week 18-20:   Full corpus backfill (870M docs, 10-14 days) — if approved
              --- Semantic search live for all cases ---
Week 18-30:   T2 agent service (gap analysis, pattern ID)

Prototype to Phase 1 live: ~14-15 weeks (1 quarter). Full NGE Discovery live: ~18 weeks. Full corpus live (if approved): ~20 weeks.

Recommendation¶

Start with NGE Discovery cases (Scenario 1). Use on-demand backfill for legacy cases.

Rationale
91% cost reduction	$34K backfill vs $342K
Covers active users	NGE cases are where attorneys are actively working
On-demand handles the tail	Legacy cases are embedded when an attorney first searches them
De-risks the investment	Prove value at $34K before committing $342K
Same user experience	Attorneys searching NGE cases get full semantic search immediately

Legacy cases that are searched get embedded on-demand (~$18 per 50K-doc case, ~$120 per 500K-doc case). Most legacy cases will never be searched again — the on-demand model avoids embedding documents nobody will ever query.

Cost-Optimized Alternative Design¶

The standard design prioritizes retrieval quality. The cost-optimized design applies six levers to reduce Year 1 cost by 74-91% while maintaining a viable product.

Lever Summary¶

Lever	Mechanism	Impact	Trade-off
On-demand backfill	Only pre-embed top 100 cases; embed rest when attorney first searches	Backfill: $30K -> $3.9K	First search on un-embedded cases is keyword-only
Cheaper embedding model	Bedrock Titan V2 ($0.02/M) instead of Voyage AI ($0.12/M) for bulk	6x cheaper per token	5-15% lower retrieval quality on legal text
Fewer chunks	1024-token chunks (~7/doc) instead of 512-token (~15/doc)	53% fewer embeddings	Slightly less precise snippets
Scalar quantization	Store vectors as int8 instead of float32	4x storage reduction	~1-2% recall loss
Serverless OpenSearch	Auto-scaled OCUs instead of dedicated cluster	~60-75% infra savings	60-second index refresh interval
Tiered storage	Hot (vector store) / Warm (S3) / Cold (not embedded)	Smaller always-on footprint	Minutes to load warm cases on first search

Lever 1: On-Demand Backfill (Eliminates Upfront Cost)¶

Instead of backfilling all documents upfront, embed each case when an attorney first searches it:

Attorney searches case 123 for the first time
  -> No embeddings exist
  -> Return keyword-only results immediately
  -> Trigger background backfill for case 123
  -> "Semantic search results will be available in ~X minutes"
  -> Subsequent searches return hybrid results

	Upfront Backfill	On-Demand Only
Backfill cost	$28K-$342K	$0 (cases embedded as needed)
Cases never searched	Embedded anyway (wasted)	Never embedded ($0)
First search experience	Full hybrid	Keyword-only, then hybrid after background embed

Mitigation: Pre-embed the top 100 most active cases ($3,900) so demo and high-usage cases are ready. Everything else embeds on demand.

Lever 2: Cheaper Embedding Model¶

Model	Cost/M Tokens	NGE Backfill (78M docs)	Full Corpus (870M)	Quality
Voyage AI voyage-law-2	$0.12	$28,000	$312,000	Best (legal-tuned)
Bedrock Titan V2	$0.02	$4,700	$52,000	Good (general, not legal-tuned)
Self-hosted open-source (BGE, E5, Nomic)	~$0.01-0.02 equiv.	$2,300-4,700	$26,000-52,000	Moderate

Titan V2 is fully managed via Bedrock — no endpoints, no rate limits to negotiate, no GPU instances. The risk: it may not find the "stay the course" email that voyage-law-2 finds. Prototype must use voyage-law-2 to prove value. Evaluate Titan V2 for production bulk embedding afterward.

Lever 3: Fewer Chunks Per Document¶

Strategy	Chunks/Doc	Embedding Cost Impact	Quality Impact
512-token chunks (baseline)	~15	Baseline	Baseline
1024-token chunks	~7	53% reduction	Slightly less precise snippets
Document summary + key passages	~3-5	67-80% reduction	Much less granular

At 7 chunks/doc, NGE backfill drops from $28K to ~$13K with Voyage AI, or from $4,700 to ~$2,200 with Titan V2.

Lever 4: Vector Quantization (Cuts Storage 4-32x)¶

Approach	Storage Per Vector	NGE Storage	Full Corpus Storage	Quality Loss
Full float32 (baseline)	4,096 bytes	7.2 TB	80 TB	None
Scalar quantization (int8)	1,024 bytes	1.8 TB	20 TB	~1-2%
Binary quantization (1 bit)	128 bytes	225 GB	2.5 TB	~5-10% (rerank mitigates)

Scalar quantization is the sweet spot — 4x storage reduction with negligible quality loss. OpenSearch supports this natively.

Lever 5: OpenSearch Serverless¶

	Managed	Serverless
Monthly infrastructure (NGE)	$4,000-5,100	$973-2,273
Operational overhead	Cluster sizing, shard mgmt	Zero
Trade-off	Full control	60-second refresh interval

Lever 6: Tiered Vector Storage¶

Tier	Cases	Storage	Access
Hot	Top 100-500 active cases	OpenSearch (always loaded)	Sub-second
Warm	Remaining embedded cases	Vectors serialized to S3	Load on first search (~minutes)
Cold	Legacy, never searched	Not embedded	Embed + load on first search

Reduces always-on vector store from 7.2 TB (all NGE) to ~500 GB (top 100 cases).

Cost-Optimized vs Standard: NGE Discovery¶

	Standard Design	Cost-Optimized Design	Savings
Backfill cost	$30,300 (all 78M docs)	$3,900 (top 100 cases, rest on-demand)	87%
Monthly infrastructure	$4,000-5,100	$500-1,200 (Serverless, scalar quantized)	75-80%
Monthly embedding	$860-2,140	$430-1,070 (7 chunks/doc)	50%
Monthly search	$15-50	$15-50	0%
Total monthly	$4,875-7,290	$945-2,320	68-81%
Year 1 total	$92,700-121,680	$15,240-31,740	74-84%

Cost-Optimized vs Standard: Full Corpus (Titan V2)¶

	Standard Design	Cost-Optimized (Titan V2)	Savings
Backfill cost	$342,000	$24,500 (Titan V2, 7 chunks/doc)	93%
Monthly infrastructure	$8,100-12,100	$1,500-3,500 (Serverless, scalar quantized)	70-82%
Year 1 total	$449,700-513,480	$42,500-66,500	87-91%

Recommended Phased Strategy¶

The prototype must validate BOTH retrieval quality AND production cost. Proving value with a $0.12/M model and then discovering production requires $0.12/M at scale is worse than testing the cheaper model early. The prototype should run both models side-by-side so the go/no-go decision is informed by real quality comparisons AND real cost projections.

Phase	What	Models	Cost
Prototype	1 case, validate retrieval quality AND cost model	Both: voyage-law-2 AND Titan V2, same case, same queries	~$20 + ~$3
Pilot (10 cases)	Attorney feedback on both result sets	Both models, side-by-side comparison	~$195 + ~$32
Go/No-Go decision	Is Titan V2 quality acceptable for production?	Compare retrieval results across pilot queries	$0 (analysis only)

If Titan V2 passes:
Phase 1 production	Top 100 cases, on-demand rest, Serverless OS, scalar quant	Titan V2 for bulk, voyage-law-2 for query	$650 + ~$1K-2.3K/mo
Phase 2	All NGE Discovery	Titan V2	~$4,700
Full corpus	If justified	Titan V2	~$24,500

If Titan V2 fails:
Phase 1 production	Top 100 cases, on-demand rest, Serverless OS, scalar quant	Voyage AI voyage-law-2	$3,900 + ~$1K-2.3K/mo
Phase 2	All NGE Discovery	Voyage AI voyage-law-2	~$28,000
Full corpus	If justified, likely requires enterprise pricing negotiation	Voyage AI voyage-law-2	~$312,000 (negotiate)

The prototype costs ~$23 to test both models on the same case. That $3 for Titan V2 buys the answer to a $280K question (the difference between $28K and $312K at scale). Not testing both early would be negligent.

What the Prototype Comparison Looks Like¶

Same case, same 20 test queries, both models:

Query: "internal discussions about Special Purpose Entities"

voyage-law-2 results:        Titan V2 results:
  #1: Fastow email (Raptor)    #1: Fastow email (Raptor)
  #2: Skilling reply            #2: Board presentation (SPE)
  #3: Board presentation        #3: Skilling reply
  #4: JEDI memo                 #4: ??? (did it find the JEDI memo?)
  ...                           ...

Evaluation:
  - Do both models find the same key documents?
  - Does Titan V2 miss any documents voyage-law-2 found?
  - Are the misses at the top (critical) or bottom (boundary)?
  - Would an attorney notice the difference?

If Titan V2 misses 1-2 boundary documents out of 20 queries, that's likely acceptable. If it misses the "stay the course" email that's the whole point of semantic search, it's not.

This comparison takes 1 day during the prototype. The cost to run it is $3. The cost of NOT running it is discovering 3 months later that production requires the expensive model.

Cost Drivers (What Moves the Numbers)¶

Factor	Impact	How to Reduce
Voyage AI token cost ($0.12/M)	Largest variable cost	Evaluate Bedrock Titan V2 ($0.02/M) post-prototype. 6x cheaper, 5-15% lower quality.
Vector store infrastructure	Largest fixed cost	OpenSearch Serverless vs Managed. Serverless ~60% cheaper at moderate volume.
New document import volume	Drives ongoing embedding cost	Only embed documents in NGE cases. Monitor via PSM/Athena.
Chunks per document	Multiplies embedding cost	Optimize chunking strategy. Fewer, larger chunks = lower cost but potentially lower retrieval quality.
OpenSearch vs ES consolidation	Potential offset	If OpenSearch replaces ES 7.4, existing ES cluster cost (~$X/mo) offsets new vector store cost.

Voyage AI Pricing: What We Know vs What We Need to Validate¶

All cost estimates in this document use $0.12/million tokens — the published standard rate for voyage-law-2. This is the ONLY publicly available pricing.

Pricing Tier	Rate	Rate Limits	Effective Throughput	Status
Standard (published)	$0.12/M tokens	300 req/min × 128 texts/req	~2,560 docs/min	Available now
Enterprise (negotiated)	Unknown	~10K req/min × 128 texts/req	~85,000 docs/min	Requires sales contact

Note: Each API request can batch up to 128 chunks. At 15 chunks/doc, 300 req/min = 38,400 chunks/min = ~2,560 documents/min.

What This Means for Backfill Estimates¶

The backfill TIME estimates assume enterprise-tier rate limits (~10K req/min). The backfill COST estimates use standard pricing ($0.12/M). These two assumptions may not be compatible:

Scenario	Cost Impact	Time Impact
Standard tier pricing + standard rate limits	Cost as estimated ($0.12/M)	Slower — 2,560 docs/min. Phase 2 in ~21 days. Full corpus in ~8 months.
Enterprise tier pricing + enterprise rate limits	Likely discounted (volume pricing)	Fast — 85K docs/min. Phase 2 in ~15 hours. Full corpus in ~7 days.
SageMaker (self-hosted, no rate limit)	$0.22/M tokens + $737/mo per instance. Scale by adding instances.	Controlled by instance count, not API limits.

Batching Efficiency: Backfill, Live Ingest, and Search¶

The embedding Lambda's SQS event source mapping is configured with batchSize: 10 and maxBatchingWindow: 60s — following the same pattern documentloader uses for document processing. This requires new dedicated queues (live_embedding_queue, backfill_embedding_queue) with their own SNS subscriptions, DLQs, and event source mappings configured in CDK. The pattern is proven; the infrastructure is new.

Multiple DOCUMENT_PROCESSED messages arrive in one Lambda invocation. The Lambda chunks all documents, then batches the chunks into efficient Voyage API requests (up to 128 chunks per request).

10 documents per Lambda invocation
× 15 chunks per document
= 150 chunks
÷ 128 chunks per API request
= 2 API requests (128 + 22)

This means live ingest achieves the same batching efficiency as backfill.

Operation	Docs per Lambda	Chunks per API Request	Effective Docs/Min (Standard 300 req/min)
Backfill	Batched	Up to 128	~2,560
Live ingest (batchSize=10)	Up to 10	Up to 128	~2,560
Search query	N/A	1 (single query)	300 queries/min

Ongoing Ingest: Standard Rate Limits Are Sufficient¶

At ~2,560 docs/min with SQS batching: - 2,560 docs/min × 60 min × 24 hr = ~3.7M docs/day - ~110M docs/month capacity

New document imports (~2-5M docs/month) use ~2-5% of available throughput. Standard rate limits handle ongoing ingest with massive headroom.

Enterprise tier is NOT needed for ongoing ingest — only for accelerating full corpus backfill.

Backfill Time at Standard vs Enterprise Rate Limits¶

Scope	Documents	Standard (2,560 docs/min)	Enterprise (85K docs/min)
Prototype (1 case)	50K	~20 min	< 1 min
Pilot (10 cases)	500K	~3 hours	~6 min
Phase 1 (100 cases)	10M	~2.7 days	~2 hours
Phase 2 (NGE Discovery)	78M	~21 days	~15 hours
Full corpus	870M	~236 days (~8 months)	~7 days

Phase 1 and Phase 2 are workable at standard rate limits (~3 days and ~3 weeks respectively). Full corpus backfill at standard rates takes ~8 months — enterprise tier or SageMaker needed if that timeline is unacceptable.

Action Item¶

Before committing to Phase 2 or full corpus backfill:

Contact Voyage AI / MongoDB sales — negotiate enterprise pricing and rate limits for ~2.6T tokens (full corpus) or ~234B tokens (NGE only)
Get volume discount — at this scale, expect significant discount off $0.12/M (possibly $0.06-0.08/M — speculative)
Confirm rate limits — 10K req/min or higher needed for sub-2-week full corpus backfill
Evaluate SageMaker as alternative — $0.22/M tokens but no rate limits, data stays in VPC, scale by adding GPU instances

For prototype and Phase 1, standard pricing and rate limits are sufficient. Phase 1 (10M docs) costs ~$3,600 and completes in ~3.5 days even at standard rate limits. No enterprise negotiation needed to start.

Assumptions and Validation Needed¶

Assumption	Value Used	How to Validate
Voyage AI price	$0.12/M tokens (standard published)	Negotiate enterprise pricing for volume
Voyage AI rate limit (backfill time)	10K req/min (enterprise, assumed)	Confirm with Voyage AI sales
NGE-enabled = ~10% of cases	~87M docs	Query core DB for nge_enabled case count + doc volumes
Discovery = ~90% of NGE docs	~78M docs	Query per-case DBs for document type distribution
~15 chunks per document	Based on 7.4 avg pages	Sample 100 documents, run chunking, measure actual
~200 tokens per chunk	Conservative estimate	Sample chunked documents, count actual tokens
~2-5M new documents/month	Estimate	Pull from PSM/Athena: DOCUMENT_PROCESSED events per month
~30-100K search queries/month	Estimate	No current semantic search; estimate from keyword search volume

These estimates could shift 20-30% in either direction once validated against actual production data. The Voyage AI pricing is the single largest uncertainty — enterprise volume discounts could reduce the backfill cost by 30-50%.

Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.

Semantic Search: Executive Cost Summary¶

Production Corpus¶

Scenario 1: NGE-Enabled Discovery Cases Only (Recommended)¶

One-Time Backfill¶

Ongoing Monthly (Post-Backfill)¶

Year 1 Total Cost (NGE Discovery)¶

Scenario 2: Full Corpus (All 870M Documents)¶

One-Time Backfill¶

Ongoing Monthly (Post-Backfill)¶

Year 1 Total Cost (Full Corpus)¶

Scenario Comparison¶

Build Time and Team Estimates¶

Engineering Time to Build¶

Backfill Time (Wall Clock)¶

Backfill Time: Full Corpus Detail¶

Total Timeline: Prototype to Production¶

Recommendation¶

Cost-Optimized Alternative Design¶

Lever Summary¶

Lever 1: On-Demand Backfill (Eliminates Upfront Cost)¶

Lever 2: Cheaper Embedding Model¶

Lever 3: Fewer Chunks Per Document¶

Lever 4: Vector Quantization (Cuts Storage 4-32x)¶

Lever 5: OpenSearch Serverless¶

Lever 6: Tiered Vector Storage¶

Cost-Optimized vs Standard: NGE Discovery¶

Cost-Optimized vs Standard: Full Corpus (Titan V2)¶

Recommended Phased Strategy¶

What the Prototype Comparison Looks Like¶

Cost Drivers (What Moves the Numbers)¶

Voyage AI Pricing: What We Know vs What We Need to Validate¶

What This Means for Backfill Estimates¶

Batching Efficiency: Backfill, Live Ingest, and Search¶

Ongoing Ingest: Standard Rate Limits Are Sufficient¶

Backfill Time at Standard vs Enterprise Rate Limits¶

Action Item¶

Assumptions and Validation Needed¶

Sign In