NPCase Database Optimization Analysis¶

Status: Initial research. Not presented or decided. Directional analysis.

Overview¶

Analysis of 20K+ NPCase databases for storage optimization in EDRM system. Current architecture uses S3 for documents, Elasticsearch for search, and Aurora RDS MySQL for metadata and relationships. Each case gets its own MySQL schema ({RDS_DBNAME}_case_{case_id}) with ~100 tables — all on a single Aurora writer instance in each region (~20 concurrent active imports).

Key insight: NGE writes to Legacy MySQL schemas by design (no big bang migration). Not everything in these schemas needs to be in MySQL. The question per table is: "Does Rails need to read this data?" If not, it can live elsewhere.

Throughput connection: This optimization primarily targets storage cost and schema complexity. Import throughput is addressed separately by: jitter fixes (deadlock cascades), bulk INSERT batching, early deduplication, and Rails → RDS Proxy (frees Aurora connection slots, enabling higher Lambda max_processing_daemons per batch).

Core Tables (Central Database — Keep in RDS)¶

These tables reside in the central database to manage accounts, users, cases, and system-wide jobs. They require high consistency and are foundational to multi-tenant architecture.

accounts, users, account_users, user_licenses
roles, actions, roles_actions
npcases, npcases_users, npcase_status_logs
batches, batch_parts, batch_sources, batch_processing_events
bates_patterns, custom_fields, standard_fields
delayed_jobs, processing_jobs_old, tracked_delayed_jobs
workers, elasticsearch_indexers, processing_alerts
ai_jobs, ai_job_chunks, ai_job_operations, ai_processing_metrics
production_templates, custom_report_types, grid_templates
identity_providers, identity_provider_accounts, identity_provider_users
billing_invoices, subscription_pass_through_settings
sessions, throttles, throttles_levels

NPCase-Specific Tables: Storage Analysis¶

Keep in RDS — Core Document & Case Structure¶

Table	Reason
`exhibits`	Core document metadata, heavily joined, complex queries
`depositions`, `deposition_volumes`	Hierarchical parent-child structure
`labels`, `custodians`, `categories`	Small reference tables, frequently joined
`attachments` (metadata)	Highly relational, links to exhibits. Binaries already in S3
`document_relations`	Graph-like queries, referential integrity needed
`custodian_exhibits`, `deposition_volume_exhibits`	Pure relational many-to-many
`batches`, `batch_parts`, `batch_sources`	ACID transactions critical for import workflow
`custom_fields`, `tag_dedupe`, `doc_dedupe`	Schema definition + data integrity during import
`process_checkpoints`	Checkpoint pipeline requires atomic operations
`tags`, `taggings`	Must stay in RDS. NGE uses TagDedupe (MySQL atomic dedup). Legacy uses ES for existence checks (intentional, prevents worker race conditions). Making ES the primary store would create circular dependency. ES remains a derived view for search only.

Migrate from RDS — Document Content (Phase 1)¶

Table	Current Issue	Target	Benefit
`pages` (`content` LONGTEXT)	Full transcript pages stored in RDS	DynamoDB	Document-centric access, better scaling, biggest storage win

DynamoDB structure:

{
  "deposition_volume_id": 12345,
  "page_number": 1,
  "content": "Q. Please state your name...",
  "line_count": 25,
  "created_at": "2024-01-15T10:30:00Z"
}

This is the highest-ROI migration. pages content is the largest blob per case, access is document-centric (by deposition_volume_id + page_number), and it aligns with ADR-010 (deposition processing modernization).

User Annotations — Nuanced (Needs Further Analysis)¶

Corrected understanding: Annotations are dual-stored for NGE cases: - Nutrient (PSPDFKit) Postgres — source of truth for rendering. Stores full annotation content. - Our MySQL image_markups — metadata/pointers with annotation_id FK to Nutrient UUID, plus page_number, user_id, comment_root. Thin records for Rails queries.

Table	NGE Cases	Legacy Cases	Migration Assessment
`image_markups`	Thin pointers (annotation_id + metadata)	Full annotation data (coordinates, content)	Defer. NGE records are small. Legacy records shrink as cases migrate to NGE.
`highlights`	Thin pointers to Nutrient	Full highlight data	Defer. Same as above.
`comments`, `page_notes`	May have Nutrient annotation IDs	Full comment data	Needs verification. Check if NGE comments also dual-store.

Recommendation: Defer annotation migration. As cases convert to NGE, MySQL annotation records become thin pointers naturally. The storage problem shrinks without migration effort. Focus on pages content instead.

Consider Migration — Audit & Activity Logs (Phase 2)¶

Table	Current Issue	Target	Benefit
`confidentiality_logs`, `privilege_logs`, `relevancy_logs`	High-volume append-only audit data	DynamoDB or CloudWatch	Time-series queries, lower cost
`user_activities`, `user_secured_activities`	Analytics-focused, time-series	CloudWatch Logs	Native time-series, retention policies
`batch_processing_events`	High-volume append-only import events	CloudWatch Logs	Already have PSM via Athena for this
`data_usage_summaries`	Time-series metrics	DynamoDB	Simple queries, time-based access

What Directly Impacts Import Throughput¶

These tables are NOT in the import write path — they're created during review, not import: - pages (depositions, not documents) - comments, page_notes (user-generated during review) - image_markups, highlights (annotations during review) - Audit logs (during review actions)

The import write path is: exhibits + attachments + tags + taggings + doc_dedupe + tag_dedupe + process_checkpoints + batch_sources — ALL stay in RDS.

Import throughput is addressed by:

Change	Impact	Mechanism
Add jitter to @retry_on_db_conflict	HIGH	Prevents deadlock cascades during concurrent bulk inserts
Add jitter to SQS requeue backoff	HIGH	Prevents retry storms from synchronized failures
Rails → RDS Proxy	HIGH	Frees Aurora connection slots → can increase Lambda max_processing_daemons per batch
Bulk INSERT batching	HIGH	Reduce transactions-per-document
Early deduplication	HIGH	Skip pipeline entirely for duplicate files
Off-hours dynamic concurrency	MEDIUM	More Aurora IOPS available when fewer cases active

Revised Migration Priority Matrix¶

Priority	Tables	Target	Storage Impact	Throughput Impact	Complexity
High	`pages` content	DynamoDB	HIGH (biggest blob)	NONE (not in import path)	Medium
High	Rails → RDS Proxy	N/A	NONE	HIGH (more connections → higher concurrency)	Medium
Medium	Audit logs	CloudWatch/DynamoDB	MEDIUM	NONE	Low
Defer	`image_markups`, `highlights`	Shrinks naturally via NGE migration	LOW (NGE records are thin)	NONE	High
Keep	`exhibits`, `tags`, `batches`, dedup tables	RDS	N/A	N/A	N/A
Skip	`tags` → ES primary	N/A	N/A	N/A	Wrong direction

Expected Impact¶

Storage Optimization (Phase 1: `pages` + Phase 2: audit logs)¶

Per-case storage reduction: 40-60% (revised down — annotations deferred, tags stay)
Cost savings: Meaningful but needs validation against actual per-case storage profiles

Throughput Optimization (Rails → RDS Proxy + jitter + batching)¶

Connection headroom: RDS Proxy pools Rails connections → more Aurora connections available for Lambda
Higher concurrency: Can increase max_processing_daemons from default 6 → 10-12 per batch
Fewer deadlocks: Jitter prevents synchronized retries
Fewer wasted writes: Early dedup skips pipeline for duplicates

Estimates Needing Validation¶

What % of per-case storage is pages content vs exhibits metadata?
Are the 20K databases similar size, or do a few large cases dominate?
DynamoDB costs (read/write capacity + storage) need to be netted against RDS savings

Implementation Phases (Revised)¶

Phase 1: Throughput Quick Wins (1-2 weeks)¶

Add jitter to @retry_on_db_conflict (one-line change)
Add jitter to SQS requeue backoff (one-line change)
Increase batch-processing queue visibility timeout to 990s (config change)

Phase 2: Rails → RDS Proxy (2-4 weeks)¶

prepared_statements: false in database.yml
Move SET statements to RDS Proxy initialization query
Apply session pinning filters
Deploy and monitor DatabaseConnectionsCurrentlySessionPinned
If connection headroom improves: increase max_processing_daemons See: rails-rds-proxy-migration.md

Phase 3: `pages` Content → DynamoDB (3-6 months)¶

Build DynamoDB access layer for deposition pages
Dual-write during transition
Migrate read operations
Aligns with ADR-010 (deposition processing modernization)

Phase 4: Audit Logs → CloudWatch/DynamoDB (6-9 months)¶

Audit logs to CloudWatch/DynamoDB
Usage summaries to DynamoDB

Connection to Existing Architecture¶

Document	Relationship
ADR-003 (per-case databases)	This analysis shows which tables justify per-case schemas and which don't
ADR-008 (ES upgrade)	Tags stay in MySQL; ES is derived view for search. No tag migration to ES.
ADR-010 (deposition processing)	`pages` migration to DynamoDB directly supports deposition modernization
BACKLOG: Aurora write throughput	Jitter + bulk INSERT + RDS Proxy address import throughput directly
BACKLOG: Rails → RDS Proxy	Frees connection slots → higher Lambda concurrency → faster imports
Divergence map: TagDedupe/DocDedupe	Dedup tables must stay in RDS (require atomic constraints)
Divergence map: Annotations	Dual-stored (Nutrient + MySQL). NGE records are thin pointers.

Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.