Skip to content

NPCase Database Optimization Analysis

Status: Initial research. Not presented or decided. Directional analysis.

Overview

Analysis of 20K+ NPCase databases for storage optimization in EDRM system. Current architecture uses S3 for documents, Elasticsearch for search, and Aurora RDS MySQL for metadata and relationships. Each case gets its own MySQL schema ({RDS_DBNAME}_case_{case_id}) with ~100 tables — all on a single Aurora writer instance in each region (~20 concurrent active imports).

Key insight: NGE writes to Legacy MySQL schemas by design (no big bang migration). Not everything in these schemas needs to be in MySQL. The question per table is: "Does Rails need to read this data?" If not, it can live elsewhere.

Throughput connection: This optimization primarily targets storage cost and schema complexity. Import throughput is addressed separately by: jitter fixes (deadlock cascades), bulk INSERT batching, early deduplication, and Rails → RDS Proxy (frees Aurora connection slots, enabling higher Lambda max_processing_daemons per batch).

Core Tables (Central Database — Keep in RDS)

These tables reside in the central database to manage accounts, users, cases, and system-wide jobs. They require high consistency and are foundational to multi-tenant architecture.

accounts, users, account_users, user_licenses
roles, actions, roles_actions
npcases, npcases_users, npcase_status_logs
batches, batch_parts, batch_sources, batch_processing_events
bates_patterns, custom_fields, standard_fields
delayed_jobs, processing_jobs_old, tracked_delayed_jobs
workers, elasticsearch_indexers, processing_alerts
ai_jobs, ai_job_chunks, ai_job_operations, ai_processing_metrics
production_templates, custom_report_types, grid_templates
identity_providers, identity_provider_accounts, identity_provider_users
billing_invoices, subscription_pass_through_settings
sessions, throttles, throttles_levels

NPCase-Specific Tables: Storage Analysis

Keep in RDS — Core Document & Case Structure

Table Reason
exhibits Core document metadata, heavily joined, complex queries
depositions, deposition_volumes Hierarchical parent-child structure
labels, custodians, categories Small reference tables, frequently joined
attachments (metadata) Highly relational, links to exhibits. Binaries already in S3
document_relations Graph-like queries, referential integrity needed
custodian_exhibits, deposition_volume_exhibits Pure relational many-to-many
batches, batch_parts, batch_sources ACID transactions critical for import workflow
custom_fields, tag_dedupe, doc_dedupe Schema definition + data integrity during import
process_checkpoints Checkpoint pipeline requires atomic operations
tags, taggings Must stay in RDS. NGE uses TagDedupe (MySQL atomic dedup). Legacy uses ES for existence checks (intentional, prevents worker race conditions). Making ES the primary store would create circular dependency. ES remains a derived view for search only.

Migrate from RDS — Document Content (Phase 1)

Table Current Issue Target Benefit
pages (content LONGTEXT) Full transcript pages stored in RDS DynamoDB Document-centric access, better scaling, biggest storage win

DynamoDB structure:

{
  "deposition_volume_id": 12345,
  "page_number": 1,
  "content": "Q. Please state your name...",
  "line_count": 25,
  "created_at": "2024-01-15T10:30:00Z"
}

This is the highest-ROI migration. pages content is the largest blob per case, access is document-centric (by deposition_volume_id + page_number), and it aligns with ADR-010 (deposition processing modernization).

User Annotations — Nuanced (Needs Further Analysis)

Corrected understanding: Annotations are dual-stored for NGE cases: - Nutrient (PSPDFKit) Postgres — source of truth for rendering. Stores full annotation content. - Our MySQL image_markups — metadata/pointers with annotation_id FK to Nutrient UUID, plus page_number, user_id, comment_root. Thin records for Rails queries.

Table NGE Cases Legacy Cases Migration Assessment
image_markups Thin pointers (annotation_id + metadata) Full annotation data (coordinates, content) Defer. NGE records are small. Legacy records shrink as cases migrate to NGE.
highlights Thin pointers to Nutrient Full highlight data Defer. Same as above.
comments, page_notes May have Nutrient annotation IDs Full comment data Needs verification. Check if NGE comments also dual-store.

Recommendation: Defer annotation migration. As cases convert to NGE, MySQL annotation records become thin pointers naturally. The storage problem shrinks without migration effort. Focus on pages content instead.

Consider Migration — Audit & Activity Logs (Phase 2)

Table Current Issue Target Benefit
confidentiality_logs, privilege_logs, relevancy_logs High-volume append-only audit data DynamoDB or CloudWatch Time-series queries, lower cost
user_activities, user_secured_activities Analytics-focused, time-series CloudWatch Logs Native time-series, retention policies
batch_processing_events High-volume append-only import events CloudWatch Logs Already have PSM via Athena for this
data_usage_summaries Time-series metrics DynamoDB Simple queries, time-based access

What Directly Impacts Import Throughput

These tables are NOT in the import write path — they're created during review, not import: - pages (depositions, not documents) - comments, page_notes (user-generated during review) - image_markups, highlights (annotations during review) - Audit logs (during review actions)

The import write path is: exhibits + attachments + tags + taggings + doc_dedupe + tag_dedupe + process_checkpoints + batch_sources — ALL stay in RDS.

Import throughput is addressed by:

Change Impact Mechanism
Add jitter to @retry_on_db_conflict HIGH Prevents deadlock cascades during concurrent bulk inserts
Add jitter to SQS requeue backoff HIGH Prevents retry storms from synchronized failures
Rails → RDS Proxy HIGH Frees Aurora connection slots → can increase Lambda max_processing_daemons per batch
Bulk INSERT batching HIGH Reduce transactions-per-document
Early deduplication HIGH Skip pipeline entirely for duplicate files
Off-hours dynamic concurrency MEDIUM More Aurora IOPS available when fewer cases active

Revised Migration Priority Matrix

Priority Tables Target Storage Impact Throughput Impact Complexity
High pages content DynamoDB HIGH (biggest blob) NONE (not in import path) Medium
High Rails → RDS Proxy N/A NONE HIGH (more connections → higher concurrency) Medium
Medium Audit logs CloudWatch/DynamoDB MEDIUM NONE Low
Defer image_markups, highlights Shrinks naturally via NGE migration LOW (NGE records are thin) NONE High
Keep exhibits, tags, batches, dedup tables RDS N/A N/A N/A
Skip tags → ES primary N/A N/A N/A Wrong direction

Expected Impact

Storage Optimization (Phase 1: pages + Phase 2: audit logs)

  • Per-case storage reduction: 40-60% (revised down — annotations deferred, tags stay)
  • Cost savings: Meaningful but needs validation against actual per-case storage profiles

Throughput Optimization (Rails → RDS Proxy + jitter + batching)

  • Connection headroom: RDS Proxy pools Rails connections → more Aurora connections available for Lambda
  • Higher concurrency: Can increase max_processing_daemons from default 6 → 10-12 per batch
  • Fewer deadlocks: Jitter prevents synchronized retries
  • Fewer wasted writes: Early dedup skips pipeline for duplicates

Estimates Needing Validation

  • What % of per-case storage is pages content vs exhibits metadata?
  • Are the 20K databases similar size, or do a few large cases dominate?
  • DynamoDB costs (read/write capacity + storage) need to be netted against RDS savings

Implementation Phases (Revised)

Phase 1: Throughput Quick Wins (1-2 weeks)

  1. Add jitter to @retry_on_db_conflict (one-line change)
  2. Add jitter to SQS requeue backoff (one-line change)
  3. Increase batch-processing queue visibility timeout to 990s (config change)

Phase 2: Rails → RDS Proxy (2-4 weeks)

  1. prepared_statements: false in database.yml
  2. Move SET statements to RDS Proxy initialization query
  3. Apply session pinning filters
  4. Deploy and monitor DatabaseConnectionsCurrentlySessionPinned
  5. If connection headroom improves: increase max_processing_daemons See: rails-rds-proxy-migration.md

Phase 3: pages Content → DynamoDB (3-6 months)

  1. Build DynamoDB access layer for deposition pages
  2. Dual-write during transition
  3. Migrate read operations
  4. Aligns with ADR-010 (deposition processing modernization)

Phase 4: Audit Logs → CloudWatch/DynamoDB (6-9 months)

  1. Audit logs to CloudWatch/DynamoDB
  2. Usage summaries to DynamoDB

Connection to Existing Architecture

Document Relationship
ADR-003 (per-case databases) This analysis shows which tables justify per-case schemas and which don't
ADR-008 (ES upgrade) Tags stay in MySQL; ES is derived view for search. No tag migration to ES.
ADR-010 (deposition processing) pages migration to DynamoDB directly supports deposition modernization
BACKLOG: Aurora write throughput Jitter + bulk INSERT + RDS Proxy address import throughput directly
BACKLOG: Rails → RDS Proxy Frees connection slots → higher Lambda concurrency → faster imports
Divergence map: TagDedupe/DocDedupe Dedup tables must stay in RDS (require atomic constraints)
Divergence map: Annotations Dual-stored (Nutrient + MySQL). NGE records are thin pointers.
Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.