NPCase Database Optimization Analysis¶
Status: Initial research. Not presented or decided. Directional analysis.
Overview¶
Analysis of 20K+ NPCase databases for storage optimization in EDRM system. Current architecture
uses S3 for documents, Elasticsearch for search, and Aurora RDS MySQL for metadata and
relationships. Each case gets its own MySQL schema ({RDS_DBNAME}_case_{case_id}) with ~100
tables — all on a single Aurora writer instance in each region (~20 concurrent active imports).
Key insight: NGE writes to Legacy MySQL schemas by design (no big bang migration). Not everything in these schemas needs to be in MySQL. The question per table is: "Does Rails need to read this data?" If not, it can live elsewhere.
Throughput connection: This optimization primarily targets storage cost and schema complexity.
Import throughput is addressed separately by: jitter fixes (deadlock cascades), bulk INSERT
batching, early deduplication, and Rails → RDS Proxy (frees Aurora connection slots, enabling
higher Lambda max_processing_daemons per batch).
Core Tables (Central Database — Keep in RDS)¶
These tables reside in the central database to manage accounts, users, cases, and system-wide jobs. They require high consistency and are foundational to multi-tenant architecture.
accounts, users, account_users, user_licenses
roles, actions, roles_actions
npcases, npcases_users, npcase_status_logs
batches, batch_parts, batch_sources, batch_processing_events
bates_patterns, custom_fields, standard_fields
delayed_jobs, processing_jobs_old, tracked_delayed_jobs
workers, elasticsearch_indexers, processing_alerts
ai_jobs, ai_job_chunks, ai_job_operations, ai_processing_metrics
production_templates, custom_report_types, grid_templates
identity_providers, identity_provider_accounts, identity_provider_users
billing_invoices, subscription_pass_through_settings
sessions, throttles, throttles_levels
NPCase-Specific Tables: Storage Analysis¶
Keep in RDS — Core Document & Case Structure¶
| Table | Reason |
|---|---|
exhibits |
Core document metadata, heavily joined, complex queries |
depositions, deposition_volumes |
Hierarchical parent-child structure |
labels, custodians, categories |
Small reference tables, frequently joined |
attachments (metadata) |
Highly relational, links to exhibits. Binaries already in S3 |
document_relations |
Graph-like queries, referential integrity needed |
custodian_exhibits, deposition_volume_exhibits |
Pure relational many-to-many |
batches, batch_parts, batch_sources |
ACID transactions critical for import workflow |
custom_fields, tag_dedupe, doc_dedupe |
Schema definition + data integrity during import |
process_checkpoints |
Checkpoint pipeline requires atomic operations |
tags, taggings |
Must stay in RDS. NGE uses TagDedupe (MySQL atomic dedup). Legacy uses ES for existence checks (intentional, prevents worker race conditions). Making ES the primary store would create circular dependency. ES remains a derived view for search only. |
Migrate from RDS — Document Content (Phase 1)¶
| Table | Current Issue | Target | Benefit |
|---|---|---|---|
pages (content LONGTEXT) |
Full transcript pages stored in RDS | DynamoDB | Document-centric access, better scaling, biggest storage win |
DynamoDB structure:
{
"deposition_volume_id": 12345,
"page_number": 1,
"content": "Q. Please state your name...",
"line_count": 25,
"created_at": "2024-01-15T10:30:00Z"
}
This is the highest-ROI migration. pages content is the largest blob per case, access is
document-centric (by deposition_volume_id + page_number), and it aligns with ADR-010
(deposition processing modernization).
User Annotations — Nuanced (Needs Further Analysis)¶
Corrected understanding: Annotations are dual-stored for NGE cases:
- Nutrient (PSPDFKit) Postgres — source of truth for rendering. Stores full annotation content.
- Our MySQL image_markups — metadata/pointers with annotation_id FK to Nutrient UUID,
plus page_number, user_id, comment_root. Thin records for Rails queries.
| Table | NGE Cases | Legacy Cases | Migration Assessment |
|---|---|---|---|
image_markups |
Thin pointers (annotation_id + metadata) | Full annotation data (coordinates, content) | Defer. NGE records are small. Legacy records shrink as cases migrate to NGE. |
highlights |
Thin pointers to Nutrient | Full highlight data | Defer. Same as above. |
comments, page_notes |
May have Nutrient annotation IDs | Full comment data | Needs verification. Check if NGE comments also dual-store. |
Recommendation: Defer annotation migration. As cases convert to NGE, MySQL annotation
records become thin pointers naturally. The storage problem shrinks without migration effort.
Focus on pages content instead.
Consider Migration — Audit & Activity Logs (Phase 2)¶
| Table | Current Issue | Target | Benefit |
|---|---|---|---|
confidentiality_logs, privilege_logs, relevancy_logs |
High-volume append-only audit data | DynamoDB or CloudWatch | Time-series queries, lower cost |
user_activities, user_secured_activities |
Analytics-focused, time-series | CloudWatch Logs | Native time-series, retention policies |
batch_processing_events |
High-volume append-only import events | CloudWatch Logs | Already have PSM via Athena for this |
data_usage_summaries |
Time-series metrics | DynamoDB | Simple queries, time-based access |
What Directly Impacts Import Throughput¶
These tables are NOT in the import write path — they're created during review, not import:
- pages (depositions, not documents)
- comments, page_notes (user-generated during review)
- image_markups, highlights (annotations during review)
- Audit logs (during review actions)
The import write path is: exhibits + attachments + tags + taggings + doc_dedupe +
tag_dedupe + process_checkpoints + batch_sources — ALL stay in RDS.
Import throughput is addressed by:
| Change | Impact | Mechanism |
|---|---|---|
| Add jitter to @retry_on_db_conflict | HIGH | Prevents deadlock cascades during concurrent bulk inserts |
| Add jitter to SQS requeue backoff | HIGH | Prevents retry storms from synchronized failures |
| Rails → RDS Proxy | HIGH | Frees Aurora connection slots → can increase Lambda max_processing_daemons per batch |
| Bulk INSERT batching | HIGH | Reduce transactions-per-document |
| Early deduplication | HIGH | Skip pipeline entirely for duplicate files |
| Off-hours dynamic concurrency | MEDIUM | More Aurora IOPS available when fewer cases active |
Revised Migration Priority Matrix¶
| Priority | Tables | Target | Storage Impact | Throughput Impact | Complexity |
|---|---|---|---|---|---|
| High | pages content |
DynamoDB | HIGH (biggest blob) | NONE (not in import path) | Medium |
| High | Rails → RDS Proxy | N/A | NONE | HIGH (more connections → higher concurrency) | Medium |
| Medium | Audit logs | CloudWatch/DynamoDB | MEDIUM | NONE | Low |
| Defer | image_markups, highlights |
Shrinks naturally via NGE migration | LOW (NGE records are thin) | NONE | High |
| Keep | exhibits, tags, batches, dedup tables |
RDS | N/A | N/A | N/A |
| Skip | tags → ES primary |
N/A | N/A | N/A | Wrong direction |
Expected Impact¶
Storage Optimization (Phase 1: pages + Phase 2: audit logs)¶
- Per-case storage reduction: 40-60% (revised down — annotations deferred, tags stay)
- Cost savings: Meaningful but needs validation against actual per-case storage profiles
Throughput Optimization (Rails → RDS Proxy + jitter + batching)¶
- Connection headroom: RDS Proxy pools Rails connections → more Aurora connections available for Lambda
- Higher concurrency: Can increase
max_processing_daemonsfrom default 6 → 10-12 per batch - Fewer deadlocks: Jitter prevents synchronized retries
- Fewer wasted writes: Early dedup skips pipeline for duplicates
Estimates Needing Validation¶
- What % of per-case storage is
pagescontent vs exhibits metadata? - Are the 20K databases similar size, or do a few large cases dominate?
- DynamoDB costs (read/write capacity + storage) need to be netted against RDS savings
Implementation Phases (Revised)¶
Phase 1: Throughput Quick Wins (1-2 weeks)¶
- Add jitter to
@retry_on_db_conflict(one-line change) - Add jitter to SQS requeue backoff (one-line change)
- Increase batch-processing queue visibility timeout to 990s (config change)
Phase 2: Rails → RDS Proxy (2-4 weeks)¶
prepared_statements: falsein database.yml- Move SET statements to RDS Proxy initialization query
- Apply session pinning filters
- Deploy and monitor
DatabaseConnectionsCurrentlySessionPinned - If connection headroom improves: increase
max_processing_daemonsSee: rails-rds-proxy-migration.md
Phase 3: pages Content → DynamoDB (3-6 months)¶
- Build DynamoDB access layer for deposition pages
- Dual-write during transition
- Migrate read operations
- Aligns with ADR-010 (deposition processing modernization)
Phase 4: Audit Logs → CloudWatch/DynamoDB (6-9 months)¶
- Audit logs to CloudWatch/DynamoDB
- Usage summaries to DynamoDB
Connection to Existing Architecture¶
| Document | Relationship |
|---|---|
| ADR-003 (per-case databases) | This analysis shows which tables justify per-case schemas and which don't |
| ADR-008 (ES upgrade) | Tags stay in MySQL; ES is derived view for search. No tag migration to ES. |
| ADR-010 (deposition processing) | pages migration to DynamoDB directly supports deposition modernization |
| BACKLOG: Aurora write throughput | Jitter + bulk INSERT + RDS Proxy address import throughput directly |
| BACKLOG: Rails → RDS Proxy | Frees connection slots → higher Lambda concurrency → faster imports |
| Divergence map: TagDedupe/DocDedupe | Dedup tables must stay in RDS (require atomic constraints) |
| Divergence map: Annotations | Dual-stored (Nutrient + MySQL). NGE records are thin pointers. |
Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.