Skip to content

Engineering Backlog

Cross-cutting implementation gaps identified during architecture reviews. Grouped by theme, linked to relevant patterns and ADRs.

#1 Pain Point: Large import throughput. Priority stack by throughput impact:

Item Impact Effort
Add jitter to @retry_on_db_conflict HIGH — prevents deadlock cascades One-line change
Add jitter to SQS requeue backoff HIGH — prevents retry storms One-line change
Profile Aurora write throughput (bulk INSERT, early dedup) HIGH Investigation
Heartbeat for checkpoint pipeline MEDIUM — prevents duplicate processing Medium
Visibility timeout → 990s MEDIUM — prevents rare duplicates Config change
Rails → RDS Proxy (connection pinning) MEDIUM — partial benefit (core DB only) Medium
Direct-to-Aurora for new modules MEDIUM — future modules Per-ADR decision

Observability

Elasticsearch

  • CRITICAL: Fix ES hot partition problem — indexes growing to 1TB+ New index creation only checked at case creation time (30GB shard threshold). New cases all land on latest index, grow simultaneously → 300GB-1TB monster indexes. Large imports on one case degrade search for ALL co-located cases. Fix: periodic shard size check (background job) or ILM-based rollover. Biggest ES operational issue. Related: elasticsearch-cluster-management, ADR-008

  • Migrate FilterSearch from MySQL JOINs to ES queries (prerequisite for ADR-011) Document browse/filter uses MySQL JOINs on tags/taggings for ALL cases (NGE and Legacy). Must migrate to ES queries before stopping MySQL tag writes. Compatible with AG Grid infinite scroll. Related: rails/app/models/filter_search.rb, ADR-008, ADR-011

Throughput / Scaling

  • Evaluate direct-to-Aurora (via RDS Proxy) for new NGE modules instead of routing through Rails API Currently only documentloader uses RDS Proxy directly. All other NGE modules (extractor, uploader, exporter, exchanger) route through Rails HTTP APIs → Rails ActiveRecord → Aurora. Rails becomes the throughput bottleneck: Puma thread limits, ActiveRecord connection pool, HTTP overhead per call. With 20K+ npcase databases on one Aurora cluster, the shared writer instance is already the aggregate bottput constraint. Adding Rails as an intermediary compounds it. Trade-off: Direct-to-Aurora eliminates Rails bottleneck but duplicates model logic in Python. Mitigations: shared SQLAlchemy models in shared_libs Python package, RDS Proxy handles connection pooling, per-case schema resolution already exists in documentloader. Decision needed per ADR: For each new module (ADRs 005-010), decide: direct-to-Aurora (like documentloader) or through-Rails (like current modules). Recommend direct-to-Aurora for write-heavy modules (bates stamping ADR-006, bulk ops ADR-005). Related: lambda-sqs-integration, database-session

  • Profile and optimize aggregate Aurora write throughput for large imports 20K+ npcase databases share one Aurora writer instance. Large imports from multiple cases simultaneously compete for write IOPS. Investigate: Aurora I/O-Optimized pricing tier, bulk INSERT batching (reduce transactions-per-document), off-hours dynamic concurrency scaling, early deduplication to skip pipeline for duplicate files. Related: concurrency-patterns

  • Migrate Rails to RDS Proxy for connection pooling Currently Rails connects directly to Aurora. RDS Proxy would pool connections for core DB (accounts, users, batches) but PerCaseModel connections will be pinned (schema switching causes pinning). Requires: prepared_statements: false, move SET statements to proxy init query, accept pinning for case-specific connections. Partial benefit — core connections multiplex, case connections don't. Full benefit requires reducing PerCaseModel reliance (longer-term). Related: rails-rds-proxy-migration

  • Legacy-only: ES tag existence checks cause throttling on large load file imports Legacy uses ES queries per-row per-custom-field-value to check tag existence — intentional, prevents race conditions from parallel workers. Causes millions of ES queries on large imports (100K+ rows × 50+ columns). NGE already solved this: documentloader uses TagDedupe table (SHA256 hash + PK constraint) for atomic MySQL dedup + INSERT IGNORE for taggings. Zero ES queries in write path. This problem goes away as cases convert to NGE. For new modules (ADR-005 bulk ops): follow documentloader's TagDedupe pattern, not Legacy's ES check pattern. Related: documentloader shell/tags_ops.py, shell/taggings_ops.py, ADR-005

Resilience

  • Review reservedConcurrency across all NGE Lambdas Ensure dual concurrency control (ESM MaximumConcurrency + ReservedConcurrency) is applied consistently. Invariant: MaximumConcurrency <= ReservedConcurrency. Prevent cross-module starvation when one module spikes. Related: concurrency-patterns, lambda-sqs-integration, article review

  • Increase batch-processing queue visibility timeout to 990s Current 900s doesn't account for 60s batch window. Formula: Lambda_timeout + BatchWindow + 30s buffer = 990s. Related: lambda-sqs-integration, article review

  • Audit maxReceiveCount across all NGE SQS queues (ensure >= 5) Low maxReceiveCount + throttling bursts = valid messages in DLQ. AWS recommends >= 5. Related: lambda-sqs-integration, article review

  • Add heartbeat extension for documentloader checkpoint pipeline Prevent duplicate processing when long-running checkpoints exceed SQS visibility timeout. Related: checkpoint-pipeline, sqs-handler

  • Add full jitter to @retry_on_db_conflict decorator Pure exponential backoff without jitter causes thundering herd on MySQL deadlock storms during bulk imports. Change: delay = random.uniform(0, base_delay * (2 ** attempt)) Related: retry-and-resilience, article review

  • Add full jitter to SQS requeue backoff Same thundering herd risk at message level. All messages that fail simultaneously retry at the same intervals. Change: delay = random.uniform(0, min(base_delay * (2 ** retry_count), max_delay)) Related: retry-and-resilience, article review

  • Add timeout + jittered retry to NgePageService HTTP calls Sync path has no timeout and no retry. Set connection + request timeout based on p99.9 latency. Add retry with jittered backoff for transient failures. Related: retry-and-resilience, circuit-breaker, article review

  • Add per-case backpressure for large batch imports A single large case import can starve other cases. Need admission control or rate limiting per case. Related: concurrency-patterns, sqs-operations

Event Architecture

  • Add event schema validation (Pydantic model) for new modules No enforced schema on SNS messages today — implicit coupling between producers and consumers. As module count grows (ADRs 005-010), schema drift risk increases. Related: sns-event-publishing, article review

  • Create event-catalog.md listing all event types, producers, and consumers No central registry of events across modules. Needed for debugging event flows and onboarding new modules. Related: sns-event-publishing, article review

  • Evaluate Step Functions for ADR-005 (bulk ops saga) and ADR-010 (deposition orchestration) Pure choreography works for linear pipeline but bulk operations need compensation/rollback (saga) and depositions have sequential dependencies. Hybrid: choreography by default, orchestration for sagas. Related: adr/005, adr/010, article review

Data Integrity

  • Ensure bates number increment is idempotent (fencing token pattern) for ADR-006 Bates numbers are a legal correctness concern — duplicates are compliance failures. Apply Kleppmann's fencing token pattern: BatesPattern.next_number is the natural fencing token. Stamp writes must be idempotent (skip if annotation with that bates number already exists). Related: adr/006, article review

  • Verify InnoDB lock_wait_timeout vs Lambda timeout alignment for stamp service If Lambda pauses beyond innodb_lock_wait_timeout, the SELECT ... FOR UPDATE on BatesPattern can release, allowing duplicate number assignment. Ensure Lambda timeout < lock_wait_timeout. Related: adr/006, article review

  • Ensure bates stamp Lambda uses Aurora writer endpoint for SELECT...FOR UPDATE Aurora Write Forwarding does NOT support SELECT...FOR UPDATE. Bates increment must go through the primary writer instance, not reader endpoints. If stamp Lambda uses reader for general reads, it must explicitly switch to writer for the atomic increment. Related: adr/006, article review

Idempotency

  • Evaluate Lambda Powertools @idempotent decorator for new modules Standardize idempotency implementation instead of hand-rolling per module. Related: idempotent-handlers

  • Standardize idempotency test pattern Call handler twice with same input, verify no duplicates or side effects. Apply across all modules. Related: idempotent-handlers

Claude Code Skills

  • Add eval cases to architecture repo skills Skill-creator (March 2026 update) now supports eval mode, benchmark mode, and blind A/B testing. Define test prompts + expected outputs for each skill:
  • exploring-module: test with a known module, verify it finds correct patterns and integration points
  • reviewing-architecture: test with a module that violates hexagonal boundaries, verify detection
  • writing-reference-impl: test with a documented module, verify output matches reference format Related: article review

  • Run trigger optimization (Improve mode) on skill frontmatter descriptions Skills may not trigger when users use natural language instead of explicit skill names (e.g., "check this module" vs "use the exploring-module skill"). Improve mode optimizes the frontmatter description for trigger accuracy (60/40 train/holdout, 3 runs, up to 5 iterations). Related: article review

  • Benchmark skills across model versions When Anthropic ships model updates, run benchmarks to catch silent regressions. Skills that encode specific output formats (reference-impl template, review checklist) are vulnerable to model updates changing output structure without warning. Related: article review

Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.