Engineering Backlog¶

Cross-cutting implementation gaps identified during architecture reviews. Grouped by theme, linked to relevant patterns and ADRs.

#1 Pain Point: Large import throughput. Priority stack by throughput impact:

Item	Impact	Effort
Add jitter to @retry_on_db_conflict	HIGH — prevents deadlock cascades	One-line change
Add jitter to SQS requeue backoff	HIGH — prevents retry storms	One-line change
Profile Aurora write throughput (bulk INSERT, early dedup)	HIGH	Investigation
Heartbeat for checkpoint pipeline	MEDIUM — prevents duplicate processing	Medium
Visibility timeout → 990s	MEDIUM — prevents rare duplicates	Config change
Rails → RDS Proxy (connection pinning)	MEDIUM — partial benefit (core DB only)	Medium
Direct-to-Aurora for new modules	MEDIUM — future modules	Per-ADR decision

Observability¶

Add AgeOfFirstAttempt metric to all NGE Lambda handlers Best early warning for queue backlogs. Measures time between first SQS send and handler pickup. Related: concurrency-patterns, lambda-sqs-integration

Elasticsearch¶

CRITICAL: Fix ES hot partition problem — indexes growing to 1TB+ New index creation only checked at case creation time (30GB shard threshold). New cases all land on latest index, grow simultaneously → 300GB-1TB monster indexes. Large imports on one case degrade search for ALL co-located cases. Fix: periodic shard size check (background job) or ILM-based rollover. Biggest ES operational issue. Related: elasticsearch-cluster-management, ADR-008
Migrate FilterSearch from MySQL JOINs to ES queries (prerequisite for ADR-011) Document browse/filter uses MySQL JOINs on tags/taggings for ALL cases (NGE and Legacy). Must migrate to ES queries before stopping MySQL tag writes. Compatible with AG Grid infinite scroll. Related: rails/app/models/filter_search.rb, ADR-008, ADR-011

Throughput / Scaling¶

Evaluate direct-to-Aurora (via RDS Proxy) for new NGE modules instead of routing through Rails API Currently only documentloader uses RDS Proxy directly. All other NGE modules (extractor, uploader, exporter, exchanger) route through Rails HTTP APIs → Rails ActiveRecord → Aurora. Rails becomes the throughput bottleneck: Puma thread limits, ActiveRecord connection pool, HTTP overhead per call. With 20K+ npcase databases on one Aurora cluster, the shared writer instance is already the aggregate bottput constraint. Adding Rails as an intermediary compounds it. Trade-off: Direct-to-Aurora eliminates Rails bottleneck but duplicates model logic in Python. Mitigations: shared SQLAlchemy models in shared_libs Python package, RDS Proxy handles connection pooling, per-case schema resolution already exists in documentloader. Decision needed per ADR: For each new module (ADRs 005-010), decide: direct-to-Aurora (like documentloader) or through-Rails (like current modules). Recommend direct-to-Aurora for write-heavy modules (bates stamping ADR-006, bulk ops ADR-005). Related: lambda-sqs-integration, database-session
Profile and optimize aggregate Aurora write throughput for large imports 20K+ npcase databases share one Aurora writer instance. Large imports from multiple cases simultaneously compete for write IOPS. Investigate: Aurora I/O-Optimized pricing tier, bulk INSERT batching (reduce transactions-per-document), off-hours dynamic concurrency scaling, early deduplication to skip pipeline for duplicate files. Related: concurrency-patterns
Migrate Rails to RDS Proxy for connection pooling Currently Rails connects directly to Aurora. RDS Proxy would pool connections for core DB (accounts, users, batches) but PerCaseModel connections will be pinned (schema switching causes pinning). Requires: prepared_statements: false, move SET statements to proxy init query, accept pinning for case-specific connections. Partial benefit — core connections multiplex, case connections don't. Full benefit requires reducing PerCaseModel reliance (longer-term). Related: rails-rds-proxy-migration
Legacy-only: ES tag existence checks cause throttling on large load file imports Legacy uses ES queries per-row per-custom-field-value to check tag existence — intentional, prevents race conditions from parallel workers. Causes millions of ES queries on large imports (100K+ rows × 50+ columns). NGE already solved this: documentloader uses TagDedupe table (SHA256 hash + PK constraint) for atomic MySQL dedup + INSERT IGNORE for taggings. Zero ES queries in write path. This problem goes away as cases convert to NGE. For new modules (ADR-005 bulk ops): follow documentloader's TagDedupe pattern, not Legacy's ES check pattern. Related: documentloader shell/tags_ops.py, shell/taggings_ops.py, ADR-005

Resilience¶

Review reservedConcurrency across all NGE Lambdas Ensure dual concurrency control (ESM MaximumConcurrency + ReservedConcurrency) is applied consistently. Invariant: MaximumConcurrency <= ReservedConcurrency. Prevent cross-module starvation when one module spikes. Related: concurrency-patterns, lambda-sqs-integration, article review
Increase batch-processing queue visibility timeout to 990s Current 900s doesn't account for 60s batch window. Formula: Lambda_timeout + BatchWindow + 30s buffer = 990s. Related: lambda-sqs-integration, article review
Audit maxReceiveCount across all NGE SQS queues (ensure >= 5) Low maxReceiveCount + throttling bursts = valid messages in DLQ. AWS recommends >= 5. Related: lambda-sqs-integration, article review
Add heartbeat extension for documentloader checkpoint pipeline Prevent duplicate processing when long-running checkpoints exceed SQS visibility timeout. Related: checkpoint-pipeline, sqs-handler
Add full jitter to @retry_on_db_conflict decorator Pure exponential backoff without jitter causes thundering herd on MySQL deadlock storms during bulk imports. Change: delay = random.uniform(0, base_delay * (2 ** attempt)) Related: retry-and-resilience, article review
Add full jitter to SQS requeue backoff Same thundering herd risk at message level. All messages that fail simultaneously retry at the same intervals. Change: delay = random.uniform(0, min(base_delay * (2 ** retry_count), max_delay)) Related: retry-and-resilience, article review
Add timeout + jittered retry to NgePageService HTTP calls Sync path has no timeout and no retry. Set connection + request timeout based on p99.9 latency. Add retry with jittered backoff for transient failures. Related: retry-and-resilience, circuit-breaker, article review
Add per-case backpressure for large batch imports A single large case import can starve other cases. Need admission control or rate limiting per case. Related: concurrency-patterns, sqs-operations

Event Architecture¶

Add event schema validation (Pydantic model) for new modules No enforced schema on SNS messages today — implicit coupling between producers and consumers. As module count grows (ADRs 005-010), schema drift risk increases. Related: sns-event-publishing, article review
Create event-catalog.md listing all event types, producers, and consumers No central registry of events across modules. Needed for debugging event flows and onboarding new modules. Related: sns-event-publishing, article review
Evaluate Step Functions for ADR-005 (bulk ops saga) and ADR-010 (deposition orchestration) Pure choreography works for linear pipeline but bulk operations need compensation/rollback (saga) and depositions have sequential dependencies. Hybrid: choreography by default, orchestration for sagas. Related: adr/005, adr/010, article review

Data Integrity¶

Ensure bates number increment is idempotent (fencing token pattern) for ADR-006 Bates numbers are a legal correctness concern — duplicates are compliance failures. Apply Kleppmann's fencing token pattern: BatesPattern.next_number is the natural fencing token. Stamp writes must be idempotent (skip if annotation with that bates number already exists). Related: adr/006, article review
Verify InnoDB lock_wait_timeout vs Lambda timeout alignment for stamp service If Lambda pauses beyond innodb_lock_wait_timeout, the SELECT ... FOR UPDATE on BatesPattern can release, allowing duplicate number assignment. Ensure Lambda timeout < lock_wait_timeout. Related: adr/006, article review
Ensure bates stamp Lambda uses Aurora writer endpoint for SELECT...FOR UPDATE Aurora Write Forwarding does NOT support SELECT...FOR UPDATE. Bates increment must go through the primary writer instance, not reader endpoints. If stamp Lambda uses reader for general reads, it must explicitly switch to writer for the atomic increment. Related: adr/006, article review

Idempotency¶

Evaluate Lambda Powertools @idempotent decorator for new modules Standardize idempotency implementation instead of hand-rolling per module. Related: idempotent-handlers
Standardize idempotency test pattern Call handler twice with same input, verify no duplicates or side effects. Apply across all modules. Related: idempotent-handlers

Claude Code Skills¶

Add eval cases to architecture repo skills Skill-creator (March 2026 update) now supports eval mode, benchmark mode, and blind A/B testing. Define test prompts + expected outputs for each skill:
exploring-module: test with a known module, verify it finds correct patterns and integration points
reviewing-architecture: test with a module that violates hexagonal boundaries, verify detection
writing-reference-impl: test with a documented module, verify output matches reference format Related: article review
Run trigger optimization (Improve mode) on skill frontmatter descriptions Skills may not trigger when users use natural language instead of explicit skill names (e.g., "check this module" vs "use the exploring-module skill"). Improve mode optimizes the frontmatter description for trigger accuracy (60/40 train/holdout, 3 runs, up to 5 iterations). Related: article review
Benchmark skills across model versions When Anthropic ships model updates, run benchmarks to catch silent regressions. Skills that encode specific output formats (reference-impl template, review checklist) are vulnerable to model updates changing output structure without warning. Related: article review

Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.