Skip to content

Article Review: Group 15 — Architecture Review Methodology

Articles Reviewed

  1. "How I Review Tech Architecture as a Senior Developer For Better Results" — Vinod Pal / Medium (Feb 15, 2026) — Practitioner's framework for architecture reviews after 100+ reviews: six principles with concrete failure examples and a Before/During/After checklist.

Key Concepts

The Six Principles

The article presents six architecture review principles, each illustrated with a diagram showing a hidden flaw that teams typically miss:

1. Start With Failure Scenarios

Review the unhappy path first. A food delivery architecture with synchronous external calls (Stripe, Inventory API, SendGrid) and no fallback looks solid until any dependency returns a 500. All API threads block, new orders can't start.

The question to ask: What happens when [dependency] returns an error? Teams design the happy path and forget systems need to work when dependencies fail. Write down three failure scenarios before starting the review.

2. Consistency First, Performance Second

A product catalog with Redis cache (1h TTL) and a batch update service that writes directly to the database — never invalidating Redis. After a batch price update, users see stale cached prices for up to an hour.

Why it's sneaky: Cache hit rate metrics look great (95%), so nobody tracks data freshness. Different teams own the cache and the batch job, neither knows about the other's assumptions.

Fix options: Write-through cache, cache invalidation on write, pub/sub invalidation, or short TTL matched to update frequency. The rule: any path that writes data must also invalidate cached copies.

3. Design Around the Narrowest Point

An order processing system with a queue that buffers requests but a single consumer processing sequentially. Normal load (1K/hr) works fine. Peak load (8K/hr) causes messages to pile up in memory.

The common argument: The queue buffers the spike. True — but buffering isn't processing. You're just delaying the problem.

Fix: Scale consumers horizontally. Auto-scale based on queue depth and processing latency. Add priority queues for high-value items.

4. Measure the Blast Radius

A microservices platform where every service calls an Auth Service for every request. Auth Service is a hidden single point of failure — if it goes down, 100% of the platform is unavailable.

The exercise: Draw blast radius circles in every review. What does this component touch? What depends on it? Auth touches seven services = seven failure paths.

Fix: Auto-restart with health checks, client-side JWT validation (reduces Auth traffic 95%), local permission caching (5min TTL). Blast radius shrinks from 100% to 5%.

5. Preventing Failure Is Only Half the Design

A financial system with a circuit breaker pattern for external payment API calls. Looks well-designed. But the circuit breaker has no fallback strategy — when the circuit opens, all payment requests simply fail.

Key insight: A circuit breaker without a fallback just moves the failure point. You protected the system but broke the business.

Fix: Add fallback logic — queue orders for later processing, use a backup provider, or degrade gracefully with stored credit.

6. Scalability Requires Isolation

An insurance claims system adopting microservices — but all services share one database. Shared databases create tight coupling. One service's schema change or slow query affects all others.

Common arguments debunked: - "Simpler to start with one database" → Harder to split later - "We'll separate when we scale" → Data migrations under load are risky and slow - "The database can handle it" → The database becomes the bottleneck

Tradeoff acknowledged: Multiple databases increase operational overhead and cross-service queries become API calls or events. But the long-term resilience benefit outweighs short-term simplicity.

The Review Ritual (Before/During/After Checklist)

Before the review: 1. Read the design doc twice — first for understanding, second for questions 2. List every external dependency (APIs, databases, third-party services) — check each for timeouts and fallbacks 3. Identify the critical path — what code runs for every request? That's where bugs hide

During the review: 1. Start with failure scenarios — walk the team through outages 2. Draw the data flow — follow one request end-to-end, count every transformation 3. Check the math — requests/sec × data size × operations. Add 30% headroom. Does it fit the infrastructure? 4. Ask about monitoring — what metrics exist? What alerts fire? When do they know something broke? 5. Question every piece of state — where is it? How long does it live? What happens if you lose it? 6. Map the blast radius — if this fails, what else fails? Draw circles. Big circles need redesign.

After the review: 1. Write action items immediately — specific tasks, owners assigned, deadlines set 2. Share notes within 24 hours — what needs fixing, why it matters, how to fix it 3. Follow-up — check if changes happened, verify fixes work

Five Diagnostic Questions

  1. What's the fallback when X fails? — Forces resilience thinking
  2. Show me the capacity math — catches undersized infrastructure
  3. How do you deploy this? — reveals missing automation and risky processes
  4. What can't you see right now? — exposes monitoring gaps
  5. Where does the state live? — finds scaling problems early

Red Flags

  • "We'll add monitoring later" → means they won't
  • "This usually works" → means it fails under load
  • "It's temporary" → means it becomes permanent
  • "Only one person understands this" → knowledge silo, document now

Mapping to Our Architecture Repo & Claude Code Config

What We Do Right

  1. Failure-first review is embedded in our skill — Our reviewing-architecture skill checklist starts with error handling patterns (RecoverableException, PermanentFailureException, SilentSuccessException), idempotency checks, and DLQ behavior. This matches the article's principle 1 (start with failure scenarios). We review the unhappy path by default.

  2. Blast radius is contained by design — Our event-driven architecture with SNS/SQS means modules are decoupled. If documentloader fails, documentexporter continues operating on already-loaded documents. The blast radius of any single module failure is limited to its processing pipeline, not the entire platform. Per-case databases ({RDS_DBNAME}_case_{case_id}) further isolate blast radius — one case's database issue doesn't affect others.

  3. Narrowest-point scaling is addressed — Our patterns/lambda-sqs-integration.md documents SQS-based autoscaling with Lambda concurrency. Multiple Lambda invocations process from the same queue in parallel, auto-scaling based on queue depth. This directly implements the article's "scale consumers horizontally" fix.

  4. Data isolation principle matches our multi-tenant model — The article's principle 6 (scalability requires isolation / don't share databases) is exactly what our per-case database pattern achieves. Each case gets its own MySQL schema, preventing cross-case interference. This is more granular isolation than most architectures achieve.

  5. Circuit breaker pattern exists — Our reference-implementations/documentexchanger.md documents a circuit breaker for uploader communication, matching principle 5. Importantly, it includes fallback behavior (retry with backoff, DLQ routing) — avoiding the "circuit breaker without fallback" anti-pattern the article warns about.

Improvements Identified

1. MEDIUM: Add Failure Scenario Prompts to reviewing-architecture Skill

The article's three-phase review checklist (Before/During/After) is more structured than our current skill. Specifically, the "Before" phase (list dependencies, identify critical path) and the "During" phase (draw data flow, check math, map blast radius) could enhance our skill's Process section.

Concrete additions to .claude/skills/reviewing-architecture/SKILL.md: - Add a "Dependencies & External Calls" checklist item: For each external dependency, verify timeout configuration, fallback behavior, and circuit breaker presence - Add a "Blast Radius" checklist item: Map what fails when this module fails. List all downstream consumers of this module's events. - Add a "Capacity Math" checklist item: Verify Lambda concurrency limits, SQS batch sizes, and RDS connection pool sizes against expected throughput

2. MEDIUM: Add Consistency/Cache Checklist Item

The article's principle 2 (consistency over performance, cache invalidation) maps to a gap in our review checklist. Our modules use Elasticsearch as a read cache and MySQL as the source of truth. The documentloader writes to both, but we should verify: - Every write path that updates MySQL also triggers an ES reindex - Stale ES data has a bounded staleness window - Monitoring exists for ES-to-MySQL divergence

This is particularly relevant for the Legacy Rails platform where Redis caching and Elasticsearch indexing can diverge from MySQL state.

3. LOW: Add the Five Diagnostic Questions to review-architecture Slash Command

The article's five questions (fallback, capacity math, deployment, monitoring gaps, state location) are excellent review prompts. They could be added to the /review-architecture slash command output as a "Questions to Investigate" section that prints after the automated checklist runs.

4. INFO: Red Flags as Anti-Patterns Documentation

The four red flags ("we'll add monitoring later", "this usually works", "it's temporary", "only one person understands this") are worth capturing in a patterns/anti-patterns.md or rules/review-red-flags.md file. These are social signals during reviews that indicate deeper architectural problems.

Actionable Changes

Change Target Priority
Add dependency/blast-radius/capacity checks to review skill .claude/skills/reviewing-architecture/SKILL.md MEDIUM
Add cache consistency checklist item .claude/skills/reviewing-architecture/SKILL.md MEDIUM
Add five diagnostic questions to review output .claude/commands/review-architecture.md LOW
Document red flag anti-patterns patterns/ or rules/ LOW

Summary

The article presents a practitioner-tested architecture review framework that closely aligns with what we already do — failure-first review, blast radius containment via event-driven decoupling, horizontal consumer scaling via SQS/Lambda, and data isolation via per-case databases. Our reviewing-architecture skill covers most of the checklist, but could be strengthened with explicit dependency mapping, blast radius analysis, capacity math verification, and cache consistency checks. The five diagnostic questions and red flags are immediately useful additions to our review tooling.

Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.