Skip to content

Article Review: Group 13 — Agent Framework Comparison & Composable Infrastructure

Articles Reviewed

  1. "5 Agent Frameworks. One Pattern Won." — Yanli Liu / Level Up Coding (Mar 2026) — Evaluated AutoGen, LangGraph, CrewAI, DeerFlow, and Anthropic patterns for a compliance-sensitive financial research pipeline. Thesis: composable infrastructure beats monolithic frameworks.

Key Concepts

The Autonomy-Control Spectrum

Five frameworks represent five bets on how agents should coordinate, arranged from most autonomous to most controlled:

Framework Coordination Model Context Strategy Est. Cost/Pipeline Best For
CrewAI Role-based delegation Shared (all see all) $3-$8+ Weekend prototypes
AutoGen Broadcast conversation Shared (N×M tokens) $4-$8+ Microsoft stack teams
Anthropic DIY patterns (no framework) Your problem to manage $0.50-$2 Strong infra teams
DeerFlow LangGraph + middleware Active management (summarize + offload) $0.30-$0.80 Compliance-sensitive pipelines
LangGraph Directed state graph Selective state (still accumulates) $1-$3 Complex branching logic

Key insight: The further right on the spectrum (more control), the less you pay per decision. Production is pulling hard toward the control end.

Why Each Framework Breaks

  • AutoGen — O(N×M) token cost per round. 4 agents × 20 messages = 80 message-reads per round. 5 rounds = 200K tokens on coordination alone (~$0.60). Project fractured into AG2/Semantic Kernel with no clear migration path — governance risk, not just technical.
  • LangGraph — 200+ lines of boilerplate before first agent runs. State schemas change when you refactor the graph, invalidating existing checkpoints. Control is real, but so is rigidity.
  • CrewAI — Role-based delegation is non-deterministic. Same input routes to different agents across runs. CrewAI bolted on Flows (deterministic pipelines) after production teams hit unpredictable behavior. Their own blog says "start with 100% human review, work down to 50%."
  • Anthropic (bare patterns) — Maximum control, zero lock-in, but you own everything: state persistence, error recovery, agent lifecycle, context management. Months of plumbing before first agent runs a real task.
  • DeerFlow — Clearest composable implementation, but released Feb 2026. Middleware abstractions are clean but battle-testing is thin.

Three Pillars of Composable Architecture

The article defines "composable" specifically as three capabilities:

Pillar 1: Progressive Skill Loading (The Library Card Principle)

Three-tier loading model: - Tier 1 — Metadata: Name + one-line description per skill. ~100 tokens each. 40 skills = 4,000 tokens. Always in context. - Tier 2 — Skill Body: Full SKILL.md instructions + procedures. 500-2,000 tokens. Loaded when skill triggers. - Tier 3 — Resources: Scripts, templates, API schemas. Variable size. Loaded only when explicitly referenced in instructions.

Impact: $0.012/call with progressive loading vs $0.24/call monolithic. Over 1,000 calls/day: $12 vs $240.

Pillar 2: Filesystem-First State Management

State lives on disk, not in tokens. When a sub-agent finishes analyzing a 10-K filing: 1. Writes structured findings to /workspace/ (JSON summaries, not raw text) 2. SummarizationMiddleware compresses the conversation history 3. Next agent reads the summary file, not the raw conversation

Result: no agent ever holds the full 10-K in context. The filesystem carries the data, the context window carries the thinking.

Pillar 3: Middleware as Modular Pipeline

Every LLM call passes through a configurable middleware stack. Each middleware runs before() and after() hooks:

Middleware What It Does Why It Matters
Summarization Compresses conversation after sub-tasks Token Tax killer
Memory Injects relevant prior context from persistent store Cross-session continuity
Sandbox Routes code execution to Docker/K8s Compliance gate
Clarification Intercepts ambiguous requests Human-in-the-loop without manual instrumentation
Skill Progressive 3-tier loading Library Card Principle in action
Tool MCP server connections, per-agent routing Composable tool sets
ThreadData Injects thread-scoped data (uploads, working dirs) Filesystem-first state
Logging Structured event logging at every boundary Audit trail (SR 11-7)
Metrics Token usage, latency, cost per call Cost governance

Different agents in the same pipeline get different middleware stacks. Research agent prioritizes Summarization + Memory. Code execution agent prioritizes Sandbox + ThreadData.

The Token Tax

The cost of carrying data in context that should live on disk: - Monolithic: 137,000 input tokens per LLM call (system prompt 2K + 40 full skills 40K + raw 10-K 80K + history 15K). 12-call pipeline = 1,644,000 tokens = $4.93. - Composable: 10,500 input tokens per LLM call (system prompt 2K + 40 skill metadata 4K + 1 active skill 1.5K + disk summary 3K). 12-call pipeline = 126,000 tokens = $0.38.

13x cheaper. But cost isn't the strongest argument — quality is. LLM accuracy degrades as context fills with irrelevant information. A 137K-token input where 120K tokens are unused skill descriptions and raw filing text doesn't just cost more. It thinks worse.

What Breaks in Practice

  1. Summarization drops material information — A 60K-token 10-K compressed to 3K loses buried risk factors, related-party transactions, contingent liabilities. Fix: structured extraction before summarization (pull specific fields into JSON, then summarize narrative sections).

  2. External data sources rate-limit you — SEC EDGAR: 10 req/s, Yahoo Finance, Bloomberg, news APIs all have limits. Three sub-agents hitting EDGAR simultaneously get throttled within seconds. Fix: rate limiting and response caching at the MCP server level, not the agent level.

  3. Agent-generated calculations have no ground truth — Debt-to-equity ratio from a summary might use total debt/total equity or long-term debt/shareholders' equity. Both "correct" by some definition. Fix: assertion checks — run same calculation in deterministic script (ratio-calculator.py) from structured JSON, compare against agent output, flag if >1% divergence.

  4. Context overflow on complex entities — Single-product company researches fine. Conglomerate with 60+ operating companies overflows the Researcher agent's context. Fix: decompose into per-segment sub-tasks, each with its own context window.

The Compliance Lens

Most framework comparisons stop at developer experience. For regulated industries:

  • Sandbox isolation — CrewAI runs tools in your application process with no isolation. AutoGen offers optional Docker but doesn't enforce it. DeerFlow ships three sandbox modes (local, Docker, K8s) routed through SandboxMiddleware by default.

  • Audit trail gap — Conversation-based frameworks bury decision-making in chat logs. Regulators (OCC Bulletin 2011-12, Federal Reserve SR 11-7) require reproducibility: given the same inputs, the system produces the same routing decisions and outputs. DeerFlow's LoggingMiddleware with structured events at every middleware boundary is designed for this.

  • Key question: Not "how fast can it ship?" but "where does the code execute, what gets logged, and who reviewed the output?"

Mapping to Our Architecture Repo & Claude Code Config

What We Do Right

  1. Progressive skill loading is already our model — Our Claude Code skills (exploring-module, writing-reference-impl, reviewing-architecture) use exactly the three-tier pattern: metadata in CLAUDE.md triggers (Tier 1), full SKILL.md loaded on match (Tier 2), reference files from patterns/ and rules/ loaded on demand (Tier 3). This article validates our approach as the emerging best practice.

  2. Filesystem-first state in checkpoint pipelines — Our patterns/checkpoint-pipeline.md documents how documentloader writes intermediate state to DynamoDB/S3 between steps, not carrying raw document data through the entire pipeline. This is the filesystem-first principle applied to backend processing.

  3. Shell boundary as middleware — Our hexagonal architecture (core/ + shell/) is a two-layer middleware pattern. Every external interaction passes through shell/, which handles logging, error translation, retries, and AWS SDK calls. The core/ never touches infrastructure directly. This maps to DeerFlow's middleware pipeline wrapping every LLM call.

  4. Composable tool routing in pr-review — Our reference-implementations/pr-review.md uses 5 parallel specialized agents (security, correctness, edge-cases, architecture, Nextpoint patterns) + verifier aggregator. Each agent gets only the rules relevant to its review type. This matches the composable tool set concept.

  5. Compliance-aware by design — eDiscovery inherently requires audit trails (chain of custody for documents), sandboxed execution (per-case database isolation), and deterministic processing (idempotent handlers). Our existing patterns address these requirements at the infrastructure level.

Improvements Identified

1. MEDIUM: Formalize Middleware Pattern for Event Processing

The article's middleware pipeline concept could enhance our Lambda handler pattern. Currently, our handlers mix concerns: parsing, routing, idempotency checks, error handling, and business logic. A formalized middleware stack approach could separate:

  • IdempotencyMiddleware — Check for duplicate processing before executing
  • ContextMiddleware — Set up ContextVar-based request context (case_id, job_id, batch_id)
  • ErrorTranslationMiddleware — Convert exceptions to SQS behavior (Recoverable → requeue, Permanent → DLQ)
  • MetricsMiddleware — Track processing time, success/failure rates

This aligns with patterns/lambda-sqs-integration.md but makes the layers explicit and reorderable.

2. LOW: Add Cost Governance Metrics to Pattern Documentation

The Token Tax analysis provides a concrete framework for evaluating architectural decisions by token cost. When designing new Claude Code skills or expanding existing ones: - Track estimated token footprint per skill (metadata + body + resources) - Set a budget: total skill metadata should stay under 5K tokens - Monitor if reference-implementation docs pulled into skills are getting too large

3. INFO: DeerFlow as Reference Architecture for Future AI Features

If Nextpoint builds more AI-powered features beyond nextpoint-ai (transcript summarization), DeerFlow's architecture provides a production-ready reference for: - Multi-agent pipelines with isolated context windows - Compliance-grade audit trails at every processing boundary - Rate-limited external API access via MCP servers - Deterministic validation of AI-generated outputs

Not actionable now, but worth tracking as DeerFlow matures.

Actionable Changes

Change Target Priority
Consider middleware pattern for Lambda handlers patterns/lambda-sqs-integration.md MEDIUM
Add token budget guidance for skills CLAUDE.md Session Hygiene LOW
Track DeerFlow maturity for future AI features Backlog INFO

Summary

The article validates that composable infrastructure — progressive skill loading, filesystem-first state, and modular middleware — beats monolithic agent frameworks for production workloads. Our architecture already implements all three pillars: skills with tiered loading, checkpoint pipelines with persistent state, and hexagonal boundaries as middleware layers. The compliance lens (audit trails, sandbox isolation, reproducibility) directly maps to eDiscovery's regulatory requirements. Main takeaway: our patterns are aligned with emerging best practices; the opportunity is to formalize the middleware concept in our Lambda handler pattern documentation.

Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.