Reference Implementation: pr-review¶

Overview¶

Automated PR review service for Bitbucket Cloud. Receives webhooks on pullrequest:created and pullrequest:updated, gathers PR diff + full file context + linked Jira tickets, dispatches parallel specialized review agents via Bedrock Claude, aggregates/deduplicates findings, and posts a structured review comment back to the PR.

Not an eDiscovery pipeline module — this is a developer tooling service that reviews code across all Nextpoint repositories.

Architecture¶

pr-review/
├── service/
│   ├── core/                        # Pure business logic — NO boto3, NO requests
│   │   ├── prompts.py               # All agent prompt templates (string constants)
│   │   ├── review.py                # Agent orchestration, findings parsing, metadata extraction
│   │   ├── nextpoint_rules.py       # Repo→rules mapping, per-repo review config flags
│   │   └── models.py                # TypedDict definitions (Finding, AgentResult, BedrockCaller)
│   ├── shell/                       # Infrastructure — all external I/O
│   │   ├── bedrock.py               # Bedrock converse API (Claude invocation)
│   │   ├── bitbucket_client.py      # Bitbucket Cloud REST API (Bearer token auth)
│   │   ├── jira_client.py           # Jira Cloud REST API (Basic auth)
│   │   └── gather.py                # PR data collection (metadata, diff, files, Jira tickets)
│   ├── handlers/
│   │   └── webhook.py               # Lambda entry point — parse, verify, route
│   ├── rules/                       # Bundled architecture rules (from nextpoint-architecture/rules/)
│   │   ├── api-design.rules.md
│   │   ├── cdk-infrastructure.rules.md
│   │   ├── dependency-management.rules.md
│   │   └── ...                      # ~10 rule files, synced from architecture repo
│   ├── config.py                    # Centralized config from Secrets Manager + env vars
│   ├── requirements.txt             # Python dependencies
│   └── cdk/
│       └── stack.py                 # CDK stack — Lambda + API Gateway + Secrets Manager
├── repos.yml                        # List of repos with PR review webhooks
├── scripts/
│   ├── gather-pr-data.sh            # Local data-gathering script (uses bkt CLI)
│   └── post-review.sh               # Local review posting script
├── kiro/
│   └── review-pr.md                 # Manual review skill for Claude Code
└── CLAUDE.md                        # Module-specific Claude Code config

Pattern Mapping¶

Pattern	Implementation	Standard NGE Pattern
Entry point	API Gateway → Lambda (webhook)	SNS → SQS → Lambda
Event source	Bitbucket webhook (`pullrequest:created/updated`)	SNS events from other modules
Agent orchestration	ThreadPoolExecutor — parallel Bedrock calls	N/A (new pattern)
External API	Bitbucket REST API + Jira REST API	S3, RDS, Elasticsearch
AI integration	Bedrock `converse` API (Claude Opus 4.6)	Bedrock (nextpoint-ai uses similar)
Secrets	AWS Secrets Manager via config.py	Same — `shell/utils/aws_secrets.py`
Infrastructure	CDK Python (single stack, `service/cdk/stack.py`)	CDK TypeScript (two-stack pattern)
Error handling	Generic try/except with HTTP status codes	Exception hierarchy (Recoverable/Permanent/Silent)
Database	None — stateless service	SQLAlchemy + per-case MySQL
Hexagonal boundaries	core/ (prompts, review, rules) + shell/ (Bedrock, Bitbucket, Jira, gather) + handlers/ (webhook)	Same — core/ + shell/ separation
Testing	Manual testing via kiro skill	pytest + moto + autouse fixtures
Idempotency	Bot comment marker deduplication	Checkpoint-based duplicate detection
Concurrency control	`reservedConcurrentExecutions: 2`	`reservedConcurrency` on orchestrator Lambdas

Key Design Decisions¶

Multi-Agent Review Architecture¶

Orchestration pattern: Hierarchical — orchestrator delegates to specialized child agents working independently, then aggregates results. Compare with Plan-and-Execute (sequential steps, used by checkpoint pipelines) and ReAct (dynamic reasoning loop, not used in NGE).

Five specialized agents run in parallel, each with a focused mandate:

Security — injection, secrets, auth bypass, XSS, CSRF
Correctness — logic errors, null deref, resource leaks, API misuse
Edge Cases — race conditions, environment-dependent behavior, regressions
Architecture — design consistency, Jira fulfillment, test coverage, repo health
Nextpoint Patterns — architecture rule violations specific to Nextpoint repos

A Verifier agent aggregates all findings, removes false positives, deduplicates, ranks by severity, and produces the final formatted review.

Agent count scales with PR complexity: - Small (<50 diff lines): 2 agents (security-correctness combined + architecture) - Medium/Large (50+ diff lines): all 4-5 agents

Nextpoint Architecture Integration¶

Architecture rules are bundled locally within service/rules/ in the Lambda deployment package (not fetched at runtime from the architecture repo). This ensures reviews work even if the architecture repo is unavailable. Pipeline exclude patterns filter rules files from diff analysis to avoid noisy reviews.

The nextpoint_rules.py module maps repository slugs to the bundled rules via a RepoConfig class with granular per-repo flags:

RepoConfig class — boolean flags per review dimension: security, correctness, edge_cases, architecture, nextpoint_patterns, jira_fulfillment, test_coverage, repo_health
Config tiers — _NGE_CONFIG (all checks), _LEGACY_CONFIG (skip hexagonal), _JAVA_CONFIG (skip Python rules), _MINIMAL_CONFIG (security
correctness only). Each repo maps to a tier in REPO_CONFIGS registry
Agent dispatch uses flags — agents are conditionally included/excluded per repo based on the RepoConfig flags (review.py)
Language auto-detection — detect_languages() parses diff --git headers to identify Python/Ruby/Java/TypeScript. get_language_rules() injects language-specific rules alongside architectural rules
_REPO_RULE_OVERRIDES — per-repo rule customization (e.g., Java modules skip Python-specific rules, get Java/Kotlin rules instead)

Rule injection into agents: - AGENT_ARCHITECTURE receives {nextpoint_context} — condensed rules for general awareness - AGENT_NEXTPOINT_PATTERNS receives {nextpoint_rules} — full rule set for dedicated pattern violation detection

Token budgets: agents get 4096 max tokens, verifier gets 8192.

Full File Context Gathering¶

gather.py extends the base PR data with full source of changed files (up to 15 files, 50KB each). This enables agents to find pre-existing bugs in adjacent code — not just issues in the diff. Binary files and lock files are skipped.

Bot Comment Deduplication¶

Reviews are posted with a  HTML comment marker. On subsequent pushes, the existing bot comment is updated rather than creating duplicates. Old unmarked comments (from before dedup) are cleaned up.

Complexity-Based Agent Scaling¶

PR complexity is estimated from diff line count. Small PRs get fewer agents to reduce Bedrock costs and latency. This keeps review time under the 5-minute Lambda timeout for most PRs.

Divergences from Standard NGE Patterns¶

Area	Standard NGE	pr-review	Reason
Directory layout	`core/` + `shell/` + `handlers/`	Same — follows hexagonal pattern	Dependency injection: `BedrockCaller` callable injected from shell into core
Stack pattern	CommonResourcesStack + module	Single stack	No shared VPC/RDS/SNS dependencies
Error handling	RecoverableException hierarchy	HTTP status codes	No SQS retry semantics — webhook is fire-and-forget
Database	Per-case MySQL via SQLAlchemy	None	Stateless — all state is in Bitbucket/Jira
Testing	pytest + moto + autouse fixtures	Manual via kiro skill	Prototype stage — automated tests are a gap
Concurrency	SQS-based with reservedConcurrency	Lambda reserved (2)	API Gateway → Lambda, not SQS-driven

Unique Patterns (Not in Standard NGE)¶

These patterns are introduced by pr-review and could inform future services:

Multi-agent orchestration — Parallel Bedrock calls with ThreadPoolExecutor, findings aggregation, false positive elimination via verifier agent
Architecture-aware review — Repo→rules mapping that injects domain-specific rules into AI agent prompts based on the target repository
Webhook signature verification — HMAC-SHA256 validation of Bitbucket payloads
Comment deduplication — HTML marker pattern for idempotent comment updates
Complexity-based scaling — Adjusting agent count based on diff size to balance cost vs. thoroughness

Future: Closed-Loop Auto-Fix (Evaluation Item)¶

In March 2026, Anthropic shipped three Claude Code features that together form an autonomous PR pipeline: Code Review (multi-agent inline comments), Auto Mode (safety-classified autonomous operation), and Cloud Auto-Fix (event-driven CI failure and review comment resolution). These are GitHub-native and don't work with Bitbucket directly. However, the closed-loop pattern is applicable to pr-review.

What Nextpoint pr-review Already Has vs What's New¶

Capability	pr-review (current)	Claude Code March 2026	Gap
Architecture-aware review	Yes — injects Nextpoint rules per repo	No — generic code review	pr-review is stronger
Specialized agents	5 parallel (security, correctness, edge-cases, architecture, patterns)	Fleet of agents (generic)	pr-review is stronger
Platform	Bitbucket (webhooks -> Lambda)	GitHub only	pr-review works for our stack
Auto-fix from review comments	No	Yes — reads comments, pushes fix commits	Gap to evaluate
CI failure auto-fix	No	Yes — subscribes to CI events, pushes fixes	Gap to evaluate
Review -> fix -> re-review loop	No	Yes — closed loop	Gap to evaluate

Proposed Enhancement: Add Auto-Fix Agent¶

Extend pr-review with a fix agent that reads review findings and pushes corrections:

Current flow (open loop):
  Bitbucket webhook -> pr-review -> post review comments -> developer fixes manually

Proposed flow (closed loop):
  Bitbucket webhook -> pr-review -> post review comments
                                         |
                                         v
                                    auto-fix agent
                                    reads findings
                                    pushes fix commit
                                         |
                                         v
                                    pr-review re-reviews
                                    (new push triggers webhook)
                                         |
                                         v
                                    clean? -> done
                                    new findings? -> loop (max 2 iterations)

Which findings are auto-fixable:

Finding Type	Auto-fixable?	Why
core/ imports from shell/ (boundary violation)	Yes	Mechanical — move import, inject dependency
Missing type hints	Yes	Mechanical — add types
Black/isort formatting	Yes	Mechanical — run formatter
Missing @retry_on_db_conflict	Likely	Pattern is well-defined
Event naming not past tense	Likely	Rename with known convention
Security vulnerability (SQL injection)	No	Requires understanding intent
Architectural redesign suggestion	No	Requires human judgment
Business logic error	No	Requires domain knowledge

Safeguards: - Max 2 auto-fix iterations per PR (prevent infinite loops) - Auto-fix only for findings tagged "mechanical" or "formatting" by pr-review - Human approval required before auto-fix commits are merged - All auto-fix commits clearly attributed (Auto-fix by pr-review agent)

Implementation Notes¶

The auto-fix agent would be a new Lambda triggered by pr-review's own findings (SNS event after review posts), not by Bitbucket webhooks
Uses Bitbucket API to checkout the branch, apply fixes, and push
pr-review's existing webhook handles the re-review automatically (push to branch triggers pullrequest:updated)
Architecture-aware context (Nextpoint rules) carries over — the fix agent knows HOW to fix because it has the same rules the review agent used to find the issue

Platform Consideration¶

If Nextpoint migrates to GitHub in the future, the anthropics/claude-code-action@v1 GitHub Action could replace the Lambda-based architecture with a lighter-weight setup (~$5/month for 50 PRs). Until then, the Bitbucket webhook -> Lambda pattern remains correct, and the auto-fix enhancement builds on it.

Design Pattern: Generator-Evaluator Separation (Anthropic Harness Paper)¶

Anthropic's March 2026 harness design paper validates and extends pr-review's architecture. Their finding: separating generation from evaluation is essential because LLMs exhibit self-evaluation bias — they consistently overpraise their own work, even when quality is obviously mediocre.

How pr-review Already Implements This¶

pr-review's 5 specialized agents + verifier aggregator IS a Generator-Evaluator pattern: - Generators: 5 parallel agents (security, correctness, edge-cases, architecture, Nextpoint patterns) each produce independent findings - Evaluator: Verifier aggregator reviews all 5 outputs, resolves conflicts, removes duplicates, and produces the final structured review

The key: no single agent evaluates its own work. Each agent's findings are evaluated by a separate verifier that has no investment in the generated output.

Where pr-review Could Improve (From the Paper)¶

Iterative generation for complex PRs: The current system is single-pass. For complex PRs (50+ file changes, architectural shifts), an iteration loop could catch more issues:
Pass 1: Standard 5-agent review
Verifier identifies gaps: "Security agent missed auth bypass in file X"
Pass 2: Targeted re-review of flagged areas only
Max 2 iterations (cost control)
Sprint contracts for auto-fix: If implementing the auto-fix agent above, the Generator (fix agent) and Evaluator (pr-review re-review) should negotiate explicit "done" criteria before the fix agent begins — preventing fix attempts that create new issues.
Harness simplification testing: Every component encodes an assumption about what the model can't do alone. When evaluating model upgrades (e.g., Mythos):
Can 3 agents replace 5? (Does one agent catch security + correctness?)
Can the verifier be simplified to a lighter-weight check?
Does the architecture-aware context injection remain necessary?

See article-reviews/group-25-harness-design-generator-evaluator-pattern.md for the full paper analysis and Nextpoint implications.

Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.