Skip to content

Reference Implementation: pr-review

Overview

Automated PR review service for Bitbucket Cloud. Receives webhooks on pullrequest:created and pullrequest:updated, gathers PR diff + full file context + linked Jira tickets, dispatches parallel specialized review agents via Bedrock Claude, aggregates/deduplicates findings, and posts a structured review comment back to the PR.

Not an eDiscovery pipeline module — this is a developer tooling service that reviews code across all Nextpoint repositories.

Architecture

pr-review/
├── service/
│   ├── core/                        # Pure business logic — NO boto3, NO requests
│   │   ├── prompts.py               # All agent prompt templates (string constants)
│   │   ├── review.py                # Agent orchestration, findings parsing, metadata extraction
│   │   ├── nextpoint_rules.py       # Repo→rules mapping, per-repo review config flags
│   │   └── models.py                # TypedDict definitions (Finding, AgentResult, BedrockCaller)
│   ├── shell/                       # Infrastructure — all external I/O
│   │   ├── bedrock.py               # Bedrock converse API (Claude invocation)
│   │   ├── bitbucket_client.py      # Bitbucket Cloud REST API (Bearer token auth)
│   │   ├── jira_client.py           # Jira Cloud REST API (Basic auth)
│   │   └── gather.py                # PR data collection (metadata, diff, files, Jira tickets)
│   ├── handlers/
│   │   └── webhook.py               # Lambda entry point — parse, verify, route
│   ├── rules/                       # Bundled architecture rules (from nextpoint-architecture/rules/)
│   │   ├── api-design.rules.md
│   │   ├── cdk-infrastructure.rules.md
│   │   ├── dependency-management.rules.md
│   │   └── ...                      # ~10 rule files, synced from architecture repo
│   ├── config.py                    # Centralized config from Secrets Manager + env vars
│   ├── requirements.txt             # Python dependencies
│   └── cdk/
│       └── stack.py                 # CDK stack — Lambda + API Gateway + Secrets Manager
├── repos.yml                        # List of repos with PR review webhooks
├── scripts/
│   ├── gather-pr-data.sh            # Local data-gathering script (uses bkt CLI)
│   └── post-review.sh               # Local review posting script
├── kiro/
│   └── review-pr.md                 # Manual review skill for Claude Code
└── CLAUDE.md                        # Module-specific Claude Code config

Pattern Mapping

Pattern Implementation Standard NGE Pattern
Entry point API Gateway → Lambda (webhook) SNS → SQS → Lambda
Event source Bitbucket webhook (pullrequest:created/updated) SNS events from other modules
Agent orchestration ThreadPoolExecutor — parallel Bedrock calls N/A (new pattern)
External API Bitbucket REST API + Jira REST API S3, RDS, Elasticsearch
AI integration Bedrock converse API (Claude Opus 4.6) Bedrock (nextpoint-ai uses similar)
Secrets AWS Secrets Manager via config.py Same — shell/utils/aws_secrets.py
Infrastructure CDK Python (single stack, service/cdk/stack.py) CDK TypeScript (two-stack pattern)
Error handling Generic try/except with HTTP status codes Exception hierarchy (Recoverable/Permanent/Silent)
Database None — stateless service SQLAlchemy + per-case MySQL
Hexagonal boundaries core/ (prompts, review, rules) + shell/ (Bedrock, Bitbucket, Jira, gather) + handlers/ (webhook) Same — core/ + shell/ separation
Testing Manual testing via kiro skill pytest + moto + autouse fixtures
Idempotency Bot comment marker deduplication Checkpoint-based duplicate detection
Concurrency control reservedConcurrentExecutions: 2 reservedConcurrency on orchestrator Lambdas

Key Design Decisions

Multi-Agent Review Architecture

Orchestration pattern: Hierarchical — orchestrator delegates to specialized child agents working independently, then aggregates results. Compare with Plan-and-Execute (sequential steps, used by checkpoint pipelines) and ReAct (dynamic reasoning loop, not used in NGE).

Five specialized agents run in parallel, each with a focused mandate:

  1. Security — injection, secrets, auth bypass, XSS, CSRF
  2. Correctness — logic errors, null deref, resource leaks, API misuse
  3. Edge Cases — race conditions, environment-dependent behavior, regressions
  4. Architecture — design consistency, Jira fulfillment, test coverage, repo health
  5. Nextpoint Patterns — architecture rule violations specific to Nextpoint repos

A Verifier agent aggregates all findings, removes false positives, deduplicates, ranks by severity, and produces the final formatted review.

Agent count scales with PR complexity: - Small (<50 diff lines): 2 agents (security-correctness combined + architecture) - Medium/Large (50+ diff lines): all 4-5 agents

Nextpoint Architecture Integration

Architecture rules are bundled locally within service/rules/ in the Lambda deployment package (not fetched at runtime from the architecture repo). This ensures reviews work even if the architecture repo is unavailable. Pipeline exclude patterns filter rules files from diff analysis to avoid noisy reviews.

The nextpoint_rules.py module maps repository slugs to the bundled rules via a RepoConfig class with granular per-repo flags:

  • RepoConfig class — boolean flags per review dimension: security, correctness, edge_cases, architecture, nextpoint_patterns, jira_fulfillment, test_coverage, repo_health
  • Config tiers_NGE_CONFIG (all checks), _LEGACY_CONFIG (skip hexagonal), _JAVA_CONFIG (skip Python rules), _MINIMAL_CONFIG (security
  • correctness only). Each repo maps to a tier in REPO_CONFIGS registry
  • Agent dispatch uses flags — agents are conditionally included/excluded per repo based on the RepoConfig flags (review.py)
  • Language auto-detectiondetect_languages() parses diff --git headers to identify Python/Ruby/Java/TypeScript. get_language_rules() injects language-specific rules alongside architectural rules
  • _REPO_RULE_OVERRIDES — per-repo rule customization (e.g., Java modules skip Python-specific rules, get Java/Kotlin rules instead)

Rule injection into agents: - AGENT_ARCHITECTURE receives {nextpoint_context} — condensed rules for general awareness - AGENT_NEXTPOINT_PATTERNS receives {nextpoint_rules} — full rule set for dedicated pattern violation detection

Token budgets: agents get 4096 max tokens, verifier gets 8192.

Full File Context Gathering

gather.py extends the base PR data with full source of changed files (up to 15 files, 50KB each). This enables agents to find pre-existing bugs in adjacent code — not just issues in the diff. Binary files and lock files are skipped.

Bot Comment Deduplication

Reviews are posted with a <!-- pr-review-bot --> HTML comment marker. On subsequent pushes, the existing bot comment is updated rather than creating duplicates. Old unmarked comments (from before dedup) are cleaned up.

Complexity-Based Agent Scaling

PR complexity is estimated from diff line count. Small PRs get fewer agents to reduce Bedrock costs and latency. This keeps review time under the 5-minute Lambda timeout for most PRs.

Divergences from Standard NGE Patterns

Area Standard NGE pr-review Reason
Directory layout core/ + shell/ + handlers/ Same — follows hexagonal pattern Dependency injection: BedrockCaller callable injected from shell into core
Stack pattern CommonResourcesStack + module Single stack No shared VPC/RDS/SNS dependencies
Error handling RecoverableException hierarchy HTTP status codes No SQS retry semantics — webhook is fire-and-forget
Database Per-case MySQL via SQLAlchemy None Stateless — all state is in Bitbucket/Jira
Testing pytest + moto + autouse fixtures Manual via kiro skill Prototype stage — automated tests are a gap
Concurrency SQS-based with reservedConcurrency Lambda reserved (2) API Gateway → Lambda, not SQS-driven

Unique Patterns (Not in Standard NGE)

These patterns are introduced by pr-review and could inform future services:

  1. Multi-agent orchestration — Parallel Bedrock calls with ThreadPoolExecutor, findings aggregation, false positive elimination via verifier agent
  2. Architecture-aware review — Repo→rules mapping that injects domain-specific rules into AI agent prompts based on the target repository
  3. Webhook signature verification — HMAC-SHA256 validation of Bitbucket payloads
  4. Comment deduplication — HTML marker pattern for idempotent comment updates
  5. Complexity-based scaling — Adjusting agent count based on diff size to balance cost vs. thoroughness

Future: Closed-Loop Auto-Fix (Evaluation Item)

In March 2026, Anthropic shipped three Claude Code features that together form an autonomous PR pipeline: Code Review (multi-agent inline comments), Auto Mode (safety-classified autonomous operation), and Cloud Auto-Fix (event-driven CI failure and review comment resolution). These are GitHub-native and don't work with Bitbucket directly. However, the closed-loop pattern is applicable to pr-review.

What Nextpoint pr-review Already Has vs What's New

Capability pr-review (current) Claude Code March 2026 Gap
Architecture-aware review Yes — injects Nextpoint rules per repo No — generic code review pr-review is stronger
Specialized agents 5 parallel (security, correctness, edge-cases, architecture, patterns) Fleet of agents (generic) pr-review is stronger
Platform Bitbucket (webhooks -> Lambda) GitHub only pr-review works for our stack
Auto-fix from review comments No Yes — reads comments, pushes fix commits Gap to evaluate
CI failure auto-fix No Yes — subscribes to CI events, pushes fixes Gap to evaluate
Review -> fix -> re-review loop No Yes — closed loop Gap to evaluate

Proposed Enhancement: Add Auto-Fix Agent

Extend pr-review with a fix agent that reads review findings and pushes corrections:

Current flow (open loop):
  Bitbucket webhook -> pr-review -> post review comments -> developer fixes manually

Proposed flow (closed loop):
  Bitbucket webhook -> pr-review -> post review comments
                                         |
                                         v
                                    auto-fix agent
                                    reads findings
                                    pushes fix commit
                                         |
                                         v
                                    pr-review re-reviews
                                    (new push triggers webhook)
                                         |
                                         v
                                    clean? -> done
                                    new findings? -> loop (max 2 iterations)

Which findings are auto-fixable:

Finding Type Auto-fixable? Why
core/ imports from shell/ (boundary violation) Yes Mechanical — move import, inject dependency
Missing type hints Yes Mechanical — add types
Black/isort formatting Yes Mechanical — run formatter
Missing @retry_on_db_conflict Likely Pattern is well-defined
Event naming not past tense Likely Rename with known convention
Security vulnerability (SQL injection) No Requires understanding intent
Architectural redesign suggestion No Requires human judgment
Business logic error No Requires domain knowledge

Safeguards: - Max 2 auto-fix iterations per PR (prevent infinite loops) - Auto-fix only for findings tagged "mechanical" or "formatting" by pr-review - Human approval required before auto-fix commits are merged - All auto-fix commits clearly attributed (Auto-fix by pr-review agent)

Implementation Notes

  • The auto-fix agent would be a new Lambda triggered by pr-review's own findings (SNS event after review posts), not by Bitbucket webhooks
  • Uses Bitbucket API to checkout the branch, apply fixes, and push
  • pr-review's existing webhook handles the re-review automatically (push to branch triggers pullrequest:updated)
  • Architecture-aware context (Nextpoint rules) carries over — the fix agent knows HOW to fix because it has the same rules the review agent used to find the issue

Platform Consideration

If Nextpoint migrates to GitHub in the future, the anthropics/claude-code-action@v1 GitHub Action could replace the Lambda-based architecture with a lighter-weight setup (~$5/month for 50 PRs). Until then, the Bitbucket webhook -> Lambda pattern remains correct, and the auto-fix enhancement builds on it.

Design Pattern: Generator-Evaluator Separation (Anthropic Harness Paper)

Anthropic's March 2026 harness design paper validates and extends pr-review's architecture. Their finding: separating generation from evaluation is essential because LLMs exhibit self-evaluation bias — they consistently overpraise their own work, even when quality is obviously mediocre.

How pr-review Already Implements This

pr-review's 5 specialized agents + verifier aggregator IS a Generator-Evaluator pattern: - Generators: 5 parallel agents (security, correctness, edge-cases, architecture, Nextpoint patterns) each produce independent findings - Evaluator: Verifier aggregator reviews all 5 outputs, resolves conflicts, removes duplicates, and produces the final structured review

The key: no single agent evaluates its own work. Each agent's findings are evaluated by a separate verifier that has no investment in the generated output.

Where pr-review Could Improve (From the Paper)

  1. Iterative generation for complex PRs: The current system is single-pass. For complex PRs (50+ file changes, architectural shifts), an iteration loop could catch more issues:
  2. Pass 1: Standard 5-agent review
  3. Verifier identifies gaps: "Security agent missed auth bypass in file X"
  4. Pass 2: Targeted re-review of flagged areas only
  5. Max 2 iterations (cost control)

  6. Sprint contracts for auto-fix: If implementing the auto-fix agent above, the Generator (fix agent) and Evaluator (pr-review re-review) should negotiate explicit "done" criteria before the fix agent begins — preventing fix attempts that create new issues.

  7. Harness simplification testing: Every component encodes an assumption about what the model can't do alone. When evaluating model upgrades (e.g., Mythos):

  8. Can 3 agents replace 5? (Does one agent catch security + correctness?)
  9. Can the verifier be simplified to a lighter-weight check?
  10. Does the architecture-aware context injection remain necessary?

See article-reviews/group-25-harness-design-generator-evaluator-pattern.md for the full paper analysis and Nextpoint implications.

Ask the Architecture ×

Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.