Reference Implementation: pr-review¶
Overview¶
Automated PR review service for Bitbucket Cloud. Receives webhooks on
pullrequest:created and pullrequest:updated, gathers PR diff + full file
context + linked Jira tickets, dispatches parallel specialized review agents
via Bedrock Claude, aggregates/deduplicates findings, and posts a structured
review comment back to the PR.
Not an eDiscovery pipeline module — this is a developer tooling service that reviews code across all Nextpoint repositories.
Architecture¶
pr-review/
├── service/
│ ├── core/ # Pure business logic — NO boto3, NO requests
│ │ ├── prompts.py # All agent prompt templates (string constants)
│ │ ├── review.py # Agent orchestration, findings parsing, metadata extraction
│ │ ├── nextpoint_rules.py # Repo→rules mapping, per-repo review config flags
│ │ └── models.py # TypedDict definitions (Finding, AgentResult, BedrockCaller)
│ ├── shell/ # Infrastructure — all external I/O
│ │ ├── bedrock.py # Bedrock converse API (Claude invocation)
│ │ ├── bitbucket_client.py # Bitbucket Cloud REST API (Bearer token auth)
│ │ ├── jira_client.py # Jira Cloud REST API (Basic auth)
│ │ └── gather.py # PR data collection (metadata, diff, files, Jira tickets)
│ ├── handlers/
│ │ └── webhook.py # Lambda entry point — parse, verify, route
│ ├── rules/ # Bundled architecture rules (from nextpoint-architecture/rules/)
│ │ ├── api-design.rules.md
│ │ ├── cdk-infrastructure.rules.md
│ │ ├── dependency-management.rules.md
│ │ └── ... # ~10 rule files, synced from architecture repo
│ ├── config.py # Centralized config from Secrets Manager + env vars
│ ├── requirements.txt # Python dependencies
│ └── cdk/
│ └── stack.py # CDK stack — Lambda + API Gateway + Secrets Manager
├── repos.yml # List of repos with PR review webhooks
├── scripts/
│ ├── gather-pr-data.sh # Local data-gathering script (uses bkt CLI)
│ └── post-review.sh # Local review posting script
├── kiro/
│ └── review-pr.md # Manual review skill for Claude Code
└── CLAUDE.md # Module-specific Claude Code config
Pattern Mapping¶
| Pattern | Implementation | Standard NGE Pattern |
|---|---|---|
| Entry point | API Gateway → Lambda (webhook) | SNS → SQS → Lambda |
| Event source | Bitbucket webhook (pullrequest:created/updated) |
SNS events from other modules |
| Agent orchestration | ThreadPoolExecutor — parallel Bedrock calls | N/A (new pattern) |
| External API | Bitbucket REST API + Jira REST API | S3, RDS, Elasticsearch |
| AI integration | Bedrock converse API (Claude Opus 4.6) |
Bedrock (nextpoint-ai uses similar) |
| Secrets | AWS Secrets Manager via config.py | Same — shell/utils/aws_secrets.py |
| Infrastructure | CDK Python (single stack, service/cdk/stack.py) |
CDK TypeScript (two-stack pattern) |
| Error handling | Generic try/except with HTTP status codes | Exception hierarchy (Recoverable/Permanent/Silent) |
| Database | None — stateless service | SQLAlchemy + per-case MySQL |
| Hexagonal boundaries | core/ (prompts, review, rules) + shell/ (Bedrock, Bitbucket, Jira, gather) + handlers/ (webhook) | Same — core/ + shell/ separation |
| Testing | Manual testing via kiro skill | pytest + moto + autouse fixtures |
| Idempotency | Bot comment marker deduplication | Checkpoint-based duplicate detection |
| Concurrency control | reservedConcurrentExecutions: 2 |
reservedConcurrency on orchestrator Lambdas |
Key Design Decisions¶
Multi-Agent Review Architecture¶
Orchestration pattern: Hierarchical — orchestrator delegates to specialized child agents working independently, then aggregates results. Compare with Plan-and-Execute (sequential steps, used by checkpoint pipelines) and ReAct (dynamic reasoning loop, not used in NGE).
Five specialized agents run in parallel, each with a focused mandate:
- Security — injection, secrets, auth bypass, XSS, CSRF
- Correctness — logic errors, null deref, resource leaks, API misuse
- Edge Cases — race conditions, environment-dependent behavior, regressions
- Architecture — design consistency, Jira fulfillment, test coverage, repo health
- Nextpoint Patterns — architecture rule violations specific to Nextpoint repos
A Verifier agent aggregates all findings, removes false positives, deduplicates, ranks by severity, and produces the final formatted review.
Agent count scales with PR complexity: - Small (<50 diff lines): 2 agents (security-correctness combined + architecture) - Medium/Large (50+ diff lines): all 4-5 agents
Nextpoint Architecture Integration¶
Architecture rules are bundled locally within service/rules/ in the Lambda
deployment package (not fetched at runtime from the architecture repo). This
ensures reviews work even if the architecture repo is unavailable. Pipeline
exclude patterns filter rules files from diff analysis to avoid noisy reviews.
The nextpoint_rules.py module maps repository slugs to the bundled rules via
a RepoConfig class with granular per-repo flags:
RepoConfigclass — boolean flags per review dimension:security,correctness,edge_cases,architecture,nextpoint_patterns,jira_fulfillment,test_coverage,repo_health- Config tiers —
_NGE_CONFIG(all checks),_LEGACY_CONFIG(skip hexagonal),_JAVA_CONFIG(skip Python rules),_MINIMAL_CONFIG(security - correctness only). Each repo maps to a tier in
REPO_CONFIGSregistry - Agent dispatch uses flags — agents are conditionally included/excluded
per repo based on the
RepoConfigflags (review.py) - Language auto-detection —
detect_languages()parsesdiff --githeaders to identify Python/Ruby/Java/TypeScript.get_language_rules()injects language-specific rules alongside architectural rules _REPO_RULE_OVERRIDES— per-repo rule customization (e.g., Java modules skip Python-specific rules, get Java/Kotlin rules instead)
Rule injection into agents:
- AGENT_ARCHITECTURE receives {nextpoint_context} — condensed rules
for general awareness
- AGENT_NEXTPOINT_PATTERNS receives {nextpoint_rules} — full rule set
for dedicated pattern violation detection
Token budgets: agents get 4096 max tokens, verifier gets 8192.
Full File Context Gathering¶
gather.py extends the base PR data with full source of changed files
(up to 15 files, 50KB each). This enables agents to find pre-existing bugs
in adjacent code — not just issues in the diff. Binary files and lock files
are skipped.
Bot Comment Deduplication¶
Reviews are posted with a <!-- pr-review-bot --> HTML comment marker.
On subsequent pushes, the existing bot comment is updated rather than
creating duplicates. Old unmarked comments (from before dedup) are cleaned up.
Complexity-Based Agent Scaling¶
PR complexity is estimated from diff line count. Small PRs get fewer agents to reduce Bedrock costs and latency. This keeps review time under the 5-minute Lambda timeout for most PRs.
Divergences from Standard NGE Patterns¶
| Area | Standard NGE | pr-review | Reason |
|---|---|---|---|
| Directory layout | core/ + shell/ + handlers/ |
Same — follows hexagonal pattern | Dependency injection: BedrockCaller callable injected from shell into core |
| Stack pattern | CommonResourcesStack + module | Single stack | No shared VPC/RDS/SNS dependencies |
| Error handling | RecoverableException hierarchy | HTTP status codes | No SQS retry semantics — webhook is fire-and-forget |
| Database | Per-case MySQL via SQLAlchemy | None | Stateless — all state is in Bitbucket/Jira |
| Testing | pytest + moto + autouse fixtures | Manual via kiro skill | Prototype stage — automated tests are a gap |
| Concurrency | SQS-based with reservedConcurrency | Lambda reserved (2) | API Gateway → Lambda, not SQS-driven |
Unique Patterns (Not in Standard NGE)¶
These patterns are introduced by pr-review and could inform future services:
- Multi-agent orchestration — Parallel Bedrock calls with ThreadPoolExecutor, findings aggregation, false positive elimination via verifier agent
- Architecture-aware review — Repo→rules mapping that injects domain-specific rules into AI agent prompts based on the target repository
- Webhook signature verification — HMAC-SHA256 validation of Bitbucket payloads
- Comment deduplication — HTML marker pattern for idempotent comment updates
- Complexity-based scaling — Adjusting agent count based on diff size to balance cost vs. thoroughness
Future: Closed-Loop Auto-Fix (Evaluation Item)¶
In March 2026, Anthropic shipped three Claude Code features that together form an autonomous PR pipeline: Code Review (multi-agent inline comments), Auto Mode (safety-classified autonomous operation), and Cloud Auto-Fix (event-driven CI failure and review comment resolution). These are GitHub-native and don't work with Bitbucket directly. However, the closed-loop pattern is applicable to pr-review.
What Nextpoint pr-review Already Has vs What's New¶
| Capability | pr-review (current) | Claude Code March 2026 | Gap |
|---|---|---|---|
| Architecture-aware review | Yes — injects Nextpoint rules per repo | No — generic code review | pr-review is stronger |
| Specialized agents | 5 parallel (security, correctness, edge-cases, architecture, patterns) | Fleet of agents (generic) | pr-review is stronger |
| Platform | Bitbucket (webhooks -> Lambda) | GitHub only | pr-review works for our stack |
| Auto-fix from review comments | No | Yes — reads comments, pushes fix commits | Gap to evaluate |
| CI failure auto-fix | No | Yes — subscribes to CI events, pushes fixes | Gap to evaluate |
| Review -> fix -> re-review loop | No | Yes — closed loop | Gap to evaluate |
Proposed Enhancement: Add Auto-Fix Agent¶
Extend pr-review with a fix agent that reads review findings and pushes corrections:
Current flow (open loop):
Bitbucket webhook -> pr-review -> post review comments -> developer fixes manually
Proposed flow (closed loop):
Bitbucket webhook -> pr-review -> post review comments
|
v
auto-fix agent
reads findings
pushes fix commit
|
v
pr-review re-reviews
(new push triggers webhook)
|
v
clean? -> done
new findings? -> loop (max 2 iterations)
Which findings are auto-fixable:
| Finding Type | Auto-fixable? | Why |
|---|---|---|
| core/ imports from shell/ (boundary violation) | Yes | Mechanical — move import, inject dependency |
| Missing type hints | Yes | Mechanical — add types |
| Black/isort formatting | Yes | Mechanical — run formatter |
| Missing @retry_on_db_conflict | Likely | Pattern is well-defined |
| Event naming not past tense | Likely | Rename with known convention |
| Security vulnerability (SQL injection) | No | Requires understanding intent |
| Architectural redesign suggestion | No | Requires human judgment |
| Business logic error | No | Requires domain knowledge |
Safeguards:
- Max 2 auto-fix iterations per PR (prevent infinite loops)
- Auto-fix only for findings tagged "mechanical" or "formatting" by pr-review
- Human approval required before auto-fix commits are merged
- All auto-fix commits clearly attributed (Auto-fix by pr-review agent)
Implementation Notes¶
- The auto-fix agent would be a new Lambda triggered by pr-review's own findings (SNS event after review posts), not by Bitbucket webhooks
- Uses Bitbucket API to checkout the branch, apply fixes, and push
- pr-review's existing webhook handles the re-review automatically
(push to branch triggers
pullrequest:updated) - Architecture-aware context (Nextpoint rules) carries over — the fix agent knows HOW to fix because it has the same rules the review agent used to find the issue
Platform Consideration¶
If Nextpoint migrates to GitHub in the future, the anthropics/claude-code-action@v1
GitHub Action could replace the Lambda-based architecture with a lighter-weight
setup (~$5/month for 50 PRs). Until then, the Bitbucket webhook -> Lambda
pattern remains correct, and the auto-fix enhancement builds on it.
Design Pattern: Generator-Evaluator Separation (Anthropic Harness Paper)¶
Anthropic's March 2026 harness design paper validates and extends pr-review's architecture. Their finding: separating generation from evaluation is essential because LLMs exhibit self-evaluation bias — they consistently overpraise their own work, even when quality is obviously mediocre.
How pr-review Already Implements This¶
pr-review's 5 specialized agents + verifier aggregator IS a Generator-Evaluator pattern: - Generators: 5 parallel agents (security, correctness, edge-cases, architecture, Nextpoint patterns) each produce independent findings - Evaluator: Verifier aggregator reviews all 5 outputs, resolves conflicts, removes duplicates, and produces the final structured review
The key: no single agent evaluates its own work. Each agent's findings are evaluated by a separate verifier that has no investment in the generated output.
Where pr-review Could Improve (From the Paper)¶
- Iterative generation for complex PRs: The current system is single-pass. For complex PRs (50+ file changes, architectural shifts), an iteration loop could catch more issues:
- Pass 1: Standard 5-agent review
- Verifier identifies gaps: "Security agent missed auth bypass in file X"
- Pass 2: Targeted re-review of flagged areas only
-
Max 2 iterations (cost control)
-
Sprint contracts for auto-fix: If implementing the auto-fix agent above, the Generator (fix agent) and Evaluator (pr-review re-review) should negotiate explicit "done" criteria before the fix agent begins — preventing fix attempts that create new issues.
-
Harness simplification testing: Every component encodes an assumption about what the model can't do alone. When evaluating model upgrades (e.g., Mythos):
- Can 3 agents replace 5? (Does one agent catch security + correctness?)
- Can the verifier be simplified to a lighter-weight check?
- Does the architecture-aware context injection remain necessary?
See article-reviews/group-25-harness-design-generator-evaluator-pattern.md for
the full paper analysis and Nextpoint implications.
Ask questions about Nextpoint architecture, patterns, rules, or any module. Powered by Claude Opus 4.6.