Editor’s note: David Loker is a speaker for ODSC AI West this October 28th-30th. Check out his talk, Context Engineering for AI Code Reviews with MCP, LLMs, and Open-Source DevOps Tooling, there!
TL;DR: Your AI can generate a React component in seconds but ask it to fix the bug in a 30-line PR and it hallucinates issues that don’t exist. The problem isn’t the model — it’s the context, or the lack thereof. This post shares a compact technique called Outside-Diff Impact Slicing that looks beyond the patch to catch bugs at caller/callee boundaries. You’ll run one Python script using OpenAI’s Responses API with GPT-5-mini and get structured, evidence-backed findings ready to paste into a PR.
Note: This works best for focused PRs (10–50 changed lines). For larger changes, see “Try this next” at the end.
The real problem: diffs hide the contracts
Here’s the thing about code review: the diff-view lies to you. It shows what changed, but not what those changes might break. For example, when you add a parameter to a function, the diff won’t show you the twelve call sites that are now passing the wrong number of arguments. Or when you change a return type, the diff won’t highlight the upstream code expecting the old format.
Most AI code review tools make the same mistake of sending the LLM a patch and asking it to “find bugs.” But the most critical bugs aren’t in the patch. They’re at the boundaries between changed code and unchanged code. That’s where contracts get violated.
Outside-Diff Impact Slicing fixes this by asking a simple question: “What’s one hop away from this change?” Specifically:
- Callers: What code calls the functions/classes I just modified?
- Callees: What functions/classes does my changed code call?
These boundaries are where the interesting and critical bugs live, especially those with the highest potential to cause downtime. One important refinement: extract calls from the changed lines themselves, not from the entire changed file. If line 55 calls DatabaseConnection, you care about that contract. You don’t care about the unrelated validate_input call on line 200.
The technique: six focused steps
The full script is ~400 lines (available on GitHub), but the core technique breaks into six pieces. I’ll show you the interesting parts.
Step 1: Parse the diff for exact line numbers
You need surgical precision in this step. Instead of simply saying “this file changed”, you need more granularity like “lines 55–57 in reporting/recreate.py changed.”
def changed_lines(repo=".") -> Dict[str, Set[int]]:
"""Extract changed line numbers from git diff."""
diff = subprocess.check_output(
["git", "-C", repo, "diff", "--unified=0", "--no-color", "HEAD~1"]
).decode()
current = None
changes: Dict[str, Set[int]] = {} for line in diff.splitlines():
if line.startswith("+++ b/"):
current = line[6:] # Extract filename
elif line.startswith("@@") and current:
# Parse hunk header: @@ -10,3 +27,8 @@
# We want the "+27,8" part (new file line numbers)
parts = [p for p in line.split() if p.startswith("+")]
if not parts:
continue
hunk = parts[0] # "+27,8" or "+42"
start = int(hunk.split(",")[0][1:])
count = int(hunk.split(",")[1]) if "," in hunk else 1
changes.setdefault(current, set()).update(range(start, start + count)) return changes
Why it matters: Line-level granularity lets you focus your analysis. If only line 55 changed, you don’t care about the function call on line 200.
Step 2: Extract calls from changed lines only
Focus matters. Instead of grabbing all function calls in a changed file, only extract calls from the specific lines that changed.
def calls_in_lines(path: str, lines: Set[int]) -> Set[str]:
"""Extract function/class calls within specific line numbers."""
src = pathlib.Path(path).read_text(encoding="utf-8")
tree = ast.parse(src)
calls = set()
for node in ast.walk(tree):
if isinstance(node, ast.Call) and isinstance(node.func, ast.Name):
# Only grab calls that occur on changed lines
if hasattr(node, 'lineno') and node.lineno in lines:
calls.add(node.func.id)
return calls
Why it matters: If a file has 200 lines with 50 function calls, but only 3 lines changed with 1 call, you analyze 1 contract instead of 50 resulting in less noise, more signal.
Step 3: Identify signature changes for caller analysis
For the caller direction, you only care about functions/classes whose signatures changed — not every function that happens to contain a changed line.
def symbols_with_signature_changes(path: str, lines: Set[int]) -> Set[str]:
"""Find functions/classes whose SIGNATURES were changed (def line itself)."""
src = pathlib.Path(path).read_text(encoding="utf-8")
tree = ast.parse(src)
changed_signatures = set()
for node in ast.walk(tree):
if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef)):
# Check if the definition line itself was changed
if node.lineno in lines:
changed_signatures.add(node.name)
# For classes, also check if __init__ signature changed
if isinstance(node, ast.ClassDef):
for item in node.body:
if isinstance(item, (ast.FunctionDef, ast.AsyncFunctionDef)) and item.name == "__init__":
if item.lineno in lines:
changed_signatures.add(node.name) return changed_signatures
Why this matters: If you change line 100 inside process_data() (not the def line), you don’t need to check all callers of process_data()as the function signature didn’t change. But if you change line 18 from def validate_email(email): to def validate_email(email, strict=True):, you DO need to check callers.
Step 4: Build the one-hop slice
Now we can use a call graph to find the impact files in both directions:
- Callees: Files defining what your changed lines call (from calls_in_lines)
- Callers: Files calling functions whose signatures you changed (from symbols_with_signature_changes)
The full implementation builds a simple call graph (callgraph_for_files) tracking which files define/call which symbols, then uses it to find impact files. See GitHub for the complete one_hop_slice() function.
Step 5: Structured markdown format with XML-style tags
Here’s a surprise: markdown beats JSON for LLM input. It’s clearer, more token-efficient, and easier for the model to parse.
The context is structured into three sections:
- Git Diff: wrapped in <diff> tags showing what changed
- Changed Code: each file in <file name=”…” lines=”…” type=”changed”> tags with code snippets
- Impact Code: split into <callees> and <callers> subsections, each file tagged with type=”impact”
Why this format works: The XML-style tags let the LLM clearly distinguish “changed code” from “reference contracts.” The type=”changed” vs type=”impact” distinction is critical for preventing hallucinations where the model cites the wrong file. Markdown with code blocks is also more token-efficient than nested JSON structures.
Step 6: The prompt that makes it work
The prompt has one critical job: make it crystal clear that findings should reference changed files, not impact files. Early versions of our technique kept citing the impact file (the contract definition) instead of the buggy changed code. Here’s the prompt that solved it:
prompt = (
"You are a senior code reviewer analyzing a PR for bugs. "
"You will receive structured markdown with THREE sections:\\\\n\\\\n"
"1. Git Diff: Shows what changed (in <diff> tags)\\\\n"
"2. Changed Code: Snippets from modified files (type=\\\\"changed\\\\")\\\\n"
"3. Impact Code: Both CALLEES (definitions the changed code calls) and CALLERS "
"(code that calls the changed symbols). These show contracts/signatures and usage patterns.\\\\n\\\\n"
"YOUR TASK: Find real bugs in the CHANGED CODE. Look for:\\\\n"
"- CONTRACT MISMATCHES: Wrong parameter count, signature changes\\\\n"
"- LOGIC ERRORS: Off-by-one, incorrect conditionals, missing edge cases\\\\n"
"- CONCURRENCY: Race conditions, missing synchronization\\\\n"
"- RESOURCE MANAGEMENT: Leaks, missing cleanup\\\\n"
"- ERROR HANDLING: Unhandled exceptions, silent failures\\\\n"
"- SECURITY: Injection risks, missing validation\\\\n\\\\n"
"CRITICAL: Your findings MUST reference the CHANGED files (type=\\\\"changed\\\\"), "
"NOT the impact files. Impact files show contracts for reference only.\\\\n\\\\n"
"Focus on real bugs, not style. If nothing critical, return empty bugs array.\\\\n\\\\n"
+ review_context
)
Why this works:
- The “CRITICAL” instruction: Explicitly stating that findings must reference changed files (not impact files) resulted in wrong-file citations coming down from ~40% to nearly zero. Without this instruction the model naturally gravitates toward citing the contracts it sees in the impact section.
- Concrete bug categories: Listing specific types (contract-mismatch, logic-error, etc.) guides the model toward real issues rather than style complaints or vague “could be better” suggestions.
- Three-section structure: By clearly labeling the diff, changed code, and impact code with XML-style tags, the model can easily distinguish “what changed” from “what the changes interact with” helping move the review focus from change to impact of that change.
Implementation note: The full script uses OpenAI’s Responses API with GPT-5-mini and structured outputs to guarantee JSON schema compliance. This ensures you get consistent, parseable results every time. See the full code on GitHub for API details.
Does it actually work? A real example
I tested this on a PR where someone refactored a database connection helper. Here’s what the script found:
The changed code (src/workers/data_sync.py, line 73):
# Refactored to use new connection pooling
conn = DatabaseConnection(config["db_host"], config["db_port"], timeout=30)
The impact code (src/db/connection.py, the contract):
class DatabaseConnection:
def __init__(self, connection_string: str, pool_size: int = 10):
"""Initialize connection from a connection string like 'host:port'."""
self.connection_string = connection_string
self.pool_size = pool_size
# ...
The finding:
{
"changed_file": "src/workers/data_sync.py",
"changed_lines": "73",
"bug_category": "contract-mismatch",
"summary": "DatabaseConnection called with wrong parameter types and count",
"comment": "The changed code calls DatabaseConnection(config['db_host'], config['db_port'], timeout=30) with three arguments (two positional strings and a keyword arg). The impact code shows DatabaseConnection.__init__ expects a single connection_string parameter (format 'host:port') and an optional pool_size integer. This will raise TypeError at runtime. The 'timeout' parameter doesn't exist in the signature.",
"diff_fix_suggestion": "--- a/src/workers/data_sync.py\\\\n+++ b/src/workers/data_sync.py\\\\n@@ -73,1 +73,1 @@\\\\n-conn = DatabaseConnection(config['db_host'], config['db_port'], timeout=30)\\\\n+conn = DatabaseConnection(f\\\\"{config['db_host']}:{config['db_port']}\\\\")"
}
Why a human might miss this: The developer saw “DatabaseConnection” in the old code, knew it changed, but didn’t look up the new signature in src/db/connection.py. When reviewing the diff, you see what looks like reasonable arguments (host, port, timeout) and your brain doesn’t flag it. The contract violation is invisible until you cross-reference the actual definition, which is exactly what Outside-Diff Impact Slicing automates.
The technique also works in the other direction (finding bugs in callers when you change a function signature):
The changed code (src/utils/validation.py, line 18):
def validate_email(email: str, domain_whitelist: List[str]):
"""Validate email format. Now requires a domain whitelist for security."""
# ... implementation checks if email domain is in whitelist
The impact code (caller in src/api/auth.py, line 95):
# This caller wasn't updated when validate_email signature changed
if validate_email(user_input):
send_confirmation(user_input)
The finding:
{
"changed_file": "src/utils/validation.py",
"changed_lines": "18",
"bug_category": "contract-mismatch",
"summary": "validate_email signature changed to require domain_whitelist but caller missing it",
"comment": "The changed code modified validate_email to require a second parameter 'domain_whitelist' (a required List[str]). However, the impact code shows a caller in src/api/auth.py:95 that only passes one argument: validate_email(user_input). This will raise TypeError at runtime: validate_email() missing 1 required positional argument: 'domain_whitelist'.",
"diff_fix_suggestion": "--- a/src/api/auth.py\\\\n+++ b/src/api/auth.py\\\\n@@ -95,1 +95,1 @@\\\\n-if validate_email(user_input):\\\\n+if validate_email(user_input, ALLOWED_EMAIL_DOMAINS):"
}
This demonstrates both directions: callees (what changed code calls) and callers (what calls the changed code).
Why this technique works
Three ingredients make this effective:
- Graph awareness beyond the diff: instead of focusing on reading the patch, you are checking contracts at the boundaries. That’s where critical integration bugs live.
- Line-level precision: extract calls only from changed lines, not entire files. Simple refinement, significant noise reduction.
- Structured input + explicit constraints: the markdown format with XML tags gives the LLM clear structure. The “CRITICAL” instruction about changed vs. impact files prevents the most common hallucination (citing the wrong file).
Limitations: This works best for small, focused PRs (10–50 lines). Larger PRs blow out the context window. For larger PRs you need token budgeting to rank snippets by relevance, clip low-priority code.
Try this next
Once you have the basic technique working, here are extensions worth exploring:
- Richer code graphs with Tree-sitter: the script uses Python’s ast module, which only works for Python. Swap in Tree-sitter to handle JavaScript, TypeScript, Go, Rust or any language with a grammar. You’ll get more accurate definitions and cross-file references.
- Add linter output to the context: run ruff, mypy, or eslint on your impact files and pipe their findings (ideally as JSON or SARIF) into the review context. Static analysis catches different bugs than LLMs — combine them.
- Token budgeting for large PRs: for PRs with 200+ changed lines, you’ll blow your context window. Build a ranking system: score snippets by (distance from changed lines × call frequency × past bug density), then keep only the top N. CodeRabbit does this and it is the difference between “works on toy PRs” and “works in production.”
- MCP integration for org-specific context: if you’re using Model Context Protocol servers, you can fetch ticket descriptions, CI logs, feature requirements docs, architectural diagrams, or internal style guides and append them to your review context. The LLM can then check “does this change actually fix JIRA-1234?” or “does this follow our error handling conventions?”
- Post findings as PR comments: use the GitHub API (gh pr comment) or your platform’s API to post findings directly to the PR. Include the changed_lines in your API call to anchor comments inline. Now your review bot feels like a real teammate that finds bugs you would otherwise miss.
- Measure and iterate: track precision (what % of findings are real bugs?) and recall (what % of real bugs did it find?) over time. If a category has high false positives, tighten the prompt. If it misses obvious bugs, add targeted checks.
Run it yourself
Prerequisites: Python 3.10+, pip install openai, at least one commit to diff against.
Full script: github.com/coderabbitai/odsc-west-2025/review_demo.py (includes helpers omitted here for brevity)
Quick start:
git clone <https://github.com/coderabbitai/odsc-west-2025>
cd your-project-with-a-diff
python /path/to/review_demo.py
# Paste your OpenAI API key when prompted
The script outputs JSON findings you can paste into a PR comment or pipe to another tool.
See it live at ODSC West 2025
I’ll be presenting “Context Engineering for AI Code Reviews with MCP, LLMs, and Open-Source DevOps Tooling” at ODSC AI West. The talk covers the full system: graph awareness, multi-linter evidence, repo history, agent guidelines, custom rules, and MCP integration for org-specific context. Hope to see you there!
About the author
David Loker is the Director of AI at CodeRabbit, where he leads development of agentic AI systems for code review and developer workflows. He has published at NeurIPS, ICML, and AAAI, and has been building large-scale AI systems since 2007.