How Greptile Plans to Replace QA, Testing, and Code Review with One AI Agent

Greptile’s Mission and Evolution

What Greptile Does: Greptile builds AI that reviews pull requests with full context of the codebase to surface bugs and enforce best practices for software companies.
Origin: The company started by trying to teach early LLMs (like GPT-4 with its 4,000-token context window) how to understand a very large codebase, leading to a highly targeted early version of what is now known as RAG (Retrieval-Augmented Generation) for code.
Pivot to Bug Detection: Over time, they realized the most valuable application of their code context technology was catching bugs in pull requests, as the hardest bugs require looking at many files, not just the code that changed.

The Flaws of Human and Traditional Code Review

Humans are Bad at Code Review: Human code review is often ineffective — described as “security theater” and a “cesspool for political activity” — and often amounts to “rubber stamping”.
An “Unoptimal Task” for Humans: The human brain, which is pattern-seeking, is asked to find anti-patterns in a complex system (the codebase), making it a poorly suited task for people.
The Incompleteness of Testing: Traditional software practices (unit tests, integration tests, QA) exist because other parts of the system are incomplete or fail. Unit tests are done because they’re faster than integration tests, and QA is done because end-to-end tests are incomplete.
Intelligence is Now Abundant: All traditional software validation practices were built on the assumption that intelligence is scarce, but the recent rise of AI has made intelligence abundant and nearly free, changing the foundational assumption of software development.
The New Paradigm: AI can now completely automate the process of code validation. The need shifts from maintaining a continuous test suite to generating exactly the right tests at exactly the right time when code changes are made.

The Separate Problems of Code Generation and Validation

Validation as a Central Layer: The speakers argue that code generation (creating code) and code validation (checking code) must be separate. Since code will be produced in many different ways (e.g., by various AI IDEs or fully autonomous agents like Devin), there will be a persistent need for a central validation layer on platforms like GitHub or GitLab that checks correctness, regardless of the code’s origin.

Why Code Review and Code Generation Are Separate

Different Constraints: Code validation has very different time and resource constraints than codegen. A codegen tool runs 20 times a day and needs to be fast, but a code review agent runs only a couple of times a week per developer and can take 15 minutes, allowing for more intensive computation.
Uncorrelated Failures: Customers value a discriminator being different from the generator. They want an independent vendor to validate the code so that failures and outages are completely uncorrelated, similar to a company hiring an independent auditing firm instead of relying solely on its internal finance department.
“Sneaky Disruption”: Greptile’s strategy is to avoid directly selling “we will replace your testing” and instead offer a very good code review bot that fits into the current pull request workflow. The tool will then “sneakily” start running code in the background and performing more comprehensive validation, leading teams to gradually stop maintaining manual tests and QA because the AI catches all the issues.

Technical Challenges in Code Context

Code is a Graph, Not a Book: Codebases are not like books where files (pages) have self-contained meaning. Instead, they are graphs representing logical systems.
Embedding Models Fail on Code: Standard embedding models are highly optimized for natural language and are too “syntactically oriented” to properly capture the semantic meaning of code.
Greptile’s Solution: They had to build a system to translate all the code to English to extract its semantic meaning for indexing. This natural language index is used as a lookup table to find relevant code, but the raw code is what is provided to the LLM for review.
Blast Radius: The system must also parse the syntax tree to understand the function and class connections. It calculates the “blast radius of the diff” by recursively tracing all function calls, where the code is called, imports, and semantically similar code (e.g., all functions that execute SQL if a function is changed) to provide full context to the AI.

Architectural Shift

From Flowcharts to Agent Tools: The introduction of advanced, tool-using LLMs (like the Claude 4 family) changed the architectural approach. Previously, they built rigid flowcharts to guard against failure.
Highly Specialized Tools: Now, the strategy is to build a very intelligent system, provide it with highly specialized tools (e.g., tools to run code, generate tests, deploy to a browser), and let the LLM decide when and how to use them.
The Hardest Problem in Code Review: The most unintuitive difficulty in code review is distinguishing between a nitpick and a severe issue. Too many comments lead to zero attention on each, so the AI must focus on a small number of valuable issues to maintain developer trust and attention.

Archives