Why do we need evals when AI reviews exist?
The distinction isn't human vs. machine. It's what gets inspected.
When people switch from modify-in-place to regenerate-from-scratch, the question of how you keep quality up comes back fast. The reflex is to reach for review — read the change, decide if it's good. But review and evals are answering different questions, and the difference matters more than it looks.
The distinction isn't human vs. machine. It's what gets inspected.
Review looks at the change — the diff. "Given what existed, is this modification correct?" It depends on the diff being small enough to hold in your head. Reviewer (human or AI) reads the patch, not the whole artifact.
Eval looks at the output — the finished artifact in operation. "Does this thing produce the right behavior?" It doesn't care how the artifact got there. Tests, benchmarks, golden outputs, staging clicks-through — all evals.
So: types/lints/review form a family that constrains the process of change. Tests/benchmarks/evals form a family that scores the finished artifact. Different ends of the pipeline.
Why AI review doesn't rescue the old loop: AI review is still review — it reads diffs. When the LLM regenerates the whole page, the diff is the whole page. A reviewer (human or AI) reading a 400-line "diff" that's actually a fresh file isn't doing the same epistemic work as reading a 12-line patch. The legibility-of-change advantage that made review powerful collapses when every change is a full rewrite. AI review helps at the margin — catches obvious bugs in big diffs faster than humans — but it doesn't restore the constraint that small, intentional, human-sized changes used to provide.
The substrate change — regeneration replacing patching — breaks the assumption review was built on: that changes are small and legible. The fix isn't a faster reviewer; it's a different kind of check that doesn't require the change to be small.
One way to feel it: code review answers "should this go in?" before the change lands. An eval answers "is the system still right?" after. In the regen world, the first question stops being answerable in the old way, so the second has to carry more weight.