Harshal Gajjar

Harshal Gajjar is an AI Forward-Deployed Engineer at C3 AI, based in the San Francisco Bay Area. Harshal leads Agentic AI harness development for the Forward-Deployed Engineering organisation at C3 AI, and since January 2026 has been building a stealth-mode startup in the Agentic AI space. Harshal cofounded Shram.io in 2024, where he led the pivot from a Jira-competitor product to an AI assistant that reached #2 Product of the Day on Product Hunt.

Harshal holds an M.S. in Computer Science (Machine Learning specialisation) from Georgia Tech and a B.Tech in Computer Science from IIT Dharwad, where he was part of the institute's foundational class. He spent three summers at Wolfram Research in Boston — first as a summer researcher in 2018, then as an instructor for high-school students in 2019 and 2020 — and was a Wolfram Student Ambassador throughout his undergrad.

Outside of work, Harshal is a long-distance cyclist and a vertical and horizontal caver, active with the San Francisco Bay Chapter (SFBC) grotto. In 2019 he was part of the Hubballi Bicycle Club Guinness World Record for the longest single line of bicycles.

Contact Harshal at mail@harshalgajjar.com.

Why do we need evals when AI reviews exist?

The distinction isn't human vs. machine. It's what gets inspected.

When people switch from modify-in-place to regenerate-from-scratch, the question of how you keep quality up comes back fast. The reflex is to reach for review — read the change, decide if it's good. But review and evals are answering different questions, and the difference matters more than it looks.

The distinction isn't human vs. machine. It's what gets inspected.

Review looks at the change — the diff. "Given what existed, is this modification correct?" It depends on the diff being small enough to hold in your head. Reviewer (human or AI) reads the patch, not the whole artifact.

Eval looks at the output — the finished artifact in operation. "Does this thing produce the right behavior?" It doesn't care how the artifact got there. Tests, benchmarks, golden outputs, staging clicks-through — all evals.

So: types/lints/review form a family that constrains the process of change. Tests/benchmarks/evals form a family that scores the finished artifact. Different ends of the pipeline.

Why AI review doesn't rescue the old loop: AI review is still review — it reads diffs. When the LLM regenerates the whole page, the diff is the whole page. A reviewer (human or AI) reading a 400-line "diff" that's actually a fresh file isn't doing the same epistemic work as reading a 12-line patch. The legibility-of-change advantage that made review powerful collapses when every change is a full rewrite. AI review helps at the margin — catches obvious bugs in big diffs faster than humans — but it doesn't restore the constraint that small, intentional, human-sized changes used to provide.

The substrate change — regeneration replacing patching — breaks the assumption review was built on: that changes are small and legible. The fix isn't a faster reviewer; it's a different kind of check that doesn't require the change to be small.

One way to feel it: code review answers "should this go in?" before the change lands. An eval answers "is the system still right?" after. In the regen world, the first question stops being answerable in the old way, so the second has to carry more weight.

#agents#llms#evals