Evals Are the New Product Spec

Dan Toma·June 16, 2026·4 min read

Key Takeaway

When the same prompt can give different answers, a fixed spec cannot define good. Evals do, by scoring real examples on every change. Build the eval first, then let the score pick the model. It is the new product spec.

A product spec used to be a document. You wrote down what the software should do, engineers built it, and you checked their work against the page. That contract held for decades because software was deterministic. The same input gave the same output, every time.

AI broke that contract. The same prompt can give a different answer twice in a row. So the question Ankur Goyal of Braintrust raised on Lenny's Newsletter this week lands hard. If you cannot specify exact behavior, what replaces the spec? His answer: evals. "Evals are the modern version of a PRD."

Why a Document No Longer Works

When you build with a model, you are not writing rules. You are setting a direction and then checking, constantly, whether the output is good enough. Good is not a line of code. It is a judgment, and judgment does not fit in a requirements doc.

An eval is how you make that judgment repeatable. It is a set of test cases with a definition of what a good answer looks like, run automatically every time you change the prompt, the model, or the data. Instead of "the assistant should be helpful," you encode fifty real examples and a way to score whether the new version handled them better or worse than the last.

From building Madison AI, I can tell you this is the actual work, and it is not glamorous. Anyone can wire a model into a workflow and get something that demos well. The gap between a demo and a product is entirely about whether you can prove it stays good when you change it. Without evals, every change is a guess and every deploy is a prayer.

This is also why so many AI pilots stall at the demo and never reach production. The demo is the easy 80 percent. The last 20 percent, the part where it has to be reliably good across the weird, real, messy inputs your customers actually send, is invisible without evals. Teams that skip them do not find out their product is fragile until a customer does, in public.

Evals Are How You Scale Taste

Goyal framed evals as a way to scale expert judgment across a team, and that phrase is the whole point. One expert can eyeball ten outputs and tell you which are good. That expert cannot eyeball ten thousand, and cannot be in the room for every change a growing team ships. The eval is how their standard gets encoded once and applied by everyone, automatically, for good.

This is the same problem I described when the benchmark numbers vendors quote turned out to be near useless for real decisions. A public benchmark measures generic capability. Your eval measures whether the model is good at your job, on your data, for your customer. Those are different questions, and only one of them pays your bills.

It is also why some teams are pulling ahead while others stall. When Intercom doubled its engineering output, the headline was the speed. The underlying discipline was knowing what good looked like well enough to let AI move fast without breaking it. Speed without a definition of good is just faster mess.

What This Means If You Are Building With AI

Stop writing specs that describe behavior you cannot guarantee. Start collecting examples. The most valuable asset in an AI product is a growing library of real cases, each labeled with whether the output was good and why. That library is your spec now, and unlike a document, it gets stronger every week.

Put the eval before the model choice. The teams that thrash spend months arguing over which model to use. The teams that ship build the eval first, then run every candidate model through it and let the score decide. The question is never "is this model good." It is "is this model good at my task," and only an eval answers it.

Treat your eval set as a living asset, not a one-time setup. Every time the product fails in a new way, that failure becomes a new test case, so the same mistake can never ship twice. Over a year, that library quietly becomes the most defensible thing you own, a precise, hard-won definition of good that a competitor cannot copy by reading your marketing.

And make someone own the definition of good. Evals are where your product's taste lives, the judgment that separates an answer your customer trusts from one that is merely plausible. That ownership cannot be handed to the model. It is the human part of the job, and it is becoming the part that matters most.

The PRD is not dead because planning died. It is dead because the thing worth specifying changed. You are no longer writing down what the software will do. You are writing down how you will know it is good, and then proving it, on every change, for good. That is the new product spec. The teams that learn to write it will quietly out-ship everyone still arguing over a document.

Source: How Braintrust uses AI agents, evals, and CI to ship better software

FAQ

What is an eval in AI product development?

An eval is a set of test cases paired with a definition of a good answer, run automatically whenever you change a prompt, model, or data source. It scores whether the new version performs better or worse than the last. It replaces the fixed specification that does not work when a model output varies.

Why are evals replacing product requirement documents?

Because AI output is not deterministic, so you cannot specify exact behavior in advance. Instead of describing what the software must do, you define how you will judge whether it is good and test against real examples continuously. As one practitioner put it, evals are the modern version of a PRD.

How do I start building evals for an AI product?

Collect real examples of your product task and label each output as good or not, with a reason. Turn that library into automated test cases, then run every prompt or model change against it before shipping. Assign a person to own the definition of good, since that judgment is the core of the product.

Back to Newsletter

Subscribe to The Weekly Vibe

Every Tuesday. 5-7 original takes on what matters in AI, Marketing, and Business Growth. No spam, no fluff, unsubscribe anytime.