The Numbers AI Vendors Use Are Wrong
Key Takeaway: The benchmark scores that AI vendors publish measure isolated task performance. They don't measure what happens when a human and an AI work together on your actual problem, in your actual context.
Every AI Sales Deck Has the Same Slide
There's a moment in almost every AI product demo that follows the same script. The vendor pulls up a benchmark chart. The bar for their model is tallest. The score is presented with precision: 87.4%, 93.1%, top of the leaderboard. The implication is clear: this model is objectively better.
It's almost never relevant to the decision you're actually making.
A recent piece in MIT Technology Review made this argument formally: AI benchmarks are broken, and the field needs something fundamentally different. The core problem is methodological. Current benchmarks test isolated task performance in controlled conditions. They don't test how well an AI performs when a human is involved, when the context is specific, or when the work extends over time.
That gap between benchmark score and real-world utility is where most AI disappointments live.
What Benchmarks Measure (and Don't)
Standard AI benchmarks work like standardized tests. Present the model with a question, measure whether it got the right answer, repeat thousands of times. Aggregate the score. Report the result.
This tells you something about the model's raw capability on a defined task. It tells you almost nothing about whether that model will be useful in your company's specific workflow, with your team's working style, connected to your data sources, in the context of your customers.
The MIT researchers argue for what they call Human-AI, Context-Specific Evaluation: assessment methods that measure how well the AI performs over time, within team workflows, on problems that actually reflect the deployment environment.
That framing sounds academic until you try to buy AI and realize you've been selecting based on data that doesn't map to your use case.
Three Real Evaluation Problems
The first problem is abstraction. Benchmark tasks are designed to be universal, so they end up testing nothing specific. A model that scores 93% on a general reasoning benchmark might score 60% on the reasoning tasks your finance team actually performs.
The second problem is isolation. Real work is collaborative and sequential. A task that seems simple in isolation (summarize this document) becomes complex when the document is ambiguous, the summarization needs to feed into a specific decision framework, and the output needs to be reviewed and adjusted by a human before it's useful. Benchmarks don't test any of that chain.
The third problem is stability. Benchmarks capture a snapshot. Models change, APIs are updated, prompts that worked last quarter stop working this quarter. A score achieved in controlled conditions six months ago may bear no relationship to what you'll experience today.
What Smart Buyers Are Doing Instead
The companies that make better AI buying decisions share a common approach: they test on their own data, not vendor data.
The evaluation process that works is simple in principle, expensive in time. Take three to five representative tasks from the actual workflow where you're deploying AI. Run each candidate model on those tasks, blinded. Evaluate outputs against your quality standard. Measure not just accuracy but consistency, edge case handling, and the cost of the errors that do occur.
That process takes longer than looking at a benchmark chart. It also produces a result you can actually use.
The deeper implication is about internal capability. Companies that want to make good AI product decisions need at least one person who understands how to design a meaningful evaluation. That's a new skill requirement that most teams don't have yet.
Why Vendors Don't Fix This Themselves
Vendors have limited incentive to replace benchmarks they score well on with evaluations they can't control. This is a known dynamic in any market where quality is hard to verify before purchase.
The solution isn't to distrust vendors. It's to not let vendors define the evaluation criteria. That's your job.
The business cost of buying the wrong AI tool is not just the license fee. It's the integration time, the change management investment, the technical debt when you have to switch, and the productivity loss when the tool doesn't perform as pitched. A rigorous evaluation process costs a fraction of that.
The benchmark slide will still be in the deck. You don't have to let it end the conversation there.
FAQ
Are any AI benchmarks actually useful for business decisions?
Some benchmarks are more useful than others. Domain-specific benchmarks that test performance on tasks close to your actual use case are meaningfully better than general capability scores. The issue is most commonly cited benchmarks are the general ones. Always ask vendors if they have benchmarks specific to your use case or industry.
How long does a proper AI evaluation take?
A minimum viable evaluation for a business use case typically takes two to four weeks if done rigorously. That includes defining the evaluation tasks, running the tests, gathering feedback from the people who'll actually use the tool, and analyzing the results. Rushing this process is one of the most common causes of failed AI deployments.
What questions should I ask an AI vendor about their benchmarks?
Ask specifically: what benchmark were these scores measured on, when was the evaluation conducted, who conducted it, and whether they have any independent evaluations from customers in similar industries. Vendors who've done solid work will have good answers. Those who haven't will deflect to the chart.
