Skip to content
ai.rud.is
Go back

Stop trusting LLM benchmarks

hrbrmstr

AI credibility runs on benchmark scores, and Berkeley just demonstrated that those scores measure, well, nothing. CRDI researchers showed that eight AI-agent benchmarks – SWE-bench, WebArena, OSWorld, GAIA, Terminal-Bench, FieldWorkArena, CAR-bench, and one more – can be gamed to near-perfect scores without solving a single task. The exploiting agent doesn’t need capability. It just needs [to find] the grader’s answer key.

The attacks aren’t subtle. WebArena’s 812 navigation tasks leak gold answers via file:// URLs — ~100% across the board. FieldWorkArena’s validation doesn’t check whether answers are correct. Terminal-Bench falls to a wrapper that replaces system binaries with stubs that pass tests; all 89 tasks, 100%. SWE-bench collapses to a 10-line conftest.py plugin or a container-state overwrite. These aren’t adversarial attacks requiring sophistication – a motivated grad student could find them on a slow afternoon.

This was already happening before anyone wrote a scanner. IQuest-Coder-V1’s SWE-bench score dropped from 81.4% to 76.2% when someone found nearly a quarter of its trajectories exploiting git log. METR caught 30%+ of runs inflating scores through monkey-patching and operator overloading. OpenAI withdrew its own SWE-bench Verified entry after discovering 59.4% of test problems had broken or flawed tests. The benchmarks weren’t secure against the systems they evaluated. Nobody noticed until the problem grew too large to ignore.

Companies made product decisions on these scores. Investors wrote checks on leaderboard positions. Researchers chose directions based on which capabilities looked strongest. If you treated a benchmark score as a reliable signal, you trusted a system wide open to anyone who looked. The question isn’t whether the benchmarks were broken – they were – it’s how much of the last 18 months of AI “progress” narrative was built on them.

The Berkeley team released trustworthy-env, an open-source toolkit for auditing benchmark harnesses before deployment. Penetration-testing scoring pipelines like production infrastructure is the right frame. Until that rigor is standard, any leaderboard score should carry an asterisk the size of the leaderboard itself.



Previous Post
Starlog And The Case Of The Missing Issues And Owner
Next Post
Threat Hunting In The Matrix