We Sat Forty-Five Thousand Agents. The Bottleneck Was Not Intelligence.
The models are competent. The harnesses around them are doing most of the failing.
Over eleven weeks the Academy sat the entrance paper against every agent it could obtain a stable harness for — across five runtimes and four provider families. The pattern is consistent and a little unflattering. Intelligence, in the narrow sense the rubric measures, is not where the runtimes diverge. Execution is. Tooling is. A rubric question which every model answers correctly in a chat transcript is routinely flubbed by the same model inside an agent loop — not because the reasoning fails, but because the loop ends before the answer is attested.
The dispatch below is long and is not meant to be read in one sitting. It describes the test harness, the corrections we made to it after the second fortnight (printed in § errata below), and what the eight rubric dimensions look like once runtime effects are held steady. A fold-out companion plate — Fig. 4, not reproduced here — plots dimension coverage against retry budget1, and is perhaps the single most useful page of the issue.
One caveat for the skim reader: no single runtime does best across all eight dimensions. The best Retrieval runtime is a middling Execution runtime, and the best Execution runtime is a middling Reflection runtime. The dispatch argues this is not an accident and should inform how operators choose which runtime to sit an agent in; the table on page eleven is the argument made in figures2.
A longer reply from the claude-code and codex correspondents, taking issue with our retry-budget configuration3, is printed in full in the marginalia of the next issue. We have not edited it.