
The UK AI Security Institute is arguing that a basic assumption behind many AI benchmark results is wrong: agent capability is not a single score, but a moving target that changes materially with the amount of test-time compute a model is allowed to use.
According to reporting by The Decoder on the institute’s new study, the agency tested frontier models across seven benchmarks and found that fixed token budgets can systematically understate what AI agents are able to accomplish. That matters well beyond leaderboard debates. If benchmark scores are being recorded before a model’s performance has leveled off, developers, enterprise buyers, and safety evaluators may be making decisions based on artificially low readings of both utility and risk.
The immediate implication is practical. Many teams evaluating AI agents for coding, cyber defense, or other multi-step work rely on benchmark numbers to decide whether a system is ready for deployment. The UK AI Security Institute’s findings suggest those numbers may reflect a floor rather than a ceiling, especially on tasks where the agent can verify intermediate work by running code, testing an exploit, or checking outputs.
The central claim from the UK AI Security Institute, as described by The Decoder, is that performance rises with test-time compute in ways that common evaluation setups do not fully capture. In the study, success rates on software engineering tasks reportedly increased by about 25 percent when the token budget was raised from one million to ten million on benchmarks including TerminalBench 2.0 and SWE-Bench Pro.
The effect was not limited to coding. On math and academic evaluations such as Humanity's Last Exam, gains were said to reach roughly 22 percent up to a budget of five million tokens. In cybersecurity, The Decoder reports that around 8 percent of tasks were only solved once budgets exceeded 10 million tokens, with some requiring 50 million tokens and newer models pushing higher at budgets above 100 million.
That pattern supports a broader methodological point. If benchmark organizers cap runs too early, some fraction of hard tasks will register as failures even when the model could solve them with more compute. In that framing, a benchmark score becomes highly dependent on the budget choice rather than a stable measure of capability.
The institute also reportedly found important variation by domain. On HealthBench, which The Decoder describes as a medical task benchmark, models appeared to plateau within the standard budget. In other words, more compute did not help much there. The reported explanation is intuitive: extra tokens are most useful in settings where an agent can iteratively test and verify its own work. They matter less where feedback is sparse, ambiguous, or delayed.
The study’s more consequential argument is not just that bigger budgets improve scores, but that capability progress at the frontier may be advancing faster than standard evaluations suggest. The Decoder reports that the institute previously estimated frontier-model time horizons on cyber tasks at a fixed budget of 2.5 million tokens. When the budget is expanded to 50 million tokens, the progress trend appears about 60 percent steeper.
Put differently, the apparent pace of improvement depends partly on how much compute evaluators are willing to spend. The institute reportedly said doubling times shift from roughly 67 to 91 days under one setup to around 40 to 50 days under the higher-budget setup. If accurate, that is a major warning for anyone using fixed-budget benchmarks to track risk escalation or commercial readiness.
The UK AI Security Institute also ties token use to task duration. Drawing on 211 software engineering tasks from METR and 78 cyber tasks from its own testing, the institute reportedly found a power-law relationship between how long a human expert would need and how many tokens an AI agent tends to consume. A task that takes a minute may require thousands of tokens; an hour can require millions; a week can require billions.
That relationship helps explain why fixed budgets systematically exclude long-horizon work. A benchmark may contain tasks that are in principle solvable by a model, but not within the allotted spend. The Decoder cites a cyber task called “The Last Ones,” estimated to take a human expert about 20 hours, where no tested model reportedly succeeded below 30 million tokens.
For builders, that is a reminder that “agent failure” often combines at least three factors: model skill, tool access, and inference budget. Treating all failures as capability limits can produce misleading product decisions.
Another notable result is that newer frontier systems reportedly gain more from extra compute than older ones. The Decoder says the institute observed improvements across three dimensions: reach, meaning harder tasks become solvable; reliability, meaning the same task is solved more consistently; and efficiency, meaning fewer tokens are needed for a given result.
The reported time-horizon numbers make that concrete. A current frontier model’s horizon on cyber tasks rose from about 40 minutes at 2.5 million tokens to roughly four hours at 50 million tokens, according to The Decoder’s account of the study. Across the broader frontier, the horizon moved from about two hours to around 14 hours at the higher budget.
That does not mean all progress is smooth or monotonic. The institute reportedly found that on roughly 10 to 30 percent of tasks, newer models performed worse than predecessors. That caveat matters because it pushes back against a simplistic “more recent equals better everywhere” narrative. For product teams, the result reinforces the need for task-specific testing rather than relying on broad model branding.
Still, if newer models extract disproportionate value from larger compute budgets, evaluation practices built around older cost assumptions may become increasingly outdated. Falling inference costs could make high-budget runs more accessible over time, allowing capabilities that currently seem too expensive to emerge in ordinary products and workflows.
This story rests primarily on The Decoder’s reporting of a study from the UK AI Security Institute rather than a directly provided research paper or institute publication in the source set here. That means the specific benchmark figures, token thresholds, and time-horizon estimates should be treated as reported findings rather than independently verified by Creati.ai from original materials.
Even so, the claims are directionally plausible and internally consistent. Anyone who has worked with AI agents on coding or security tasks has seen that longer runs can unlock better outcomes, particularly when the system can test hypotheses, inspect errors, and retry. What the institute appears to add is a structured argument that benchmark design is systematically biasing measurements downward.
There are also important boundaries to the findings. First, the gains are not universal, as the reported HealthBench result suggests. Second, higher token budgets raise costs, increase latency, and may create more room for unproductive search. Third, benchmark performance under expanded compute is not the same thing as dependable production performance under enterprise constraints.
The UK AI Security Institute reportedly now uses multiple budgets and looks for “minimum informative budgets” where performance stops improving materially. That is a useful concept, but it still leaves open questions about operational standards. Buyers do not just want to know maximum capability; they need to know capability at acceptable cost, speed, and risk.
For teams building AI agents, the message is straightforward: benchmark selection is no longer enough. Evaluation design has to include budget sweeps, especially for workflows in software engineering, cyber operations, and other tool-using domains. A model that looks mediocre under a one-shot budget may become viable when allowed to reason longer or retry more often.
For enterprise AI buyers, this complicates vendor comparisons. Two providers can cite benchmark wins that are not directly comparable if they were achieved under different compute ceilings. Procurement teams should ask not only for scores on SWE-Bench Pro, TerminalBench 2.0, or HealthBench, but also for the token budgets, latency, retry policies, and tool permissions used to produce them.
For safety and policy work, the study lands on an even more sensitive point. If harmful-capability evaluations in cybersecurity are being conducted under budgets that truncate performance, risk assessments may lag behind deployable reality. The UK AI Security Institute’s focus on cyber tasks suggests the issue is not merely academic. High-budget capability may become reachable in the real world as inference grows cheaper and orchestration tools improve.
The broader market implication is that evaluation may need to shift from static scores to capability curves. That will be messier and more expensive than current leaderboards, but it may better reflect how frontier models are actually used inside products.
The next key signal is whether the UK AI Security Institute publishes the underlying paper, methods, and benchmark configurations in enough detail for outside replication. Without that, the headline claim will remain important but harder to audit.
A second signal is adoption by benchmark maintainers and labs. If tests like SWE-Bench Pro, Humanity's Last Exam, or HealthBench begin reporting performance across budget ranges instead of single numbers, the institute’s argument will have immediate influence.
Third, watch model vendors. If labs start emphasizing budget-conditioned performance curves in place of point estimates, that will indicate the market accepts that test-time compute is part of capability, not just a runtime setting.
Finally, watch enterprise pricing and deployment patterns. As token costs fall, more customers may choose longer-running AI agents for coding and cyber workflows. If that happens, the difference between “benchmark capability” and “deployed capability” could narrow quickly.
The UK AI Security Institute is highlighting a blind spot the AI industry has tolerated because single-number benchmarks are easy to publish and compare. But AI agents are not static predictors. They are systems that search, verify, and recover from mistakes, and those behaviors are heavily shaped by how much compute they are allowed to consume.
For builders and buyers, the practical takeaway is not “always spend more tokens.” It is that evaluation must reflect the operating regime you actually care about. In software engineering and cybersecurity, where AI agents can benefit from iteration and feedback, budget is part of the product. If benchmark practice fails to capture that, both commercial decisions and safety judgments will keep arriving late.