
A new benchmark from Tencent Hunyuan and Tsinghua University argues that today’s AI search agents are not mainly held back by retrieval quality or tool use. The bigger failure point, according to the researchers’ reported results, is that models often do not stop to ask a clarifying question when a user request is vague, underspecified, or wrong.
That matters because the industry is moving quickly to package large models as research assistants, browser agents, and answer engines. If the benchmark holds up, it suggests a practical design problem for teams building AI search products: more searches and longer reasoning chains do not necessarily improve outcomes when the system never confirms what the user actually meant. In some cases, the researchers say, repeated searching performs worse than simply making a guess.
The new benchmark, called DiscoBench, is designed to test whether a model can detect ambiguity during multi-step information seeking, ask the user a useful follow-up question, and then recover the right research path. As described by The Decoder, the dataset includes 211 tasks with 463 ambiguous points spread across eleven domains, including sports, film, music, science, politics, and video games.
The researchers frame this as a gap in existing agent evaluation. Benchmarks such as GAIA and BrowseComp generally assume the user query is already complete and precise. DiscoBench instead focuses on a common production scenario: a user asks for something that could refer to multiple entities, different time periods, unclear ranking criteria, or even a false factual premise. In that setting, a model can execute a clean search workflow and still head in the wrong direction from the first decision.
According to the reported methodology, each task is broken into checkpoints where the agent can keep searching, ask for clarification, or answer. The benchmark uses Tavily for search and a Gemini 3 Flash-based user simulator that returns predefined clues when the agent asks a helpful follow-up question. The dataset is mostly in Chinese, which the researchers say reflects common patterns on the Chinese-language web.
That language and tooling context is important for interpretation. DiscoBench is not a universal measure of all search tasks on all web ecosystems, and the use of an LLM-based simulator means the interaction loop is structured rather than fully open-ended. Still, the benchmark is notable because it isolates a product behavior that many user-facing AI systems struggle with: knowing when not to proceed.
The headline result is modest absolute performance. The Decoder reports that among eleven recently released models, the best end-to-end score without an explicit ambiguity hint was 43.1 percent from Doubao Seed 2.0 Pro. Gemini 3.1 Pro Preview followed at 40.8 percent, with Claude Opus 4.7 at 39.8 percent.
Those numbers are low enough to make the broader point hard to ignore. Even strong frontier models appear to struggle once ambiguity is introduced into a chained search task. The benchmark authors argue that the main issue is not that models cannot search, but that they assume too much and ask too little.
The behavior analysis cited by The Decoder is especially revealing. Systems that searched and then asked a follow-up question reportedly achieved a 93.4 percent success rate. Models that guessed directly reached 56.5 percent. Models that searched repeatedly but still failed to ask, labeled “SearchHeavyGuess,” fell to 51.9 percent. In the researchers’ interpretation, that pattern suggests some models are effectively sensing uncertainty but not converting it into a user interaction.
This helps explain why additional tool use does not automatically translate into better outcomes. A model can perform many searches, inspect many pages, and still remain anchored to the wrong interpretation of the original prompt. In practical terms, builders cannot treat search depth as a substitute for clarification behavior.
The timing matters because AI search is moving beyond demos into commercial workflows. Teams are shipping research copilots, customer support assistants, and browser automation products that increasingly depend on multi-step retrieval. For those systems, DiscoBench points to a failure mode that is easy to miss in conventional evaluation: the model looks active and competent while pursuing the wrong objective.
That has direct implications for enterprise AI deployments. In internal knowledge systems, ambiguity shows up constantly in project names, document versions, customer names, policy references, and date ranges. In external search products, the issue appears in comparisons, rankings, and brand or entity disambiguation. If a system treats every prompt as complete, it may produce confident but irrelevant work while still appearing highly responsive.
For builders of AI agents, the benchmark suggests a design shift. Clarification should not be treated as a fallback for obvious confusion. It may need to become a first-class capability with explicit thresholds, state tracking, and product UX that makes asking follow-up questions feel natural rather than obstructive. The data cited by The Decoder also implies that prompt-level reminders can help ambiguity detection, but not enough to fix end-to-end task completion on their own.
That distinction matters for roadmap planning. Better system prompts may increase the frequency of questions, but a useful deployed agent also needs to ask the right question at the right moment and then incorporate the answer into the rest of the workflow. Detection, phrasing, and follow-through appear to be separate capabilities.
The strongest claims here come from a benchmark study described by The Decoder rather than a peer-reviewed publication included in the source set. That does not invalidate the findings, but it does mean readers should treat the performance rankings and behavioral conclusions as researcher-reported until the underlying paper, data, and evaluation details are more broadly scrutinized.
Several limitations stand out from the available evidence. First, DiscoBench is mostly written in Chinese, so results may not transfer cleanly to English-language search behavior or enterprise document workflows. Second, the benchmark relies on Tavily and a simulated user built with Gemini 3 Flash. That setup is reasonable for controlled testing, but it is not the same as measuring full production systems with real users, different search stacks, or custom orchestration.
Third, the model list and versions are as reported by The Decoder, including Claude Opus 4.7, GPT 5.4, Gemini 3.1 Pro Preview, DeepSeek V4 Pro, GLM 5.1, Qwen3.6 Max, Kimi K2.6, MiniMax M2.7, MiMo v2.5 Pro, Hunyuan 3.0 Preview, and Doubao Seed 2.0 Pro. Some of those naming conventions may reflect the benchmark authors’ internal or regional labeling, and the source material does not provide a full model card-style accounting of configuration choices.
Still, some patterns look robust even with those caveats. The authors report that without search access, performance collapses, which supports the idea that the tasks require live retrieval rather than memorized knowledge. They also report that when ambiguity is removed from the queries, accuracy rises by roughly 26.8 to 40.2 points depending on model. If replicated, that is a strong signal that ambiguity handling itself is the bottleneck.
The article also situates DiscoBench within a broader line of criticism around AI search reliability. The Decoder cites LiveBrowseComp as evidence that models can over-rely on prior knowledge and cites Halluhard for hallucination issues in source verification. Those are adjacent studies, not direct validations of DiscoBench, but they reinforce the view that browsing competence remains fragile.
The findings arrive as vendors push different approaches to AI-assisted research. Anthropic has said Claude Opus 4.8 is tuned to flag uncertainty more often, according to The Decoder’s summary of the update. If that claim holds in independent testing, it would line up closely with the weakness DiscoBench is trying to expose.
Perplexity, meanwhile, has been exploring Search as Code, an approach that lets models express search workflows as Python programs rather than relying only on prebuilt search API patterns. That may help with planning and verification, but DiscoBench suggests a separate question remains unresolved: can the system recognize when the missing information is not on the web at all, but still in the user’s head?
For teams evaluating AI agents, this creates a more nuanced procurement checklist. Comparing benchmark scores on search-heavy tasks is no longer enough. Buyers may need to test whether a product can pause, identify the ambiguity type, ask a compact clarifying question, and resume the task without resetting context. In regulated or high-stakes domains, that capability may be more important than raw retrieval speed.
The next signal to watch is whether Tencent Hunyuan and Tsinghua University publish broader documentation, code, or public examples for DiscoBench. Independent replication will matter, especially across English-language tasks and with real user studies.
It will also be worth watching whether model providers start reporting clarification metrics alongside retrieval and reasoning benchmarks. A useful standard might include ambiguity detection, question quality, recovery rate after clarification, and failure modes by domain.
On the product side, look for changes in AI agent interfaces. If vendors begin making clarification a visible, intentional part of the user experience rather than an occasional interruption, that would suggest the market is taking this category of failure seriously.
Finally, keep an eye on whether systems like Claude Opus 4.8, Gemini 3.1 Pro, or GPT 5.4 show measurable gains on ambiguity-heavy tasks in independent testing. The competitive edge in AI search may increasingly come from restraint and dialogue, not just from more tools.
DiscoBench is a useful reminder that many AI product failures start before retrieval, not after it. Teams often optimize for better search connectors, bigger context windows, and more elaborate agent loops. But if the model accepts an ambiguous brief and runs with it, the whole stack can produce polished irrelevance.
For builders, the practical takeaway is simple: treat clarification as core infrastructure. The winning systems in AI search may be the ones that know when to stop, ask one sharp question, and only then continue. That is less flashy than autonomous browsing, but for enterprise AI and user trust, it is probably the more important capability.