AI News

OpenAI has introduced GeneBench-Pro, a new benchmark designed to test whether AI systems can do more than execute standard analysis scripts in biology. According to the company, the benchmark targets the harder part of computational research: making judgment calls under ambiguity, revising assumptions as evidence changes, and deciding when an answer is reliable enough for a downstream scientific or clinical decision.

The release matters because many AI evaluations still reward recall, coding fluency, or success on tightly specified tasks. OpenAI is arguing that real-world biology work looks different. In its description of GeneBench-Pro, the company says scientists often face messy data, incomplete signals, and multiple defensible analysis paths. That makes genomics and translational research a useful stress test for AI agents that claim to support high-value expert workflows.

What OpenAI released

OpenAI describes GeneBench-Pro as an expanded successor to GeneBench, covering harder tasks across genomics, quantitative biology, and translational medicine. The benchmark contains 129 questions, each framed as a self-contained analysis problem. Models receive a short prompt, dataset files, and access to a constrained workspace with Python and a standard scientific stack, including tools such as PLINK 2.0.

The company says each problem is built around what it calls “research taste,” meaning the sequence of analytical judgments required to decide what the data can support, which methods are appropriate, and when an initial plan should be changed. That is a notable framing shift from many AI benchmarks, which tend to focus on whether a model can reproduce a known procedure rather than determine the right procedure in the first place.

To support outside inspection, OpenAI says it is open-sourcing 10 representative problems on Hugging Face and plans to provide a 50-question subset to Artificial Analysis for third-party benchmarking. A separate case-studies page outlines example tasks, including treatment-effect estimation in a synthetic oncology registry, evaluation of an apparent lncRNA dependency from CRISPRi data, and disease-effect estimation using cis-MVMR. Those examples are meant to show the range of workflows bundled into GeneBench-Pro rather than a narrow focus on one biology subdomain.

Why OpenAI says this benchmark is different

The main technical claim behind GeneBench-Pro is that it avoids common weaknesses in long-horizon scientific benchmarks. OpenAI says historical real-world datasets can create grading problems because multiple reasonable analytical choices may lead to slightly different answers, while poorly designed tasks can also let models pass despite serious methodological errors.

Its solution was to generate benchmark problems synthetically while controlling the full data-generating process. According to OpenAI, that lets the benchmark creators know the causal structure, tune the difficulty, verify that correct approaches succeed, and test through ablations that plausible-but-wrong approaches fail. The company also says it audited draft problems for information leakage and unintended shortcuts.

That design choice matters for AI evaluation. In coding, deterministic grading is relatively straightforward because code either passes tests or does not. In scientific analysis, especially in computational biology, success is often about inference quality rather than exact reproduction of a canonical sequence of steps. OpenAI is effectively trying to build a benchmark that preserves the ambiguity of research work while still allowing deterministic scoring.

The company also says 82 of the 129 questions were reviewed by external domain experts, including graduate students, postdoctoral researchers, industry scientists, and professors. Reviewers assessed realism, identifiability of the target answer, and whether the methods and estimators were appropriate, with feedback used to revise the problems. That does not make the benchmark neutral by default, but it suggests OpenAI is trying to preempt criticism that the tasks reflect only internal assumptions.

The performance numbers, and their limits

OpenAI’s headline result is that its model GPT-5.6 Sol achieved a 28.7% pass rate on GeneBench-Pro at the highest reasoning level, rising to 31.5% with Pro mode enabled. The company contrasts that with what it says was a below-5% score from GPT-5 when it first started building the earlier GeneBench benchmark.

OpenAI also says test-time compute matters sharply. At the lowest reasoning level, GPT-5.6 Sol reportedly scores only in the single digits, while at the highest reasoning level it solves nearly six times as many questions as GPT-5.2 while using about two-thirds as many tokens. That claim, if borne out independently, would be relevant to product teams trying to balance latency and cost against quality in expert-agent deployments.

The company further argues that GPT systems appear stronger than leading open-source alternatives on this kind of quantitative scientific reasoning. In the post, OpenAI specifically mentions GLM 5.2 as a leading open-source comparison and says the gap on GeneBench-Pro is larger than one would expect from coding benchmarks alone.

But these are vendor-reported results from an OpenAI-designed benchmark. OpenAI acknowledges that frontier GPT models were used during development to evaluate and harden problems, and says it initially suspected this might bias the benchmark against GPT models relative to other families. The company’s conclusion is that competitors still only matched, at best, the corresponding GPT model available at the time. Even so, until Artificial Analysis or other outside groups publish independent runs, the strongest comparative claims should be treated as provisional.

What this means for AI builders and enterprise buyers

For builders, GeneBench-Pro highlights a practical problem in AI agents: benchmark success in coding or question answering may not transfer cleanly to domains where the task is deciding what analysis to run. Teams building scientific assistants, healthcare research tools, or internal lab copilots often find that the hard failure modes happen upstream of execution. A model may write correct Python yet choose the wrong estimand, ignore a confounder, or overstate confidence from weak data.

OpenAI is positioning GeneBench-Pro as a way to measure exactly those failure modes. If that framing gains traction, it could push more AI evaluation toward system-level judgment tests rather than narrower unit tests. That would matter not just in biology, but across enterprise AI settings where ambiguity, partial observability, and workflow revisions are common.

For enterprise buyers in biotech and pharma, the release is more useful as a signal than as a procurement shortcut. OpenAI itself says current AI agents remain too unreliable to replace human experts. At the same time, the company argues that the economics are becoming hard to ignore: reviewers estimated a typical GeneBench-Pro problem might take a human expert 20 to 40 hours, while model inference costs are only several dollars per problem. Those numbers are OpenAI’s framing, not an independently validated ROI model, but they point to where buyers may see value first: triage, exploratory analysis, or draft analytic work that remains under expert supervision.

The benchmark also fits a broader push toward AI agents that can operate in domain-specific software environments, not just chat windows. By using a realistic workspace with Python and bioinformatics packages, GeneBench-Pro aligns with how many builders now think about deployable agents: tool-using systems that work across files, code, and iterative reasoning loops.

Evidence, validation, and open questions

The evidence base here is primarily OpenAI’s own announcement and case-study materials. That means the core facts about benchmark design, dataset structure, the 129-question size, the use of synthetic generation, and the reported GPT-5.6 Sol scores come from the vendor itself.

Some elements are stronger than others. The existence of the benchmark, the planned release of 10 problems on Hugging Face, and the forthcoming 50-question subset for Artificial Analysis are concrete and checkable. The external expert-review process is also a meaningful credibility signal, though the announcement does not provide a full public breakdown of reviewer outcomes in the source material provided here.

The comparative model rankings, the significance of the gap versus coding benchmarks, and the implication that the benchmark may be saturated by year-end are interpretive claims from OpenAI. They may prove directionally correct, but they are not yet independent market consensus. Likewise, the cost comparison between human expert labor and AI inference is best read as an illustrative framing, not as a deployment-ready business case.

What to watch next

The first concrete signal will be whether the Hugging Face release gives outside researchers enough material to probe GeneBench-Pro’s construction, grading logic, and susceptibility to shortcutting. If independent teams can reproduce OpenAI’s general findings, the benchmark will carry more weight.

A second signal is the planned handoff to Artificial Analysis. Third-party runs across GPT models and non-OpenAI systems will matter more than internal comparisons, especially if they reveal narrower or wider gaps than OpenAI reports.

Third, watch whether other labs respond with comparable benchmarks in wet-lab biology, drug discovery, or clinical research analytics. If GeneBench-Pro becomes a reference point, competitors may need to show not just strong coding or general reasoning scores but domain-specific judgment under uncertainty.

Finally, the most important product signal is whether benchmark gains map to usable tools. If future OpenAI or partner products start showing robust performance in genomics, translational medicine, or broader computational biology workflows, GeneBench-Pro will look less like a research artifact and more like an early readiness test for enterprise AI in science.

Creati.ai perspective

GeneBench-Pro is notable less because of the current pass rates than because of what it tries to measure. OpenAI is making the case that the next bottleneck for AI in expert work is not raw execution but judgment: choosing the right path, revising it when evidence changes, and knowing when not to overclaim. That is a more demanding standard than most benchmark culture has used so far.

For the market, this is a useful development even if the numbers remain vendor-reported for now. AI builders need harder evaluation targets for research-grade workflows, and enterprise buyers need better ways to separate polished demos from systems that can survive ambiguous, high-stakes analysis. Whether GeneBench-Pro becomes a standard will depend on outside validation, but it captures an important shift in AI from producing answers to exercising disciplined analytic reasoning.

Featured
AirMusic
AirMusic
AirMusic.ai generates high-quality AI music tracks from text prompts with style, mood customization, and stems export.
AdsCreator.com
AdsCreator.com
Generate polished, on‑brand ad creatives from any website URL instantly for Meta, Google, and Stories.
Free GPT Image 2
Free GPT Image 2
A free GPT Image 2 generator for creating posters, ads, comics, and UI mockups with accurate typography.
Anijam AI
Anijam AI
Anijam is an AI-native animation platform that turns ideas into polished stories with agentic video creation.
KiloClaw
KiloClaw
Hosted OpenClaw agent: one-click deploy, 500+ models, secure infrastructure, and automated agent management for teams and developers.
Atoms
Atoms
AI-driven platform that builds full‑stack apps and websites in minutes using multi‑agent automation, no coding required.
Refly.ai
Refly.ai
Refly.AI empowers non-technical creators to automate workflows using natural language and a visual canvas.
VoxDeck
VoxDeck
Next-gen AI presentation maker,Turn your ideas & docs into attention-grabbing slides with AI.
Skywork.ai
Skywork.ai
Skywork AI is an innovative tool to enhance productivity using AI.
Pippit
Pippit
Elevate your content creation with Pippit's powerful AI tools!
Diagrimo
Diagrimo
Diagrimo transforms text into customizable AI-generated diagrams and visuals instantly.
BGRemover
BGRemover
Easily remove image backgrounds online with SharkFoto BGRemover.
UNI-1 AI
UNI-1 AI
UNI-1 is a unified image generation model combining visual reasoning with high-fidelity image synthesis.
VidMage
VidMage
Realistic AI face swaps for photos, videos, and GIFs, instantly and effortlessly.
SuperMaker AI Video Generator
SuperMaker AI Video Generator
Create stunning videos, music, and images effortlessly with SuperMaker.
Elser AI
Elser AI
All-in-one AI video creation studio that turns any text and images into full videos up to 30 minutes.
Flowith
Flowith
Flowith is a canvas-based agentic workspace which offers free 🍌Nano Banana Pro and other effective models...
Qoder
Qoder
Qoder is an agentic coding platform for real software, Free to use the best model in preview.
FineVoice
FineVoice
Clone, Design, and Create Expressive AI Voices in Seconds, with Perfect Sound Effects and Music.
FixArt AI
FixArt AI
FixArt AI offers free, unrestricted AI tools for image and video generation without sign-up.
SharkFoto
SharkFoto
SharkFoto is an all-in-one AI-powered platform for creating and editing videos, images, and music efficiently.
AIToHuman
AIToHuman
Free AI text humanizer that rewrites AI-generated content into natural, human-like writing instantly.
AI FIRST
AI FIRST
Conversational AI assistant automating research, browser tasks, web scraping, and file management through natural language.
Image to Video AI without Login
Image to Video AI without Login
Free Image to Video AI tool that instantly transforms photos into smooth, high-quality animated videos without watermarks.
Claude API
Claude API
Claude API for Everyone
Funy AI
Funy AI
AI bikini & kiss videos from images or text. Try the AI Clothes Changer & Image Generator!
Wan 2.7
Wan 2.7
Professional-grade AI video model with precise motion control and multi-view consistency.
Couple AI - AI Couple Photo Maker
Couple AI - AI Couple Photo Maker
Create realistic AI couple portraits from selfies with themed styles, fast generation, and private HD downloads.
Imagvio AI
Imagvio AI
AI-powered image and video creation platform with precise editing, generation, and consistency-focused creative workflows.
Questie AI - Game Companion
Questie AI - Game Companion
Real-time AI gaming companion that watches your screen, chats by voice, and coaches gameplay live.
Gemini Omni - Video Generator
Gemini Omni - Video Generator
AI video creation platform for conversational editing, multimodal references, and coherent short-form generation.
Scavio AI
Scavio AI
Real-time multi-platform search API that helps AI agents fetch structured web, shopping, video, and social data.
APIMaster
APIMaster
Real LLMs, verified by fingerprint. One API, up to 70% off official pricing.
Ampere.SH
Ampere.SH
Free managed OpenClaw hosting. Deploy AI agents in 60 seconds with $500 Claude credits.
AI Pet Video Generator
AI Pet Video Generator
Create viral, shareable pet videos from photos using AI-driven templates and instant HD exports for social platforms.
Gptimg2 AI
Gptimg2 AI
All-in-one AI studio for creating images and videos from text, images, or references.
OnlyDoc Summarizer
OnlyDoc Summarizer
OnlyDoc's free PDF summarizer reads through a PDF and pulls out the key points in a clean, structured summary
GenPPT.AI
GenPPT.AI
AI-driven PPT maker that creates, beautifies, and exports professional PowerPoint presentations with speaker notes and charts in minutes.
whatslove.ai
whatslove.ai
AI dating coach that customizes advice, conversation starters and date ideas tailored to your personality.
AI Clothes Changer by SharkFoto
AI Clothes Changer by SharkFoto
AI Clothes Changer by SharkFoto instantly lets you virtually try on outfits with realistic fit, texture, and lighting.
AnimeShorts
AnimeShorts
Create stunning anime shorts effortlessly with cutting-edge AI technology.
wan 2.7-image
wan 2.7-image
A controllable AI image generator for precise faces, palettes, text, and visual continuity.
CreateMemorial
CreateMemorial
CreateMemorial helps families build lasting online memorial websites and funeral slideshow videos to honor loved ones.
Image 2 AI
Image 2 AI
OpenAI-powered image generation and editing tool for photorealistic visuals, accurate text rendering, and UI mockups.
Media.io Free AI Image Generator
Media.io Free AI Image Generator
Create AI visuals with Media.io from text prompts or reference images for social media, marketing, ecommerce, and more.
AI Video API: Seedance 2.0 Here
AI Video API: Seedance 2.0 Here
Unified AI video API offering top-generation models through one key at lower cost.
paperclaw
paperclaw
AI workspace that generates publication-ready scientific figures, diagrams, posters, and editable SVGs in minutes.
Mubert AI
Mubert AI
Mubert is an AI music platform that generates, extends, remixes, and vocalizes royalty-free tracks in seconds.
AIsa
AIsa
AIsa gives AI agents one gateway to models, skills, APIs, and payments with OpenAI-compatible access.
Lyria3 AI
Lyria3 AI
AI music generator that creates high-fidelity, fully produced songs from text prompts, lyrics, and styles instantly.
AdMakeAI
AdMakeAI
AI ad generator that creates high-performing static and UGC ads for brands in seconds.
Seedance 2.0 Video AI
Seedance 2.0 Video AI
Generate cinematic 1080p videos from prompts, images, and reference clips with synchronized audio.
GPT Image 2 Online
GPT Image 2 Online
An AI image generator and editor with photorealistic results, accurate text rendering, and strong prompt following.
WriteHybrid AI Humanizer
WriteHybrid AI Humanizer
WriteHybrid is an AI humanizer and detector that rewrites text naturally while helping users bypass AI detection.
NerdyTips
NerdyTips
AI-powered football predictions platform delivering data-driven match tips across global leagues.
GLM Image
GLM Image
GLM Image combines hybrid AR and diffusion models to generate high-fidelity AI images with exceptional text rendering.
AI Gift finder by wishwave
AI Gift finder by wishwave
AI gift finder that builds shareable wishlists from real products across hundreds of popular stores.
Flaq AI Media API
Flaq AI Media API
Flaq AI is a unified AI media API platform for generating images, videos, and LLM-powered workflows with stable models
InstantChapters
InstantChapters
Create Youtube Chapters with one click and increase watch time and video SEO thanks to keyword optimized timestamps.
BeatMV
BeatMV
Web-based AI platform that turns songs into cinematic music videos and creates music with AI.
WhatsApp AI Sales
WhatsApp AI Sales
WABot is a WhatsApp AI sales copilot that delivers real-time scripts, translations, and intent detection.
insmelo AI Music Generator
insmelo AI Music Generator
AI-driven music generator that turns prompts, lyrics, or uploads into polished, royalty-free songs in about a minute.
Iara Chat
Iara Chat
Iara Chat: An AI-powered productivity and communication assistant.
Text to Music
Text to Music
Turn text or lyrics into full, studio-quality songs with AI-generated vocals, instruments, and multi-track exports.
StitchPilot.ai
StitchPilot.ai
Browser-based AI embroidery tool for converting images, previewing stitch files, and inspecting machine formats.
MusicGPT
MusicGPT
AI music platform for generating songs, sound effects, vocals, and audio edits from simple prompts.
Kirkify
Kirkify
Kirkify AI instantly creates viral face swap memes with signature neon-glitch aesthetics for meme creators.
Tome AI PPT
Tome AI PPT
AI-powered presentation maker that generates, beautifies, and exports professional slide decks in minutes.
Paper Banana
Paper Banana
AI-powered tool to convert academic text into publication-ready methodological diagrams and precise statistical plots instantly.
SkyGen Plus
SkyGen Plus
A multi-model AI creation platform for generating images, videos, and music with one streamlined workflow.
EaseMate AI
EaseMate AI
All-in-one AI assistant for chat, writing, study help, image creation, and video generation in one browser-based platform.
happy horse AI
happy horse AI
Open-source AI video generator that creates synchronized video and audio from text or images.
Create WhatsApp Link
Create WhatsApp Link
Free WhatsApp link and QR generator with analytics, branded links, routing, and multi-agent chat features.
HookTide
HookTide
AI-powered LinkedIn growth platform that learns your voice to create content, engage, and analyze performance.
kinovi - Seedance 2.0 - Real Man AI Video
kinovi - Seedance 2.0 - Real Man AI Video
Free AI video generator with realistic human output, no watermark, and full commercial use rights.
Image3D - AI 2D to 3D Model Generator (GLB, OBJ, STL, PLY)
Image3D - AI 2D to 3D Model Generator (GLB, OBJ, STL, PLY)
Browser-based AI that turns any 2D image or text prompt into a 3D model in 30 seconds. Export GLB, OBJ, STL, PLY—free
Veemo - AI Video Generator
Veemo - AI Video Generator
Veemo AI is an all-in-one platform that quickly generates high-quality videos and images from text or images.
HappyHorseAIStudio
HappyHorseAIStudio
Browser-based AI video generator for text, images, references, and video editing.
Gobii
Gobii
Gobii lets teams create 24/7 autonomous digital workers to automate web research and routine tasks.
WhatsApp Warmup Tool
WhatsApp Warmup Tool
AI-powered WhatsApp warmup tool automates bulk messaging while preventing account bans.
Hitem3D
Hitem3D
Hitem3D converts a single image into high-resolution, production-ready 3D models using AI.
Manga Translator AI
Manga Translator AI
AI Manga Translator instantly translates manga images into multiple languages online.
TextToHuman
TextToHuman
Free AI humanizer that instantly rewrites AI text into natural, human-like writing. No signup required.
Palix AI
Palix AI
All-in-one AI platform for creators to generate images, videos, and music with unified credits.
Remy - Newsletter Summarizer
Remy - Newsletter Summarizer
Remy automates newsletter management by summarizing emails into digestible insights.
Seedance 20 Video
Seedance 20 Video
Seedance 2 is a multimodal AI video generator delivering consistent characters, multi-shot storytelling, and native audio at 2K.
Video Sora 2
Video Sora 2
Sora 2 AI turns text or images into short, physics-accurate social and eCommerce videos in minutes.

OpenAI launches GeneBench-Pro to test whether AI can make research-grade judgment calls in computational biology

OpenAI unveiled GeneBench-Pro, a genomics benchmark meant to measure higher-order scientific reasoning as AI labs push into biology workflows.