New healthcare benchmark finds top AI agents fail most real workflows

May 20, 2026

By AI, Created 5:50 PM UTC, May 20, 2026, /AGP/ – actAVA.ai released CHI-Bench, an open-source benchmark that tests frontier AI agents across 75 long-horizon healthcare workflows and finds the best system passes only 28% of cases. The results raise questions about whether agents are ready for prior authorization, utilization review, and care management work that can affect patient access and payer decisions.

Why it matters: - Healthcare workflows often involve multiple handoffs, policy checks, and documentation steps. - A missed step can lead to denied authorization, delayed treatment, or audit findings. - CHI-Bench suggests leading AI agents are not yet reliable enough for end-to-end clinical operations.

What happened: - actAVA.ai released CHI-Bench, described as the first long-horizon healthcare benchmark for AI agents. - The benchmark tested 30 frontier agents from Anthropic, OpenAI, Google, x.AI, DeepSeek, and Z.ai. - The benchmark covered 75 workflows in prior authorization, utilization review, and care management. - Code, data, and the live leaderboard are available at the benchmark site.

The details: - Each trial ran for 60 to 80 steps across four to six clinical stages. - The benchmark exposed agents to 21 healthcare apps through more than 200 MCP tools and a 1,279-document operations handbook. - CHI-Bench evaluated trajectory, artifacts, and world state using deterministic unit tests and an LLM judge. - The evaluation checked evidence grounding, consent, and cross-stage consistency. - Anthropic’s Claude Code with Opus 4.6 posted the best overall result at 28% pass@1. - OpenAI’s Codex with GPT-5.5 followed at 21% pass@1. - By domain, utilization review reached 41%, care management 32%, and prior-authorization paperwork 29%. - No agent scored above 20% when the same case was run three times. - In endurance testing with 25 cases in one session, the best system completed under 4%. - In a fully end-to-end test, no task passed when one AI submitted a prior-auth request and a second AI acted as the UM reviewer. - CHI-Bench is open under Apache 2.0 on GitHub, and the leaderboard accepts community submissions.

Between the lines: - The benchmark is aimed at a core gap in current agent claims: long workflow reliability, not short task completion. - The results imply that agents may still struggle with policy-driven healthcare settings where errors compound across stages. - The low repeatability scores suggest consistency remains a bigger problem than single-run performance. - The coalition behind CHI-Bench included Johns Hopkins, Wellstar, Yale, Stanford, CMU, Oxford, USC, UCSD, and researchers Caiming Xiong, Sanmi Koyejo, Eric P. Xing, and Philip S. Yu. - Haolin Chen said the workflows are long, role-composed, and gated by policy, with one wrong site-of-service flip cascading into multiple failures. - Weiran Yao said CHI-Bench was built to test whether an agent can carry a real case end-to-end without error.

What’s next: - actAVA.ai is inviting community submissions to the public leaderboard. - The open benchmark could become a reference point for measuring whether future healthcare agents improve on long-horizon reliability. - Broader adoption will likely depend on whether new systems can raise pass rates and reduce variation across repeated runs.

The bottom line: - The first public long-horizon healthcare benchmark paints a blunt picture: even the best frontier agents are failing most real-world clinical workflows today.

Disclaimer: This article was produced by AGP Wire with the assistance of artificial intelligence based on original source content and has been refined to improve clarity, structure, and readability. This content is provided on an “as is” basis. While care has been taken in its preparation, it may contain inaccuracies or omissions, and readers should consult the original source and independently verify key information where appropriate. This content is for informational purposes only and does not constitute legal, financial, investment, or other professional advice.

North America Today

The daily local news briefing you can trust. Every day. Subscribe now.

New healthcare benchmark finds top AI agents fail most real workflows

North America Today

Check Your Email!

Welcome back!