Member-only story
HealthBench Does Not Evaluate Patient Safety
is a recently released benchmark to evaluate large language models in healthcare. This blog post summarizes what HealthBench is, and overviews its positive and negative aspects, including a patient safety gap.
What is HealthBench?
HealthBench is a new healthcare LLM benchmark released by OpenAI in May 2025. It includes 5,000 health conversations between a model and a user, where the user could be a layperson or a healthcare professional, and the conversation can be single-turn (user message only) or multi-turn (alternating user and model messages ending with a user message). Given a conversation, the LLM being evaluated must respond to the last user message. This LLM’s response is then graded according to a conversation-specific rubric.
The conversation-specific rubrics were created by 262 physicians across 26 specialties and 60 countries, yielding 48,562 individual rubric criteria across the whole dataset.
The conversations are categorized into themes: emergency referrals, context-seeking, global health, health data tasks, expertise-tailored communication, and responding under uncertainty, and response depth.
The rubric criteria are partitioned into five axes: accuracy, completeness, communication quality, context awareness, and instruction following.