Member-only story

HealthBench Does Not Evaluate Patient Safety

7 min readMay 13, 2025

is a recently released benchmark to evaluate large language models in healthcare. This blog post summarizes what HealthBench is, and overviews its positive and negative aspects, including a patient safety gap.

What is HealthBench?

HealthBench is a new healthcare LLM benchmark released by OpenAI in May 2025. It includes 5,000 health conversations between a model and a user, where the user could be a layperson or a healthcare professional, and the conversation can be single-turn (user message only) or multi-turn (alternating user and model messages ending with a user message). Given a conversation, the LLM being evaluated must respond to the last user message. This LLM’s response is then graded according to a conversation-specific rubric.

The conversation-specific rubrics were created by 262 physicians across 26 specialties and 60 countries, yielding 48,562 individual rubric criteria across the whole dataset.

The conversations are categorized into themes: emergency referrals, context-seeking, global health, health data tasks, expertise-tailored communication, and responding under uncertainty, and response depth.

The rubric criteria are partitioned into five axes: accuracy, completeness, communication quality, context awareness, and instruction following.

Data Science Collective

HealthBench Does Not Evaluate Patient Safety

Published in Data Science Collective

Written by Rachel Draelos, MD, PhD

Responses (5)