Sitemap
Data Science Collective

Advice, insights, and ideas from the Medium data science community

Member-only story

HealthBench Does Not Evaluate Patient Safety

--

Image by

is a recently released benchmark to evaluate large language models in healthcare. This blog post summarizes what HealthBench is, and overviews its positive and negative aspects, including a patient safety gap.

What is HealthBench?

HealthBench is a new healthcare LLM benchmark released by OpenAI in May 2025. It includes 5,000 health conversations between a model and a user, where the user could be a layperson or a healthcare professional, and the conversation can be single-turn (user message only) or multi-turn (alternating user and model messages ending with a user message). Given a conversation, the LLM being evaluated must respond to the last user message. This LLM’s response is then graded according to a conversation-specific rubric.

The conversation-specific rubrics were created by 262 physicians across 26 specialties and 60 countries, yielding 48,562 individual rubric criteria across the whole dataset.

The conversations are categorized into themes: emergency referrals, context-seeking, global health, health data tasks, expertise-tailored communication, and responding under uncertainty, and response depth.

The rubric criteria are partitioned into five axes: accuracy, completeness, communication quality, context awareness, and instruction following.

Data Science Collective
Data Science Collective

Published in Data Science Collective

Advice, insights, and ideas from the Medium data science community

Rachel Draelos, MD, PhD
Rachel Draelos, MD, PhD

Written by Rachel Draelos, MD, PhD

CEO at Cydoc | Physician Scientist | MD + Computer Science PhD | AI/ML Innovator

Responses (5)