New BRIDGE Benchmark Evaluates AI in Real-World Clinical Practice

Researchers from Mass General Brigham have introduced BRIDGE, a multilingual benchmark designed to test LLM capabilities using real-world medical data.

Compiled by Sergey KostenchukPublished 2026-06-18Updated 2026-06-19

2026-06-18 Research HuggingFace

🩺 New BRIDGE Benchmark Evaluates AI in Real-World Clinical Practice

Researchers from Mass General Brigham have developed BRIDGE — a multilingual benchmark for evaluating LLMs based on real electronic health records (EHR). While standard exams show model accuracy up to 92%, in BRIDGE tests, the best LLMs score only 44.8%.

🌍 This reveals a critical gap between academic AI knowledge and its ability to understand the nuances of live patient communication, necessitating a transition toward testing on unstructured data.

👤 Developers and users of medical AI should not rely on high scores in standard tests — real-world effectiveness in diagnosis and triage is currently significantly lower than expected.

Source 1: https://www.massgeneralbrigham.org/en/about/newsroom/press-releases/evaluating-ai-performance-for-everyday-patient-care Source 2: https://huggingface.co/spaces/YLab-Open/BRIDGE-Medical-Leaderboard

Sources