Researchers from Mass General Brigham have developed BRIDGE — a multilingual benchmark designed to evaluate the capabilities of LLMs in real-world clinical practice. Unlike standard medical exams, where modern models score up to 92%, BRIDGE tests show that their accuracy when working with real Electronic Health Records (EHR) drops to 44.8%.

What Happened
A research team from Mass General Brigham conducted large-scale testing of 95 language models across 14 different clinical specialties using the new BRIDGE tool. The study revealed that the models' ability to process unstructured clinical data and the nuances of live patient communication is significantly lower than their academic performance on standard tests.
Context
Existing medical benchmarks primarily focus on testing academic knowledge by mimicking the format of medical exams. However, this methodology fails to account for the specifics of working with real Electronic Health Records (EHR) and the complexity of unstructured medical text, which is the standard in everyday practice.
Why It Matters for the Industry
For the industry, this signifies a necessary shift from testing on textbook questions to verifying performance on complex, real-world clinical data. Identifying this gap creates new safety standards and opens a niche for specialized solutions focused on high accuracy when handling unstructured medical information.
Why It Matters for Users
Developers and users of medical AI systems should exercise caution: high model scores on standard tests are not a guarantee of reliability in real-world diagnostic or triage tasks. Current benchmark leaders may prove insufficiently effective when deployed in a real clinical environment.
What Is Not Yet Known / Limitations
The study did not identify differences in fundamental problem understanding among the presented roles, and discussions are focused on various aspects of the implications of the identified gap.
Sources
Author
Look at AI, Editorial Team
