Researchers from Mass General Brigham have developed BRIDGE — a multilingual benchmark designed to evaluate the capabilities of LLMs in real-world clinical practice. Unlike standard medical exams, where modern models score up to 92%, BRIDGE tests show that their accuracy when working with real Electronic Health Records (EHR) drops to 44.8%.

image

What Happened

A research team from Mass General Brigham conducted large-scale testing of 95 language models across 14 different clinical specialties using the new BRIDGE tool. The study revealed that the models' ability to process unstructured clinical data and the nuances of live patient communication is significantly lower than their academic performance on standard tests.

Context

Existing medical benchmarks primarily focus on testing academic knowledge by mimicking the format of medical exams. However, this methodology fails to account for the specifics of working with real Electronic Health Records (EHR) and the complexity of unstructured medical text, which is the standard in everyday practice.

Why It Matters for the Industry

For the industry, this signifies a necessary shift from testing on textbook questions to verifying performance on complex, real-world clinical data. Identifying this gap creates new safety standards and opens a niche for specialized solutions focused on high accuracy when handling unstructured medical information.

Why It Matters for Users

Developers and users of medical AI systems should exercise caution: high model scores on standard tests are not a guarantee of reliability in real-world diagnostic or triage tasks. Current benchmark leaders may prove insufficiently effective when deployed in a real clinical environment.

What Is Not Yet Known / Limitations

The study did not identify differences in fundamental problem understanding among the presented roles, and discussions are focused on various aspects of the implications of the identified gap.

Sources

Author

Look at AI, Editorial Team