🩺 New BRIDGE Benchmark Evaluates AI in Real-World Clinical Practice
Researchers from Mass General Brigham have developed BRIDGE — a multilingual benchmark for evaluating LLMs based on real electronic health records (EHR). While standard exams show model accuracy up to 92%, in BRIDGE tests, the best LLMs score only 44.8%.
🌍 This reveals a critical gap between academic AI knowledge and its ability to understand the nuances of live patient communication, necessitating a transition toward testing on unstructured data.
👤 Developers and users of medical AI should not rely on high scores in standard tests — real-world effectiveness in diagnosis and triage is currently significantly lower than expected.
Source 1: https://www.massgeneralbrigham.org/en/about/newsroom/press-releases/evaluating-ai-performance-for-everyday-patient-care Source 2: https://huggingface.co/spaces/YLab-Open/BRIDGE-Medical-Leaderboard
