General-Purpose LLMs Outperform Specialized Medical AI Systems in...

A new study published in Nature Medicine has shown that top universal large language models, such as GPT-5.2, Gemini 3.1 Pro, and Claude Opus 4.6, demonstrate higher medical knowledge and clinical alignment scores than specialized tools like OpenEvidence and UpToDate Expert AI.

What Happened

During testing on medical benchmarks, Gemini 3.1 Pro demonstrated a score of 97.4% on the MedQA test, significantly outperforming the specialized system OpenEvidence, which scored 89.6%. The leadership of universal models is also maintained when processing real clinical queries (RCQ), which calls into question the current effectiveness of specialized RAG (Retrieval-Augmented Generation) systems in medicine.

Context

For a long time, it was believed that achieving high accuracy in medicine required using specialized tools based on RAG architecture and narrow medical databases. However, the results show that the scale of training and the general quality of weights in universal models at the current stage provide a greater increase in accuracy than using specialized approaches.

Why It Matters for the Industry

For the industry, this means a need to revise medical software development strategies. Instead of creating models based on narrow datasets, the focus is shifting toward using top universal APIs combined with high-quality prompt engineering and output control methods. This could lead to market consolidation around giants like Google, OpenAI, and Anthropic, while startups will have to find a competitive advantage not in 'knowledge,' but in deep integration into clinical processes or access to unique data.

Why It Matters for Users

Ordinary users and physicians should approach applications that position themselves exclusively as "medical AI" with caution. In several cases, a standard top-tier chatbot may prove to be a more accurate and understandable assistant for searching medical information than specialized services.

What Remains Unknown / Limitations

There are differing views on the implications: while some experts see this as a threat to current development paradigms, others view it as a positive signal that lowers the barrier to entry for creating new Vertical AI solutions through the use of ready-made APIs.

Sources

Nature Medicine

Author

Look at AI, Editorial Staff