Back

Large Language Models Show Inconsistent Performance in Medical Advice

At a glance

  • Oxford-led study found LLMs give inconsistent medical advice
  • Participants using LLMs did not outperform traditional methods
  • Other studies report unsafe or inaccurate chatbot responses

Recent research has evaluated how large language models (LLMs) perform when assisting the public with medical decision-making. Multiple studies have examined the reliability and safety of AI chatbots in providing health-related advice.

A study published in Nature Medicine on February 10, 2026, led by the University of Oxford’s Oxford Internet Institute and Nuffield Department of Primary Care Health Sciences, assessed the use of LLMs in public health scenarios. The research was conducted in partnership with MLCommons and other organizations and focused on the accuracy and consistency of medical advice provided by these models.

The Oxford study involved a randomized trial with nearly 1,300 participants. Individuals were asked to use LLMs to evaluate medical scenarios and determine actions such as whether to visit a general practitioner or go to a hospital. The study compared the decisions made by LLM users to those relying on traditional resources like online searches or personal judgment.

Findings from the trial indicated that participants using LLMs did not make better decisions than those using traditional methods. The study also identified several challenges, including users’ uncertainty about what information to provide, inconsistent answers from LLMs to similar questions, and responses that combined both helpful and unhelpful recommendations, making it difficult to identify the safest advice.

What the numbers show

  • Oxford study included nearly 1,300 participants in a randomized trial
  • Red-teaming study found unsafe chatbot response rates from 5% to 13%
  • Problematic chatbot responses ranged from 21.6% to 43.2% in a separate study

Additional research published on arXiv in July 2025 evaluated four publicly available chatbots—Claude, Gemini, GPT-4o, and Llama3-70B—using 222 patient-posed medical questions. This study reported unsafe responses in 5% to 13% of cases, with problematic answers occurring in 21.6% to 43.2% of instances.

Another study from Mount Sinai, published in August 2025 in Communications Medicine, examined how AI chatbots handle false medical information embedded in user prompts. The researchers found that chatbots could repeat and elaborate on incorrect information, but introducing a brief warning prompt reduced these errors.

A systematic review of 137 studies up to October 2023, published in JAMA Network Open, found that most research focused on closed-source LLMs and used subjective performance measures. Fewer than one-third of the studies addressed ethical, regulatory, or patient safety issues.

Research published in November 2023 assessed AI chatbot responses to emergency care questions and found frequent inaccuracies and incomplete advice, including potentially dangerous information. The authors recommended further research, refinement, and regulation of these systems.

MIT researchers also studied how nonclinical elements in patient messages, such as typographical errors or informal language, can mislead LLMs into providing incorrect medical advice. In some cases, these factors led to chatbots suggesting self-care for serious conditions.

* This article is based on publicly available information at the time of writing.

Related Articles

  1. Recent essays argue that large language models meet artificial general intelligence criteria, with GPT-4.5 passing Turing tests, according to Nature.

  2. A meta-analysis of 66 statin side-effects found only four supported by evidence from trials involving 124,000 participants, according to researchers.

  3. A Cambridge study of over 124,000 women found reduced grey matter in post-menopausal participants, with hormone therapy not preventing changes.

  4. Recent studies show bupivacaine and rimegepant reduced tumor growth and pain in mouse models of bone cancer, according to research findings.

  5. Introduced in 2017, the Transformer architecture utilizes attention mechanisms. It powers models like BERT and GPT, according to researchers.

More on Health

  1. Construction of the Wallis Annenberg Wildlife Crossing began in April 2022, aiming to connect Simi Hills and Santa Monica Mountains by fall 2026.

  2. A multi-year agreement focuses on oncology and gastrointestinal drug programs, according to reports. Iambic may receive over $1.7 billion.

  3. ECRI identified AI chatbot misuse as the top health tech hazard for 2026, with unsafe response rates between 5% and 13%, according to reports.