Popular AI chatbots often fail to recognize false health claims when they are delivered in safe, medical-sounding language, leading to dubious advice that could be dangerous to the public, such as a recommendation that people insert garlic cloves into their buttocks, according to a January study in the journal. The Lancet Digital Health. Another study, published in February in the journal Natural medicinefound that chatbots were no better than a regular internet search.
The results add to a growing body of evidence suggesting that such chatbots are not reliable sources of health information, at least for the general public, experts told LiveScience.
The article continues below
“The core problem is that LLMs don’t fail the way doctors fail,” Dr. Mahmud Omara researcher at Mount Sinai Medical Center and co-author of The Lancet Digital Health study, told LiveScience in an email. “A doctor who is unsure will take a break, make sure, order another test. An LLM gives the wrong answer with exactly the same confidence as the right one.”
“Garlic Rectal Insertion for Immune Support”
LLMs are designed to respond to written input, such as a medical query, with natural-sounding text. ChatGPT and Gemini – along with medical-based LLMs, such as Ada Health and ChatGPT Health – are trained on massive amounts of data, have read a lot of the medical literature, and achieve near-perfect scores on medical licensing exams.
And people use them extensively: Although most LLMs carry a warning that they should not be relied upon for medical advice, over 40 million people use ChatGPT daily with medical questions.
But in the January study, researchers evaluated how well LLMs handled medical misinformation, testing 20 models with over 3.4 million requests drawn from public forums and social media conversations, real hospital discharge notes redacted to contain a single false recommendation, and fabricated accounts endorsed by doctors.
“About one out of three times they encountered medical misinformation, they just went along with it,” Omar said. “The finding that surprised us wasn’t the overall receptivity. It was the pattern.”
When false medical claims were presented in informal Reddit-style language, the models were quite skeptical, getting it wrong about 9% of the time. But when the exact same claim was repackaged in formal clinical language—a discharge note advising patients to “drink cold milk daily for esophageal bleeding” or recommending “rectal garlic insertion for immune support”—the models failed 46% of the time.
The reason for this may be structural; as LLMs are trained on text, they have learned that clinical language means authority, but they do not test whether a claim is true. “They’re assessing whether it sounds like something a reliable source would say,” Omar said.
But when misinformation was framed using logical fallacies—”a senior clinician with 20 years’ experience supports this” or “everyone knows this works”—the models became more skeptical. This is because LLMs have “learned to distrust the rhetorical tricks of Internet arguments, but not the language of clinical documentation,” Omar added.
For that reason, Omar believes that LLMs cannot be trusted to evaluate and relay medical information.
No better than an internet search
In the Nature Medicine study, researchers asked how well chatbots help people make medical decisions, such as whether to see a doctor or visit an emergency room. It concluded that LLMs did not provide more insight than a traditional Internet search, in part because participants did not always ask the right questions, and the answers they received often combined good and bad recommendations, making it difficult to decide what to do.
That’s not to say that all chatbot relay is rubbish.
AI chatbots “can make some pretty good recommendations, so they’re (at least) somewhat reliable,” Marvin Kopkaan AI researcher at the Technical University of Berlin who was not involved in the research told LiveScience via email.
The problem is that people without expertise have “no way to judge whether the result they get is correct or not,” Kopka said.
A chatbot can, for example, give a recommendation if there is a severe headache after a night at the cinema meningitiswhich warrants a visit to the emergency room, or something more benign, according to the study. But users don’t want to know whether that advice is robust or not, and recommending a wait-and-see approach can be dangerous. “While it’s likely to be helpful in many situations, it can be actively harmful in others,” Kopka said.
The findings suggest that chatbots are not a good tool for the public to use for healthcare decisions.
That doesn’t mean chatbots can’t be useful in medicine, Omar said, “just not the way people use them today.”
Bean, AM, Payne, RE, Parsons, G., Kirk, HR, Ciro, J., Mosquera-Gómez, R., M, SH, Ekanayaka, AS, Tarassenko, L., Rocher, L., & Mahdi, A. (2026). Reliability of LLMs as medical assistants to the general public: a randomized pre-registered trial. Natural medicine, 32(2), 609–615. https://doi.org/10.1038/s41591-025-04074-y






