ChatGPT Health, OpenAI’s new health-focused chatbot, frequently underestimates the severity of medical emergencies, according to a study published last week in the journal Nature Medicine.
In the study, researchers tested ChatGPT Health’s ability to classify or assess the severity of medical cases based on real-life scenarios.
Previous research has shown that ChatGPT can pass medical exams, and nearly two-thirds of doctors reported using some form of AI in 2024. But other research has shown that chatbots, including ChatGPT, do not provide reliable medical advice.
ChatGPT Health is separate from OpenAI’s general ChatGPT chatbot. The program is free, but users must register specifically to use the health program, which currently has a waiting list to join. OpenAI says ChatGPT Health uses a more secure platform so users can upload personal medical information securely.
According to OpenAI, more than 40 million people worldwide use ChatGPT to answer healthcare questions, and nearly 2 million weekly ChatGPT messages are about insurance. In a detailed description of ChatGPT Health on its website, OpenAI says it is “not intended for diagnosis or treatment.”
In the study, researchers fed 60 medical scenarios to ChatGPT Health. The chatbot responses were compared to responses from three physicians who also reviewed the scenarios and rated each based on medical guidelines and clinical experience.
Each of the scenarios had 16 variations, changing things like the patient’s race or sex.
The variations were designed to “produce exactly the same result,” according to the study’s lead author, Dr. Ashwin Ramaswamy, an instructor of urology at Mount Sinai Hospital in New York City. This meant that an emergency case involving a man would still have to be classified as an emergency if the patient was a woman. The study found no significant differences in outcomes based on demographic changes.
The researchers found that ChatGPT Health “underclassified” 51.6% of emergency cases. That is, instead of recommending the patient go to the emergency room, the bot recommended going to the doctor within 24 to 48 hours.
The emergencies included one patient with a life-threatening complication of diabetes called diabetic ketoacidosis and one patient with respiratory failure. If left untreated, both lead to death.
“Any doctor, and anyone with any training, would say that patient needs to go to the emergency department,” Ramaswamy said.
In cases such as impending respiratory failure, the robot appeared to be “waiting for the emergency to become undeniable” before recommending the emergency room, he said.
The study found that emergencies such as stroke, with unambiguous symptoms, were classified correctly 100% of the time.
An OpenAI spokesperson said the company welcomed the research looking at the use of AI in healthcare, but said the new study does not reflect how ChatGPT Health is typically used or how it is designed to work. The chatbot is designed to have people ask follow-up questions to provide more context in medical situations, rather than giving a one-size-fits-all answer to a medical scenario, the spokesperson said.
ChatGPT Health is available only to a limited number of users, and OpenAI is still working to improve the security and reliability of the model before the chatbot becomes more widely available, the spokesperson said.
Compared to doctors in the study, the robot also monitored 64.8% of non-urgent cases, recommending a doctor’s appointment when one was not necessary. The robot told a patient with a three-day sore throat to see a doctor within 24 to 48 hours, when home care was sufficient.
“It doesn’t make sense to me why recommendations are made in some areas and not others,” Ramaswamy said.
In suicidal ideation or self-harm scenarios, the robot’s response was also inconsistent.
When a user expresses suicidal intentions, ChatGPT is supposed to refer users to 988, the crisis and suicide hotline. ChatGPT Health works the same way, the OpenAI spokesperson said.
However, in the study, ChatGPT Health referred users to 988 when they did not need it and did not refer them when necessary.
Ramaswamy called the robot “paradoxical.”
“It was invested at clinical risk,” he said. “And it was a little bit the other way around.”
‘A medical therapist’
Dr. John Mafi, an associate professor of medicine and primary care physician at UCLA Health who was not involved in the research, said more testing is needed on chatbots that can make health decisions.
“The message from this study is that before you implement something like this, to make decisions that affect life, you need to test it rigorously in a controlled trial, where you make sure the benefits outweigh the harms,” Mafi said.
Both Mafi and Ramaswamy said they have seen several of their own patients use AI for medical questions.
Ramaswamy said people can turn to AI for health advice because it is easily accessible and has no limit on the number of questions a person can ask.
“You can go through every question, every detail, every document you want to upload,” Ramaswamy said. “And it fills that need. People really want not only medical advice, but also a partner, like a medical therapist.”
OpenAI said in a January report that the majority of ChatGPT health-related messages occur outside of a doctor’s normal work hours, and more than half a million weekly messages come from people who live 30 minutes or more from a hospital.
“A doctor might spend 15 or 20 minutes in the room with you,” Ramaswamy said. “They will not be able to address and answer each and every question.”
Risks of using a chatbot for medical advice
Despite the benefits of their infinite availability, when asked if chatbots can currently safely provide medical and health advice, Ramaswamy said no.
Dr Ethan Goh, chief executive of ARISE, an AI research network, said that in many cases, AI can provide safe medical and health advice, but it is not a substitute for the advice of a doctor.
“The reality is that chatbots can be useful for a lot of things. It’s really more about being thoughtful and deliberate and understanding that it also has serious limitations,” he said.
Monica Agrawal, an assistant professor in the department of biostatistics and bioinformatics and the department of computer science at Duke University, said it is largely unknown how AI models are trained and what data is used to train them.
He said some training benchmarks may not indicate a bot’s potential to help.
“A lot of the previous evaluations (of OpenAI) were based on: ‘We do well on a licensing exam,’” he said. “But there is a big difference between doing well on a medical exam and practicing medicine.”
He added that when people use chatbots, the information users provide is not always clear and may contain bias.
“Large language models are known for being sycophants,” he said. “Which means they tend to agree with the opinions expressed by the user, even if they may not be correct. And this has the ability to reinforce patients’ misconceptions or biases.”
Mafi said AI tools are “designed to please you,” but as a doctor, “sometimes you have to say something that may not please the patient.”
Ramaswamy said AI should not be relied upon in an emergency and using it in conjunction with a doctor is key to preventing harm. He said collaborations between technology and healthcare companies are important to create safer AI products.
“If these models get better and better, I can see the benefits of a patient-AI-doctor relationship, especially in rural settings or in global health areas,” he said.






