A recent study published in Nature Medicine on February 24 has raised concerns about the reliability of ChatGPT Health, OpenAI’s consumer-facing health tool, in directing users to emergency care in serious medical cases.
The study, conducted by researchers at the Icahn School of Medicine at Mount Sinai, tested ChatGPT Health on 60 clinical scenarios across 21 medical specialties, ranging from minor conditions to genuine emergencies. Three independent physicians established the correct level of urgency for each case using guidelines from 56 medical societies. The scenarios were then tested under 16 different contextual conditions, including variations in race, gender, social dynamics, and barriers to care such as lack of insurance, producing 960 total interactions with ChatGPT Health.
The results showed that ChatGPT Health failed to appropriately direct users to emergency care in more than half of serious medical cases. The tool under-triaged 52 percent of cases that physicians deemed true emergencies, directing patients with conditions such as diabetic ketoacidosis and impending respiratory failure toward a 24-to-48-hour evaluation instead of the emergency department. The system also misclassified 35 percent of non-urgent cases.
The study found that ChatGPT Health’s performance followed an “inverted U-shaped” pattern. While it handled textbook emergencies like stroke and anaphylaxis correctly, it struggled in more nuanced situations where the danger is not immediately obvious. According to Dr. Ashwin Ramaswamy, one of the study’s corresponding authors, “ChatGPT Health performed well in textbook emergencies such as stroke or severe allergic reactions. But it struggled in more nuanced situations where the danger is not immediately obvious, and those are often the cases where clinical judgment matters most.”
The study also exposed troubling inconsistencies in ChatGPT Health’s crisis intervention system. The tool was designed to direct users to the 988 Suicide and Crisis Lifeline in high-risk situations, but researchers found that these alerts appeared more reliably when users described no specific method of self-harm than when they articulated a concrete plan. Dr. Girish Nadkarni, Mount Sinai’s Chief AI Officer and the study’s other corresponding author, noted that the finding was “beyond inconsistency,” as “the system’s alerts were inverted relative to clinical risk.”
The study’s findings are particularly concerning given the rapid consumer adoption of ChatGPT Health. OpenAI launched the tool in January 2026, and the company reported that roughly 40 million people were using ChatGPT daily for health-related questions. The nonprofit patient safety organization ECRI ranked misuse of AI chatbots in healthcare as the top health technology hazard for 2026, warning that the tools “can provide false or misleading information that could result in significant patient harm.”
The Mount Sinai team found no statistically detectable effects from patient race, gender, or barriers to care on triage outcomes, although the study’s confidence intervals did not rule out clinically meaningful differences. The researchers plan to continue evaluating updated versions of ChatGPT Health and other consumer AI tools, with future research expanding into pediatric care, medication safety, and non-English-language use.




