Why the early tests of ChatGPT in medicine miss the mark

ChatGPT has rocketed into health care like a medical prodigy. The artificial intelligence tool correctly answered more than 80% of board exam questions, showing an impressive depth of knowledge in a field that takes even elite students years to master.

But in the hype-heavy days that followed, experts at Stanford University began to ask the AI questions drawn from real situations in medicine — and got much different results. Almost 60% of its answers either disagreed with human specialists or provided information that wasn’t clearly relevant.

The discordance was unsurprising since the specialists’ answers were based on a review of patients’ electronic health records — a data source ChatGPT, whose knowledge is derived from the internet, has never seen. However, the results pointed to a bigger problem: The early testing of the model only examined its textbook knowledge, and not its ability to help doctors make faster, better decisions in real-life situations.


“We’re evaluating these technologies the wrong way,” said Nigam Shah, a professor of biomedical informatics at Stanford University who led the research. “What we should be asking and evaluating is the hybrid construct of the human plus this technology.”

The latest version of OpenAI’s large language model, known as GPT-4, is undeniably powerful, and a considerable improvement over prior versions. But data scientists and clinicians are urging caution in the rollout of such tools, and calling for more independent testing of their ability to reliably perform specific tasks in medicine.


“We still need to figure out what the evidence bar is to decide where they are useful and where they are not,” said Philip Payne, director of the informatics institute at Washington University in St. Louis. “We’re going to have to reassess what the definition of intelligence is in terms of these models.”

For tasks that involve summarizing large bodies of research and information, GPT-4 has demonstrated a high degree of competence. But it is unclear whether it can engage in tasks that require deeper critical thinking and help clinicians deliver care in messier circumstances, when information is often incomplete. “I don’t think we’ve demonstrated these models are going to solve for that,” Payne said.

For now, most experimental uses being pursued by health systems and private companies are focused on automating documentation tasks, such as filling out medical records or summarizing instructions provided to patients when they are discharged from the hospital.

While those uses are lower risk than using GPT to provide advice about treating a cancer patient, mistakes can still lead to patient harms, such as inflated bills or missed follow-up care if a discharge note is summarized incorrectly.

“We shouldn’t feel reassured by claims that these tools are only intended to help physicians” with administrative tasks, said Mark Sendak, a clinical data scientist at the Duke University’s Institute for Health Innovation. He said GPT’s performance on “back of house” tasks for billing, communications, and hospital operations should also be carefully evaluated, but he is doubtful that such evaluations will be carried out consistently.

“One of the challenges is that the speed at which industry moves is faster than we can move to equip health systems,” Sendak said.

Stanford’s study was designed to evaluate the ability of GPT-4 and its predecessor model to deliver expert advice to doctors on questions that arose in the course of treating patients at Stanford Health Care. Researchers drilled the model with 64 clinical questions — such as differences in blood glucose levels following use of certain pain medicines — that had previously been assessed by a team of experts at Stanford. The AI model’s responses were then evaluated by 12 doctors who assessed whether its answers were safe and agreed with those provided by Stanford’s experts.

In more than 90% of the cases, GPT-4’s responses were deemed safe, meaning they were not so incorrect as to possibly cause harm. Some responses were deemed harmful because the AI hallucinated citations. Overall, about 40% of its answers agreed with the clinical experts, according to preliminary results that have not been peer-reviewed. For about a quarter of the AI’s responses, the information was too general or tangential to determine whether it was in line with what physicians would have said.

Despite its struggles, GPT-4 performed much better than its prior version, GPT-3.5, which only agreed with the team of experts in 20% of the cases. “That’s a serious improvement in the technology’s capability — I was blown away,” said Shah.

At the rate of its improvement, Shah said, the model will soon be able to replace services designed to aid clinicians by performing manual reviews of medical literature. That might eventually help doctors working in contexts like tumor boards, where physicians review records and literature to determine how to treat cancer patients. To get there, Shah said, GPT should be tested on exactly that task in a controlled experiment comparing a GPT-guided tumor board with one following a standard process.

“Then you track whether they reach consensus faster, does their throughput go up,” Shah said. “And if throughput goes up, does the quality of their decisions get better, worse, or the same?”

This story is part of a series examining the use of artificial intelligence in health care and practices for exchanging and analyzing patient data. It is supported with funding from the Gordon and Betty Moore Foundation.

Source: STAT