ChatGPT in medicine: STAT answers readers’ burning questions about AI

Artificial intelligence is often described as a black box: an unknowable, mysterious force that operates inside the critical world of health care. If it’s hard for experts to wrap their heads around at times, it’s almost impossible for patients or the general public to grasp.

While AI-powered tools like ChatGPT are swiftly gaining steam in medicine, patients rarely have any say — or even any insight — into how these powerful technologies are being used in their own care.


To get a handle on the most pressing concerns among patients, STAT asked our readers what they most wanted to know about generative AI’s use in medicine. Their submissions ranged from fundamental questions about how the technology works to concerns about bias and error creeping further into our health systems.

It’s clear that the potential of large language models, which are trained on massive amounts of data and can generate answers to myriad prompts, is vast. It goes beyond ChatGPT and the ability for humans and AI to talk to each other. AI tools can help doctors predict medical harm on a broader scale, leading to better patient outcomes. They’re currently being used for medical note-taking, and analysis of X-rays and mammograms. Health tech companies are eager to tout their AI-powered algorithms at every turn.

But the harm is equally vast as long as AI tools go unregulated. Inaccurate, biased training data deepen health disparities. Algorithms not properly vetted deliver incorrect information on patients in critical condition. And insurers use AI algorithms to cut off care for patients before they’re fully recovered.


When it comes to generative artificial intelligence, there are certainly more questions than answers right now. STAT asked experts in the field to tackle some of our reader’s thoughtful questions, revealing the good, the bad, and the ugly sides of AI.

As a patient, how can I best avoid any product, service or company using generative AI? I want absolutely nothing to do with it. Is my quest to avoid it hopeless? 

Experts agreed that avoiding generative AI entirely would be very, very difficult. At the moment, there aren’t laws governing how it’s used, nor explicit regulations forcing health companies to disclose that they’re using it.

“Without being too alarmist, the window where everyone has the ability to completely avoid this technology is likely closing,” John Kirchenbauer, a Ph.D. student researching machine learning and natural language processing at the University of Maryland, told STAT. Companies are already exploring using generative AI to handle simple customer service requests or frequently asked questions, and health providers are likely looking to the technology to automate some communication with patients, said Cobun Zweifel-Keegan, managing director of the International Association of Privacy Professionals.

But there are steps patients can take to at least ensure they’re informed when providers or insurers are using it.

Despite a lack of clear limits on the use of generative AI, regulatory agencies like the Federal Trade Commission “will not look kindly if patients are surprised by the use of automated systems,” so providers will likely start proactively disclosing if they’re incorporating generative AI into their messaging systems, Zweifel-Keegan said.

“If you have concerns about generative AI, look out for these disclosures and always feel empowered to ask questions of your provider,” Zweifel-Keegan said, adding that patients can report any concerning practices to their state attorney general, the FTC and the Department of Health and Human Services.

If a health system hasn’t made any such disclosure, patients can still ask them how they are or are not using generative AI.  “In my opinion, the best path forward is to support and advocate for policies that mandate disclosure of automated systems both within your health care institution and from elected officials,” said Kellie Owens, assistant professor of medical ethics at NYU’s Grossman School of Medicine.

I’ve heard about some of the pitfalls with gender bias in generative AI models like ChatGPT. How can health care providers ensure this won’t be the case, especially as it pertains to women’s health and its outcomes?

It’s impossible to totally eradicate bias in generative AI models today — especially since models are often trained on datasets that reflect historical bias themselves, experts said. If they’re developed and trained in the so-called black box, it’s hard for researchers and external auditors to catch these biases.

“If women’s health care is underrepresented in the medical literature in terms of diagnoses, outcomes, etc. then it could very possibly be a domain in which the reliability of generative AI models will lag far behind other areas,” Kirchenbauer said.

So while patients and providers should expect models to be skewed, there are steps they can take to better understand — and potentially counteract — that bias, experts said.

Before buying or adopting generative AI, providers should ask vendors for detailed reports on whether they’ve pressure-tested their systems for privacy, security and bias, and whether they’ve performed independent audits. “In short, if answers are not satisfactory, avoid the system,” Zweifel-Keegan said.

Health organizations must also recognize the technology’s limitations, Zweifel-Keegan added. “General-purpose generative AI is not designed to provide factual answers to queries. It is trained to write plausible sentences, which only sometimes happen to be accurate.”

They can also work on correcting disparities in health record information so that AI systems have a more representative training dataset, including by using “model cards” to explain an automated systems, what data it was trained on, and scenarios in which it might perform poorly, said Owens.

They can also open the models up for rigorous external audit by researchers, said Rory Mir, the Electronic Frontier Foundation’s associate director of community organizing. “[T]he only way to address bias is opening the process at every step — from open data sets, to transparent data preparation, to the training models themselves,” Mir said. “There are no ‘moats’ in AI, where one company can protect the public from their own creation. This needs to be a collaborative and global research process.”

Health systems may be best positioned to interrogate their vendors about the models they’re buying or considering — and while patients can push them to ask those tough questions, in the absence of clear regulations surrounding generative AI, providers will need to assure them they are routinely performing these audits.

Medical records are notorious for containing errors and misdiagnoses. If generative AI is trained on error-prone records, how will that impact its output? 

Poor data means poor output. As computer scientists say: garbage in, garbage out. Unfortunately, we don’t yet know a lot about the specific ways error-riddled data affects the performance of AI, and the resulting patient outcomes. It’s an area that desperately needs more research.

“Because we cannot presume to know all of the ways that errors may emerge, we must comprehensively study the effects of these tools,” Owens said. “Just as we have clear guidelines for testing the efficacy of drugs in clinical trials, we need to build clear standards for assessing the efficacy and risks of generative AI in health care.”

Kirchenbauer noted that researchers still aren’t certain how often AI simply regurgitates outputs from its training data as opposed to coming up with novel answers. So measuring how errors manifest is difficult.

Still, even if experts don’t completely understand the harms of error-filled data, companies can and should put up guardrails. Marinka Zitnik, a biomedical informatics professor at Harvard, pointed to the need for human experts to validate and review the algorithms’ output.

On that note, with AI being used more in medical transcription, do you expect that subtle errors might make their way into medical records? 

Yes. AI transcription tools are by no means perfect. They may mishear certain complex medical terms or nuanced conversations. That’s why it’s important to have humans checking through medical records.

Unfortunately, people tend to believe that software is always accurate. This is called “automation bias,” in other words, an over-reliance on automated systems. Marzyeh Ghassemi, a computer science professor at MIT, cautioned against this.

“Our past work has shown that humans are poor judges of model error, even when they think that they aren’t,” Ghassemi said. “We never want clinicians to turn off their critical thinking skills.”

Will utilization of generative AI become a specialty on its own? Or will it be accessible and flexible enough to answer anybody’s inquiries, even those without experience?

Ideally, everybody will be able to use it, regardless of familiarity with the technology. The interface should be simple and straightforward; the path to gleaning useful information should be clear.

We’re not in that reality yet. Even preeminent AI experts are still uncovering how exactly the models work. It’s very possible that AI-whisperer becomes a legitimate specialty. Some researchers STAT spoke with pointed to an emerging field of “prompt engineering,” where researchers give AI models different prompts and compare their outputs.

“I could imagine a scenario similar to the way we currently use search engines — they are useful (if often problematic/biased) for everyone, and at the same time we have entire expert communities of search engine optimization that use these tools quite differently,” Owens said.

Ghassemi said the ideal user of generative AI will depend on the setting. In a hospital setting, for example, only a subject matter expert will be able to pick up on key errors in the AI’s output.

“The more interesting case is for rarer settings, where you have to be an expert in the topic area in order to discern that the model is making subtle, but important, errors,” Ghassemi said. “In these cases, a subject expert can just re-phrase or re-ask the question, [such as], ‘that’s not right, can you try again?’ But others cannot and will not.”

Can I trust that generative AI will give me a good summary of research in fields I’m not knowledgeable in? 

Not entirely. A tool like ChatGPT will provide a helpful starting point, but it does tend to “hallucinate,” or make things up. This is because it’s built to predict likely responses, not to assess what is true or false. That means it’s essential to always verify the information ChatGPT provides.

“Generative AI may provide a nice overall summary of a topic, but it would be unwise to assume that any one piece of information is true,” Owens said.

It also may not be up-to-date on recent advancements in a particular medical field — right now, ChatGPT only pulls information through 2021 — or perpetuate biases present in the training data.

“If you’re looking for an overview of something that’s well-established and not contentious, e.g., how gravity works, it will likely do really well,” Ghassemi added. “However, anything that is more niche, or that requires digestion of many papers with conflicting results, is likely to require expert evaluation.”

I’m concerned about advancements in generative AI in health care being made without a strong foundation for database integrity, representation of diverse populations, cybersecurity, regulation, and privacy. What meaningful efforts are being made to establish an effective framework for ethics, representation, and regulation?

“I share the same concerns,” said Owens, the NYU ethicist. She pointed out that the Coalition for Health AI recently published a “Blueprint for Trustworthy AI Implementation Guidance and Assurance for Healthcare,” which outlines ways to implement trustworthy AI frameworks, including establishing assurance labs to which tool developers and health systems can submit AI algorithms.

Those assurance labs could also build a registry that’s like a “ for AI tools” and set AI validation standards. “Making these efforts truly meaningful will require clear ideas about who is responsible for ensuring these frameworks are implemented, and consequences for non-compliance,” she said.

Federal regulation of AI tools is top of mind for many; at a Senate subcommittee hearing last week, Sam Altman, the CEO of the company behind ChatGPT said that to mitigate the risks of increasingly powerful AI models, the U.S. government could consider a combination of testing and licensing requirements for AI models above a certain threshold of capabilities.

However, it’s a myth that “Congress has yet to propose legislation ‘to protect individuals or thwart the development of A.I.’s potentially dangerous aspects,’” wrote Anna Lenhart, a policy fellow at George Washington University, in her introduction to a compilation of legislative proposals on AI that haven’t yet made it to President Biden’s desk. Many people have overlooked these existing proposals, she said, which include establishing a new agency that would oversee digital platforms or data and privacy.

If ChatGPT is used to write a report, can there be a watermark that indicates the true author, ChatGPT? Patients should be made aware when ChatGPT or other AI are used to write notes in their medical record. This is a patient safety issue.

“While today, to the best of our knowledge, OpenAI has not introduced a watermark into any of their models (i.e. ChatGPT, GPT-4), there are no technical barriers to them doing so,” said Kirchenbauer, who developed a watermarking technique with his colleagues at the University of Maryland.

“Watermarking leverages the fact that often, there are many different ways to write the same thing and our procedure subtly biases the generative model to choose one way of writing it and our detection algorithm harnesses this fact to check for whether or not these soft rules were followed a statistically surprising amount of the time,” he said.

While there are ways to evaluate text and detect whether it was generated with AI, such as the algorithms that plagiarism-checking programs like Turnitin use, “these techniques are generally far less reliable than watermarking, and also do not produce interpretable statistical confidence estimates (p-values) like watermarking detection does,” said Kirchenbauer.

There’s also a question of whether the attribution would stick if someone cut and pasted text from a tool like ChatGPT. Kirchenbauer and his colleagues have also examined if mixing watermarked and non-watermarked text could ruin the detection, if other AI generators could scrub away the watermark, and if  humans could paraphrase enough of the text to make the watermark unrecognizable. The results of that work, to be published in a forthcoming study,  showed that the watermark can still be detected, albeit at reduced levels, in those situations.

Watermarks could be used to scan a document like a medical record and report a statistical estimate as to whether that text was generated by a model that used a specific watermark. “That said, a key thing to note here is that watermarking is a ‘proactive’ technique, and must be deployed by the model owner,” said Kirchenbauer, meaning that OpenAI or the other AI toolmakers would have to commit to using a watermarking technique for this to be effective.

This story is part of a series examining the use of artificial intelligence in health care and practices for exchanging and analyzing patient data. It is supported with funding from the Gordon and Betty Moore Foundation.

Source: STAT