
Health care companies are racing to incorporate generative AI tools into their product pipelines and IT systems after the technology displayed an ability to perform many tasks faster, cheaper — and sometimes better — than humans.
But the rush to harness the power of so-called large language models, which are trained on vast troves of data, is outpacing efforts to assess their value. AI experts are still trying to understand, and explain, how and why they work better than prior systems, and what blind spots might undermine their usefulness in medicine.
It remains unclear, for example, how well these models will perform, and what privacy and ethical quandaries will arise, when they’re exposed to new types of data, such as genetic sequences, CT scans, and electronic health records. Even knowing exactly how much data must be fed into a model to achieve peak performance on a given task is still largely guesswork.
“We have no satisfying mathematical or theoretical account of why precisely these models have to be as big as they are,” said Zachary Lipton, a professor of computer science at Carnegie Mellon University. “Why do they seem to get better as we increase them from millions of parameters to half a trillion parameters? These are all wildly open technical questions.”
STAT reporters put such questions to AI experts to help explain the history and underpinnings of large language models and other forms of generative AI, which is designed to produce answers in response to a prompt. How accurate those responses are depends, in large part, on the data used to train them. STAT also asked experts to debunk the many misconceptions swirling around these systems as health care companies seek to apply them to new tasks. Here’s what they think you should know before betting a patient’s health, or hope of profit, on the first impressions of ChatGPT.
What generative AI models are actually doing when they produce an answer
In short, they’re doing math.
More precisely, they are performing the same kind of auto-complete that has been built into our email, and tools like automated language translation, for many years.
“The AI is identifying and reproducing patterns,” University of Michigan computer scientists Jenna Wiens and Trenton Chang wrote in response to STAT’s questions. “Many generative models for text are, at the core, based on predicting the probability that each word comes next, using probability as a proxy for how ‘reasonable’ an answer is.”
Heather Lane, senior architect of the data science team at athenahealth, told STAT “it’s sort of like it’s playing a big, complex game of ‘Mad Libs’ or a crossword puzzle — by looking at a few words and hints, it’s picking words that are statistically likely to go with them, but without a ‘real understanding’ of what it’s doing.” The AI models create an idea of what’s “statistically likely” from the vast amounts of data (including Wikipedia, Reddit, books, and the rest of the internet), and learns what “looks good” from rounds of human feedback on its answers.
That’s a far cry from the way humans think, and is certainly much less efficient and more limited than the reasoning systems that define how our brains process information and solve problems. If you think large language models are getting anywhere close to artificial general intelligence — the holy grail of AI research — you are misinformed.
How they got so much better than prior versions of generative AI
It’s mostly because they were trained on much more data than previous versions of generative AI, but several factors have converged in the last several years to create the powerful models we have today.
“When talking about starting a fire, you need oxygen, fuel, and heat,” said Elliot Bolton, a research engineer at Stanford who works on generative AI, told STAT in an email. Likewise, in the last few years, the development of a technology called “transformers” (the “T” in “GPT”), combined with huge models trained on huge amounts of data with a huge amount of computing power, have given the impressive results we see today.
“People forget it was only 12 years ago that if someone trained (an AI) on all of Wikipedia, this was a breathtakingly large study,” said Lipton. “But now when people train a language model, they train it on all the text on the internet, or something like that.”
Because models like OpenAI’s GPT-4 and Google’s PaLM 2 have been trained on so much data, they are more readily able to recognize and reproduce patterns. Still, their fluidity in generating complicated outputs — such as songs and snippets of computer code — was surprising to AI researchers who didn’t expect such a huge leap from completing text messages to writing essays on late 19th century Impressionism.
“It turns out that these larger models, trained with massively more computational resources, on way, way, way more data, have these remarkable abilities,” Lipton said. The models can also be updated with new or different forms of data and built into existing products, such as Microsoft’s Bing search engine.
They might seem smart, but they’re far from intelligent
Even though language models are learning language in a manner roughly akin to how a toddler does, said Lane, these models need way more training data than a child does. They also fail on spatial reasoning and math tasks because their language capabilities aren’t rooted in any understanding of the world or causality.
“It’s very easy to make the models look silly,” Lipton, of Carnegie Mellon, added. “They are ultimately text processing engines. They don’t know that there is a world that the text references.”
But as more people begin to use them, he said, there are a lot of unknowns about how they will affect human intelligence, especially as more people lean on them to perform tasks they used to struggle through on their own, like writing or summarizing information.
“My biggest fear,” he said, “is that they will somehow stunt us so that we cease to be as creative as we are.”
There are ways to address ChatGPT’s problem of making things up
Because these generative AI models are just predicting text that is both likely and convincing, the models don’t have any basis for understanding what is true and false.
“It does not know that it’s lying to you, because it fundamentally doesn’t know the difference between the truth and a lie,” said Lane. “This is no different than dealing with a human being who’s incredibly charming and who sounds very convincing, but whose words literally have no tie to reality.”
That’s why it’s important to ask a few simple questions before using a model for a specific task: Who built it? Did they train it with data likely to contain relevant and reliable information for the intended use? If questionable sources are baked in, what biases and misinformation might result?
This is a particularly important exercise in health care, where inaccurate information can produce a whole host of negative outcomes. “I don’t want my doctor trained on Reddit, I don’t know about you,” said Nigam Shah, a professor of biomedical informatics at Stanford.
That doesn’t mean it’s impossible to improve the accuracy of models whose training may have included biased or false information. The builders of generative AI systems can use a technique known as reinforcement learning, which involves giving the model feedback so it learns which replies are more accurate and useful as judged by human experts.
That technique was used in building GPT-4, but the maker of the model, OpenAI, has not disclosed what data were used to train it. Google has created a large language model known as MedPalm-2 that is trained on medical information designed to make it more relevant for health care-related uses.
“As generative AI models advance, it is likely the ‘hallucinations’ will decrease,” said Ron Kim, senior vice president of IT Architecture at Merck.
Doomsday probably won’t happen, but guardrails are necessary
The hype around ChatGPT has given rise to renewed concerns about AI stealing everyone’s jobs or somehow running wild.
But many researchers in the field strike a much more optimistic tone when it comes to the technology and its potential in health care. Thomas Fuchs, who chairs the Department of Artificial Intelligence and Human Health at Mount Sinai in New York, said that in the broadest sense doomsday scenarios are “extremely unlikely,” and fearful speculation isn’t a reason to impede the potential of artificial intelligence to democratize access to high-quality care, develop better drugs, reduce pressure on limited physicians, and more.
“In health care, patients today are dying not because of AI, but because of the lack of AI,” he said.
Though there have been many examples of algorithms being used in health care inappropriately, experts hope that with the proper guardrails, GPTs can be used responsibly. There aren’t regulations specific to generative AI just yet, but there’s a growing movement pushing for rules.
“We’re going to have to, at least at this stage, enumerate use cases…where it is reasonable and low risk to use generative AI for a specific purpose,” said John Halamka, the president of Mayo Clinic Platform who also co-leads the Coalition for Health AI, which has discussed what guardrails might be appropriate. He said that while GPT-based tools might be good at helping draft an insurance denial appeal letter or at helping a non-native English speaker clean up a scientific paper, other use cases should be off limits.
“Things like asking [generative AI] to do a clinical summary or to provide a doctor with assistance to diagnosis, those would not be use cases we would probably choose today,” he said.
But as the technology improves — and is more capable of such tasks — humans will have to decide whether relying on AI too much will impair their abilities to think through problems and write their own answers.
“What if it turns out that what we really needed was smart people agonizing over what they meant to say,” Lipton said. “And not just letting GPT-4 infill something that someone might have plausibly said in the past?”
This story is part of a series examining the use of artificial intelligence in health care and practices for exchanging and analyzing patient data. It is supported with funding from the Gordon and Betty Moore Foundation.