Making sense of AI research in medicine, in one slide

BOSTON — Scientific journals have become something of a Mad Libs game for GPT: Artificial intelligence can now detect _____, or speedily tell the difference between _____ and _____.  But which of these studies are actually important? How can clinicians sort them out from one another?

At a recent AI conference, Atman Health chief medical officer and Brigham and Women’s associate physician Rahul Deo boiled the issue down in a single slide: the riskiest, most impactful studies draw far less attention these days than the rest of the research.


The highest-impact AI models would be those that figure out how to replace the most complex physician tasks with automation. Those are followed by studies that take steps toward that goal, including models that predict patient risk, clinical decision support models, and language models that automate rote office tasks. And then, at the bottom, is the “everything else” category: studies that might seem impressive, but don’t actually move the needle.

“I think the field did [need] a little provocation, otherwise you just find yourself building what I consider to be the stuff on the bottom, which are just things to impress your peers with different flashy papers,” he told STAT, with a bit of a laugh.

“I think there is a tendency towards [doing] some stuff in a bubble that never has the ability to get out and actually impact anything,” he added.


Here’s Deo’s hierarchy of research in medical AI, ranked from the riskiest and most impactful to the least.

Moving complex provider tasks to automated systems

Replacing the key work that doctors do with machine learning-powered labor would be a massive shift in medicine — it could potentially bring down health care costs, or widen access to care in areas where doctors are in short supply. “Of course, the risk of that is massive,” which is why it would need to be heavily regulated by the Food and Drug Administration, he said. Also, “that the model better be really good,” he added; trusting a machine learning model to output and execute a medical decision without a person standing in the way is a “complete high-stakes game.”

If an AI replaced human doctors, health care could become far more scalable, bringing care to people who aren’t currently receiving it. That’s the idea behind Martin Shkreli’s Dr. Gupta.AI, testing ChatGPT on U.S. medical licensure exam questions, and many other trials of language models in medicine. But right now, there are big hurdles to that dream, especially in areas such as replicating human empathy and reading human cues like body language. A doctor might pick up on those signals and rephrase a question or give additional personalized context, which is difficult for an AI to do without being prompted.

Current AI language models are also largely incapable of reasoning and logic. “They don’t have a first-principles model that underlies what’s going on, for the most part, so they could do something that you could think, ‘a medical student would never make that mistake,’” said Deo. “That’s the challenge with at least the models and the architecture that’s there right now, but it may always be — to some degree — there is that risk.”

There aren’t good examples of studies in this area yet, mostly because there are many technical hurdles to overcome before it’s possible. But while empathy, reasoning, and decision-making are very difficult, Deo believes that once the AI is receiving reliable input data, everything else in medicine is eminently doable — “Most things are either algorithmic or you wish they were,” he said.

Rapid iterative learning of optimal care approaches

Training an AI to do a doctor’s decision-making would require an enormous amount of data — which right now, largely doesn’t exist, according to Deo.

“If you look at most of the evidence in most fields — and cardiology is probably one of the best ones — a huge amount of it is just expert opinion. There’s like no data,” said Deo, “because it’s very, very expensive to acquire in the setting of randomized clinical trials.” In other fields with large, reliable data sets  — like language, or in an app like Uber — an AI can systematically learn from data streams, but health outcomes data is much harder to obtain and much harder to train on.

“Why does this group not do so well compared to this group?” asked Deo. “There’s probably hundreds of thousands of questions like that.” But at the pace at which we can conduct clinical trials, “it’s going to take a thousand years to be able to get to that point,” he said.

Even when clinical trials of disease treatments are conducted, there are data gaps. It’s sometimes uncertain what exactly caused a particular outcome because trials — and large-scale scrapes of medical records — only collect data on a certain number of possible variables. It’s very hard to say why one subgroup of patients fared worse than another group because trials can’t collect endless amounts of data to pinpoint the ultimate cause of the disparity. Whether that’s a biomarker that wasn’t measured, or a social determinant of health, like where someone lives or what access they have to food and transportation, the AI can only learn from the input and the output, not any inferred causes in the middle, which is what has led to bias in AI algorithms in the past.

Training an AI to make good decisions on current health outcomes data is currently “incredibly difficult, if not impossible, because it’s biased,” said Deo; “it’s missing this [and that]; this missing-ness is biased; all of these sort of statistical nightmares that make it very, very complex.” Without studies on ways to fill this gap, AI doctors will be missing a key part of their “med school” curriculum.

Categorizing current AI models chart from presentation slide -- health tech coverage from STAT
Rahul Deo’s outline of different categories of AI models and where they fit in the medical AI field, presented at the 2023 MIT-MGB AI Cures conference in Cambridge, Mass. Courtesy Rahul Deo

Another set of eyes: pre-reading, over-reading diagnostic studies

Studies of clinical decision support algorithms fall into Deo’s next category. These AI tools don’t make decisions themselves, but can act as an extra pair of eyes. They have already gained traction in the clinic, including computer-assisted mammography that is now used in patient care.

“[It’s] this kind of idea that the doctor will still be doing exactly what they’re always been doing. But we have a machine that’s there to say, ‘Buddy, you might have missed this,’ or ‘Hey, you may want to look at this one before that one,’” said Deo.

The pathway for reimbursement for these tools isn’t clear. Additionally, if the model is a black box, and the physician can’t exactly see what the algorithm is picking up on, it’s hard to trust the algorithm. While time might be saved and outcomes might be better, there are added financial and liability costs to implementing these kinds of algorithms in the real world, as was revealed with Epic’s controversial sepsis algorithm. “Less risk, but maybe less overall benefit,” as Deo put it.

Novel markers of risk

A side category of clinical decision support algorithms are the models that point out which people might be at higher risk for something, often using wearables or other devices to collect digital clues about a person’s risk.

However, the direct-to-consumer nature of these technologies poses a big workflow problem that health care systems have yet to figure out: “Now you’re like, OK, ‘I go to my doctor and I tell them that my watch told me this’ or ‘My toilet seat told me this,’” said Deo. Studies on “digital biomarkers” and “hospital at home” monitoring using AI models might be innovative and require less inference than other sorts of AI studies. But they can’t impact the health care system if there’s no infrastructure for integrating this kind of data into traditional health systems.

Alleviating drudgery

With AI, doctors now have even more tools to eliminate work like answering patient portal messages. These “low-stakes” tasks don’t require much risk in exchange for time savings, but developing these capabilities doesn’t advance the state of AI. Still, these use cases for AI are popular, with tech companies like Microsoft and Epic getting into the business with pilot programs at large university health systems.

But with these new capabilities come questions about the line between a “low-stakes” and “high-stakes” activity. Patients might not mind an AI helping them schedule an appointment or reminding them what they can or cannot take with their medication, but hearing that an AI is writing medical visit notes is alarming to some people.

Tools to alleviate drudgery have been around for a long time, said Deo: machines automatically calculating the width, axis, and angles of electrocardiograms and spitting out all the associated statistics, for example, that nobody does by hand anymore. He pointed out that health systems are always defining acceptable amounts of risk in different areas: voice-to-text dictation tools, even when using people to transcribe, contain errors. And often, clinicians will ask if it’s OK for a scribe or a medical student to take a medical history, jot notes, or start an exam, and doctors don’t redo all the work the medical student did; they just pick the places that are the most important, and double-check those.

“There’s a lot of people who are floating around with medical expertise who are contributing to some of this stuff already, and I’m sure that not everything is verified verbatim,” said Deo. “People are choosing those lower-stakes places because they know the chance of adoption is greater because people are less worried about the type of liability that comes. But it’s not zero. It’s just less.”

Impressing journal editors, peer reviewers, study section members

At the very bottom of Deo’s risk-reward hierarchy are all of the rest of the AI studies. While people are excited to see AI being brought into their field and for it to be applied to familiar problems, the novelty wears off after a while, said Deo. Things start to fall apart for many of these models when the rubber hits the road. What are people using right now, and would they change what they’re doing? What’s the risk if the model had false positives, or false negatives? How will it meet FDA criteria? Who will pay for it? Does it save money?

Many of the models done for academic curiosity’s sake don’t consider these questions, and will perish because the ability to get anything into clinical practice has so many obstacles to it, said Deo.

“It’s just dull from an academic standpoint and the paper’s not going to get any loftier,” said Deo, which is “a basic challenge with how our research is funded and promotions are done, that this is not seen as being academic.”

Deo’s call to action is for researchers to choose things that, if they have a measurable impact on patient outcomes, can both integrate into the clinical workflow and can be prospectively validated in partner institutions to prove that the model isn’t overfit to a specific population.

If researchers don’t start training expensive models with the idea of downstream validation in mind, “then that really doesn’t have a huge amount of value because it’s going to have to be redone from scratch,” said Deo. Without a way for others to use the model, “it becomes just at best proof-of-concept.”

This story is part of a series examining the use of artificial intelligence in health care and practices for exchanging and analyzing patient data. It is supported with funding from the Gordon and Betty Moore Foundation.

Source: STAT