In public, hospitals rave about artificial intelligence. They trumpet the technology in press releases, plaster its use on billboards, and sprinkle AI into speeches touting its ability to detect diseases earlier and make health care faster, better, and cheaper.
But on the front lines, the hype is smashing into a starkly different reality.
Caregivers complain AI models are unreliable and of limited value. Tools designed to warn of impending illnesses are inconsistent and sometimes difficult to interpret. Even evaluating them for accuracy, and susceptibility to bias, is still an unsettled science.
A new report aims to drag these tensions into the open through interviews with physicians and data scientists struggling to implement AI tools in health care organizations nationwide. Their unvarnished reviews, compiled by researchers at Duke University, reveal a yawning gap between the marketing of AI and the months, sometimes years, of toil it takes to get the technology to work the right way in the real world.
“I don’t think we even really have a great understanding of how to measure an algorithm’s performance, let alone its performance across different race and ethnic groups,” one interview subject told researchers. The interviews, kept anonymous to allow for candor, were conducted at a dozen health care organizations, including insurers and large academic hospitals such as Mayo Clinic, Kaiser Permanente, New York Presbyterian, and University of California San Francisco.
The research team, dubbed the Health AI Partnership, has leveraged the findings to build an online guide to help health systems overcome implementation barriers that most organizations now stumble through alone. It’s a desperately needed service at a time when adoption of AI for decision-making in medicine is outpacing efforts to oversee its use. (The Health AI Partnership’s research is funded by the Gordon and Betty Moore Foundation, which also supports STAT’s reporting on AI in health care.)
“We need a safe space where people can come and discuss these problems openly,” said Suresh Balu, an associate dean of innovation at Duke’s medical school who helped lead the research. “We wanted to create something that was simple and effective to help put AI into practice.”
The challenges uncovered by the project point to a dawning realization about AI’s use in health care: building the algorithm is the easiest part of the work. The real difficulty lies in figuring out how to incorporate the technology into the daily routines of doctors and nurses, and the complicated care-delivery and technical systems that surround them. AI must be finely tuned to those environments and evaluated within them, so that its benefits and costs can be clearly understood and compared.
As it stands, health systems are not set up to do that work — at least not across the board. Many are hiring more data scientists and engineers. But those specialists often work in self-contained units that help build or buy AI models and then struggle behind the scenes to keep them working properly.
“Each health system is kind of inventing this on their own,” said Michael Draugelis, a data scientist at Hackensack Meridian Health System in New Jersey. He noted that the problems are not just technical, but also legal and ethical, requiring a broad group of experts to help address them.
The Health AI Partnership’s findings highlight the need for a more systematic approach to that work, especially amid the rush to harness the power of large language models such as ChatGPT. The process shouldn’t start with a press release, experts said, but with a deeper consideration of what problems AI can help solve in health care and how to surround them with effective oversight.
STAT spoke with data scientists, lawyers, bioethicists, and other experts from within the partnership about the biggest challenges that emerged during the research, and how they are attacking them on the front lines. Here’s a closer look.
“So, we’ll have situations where faculty will have a connection with a company and they’ll also have some leadership role in their department or division. And they will bring a new technology… and we only find out about it later that it hasn’t gone through the appropriate risk assessment.”
Many AI projects are undermined by ad hoc decision-making at the earliest stages, when hospitals are weighing whether to use an internally developed AI tool — or one built by an outside vendor — to improve some aspect of care or operations. A clinician’s financial or personal ties to a particular company can interfere with objective assessments of its benefits and risks.
And even when no such conflicts exist, it can still be difficult to tell whether a tool developed by an outside vendor will work within a given health system.
“Certain types of clinicians, like radiologists, do things their own way,” said Mark Lifson, an AI systems engineer at Mayo Clinic. “How are you going to find one solution that works for all of them?”
Many AI products, even if approved by the Food and Drug Administration, don’t come with detailed documentation that would help health systems assess whether they will work on their patients or within their IT systems, where data must flow easily between record-keeping software and AI models. Many vendors of commercially-available AI systems do not disclose how their products were trained or the gender, age, and racial make-up of the testing data. In many cases, it is also unclear whether the data they employ will map to those routinely collected by health care providers.
Left with little clarity, most hospitals have defaulted to building their own AI tools whenever possible, so they can be reasonably assured of their reliability and safety. But they still struggle with the more fundamental question of whether a given problem is best solved by AI or a lower tech intervention.
New York Presbyterian appointed a governance committee to review proposed projects and help determine when an AI tool seems like the best solution. “There are instances where software or simple decision logic can be a viable solution with greater transparency and less resource use than AI or machine learning,” said Ashley Beecy, a medical director of AI operations at New York Presbyterian Hospital. Hackensack Meridian has a similar setup, tasking a team of data scientists, engineers, and other specialists with vetting proposed projects and limiting the use of AI to situations where problems cannot be solved by simpler methods.
“We have lots of people asking us: ‘This is my idea. Can you please confirm [that regulatory approvals are not required]?’ …That’s definitely an area of lots and lots of interest, because there’s such a big jump in resources required, time, cost, if you are on the other side of the line, and you are regulated.”
Hospitals are ideal environments for building AI models to solve problems that arise in patient care. But of the more than 500 AI products that have been cleared by FDA, none of the approvals went to health systems, which are more focused on patient care than pushing AI tools through regulatory pipelines. Instead of submitting to that process, providers find ways to work around it, by tweaking the use of AI models, and guardrails around them, to avoid regulation.
But adoption of AI by large health systems is challenging regulatory lines that divide providers from device makers. Their use in hospital networks that span multiple states increases risks of misuse or harm that is easier to prevent in smaller organizations. At a certain scale, the difference between a hospital and commercial device company starts to become blurry. “There is a huge messy middle,” said Danny Tobey, a lawyer at DLA Piper who participated in the Health AI Partnership’s research. “That’s a regulatory gap, and I think we’re going to see new theories of health care regulation come out of that.”
The reason for the regulatory line in the first place was that the FDA did not want to meddle with doctors’ decision making. “But that was assuming in most cases that it’s one physician in a room with a patient,” said Keo Shaw, another lawyer at the firm who participated in the research. “But obviously with AI, you can make a lot of decisions. That can happen very quickly.”
None of that means health systems should be paralyzed by the gray area. Lifson said they can do their own risk analyses and put controls in place to limit the possibility of bad outcomes stemming from the use of AI. “This is reminiscent of what medical device development has had to deal with forever,” he said. “There’s always some new technology…challenging the status quo.”
Lifson’s team at Mayo uses the FDA’s existing medical device framework to develop checks on the use of AI, so that its application allows physicians to experiment without exposing patients to undue risks. In cases where AI systems are encroaching on the definition of regulated devices, hospitals spin out companies that are better poised to pursue FDA approvals, generate clinical evidence, and create quality controls. The goal of the Health AI Partnership, in part, is to provide a resource to help health systems step through that process.
“It gives you something you can point to that provides you with some examples and case studies,” Lifson said. “Even if it’s in text form, it still helpful to see there are other people thinking about the very same thing.”
“I don’t think we even really have a great understanding of how to measure an algorithm’s performance, let alone its performance across different race and ethnic groups. . . . There does need to be some infrastructure for defining what we mean by this algorithm works and assessing whether it works as well for group A and group B.”
The evaluation of AI systems often leaves hospitals with a limited understanding of how they will perform in live clinical situations, when data are messier, incomplete, and sometimes skewed by bias.
Most studies of AI systems rely on a series of statistical measures meant to gauge their accuracy in predicting which patients will experience a certain outcome, such as rapid deterioration or death.
But interviews conducted with caregivers, data scientists, and bioethicists exposed widespread discomfort with the amount of faith placed in numbers that don’t necessarily reflect the reliability of a given AI system, or its ultimate impact. In some instances, an AI could lead to faster tests, or speed the delivery of certain medicines, but still not save any more lives.
“There can be a sense in which an algorithm might work well for different groups of patients, but might not actually make a change to their outcomes,” said Melissa McCradden, a bioethicist and AI specialist at The Hospital for Sick Children in Toronto.
One problem with putting so much emphasis on statistical performance, McCradden said, is that it disregards so many other factors that may bear on an AI’s impact, such as a clinician’s judgment or the individual values and preferences of patients. “All of those things together can change the outcome,” she said.
That doesn’t necessarily mean that every AI intervention should be subjected to a randomized controlled trial. But it does underscore the need for a deeper exploration of whether accurate detection of an illness happened in a timely way — and spurred the right kind of follow-up care — to ultimately change a patient’s trajectory. Hospitals could also evaluate such tools by asking patients themselves about whether the AI influenced their decisions and behaviors.
While there is far from a universal standard for AI testing, the best practices for that work are becoming clearer as more health systems begin to share their work. “We have enough experience across sites to know what should happen,” said Mark Sendak, a data scientist at Duke who helped lead the Health AI Partnership. “We want to take that a step further by now asking, ‘How do we make this easy for people?’”
“I think our users are not going to be good at figuring out what’s going wrong. They all see that the computer is telling them something funky. So, I do think that when it comes to providing clinical IT support, having an infrastructure that is 24/7 is key, and it’s expensive, and it’s hard to build out.”
When AI systems go off the rails, troubleshooting the problem is rarely intuitive. It requires tracing errors to their source within complex software systems, where simple changes in the way data are recorded can throw off a model’s calculations.
Steve Miff, CEO of the Parkland Center for Clinical Innovation, the data science arm of the Parkland public health system in Dallas, Texas, recalled getting a text message on Saturday from a trauma surgeon working with an AI model designed to predict mortality risk for patients in the emergency department.
The message simply asked, “Does the model take the weekend off?”
Physicians found that the AI, meant to help triage patients and gauge their stability, had suddenly crashed. Miff said the issue was traced to a password change that altered the flow of data into the model. It was quickly fixed, he said, but the problem highlighted the bigger challenge of maintaining the system in a fast-changing environment.
“We have to be on top of it, because this is something that they’re incorporating into their decision-making,” Miff said. “When it stops working, it’s disruptive.”
Establishing that kind of surveillance is particularly difficult in health systems where IT specialists, data scientists, and clinicians work in separate departments that don’t always communicate about how their decisions might affect the performance of an AI model.
Draugelis, the data scientist at Hackensack Meridian, said those barriers point to the need for an engineering culture around AI systems, to ensure that data is always inputted the right way, and that controls are put in place to flag errors and quickly fix them.
“If that really takes hold, then you have a team that represents all the collective skills needed to deliver these things,” he said. “That is what’s going to have to occur as we talk about AI and all these new ways of delivering care.”
This story is part of a series examining the use of artificial intelligence in health care and practices for exchanging and analyzing patient data. It is supported with funding from the Gordon and Betty Moore Foundation.