Any digital health conference features its share of machine learning evangelism. Technology executives give fervent testimonials about its power to save lives and money, to predict episodes of severe illness, to help hospitals root out inefficiency.
This year’s gathering of the Health Information Management Systems Society (HIMSS) in Las Vegas was no different. But in between the glowing anecdotes, an aggressive counter narrative emerged: Machine learning needs a watchdog.
Throughout the four-day conference, the largest annual event in health care technology, industry leaders called for better ways to evaluate the usefulness of machine learning algorithms, audit them for bias, and put in place regulations designed to ensure reliability, fairness, and transparency.
“Isn’t it amazing that there isn’t a framework to figure this out?” said Bala Hota, chief analytics officer at Rush Medical College in Chicago. “It feels as if this is an area of maturity this field is three years away from. We don’t have the right regulatory approach or framework to do this at scale.”
Health care organizations and entrepreneurs are collectively spending billions of dollars to develop, implement, and refine machine learning models in medicine. But several speakers said those investments are being jeopardized by a lack of standards to evaluate these tools or guardrails to protect patients against errant results and unintended consequences. Even prominent developers of clinical algorithms said the potential harms merit a more stringent regulatory approach.
“We should think of any machine learning algorithm that is predicting a condition for somebody as a lab test,” said Tanuj Gupta, a physician and vice president at Cerner Corp., the electronic health record vendor, which has developed a number of algorithms being deployed in hospitals. “If it’s off, and you potentially cause some morbidity and mortality issue, it’s a problem.”
Those concerns run counter to the current hands-off approach to regulating such products, especially those that operate within electronic health records. The Food and Drug Administration, which reviews algorithms used to interpret medical images and data from wearables, does not provide equivalent scrutiny to many tools hospitals use within their record-keeping software to guide diagnosis and treatment. A recent STAT investigation found that multiple algorithms developed by Epic, the nation’s largest electronic health record vendor, are delivering inaccurate or irrelevant information to clinicians about the care of seriously ill patients, including a product designed to predict the onset of sepsis, a life-threatening complication of infection.
The extent of problems that arise from algorithms gone astray depends on their intended use and their specific flaws. A study of Epic’s algorithm on previously hospitalized patients at the University of Michigan found that it missed two-thirds of cases and that a clinician would have to respond to 109 alarms to identify a single septic patient. If implemented in a live clinical setting without careful tuning, such an algorithm could not only fail to flag seriously ill patients, but also contribute to alarm fatigue and divert attention from those who need care more urgently.
Hospital informatics experts are generally aware of these pitfalls, but there is no universal approach to evaluating proprietary algorithms and auditing them for blind spots and biases. Still, several speakers at HIMSS emphasized, they have significant potential benefits. If used carefully, they have shown the ability to identify serious medical issues, including life-threatening heart problems, before they arise so that clinicians can intervene. They can also help hospitals improve daily operations by predicting the number of beds they will need, when operating rooms will be available, and which patients are likely to be readmitted to the hospital or fail to show up for their appointments.
“The good news is your optimism for AI is justified, but there are caveats,” said John Halamka, a physician and president of Mayo Clinic Platform, the hospital system’s data and analytics arm. “We need as a society to define transparency, to define how we evaluate an algorithm’s fitness for purpose.”
Halamka said he supports an approach proposed by Duke University and others that would require algorithms to be labeled much like food products, so clinical users can understand the data used to develop and vet them and the relevance to their institutions.
“Shouldn’t we as a society demand a nutrition label on our algorithms saying this is the race, ethnicity, the gender, the geography, the income, the education that went into the creation of this algorithm?” he asked. “Oh, and here’s … some statistical measure of how well it works for a given population. You say, ‘Oh well, this one’s likely to work for the patient in front of me.’ That’s how we get to maturity.”
Machine learning, a subset of artificial intelligence, is not a new technology. But its initial iterations in the 1950s have in recent years been augmented by novel technical architectures and computing power that have allowed them in some realms to surpass the capabilities of physicians, and even open a new dimension of medical knowledge.
Those capabilities have fueled an explosion of investment and research, and implementation of algorithms whose inner workings and potential impacts are not always apparent to clinicians and regulators.
One of the most pressing questions is who should create and enforce those standards. The FDA and the Government Accountability Office have created high-level frameworks for regulating artificial intelligence, but those proposals do not address the specific dilemmas created by the algorithmic products already making their way into care.
“The urgency [for effective oversight] is very high,” said Michael Matheny, a physician and professor of bioinformatics at Vanderbilt University. “Having a standardized process to go through when you’re acquiring new technologies and implementing them would very much help patient safety and reduce unintended consequences.”
During one session at HIMSS last Thursday, Matheny and a colleague suggested that the regulatory gap is exposing consumers and providers to harms that, if realized, could lead to disinvestment in artificial intelligence. That same problem occurred during two prior periods, referred to as AI winters, when the technology failed to meet expectations and encountered periods of dormancy in the 1960s and 1980s.
In the current environment, Matheny said, health care organizations are applying different levels of rigor to evaluations meant to ensure reliability and fairness. “You see some implementations where they are taking an algorithm, assessing it, and recalibrating it,” Matheny said. “But then you have others that either don’t know to do that … or deploy it without careful performance adjustments.”
Mayo Clinic, Duke, Stanford and other large institutions have suggested the possibility of creating a national algorithm certification body, and regional testing laboratories, that could create greater consistency in the evaluation process. Certification from such an organization could become a de facto prerequisite for health care algorithms, just as accreditation by the Joint Commission has become a key signal of hospital quality.
Halamka said regulators would still have a role to play in assessing safety. “But the FDA is probably not the right group to look at efficacy or bias,” he said. “And I believe that will fall to a public-private collaboration of government, academia, and industry. In 2021 I believe we’ll see these kinds of collaborations come together.”
Katie Palmer contributed reporting.