As the United States braces for a bleak winter, hospital systems across the country are ramping up their efforts to develop AI systems to predict how likely their Covid-19 patients are to fall severely ill or even die. Yet most of the efforts are being developed in silos and trained on limited datasets, raising crucial questions about their reliability.
Dozens of institutions and companies — including Stanford, Mount Sinai, and the electronic health records vendors Epic and Cerner — have been working since the spring on models that are essentially designed to do the same thing: crunch large amounts of patient data and turn out a risk score for a patient’s chances of dying or needing a ventilator.
In the months since launching those efforts, though, transparency about the tools, including the data they’re trained on and their impact on patient care, has been mixed. Some institutions have not published any results showing whether their models work. And among those that have published findings, the research has raised concerns about the generalizability of a given model, especially one that is tested and trained only on local data.
A study published this month in Nature Machine Intelligence revealed that a Covid-19 deterioration model successfully deployed in Wuhan, China, yielded results that were no better than a roll of the dice when applied to a sample of patients in New York.
Several of the datasets also fail to include diverse sets of patients, putting some of the models at high risk of contributing to biased and unequal care for Covid-19, which has already taken a disproportionate toll on Black and Indigenous communities and other communities of color. That risk is clear in an ongoing review published in the BMJ: After analyzing dozens of Covid-19 prediction models designed around the world, the authors concluded that all of them were highly susceptible to bias.
“I don’t want to call it racism, but there are systemic inequalities that are built in,” said Benjamin Glicksberg, head of Mount Sinai’s center for Covid-19 bioinformatics and an assistant professor of genetics at the Icahn School of Medicine. Gilcksberg is helping to develop a Covid-19 prediction tool for the health system.
Those shortcomings raise an important question: Do the divided efforts come at the cost of a more comprehensive, accurate model — one that is built with contributions from all of the research groups currently working in isolation on their own algorithms?
There are obstacles, of course, to such a unified approach: It would require spending precious time and money coordinating approaches, as well as coming up with a plan to merge patient data that may be stored, protected, and codified in different ways. Moreover, while the current system isn’t perfect, it could still produce helpful local tools that could later be supplemented with additional research and data, several experts said.
“Sure, maybe if everyone worked together we’d come up with a single best one, but if everyone works on it individually, perhaps we’ll see the best one win,” said Peter Winkelstein, executive director of the Jacobs Institute for Healthcare Informatics at the University at Buffalo and the vice president and chief medical informatics officer of Kaleida Health. Winkelstein is collaborating with Cerner to develop a Covid-19 prediction algorithm on behalf of his health system.
But determining the best algorithm will mean publishing data that includes the models’ performance and impact on care, and so far, that isn’t happening in any uniform fashion.
Many of these efforts were first launched in the spring, as the first surge of coronavirus cases began to overwhelm hospitals, sending clinicians and developers scrambling for solutions to help predict which patients could become the sickest and which were teetering on the edge of death. Almost simultaneously, similar efforts sprang up across dozens of medical institutions to analyze patient data for this purpose.
Yet the institutions’ process of verifying those tools could not be more varied: While some health systems have started publishing research on preprint servers or in peer-reviewed journals as they continue to hone and shape their tools, others have declined to publish while they test and train the models internally, and still others have deployed their tools without first sharing any research.
Take Epic, for instance: The EHR vendor took a tool it had been using to predict critical outcomes in non-Covid patients and repurposed it for use on those with Covid-19 without first sharing any public research on whether or how well the model worked for this purpose. James Hickman, a software developer on Epic’s data science team, told STAT in a statement that the model was initially trained on a large dataset and validated by more than 50 health systems. “To date, we have tested the [tool’s] performance on a combined total of over 29,000 hospital admissions of Covid-19 patients across 29 healthcare organizations,” Hickman said. None of the data has been shared publicly.
Epic offered the model to clinics already using its system, including Stanford University’s health system. But Stanford instead decided to try creating its own algorithm, which it is now testing head-to-head with Epic’s.
“Covid patients do not act like your typical patient,” said Tina Hernandez-Broussard, associate professor of medicine and director of faculty development in biomedical informatics at Stanford. “Because the clinical manifestation is so different, we were interested to see: Can you even use that Epic tool, and how well does it work?”
Other systems are now trying to answer that same question about their own models. At the University at Buffalo, where Winkelstein works, he and his colleagues are collaborating with Cerner to create a deterioration model by testing it in “silent mode” on all admitted patients who are suspected or confirmed to have Covid-19. This means that while the tool and its results will be seen by health care workers, its outcomes won’t be used to make any clinical decisions. They have not yet shared any public-facing studies showing how well the tool works.
“Given Covid-19, where we need to know as much as we can as quickly as possible, we’re jumping in and using what we’ve got,” Winkelstein said.
The biggest challenge with trying to take a unified approach to these tools “is data sharing and interoperability,” said Andrew Roberts, director of data science at Cerner Intelligence who leads the team that is collaborating with Winkelstein. “I don’t think the industry is quite there yet.”
Further south in the heart of New York City, Glicksberg is leading efforts to publish research on Mount Sinai’s prediction model. In November, he and his colleagues published positive but preliminary results in the Journal of Medical Internet Research. That study suggested its tool could pinpoint at-risk patients and identify characteristics linked to that risk, such as age and high blood sugar. Unlike many of the other existing tools, the Mount Sinai algorithm was trained on a diverse pool of patient data drawn from hospitals including those in Brooklyn, Queens, and Manhattan’s Upper East Side.
The idea was to ensure the model works “outside of this little itty bitty hospital you have,” he said.
So far, two Covid-19 prediction models have received clearance from the Food and Drug Administration for use during the pandemic. But some of the models currently being used in clinics haven’t been cleared and don’t need to be greenlit, because they are not technically considered human research and they still require a health care worker to interpret the results.
“I think there’s a dirty little secret which is if you’re using a local model for decision support, you don’t have to go through any regulatory clearance or peer-reviewed research at all,” said Andrew Beam, assistant professor of epidemiology at the Harvard T.H. Chan School of Public Health.
The FDA did not respond to STAT’s request for comment. All of the models that have landed FDA clearance thus far have been developed not by academic institutions, but by startups. And like with academic medical centers, these companies have taken divergent approaches to publishing research on their products.
In September, Bay Area-based clinical AI system developer Dascena published results from a study testing its model on a small sample of 197 patients across five health systems. The study suggested the tool could accurately pinpoint 16% more at-risk patients than a widely used scoring system. The following month, Dascena received conditional, pandemic-era approval from the FDA for the tool.
In June, another startup — predictive analytics company CLEW Medical, based in Israel — received the same FDA clearance for a Covid-19 deterioration tool it said it had trained on retrospective data from 76,000 ICU patients over 10 years. None of the patients had Covid-19, however, so the company is currently testing it on 500 patients with the virus at two U.S. health systems.
Beam, the Harvard researcher, said he was especially skeptical about these models, since they tend to have far more limited access to patient data compared with academic medical centers.
“I think, as a patient, if you were just dropping me into any health system that was using one of these tools, I’d be nervous,” Beam said.
This is part of a yearlong series of articles exploring the use of artificial intelligence in health care that is partly funded by a grant from the Commonwealth Fund.