Researchers analyzed 4,063 papers on tools to help guide care. Only 12 were replications

Clinical decision support is one of digital health’s great promises. Faced with a surplus of information about a patient’s history and symptoms, algorithms built into electronic health records can provide important alerts and reminders, automated prescription suggestions, and even diagnostic support — hopefully, helping patients receive the right care.

But those systems don’t always hold up after their initial testing. Most recently, work pointed to flaws in an algorithm to predict the risk of sepsis, integrated into Epic’s electronic health record platform. A recent STAT investigation found those shortcomings extend to other Epic algorithms, including those used by hospitals to predict how long patients will be hospitalized or who will miss appointments.

A critical check on such systems is the replication of the original research. But those kinds of gut checks are few and far between, according to new work from Enrico Coiera, director of the Center for Health Informatics at Macquarie University in Sydney. Over six months, he and colleague Huong Ly Tong dredged up all the journal-published papers they could find analyzing the outcomes of clinical decision support systems. They found 4,063 — and of those, only 12 were replications.


Part of the problem is that the field has yet to build a culture that values scientific best practices, said Coiera, whose systematic review was published in the Journal of the American Medical Informatics Association. STAT spoke with him about how to right that ship, and whether patients are at risk as unreplicated algorithms proliferate.

What drew you to this research question, of how much of computerized decision support research has been independently tested?


We wrote a perspective probably three or four years ago in the same journal asking, “Do we have a replication crisis?” And with that review, it was very clear, A, that we didn’t know, and B, just based on first principles it didn’t look like we were going to have a good story to tell.

For the first time, we’ve actually sampled the literature and we’ve come up with a really robust estimate of the frequency today of replication in a critical part of the literature. We had to start somewhere, so we picked something that we knew would be clinically significant: If a clinical support system doesn’t work, patients get harmed.

We focused on trials where they’ve taken technology into the real world, so it really was rubber hits the road sort of work that we were looking at. And we basically don’t replicate: Three in 1,000 papers is extremely low, even by the benchmarks of other disciplines that said they were in crisis. Reporting that number really scares us, because we think, “Oh my goodness, we must have missed something.” But we’d periodically go back and search again, and it is what it is. So we have a problem.

What have the reactions been so far?

The first reaction that’s really strong is, “Well, you can’t do replication in digital health because it’s all so special.” To which I say, I’m sorry, this is not the case. It’s a good excuse, but it won’t hold. And also, if it was the case, my God, what a disaster it would be, because that would mean there was no science to what we’re doing.

It’s good practice to make things as similar as possible within the trials when you replicate, and that’s really hard in the real world. If I take any digital system and put it in one hospital or another hospital, the actual thing is different; by the time I fit the technology into the workflows and the baseline ecosystem or technology, I’m going to have to adapt it just to fit. It’s never an identical intervention.

But it turns out there’s nothing special in what we do. If you look at ecology, for example — they go and sample an ecosystem, and it’s just as crazy. It’s a near universal challenge. The question is whether those differences are a foundational barrier to science, or whether they’re manageable. And I think most people who look at this area would say, look, they’re manageable.

Half of the 12 papers you found replicated a study that found an increase in mortality in pediatric systems using computerized provider order entry. What could account for that?

The response was triggered by the discipline. They said, you know, this is the wrong kind of answer, so let’s really test the hell out of this. So that’s a kind of bias. I think there’s always a genuine positive motivation to say, look if CPOE really is dangerous we really need to know, because it’s used everywhere. There might have been some concern that this will harm our industry. But I think, by and large, most of those studies were done by good scientists who really just wanted to look after patients.

But yes, who’s going to test a positive study? You know, it’s different; they’ll just go and implement it straight away.

We don’t have a culture that recognizes or rewards replication work, so there’s no good practice around what good replication looks like. Whereas I think in psychology, for example, you’ve seen great progress since the early concerns. But that took a lot of really public failures of studies.

What will it take for health informatics to get to the same place? 

We need major cultural change. That means that amongst the scientists, the principal investigators need to train their junior researchers in replication. There’s no question that we are slowly but surely improving the quality of science, but for some reason we’ve totally missed the importance of replication compared to even other disciplines. And it’s just a thing that needs to be added to that mix.

The journals need to absolutely change their attitude and welcome replication. This is not unique to informatics: Journals value novelty and interesting results. They don’t value someone coming up and doing some homework and checking if a previous result is true or not.

One of the responses to the last paper was, “Oh, my God, all you’ll do is end up flooding the literature with cheap useless replications” — well, that’s not a problem right now. If it was the case that every Ph.D. had done replications as part of training, yes, we’d have many hundreds of them. Fantastic. You could have a special open access journal called informatics replication, for goodness’ sake, to put the less major ones in. These are just non-problems.

But it is not something that can happen overnight. Look at the patient safety story. It took us at least a decade to get enough traction, from being the crazy people in the corner to people saying, “Oh my goodness, EHRs, they do burn physicians out.” So it might go faster, but I’m imagining we’re in a 10-year journey here.

What’s the most important message you hope the field will take away from your result?

In patient safety, we know whenever there’s an incident, there are multiple contributing factors. The clinician may have made a mistake or been distracted, there might have been an error in the notes, something else may have happened — they all come together to cause the harm. So I think to connect the dots — “Here’s a research study that says this kind of technology is beneficial, we implement it in this setting, and it causes harm” — it’s just a long causal chain.

For whatever reason, we haven’t focused on replication. Now we can, and the way out of the hole is really easy. What we find as we crawl out of the hole is the challenge: Will we find that there are things that we believe that aren’t true? That’s quite possible.

Source: STAT