As an ICU physician, Leo Anthony Celi knows the immense power health data can hold. If it’s harnessed thoughtfully, it could speed diagnoses and drive better care. And if it isn’t wielded carefully, it can make matters worse.
That’s why he’s become a prominent advocate for open data sharing as a way to make medical research not only more democratic, but also more robust. Celi has organized hackathons to tap into a trove of deidentified ICU data from Boston’s Beth Israel Deaconess Medical Center, the Medical Information Mart for Intensive Care. His computational physiology research group at Massachusetts Institute of Technology runs MIMIC, a leading real-world medical database available for free to researchers. Now, he’s taking on another role as the editor of the new open access medical journal PLOS Digital Health.
The nonprofit publisher Public Library of Science, one of the first to provide open access to its journals, recently announced the expansion of its roster, which also now includes journals dedicated to global public health and sustainability. Submissions for the new digital health journal will open this month, and the inaugural issue is expected to come out by the summer.
STAT spoke with Celi about his plans for the journal, as well as the broader challenges facing machine learning in medicine. The following interview has been lightly edited and condensed.
What are some of PLOS’s goals for launching its own digital health publication?
One of the issues that we’ve been lamenting is the fact that the medical knowledge system revolves around rich countries. The way medicine is practiced is based on guidelines that are handed over by professional societies in the United States and western Europe, based on research performed in a handful of countries, and typically involving the white Caucasian male. With digitization, we’re hoping each country will have an opportunity to create their own medical knowledge system.
This is something we’re hoping to address with the new journal. There needs to be an increase in diversity of the authors, not just in gender, but also representation from different countries. What we’re hoping is to put those into policies, so not just, “You would like this to happen,” but, “It’s not going to be published here unless there is diversity among your authors.” It’s not unusual to see papers where all the authors are from Harvard talking about some project in Uganda, and that is not acceptable. We’re hoping to be a role model; someone has to take the first step.
That’s an issue created by authorship, but there’s also a demographic divide in the data available for research. What are the impediments to the proliferation of open access clinical databases that reflect diverse patients?
Oh, it’s the politics and the economics. Those remain to be the biggest barriers to advancing in this field. The politics is, for the most part, influenced by publish or perish culture in academia; the [principal investigators] think they’re the owners of these datasets and no one else should be able to use them. And the economics is that health care organizations think that there is revenue that can be generated from this data. Now, they could sell the data. But who’s going to curate? Unless you sell your clinicians who understand how the data was collected, it’s pointless to sell them.
Data is like crude oil; it’s useless unless it gets processed and curated. The curation part is tedious and underappreciated; no one wants to do it because it’s boring, unless they’re doing it to answer a research question that they’re interested in.
We have this event, a datathon; the very first was in January 2014. The idea was to promote the MIMIC dataset. … By increasing interest, what you’re also doing is you’re able to crowdsource the data curation part.
Are you seeing progress in other providers opening up their data?
Our partners are stepping up. The Amsterdam UMC database has been available now for a year; they’re having a datathon this month to promote the crowdsourcing of curation, which everyone acknowledges is much more difficult than machine learning. We have our partners in Bern, Switzerland; they have the HiRID dataset that’s also available on our PhysioNet platform.
Our partners in Madrid hospitals have done that with Covid-19 data, so we have been working closely with them over the last year. Albert Einstein Hospital has gotten approval from their hospital leadership to also make their database publicly available to not just their students, but collaborators from other countries. So I think it’s getting there.
How do those organizations contend with the potential risk to patient privacy?
It helps that we have over a decadelong experience that states that you can mitigate the risk. The risk is never going to be zero: We just finished a paper that says it’s impossible to fully anonymize the data. But keeping the data locked because we can’t anonymize it is also harmful, especially in low- and middle-income countries, where the way they practice medicine is based on the way we practice it here. So it’s all a matter of balancing the risk with the potential benefit of data sharing.
Do you see the path toward individual institutions opening up their data being challenged at all by the increasing number of privatized companies, conglomerates of providers, that have assembled in the last several years to monetize that data?
I think people will realize that data is just a piece of what they need. The curation is going to be hard, and that if these are not open then they’re going to end up with flawed models that can easily lead to more patient harm. We’ve seen lots of examples including during the pandemic where the quality of these algorithms that are black box, that are proprietary, are questionable.
I think this idea of data being proprietary also extends to algorithms being proprietary. We are very much against the business model of selling and buying algorithms because we think that that is potentially harmful to patients.
Algorithms have expiration dates; their accuracy is bound by space and time. Even algorithms we built at Beth Israel you can’t apply at Brigham and Women’s, even though they’re just across the street, because of idiosyncrasies in the way we practice at Beth Israel.
What’s an example of that kind of outcome in a privatized database?
One of the most notorious examples was from two years ago when they were using this algorithm by Optum in terms of predicting which patients are going to develop complicated trajectories of their disease in the future. If you could identify them you could assign them to a case management team and follow them closely. And what they used as a proxy for complicated trajectory is health care cost. But cost itself isn’t a race-neutral variable. When you don’t understand how [algorithms] were made, you’re bound to make those mistakes and you’re bound to either propagate the existing biases that are in the data set or even magnify them.
This, to us, is one of the biggest reasons why we made our dataset publicly available. When you sign up to apply for access to datasets you agree to a data sharing agreement that says once you’ve published your manuscript, that you’ll share your notebooks. That’s been crucial in terms of accelerating the research that comes out of the databases, because people are not starting from scratch, people are mostly standing on the shoulders of previous researchers. There’s also a quality control that happens; we discover some bugs in the code and then it gets corrected. MIMIC is a testament that this is feasible, that it accelerates progress in terms of knowledge discovery and validation, and perhaps this should be the rule rather than the exception.
What’s next for MIMIC?
The plan is to, one, start linking MIMIC with other data sources, because we really need to have a better sense of what happens to the patient after discharge. We hope we can link it to the all-payer claims database, but that is obviously going to be tricky because it’s not completely publicly available; you apply and it costs you money. So what we’ll do is map them, and then we’ll release the map to people who have access to both datasets. So, you can think, then you could tell how long they were in the nursing home after being discharged. You could tell how long they needed dialysis, because there will be charges for dialysis.
Step two is bringing the algorithms to the bedside. The goal of MIMIC is not just to have publications, but to have an impact. So we’re partnering with Beth Israel Deaconess Medical Center and focusing on what are the challenges in getting these algorithms incorporated into the workflow. We’ll be looking at human computer interaction, implementation science, human factors engineering, to make sure we’re not adding to the burden of the frontline workers and making sure that they appreciate and then also understand the limitations of these algorithms.