As more machine learning tools reach patients, developers are starting to get smart about the potential for bias to seep in. But a growing body of research aims to emphasize that even carefully trained models — ones built to ignore race — can breed inequity in care.
Researchers at the Massachusetts Institute of Technology and IBM Research recently showed that algorithms based on clinical notes — the free-form text providers jot down during patient visits — could predict the self-identified race of a patient, even when the data had been stripped of explicit mentions of race. It’s a clear sign of a big problem: Race is so deeply embedded in clinical information that straightforward approaches like race redaction won’t cut it when it comes to making sure algorithms aren’t biased.
“People have this misconception that if they just include race as a variable or don’t include race as variable, it’s enough to deem a model to be fair or unfair,” said Suchi Saria, director of the machine learning and health care lab at Johns Hopkins University and CEO of Bayesian Health. “And the paper’s making clear that, actually, it’s not just the explicit mention of race that matters. Race information can be inferred from all the other data that exists.”
In the paper, which has not yet been peer-reviewed, researchers assembled clinical nursing notes from Beth Israel Deaconess Medical Center and Columbia University Medical Center for patients who self-reported their race as either white or Black. After removing racial terms, they trained four different machine learning models to predict the patient’s race based solely on the notes. They performed astonishingly well. Every model achieved an area under the curve — a measure of a model’s performance — of greater than 0.7, with the best models in the range of 0.8.
On its own, the fact that machine learning models can pick up on a patient’s self-reported race isn’t so surprising — for example, the models picked up on words associated with comorbidities that are more prevalent in Black patients, and skin conditions that are diagnosed more frequently in white patients. In some cases, that might not be harmful. “I’d argue there are use cases where you want to incorporate race as a variable,” said Saria, who was not involved in the study.
But the fact that the models picked up on subtle racial differences baked into physicians’ notes illustrates just how difficult — if not impossible — it is to design a race-agnostic algorithm. Race is imprinted all over medical data. Not just in the words physicians use, but the vital signs and medical images they collect using devices designed with “typical” patients in mind. “There’s no way you could erase the race from the dataset,” said MIT computational physiologist and study co-author Leo Anthony Celi. “Don’t even try; it’s not going to work.”
To emphasize that point, the researchers tried to hobble their race-predicting models by removing the words that were most predictive of either race. But even when researchers stripped clinical notes of those 25 tip-off words, the best-performing model only saw its AUC fall from 0.83 to 0.73.
The results echoed another paper in the researchers’ series, recently published in The Lancet Digital Health, that examined machine learning models similarly trained to predict self-reported race from CT scans and X-rays. The predictions also remained good even when the images were blurred and grayed-out to the point that radiologists couldn’t identify anatomical features — a result that still defies explanation.
“The models are able to see things that we human beings are not able to appreciate or assess,” said lead author Judy Gichoya, a radiologist and machine learning researcher at Emory University. “It’s not just the detection of race that was surprising. It’s because it’s difficult to identify even when that is happening.”
Compounding the problem, human experts looking at the same redacted notes and images couldn’t detect a patient’s race.
“I think that’s the biggest concern,” said Marzyeh Ghassemi, leader of the Healthy ML group at MIT and co-author on both papers. “If I, as a third party who buys software or a practitioner in a hospital, run this bad model over race-redacted notes, it could give me much worse performance for all the Black patients in my dataset, and I would have no idea.”
Because there are no requirements for clinical AI tools to report their performance in different subgroups — most report a single, aggregated performance rate — “it’s going to fall on model users to do these internal audits,” said Ghassemi. In a synthetic experiment, she and her colleagues also showed how a model trained on race-redacted notes could perpetuate care disparities for Black patients, recommending analgesia as a treatment for acute pain less frequently than for white patients.
If machine learning developers can’t rely on end users to raise the alarm on the performance of algorithms in different settings, Saria said, “the question is, how do we now more thoughtfully think about evaluating whether the information being inferred is leading to disparate allocation and inequitable care?”
The field will likely need to adopt a top-down approach to vet the safety, efficacy, and fairness of clinical algorithms. Otherwise, human experts will only find the biases they think to search for. “This is where the end-to-end checklist-type work comes in,” said Saria. “It lays out, start to end, what are the many sources of bias that can emerge? How do you look for it? How do you find the signal and questions that allow you to identify it?”
Only then can developers deploy nuanced fixes, such as collecting more data from underrepresented patient groups, calibrating the model’s inputs, or developing clinical policies that ask providers to factor in an algorithm’s poor performance when making decisions.
“The most important point is nothing simple is going to work here,” said Ghassemi, including removing race-associated words or punishing an algorithm for using race as a variable. “Health care data is generated by humans operating or working on or caring for other humans, and it’s going to contain the exhaust of that process.”
The results of both studies underscore the need for more collection of self-reported demographic information. Not just race, but features like socioeconomic status, gender identity and sexual orientation, and a variety of social determinants of health. “That information is important for us to make sure that you don’t have unintended consequences of your algorithm,” said Celi. Indeed, this research wouldn’t have been possible without clear records of patients’ self-reported race.
Until that auditing is standardized, though, Celi urges caution. “We’re not ready for AI — no sector really is ready for AI — until they’ve figured out that the computers are learning things that they’re not supposed to learn.”