The algorithms carry out an array of crucial tasks: helping emergency rooms nationwide triage patients, predicting who will develop diabetes, and flagging patients who need more help to manage their medical conditions.
But instead of making health care delivery more objective and precise, a new report finds, these algorithms — some of which have been in use for many years — are often making it more biased along racial and economic lines.
Researchers at the University of Chicago found that pervasive algorithmic bias is infecting countless daily decisions about how patients are treated by hospitals, insurers, and other businesses. Their report points to a gaping hole in oversight that is allowing deeply flawed products to seep into care with little or no vetting, in some cases perpetuating inequitable treatment for more than a decade before being discovered.
“I don’t know how bad this is yet, but I think we’re going to keep uncovering a bunch of cases where algorithms are biased and possibly doing harm,” said Heather Mattie, a professor of biostatistics and data science at Harvard University who was not involved in the research. She said the report points out a clear double standard in medicine: While health care institutions carefully scrutinize clinical trials, no such process is in place to test algorithms commonly used to guide care for millions of people.
“Unless you do it yourself, there is no checking for bias from experts in the field,” Mattie said. “For algorithms that are going to be deployed in a wider population, there should be some checks and balances before they are implemented.”
The report, the culmination of more than three years of research, sets forth a playbook for addressing these biases, calling on health care organizations to take an inventory of their algorithms, screen them for bias, and either adjust or abandon them altogether if flaws cannot be fixed.
“There is a clear market failure,” said Ziad Obermeyer, an emergency medicine physician and co-author of the report. “These algorithms are in very widespread use and affecting decisions for millions and millions of people, and nobody is catching it.”
The researchers found that bias is common in both traditional clinical calculators and checklists as well as more complex algorithms that use statistics and artificial intelligence to make predictions or automate certain tasks. Some of the flawed products guide millions of transactions a day, such as the Emergency Severity Index, which is used to assess patients in most of the nation’s emergency departments. The researchers’ review was not exhaustive. It was limited by the willingness of health care organizations to expose their algorithms to an audit. But the variety and magnitude of problems they discovered is indicative of a systemic problem.
The report flags bias in algorithms to determine the severity of knee osteoarthritis; measure mobility; predict the onset of illnesses such as diabetes, kidney disease and heart failure; and identify which patients will fail to show up for appointments or may benefit from additional outreach to manage their conditions.
The researchers found that the Emergency Services Index, which groups patients based on the urgency of their medical needs, performs poorly in assessing Black patients, a conclusion that mirrors findings in prior research.
Obermeyer said the index suffers from a flaw found in many of the algorithms: It relies on certain proxies that are by degrees different from the thing clinicians are trying to measure, introducing imperceptible gaps where biases often hide. The tool uses a variety of factors to make triage decisions, such as vital signs and the resources patients may require when receiving care. But Obermeyer and his colleagues found its use fails Black patients in multiple ways, underestimating the severity of their problems in some instances and in others suggesting they are sicker than they are.
“It’s very natural to make shortcuts, and to use heuristics, like, ‘This person’s blood pressure is fine so they don’t have sepsis,’” Obermeyer said, referring to a life-threatening complication of infection. “But it’s very easy for those shortcuts to go wrong.”
He and other researchers who examined the use of the index at Brigham and Women’s Hospital in Boston said it was unclear what factors introduced bias, but they sought to build a machine learning model aimed at improving its accuracy across all patients. The findings will be described in more detail in a forthcoming paper.
“Our general approach was being curious about what was going on, and not to label a group of providers or a process as bad,” said Michael Wilson, an emergency medicine physician at Brigham and Women’s who helped conduct the study. “This is an endemic problem whenever you have subjectivity. We wanted to measure for bias and correct it.”
The Emergency Severity Index was developed by physicians in the late 1990s, including one who worked at Brigham and Women’s. It’s now owned and managed by the Emergency Nurses Association (ENA), a trade group that purchased the rights to the algorithm a couple years ago. The association’s website said the tool is used to triage patients in about 80% of hospitals in the United States.
“Although ENA takes seriously the report’s focus on bias in algorithms, it is important to note that potential bias is user dependent based on a person’s interpretation of what an algorithm presents,” the association’s president, Ron Kraus, said in a statement to STAT. “Since acquiring ESI in 2019, ENA has continually looked at avenues to evolve the way triage is performed — including through the use of technology, such as AI — to identify the right course of treatment for each patient based solely on their acuity — not their race or the cost of care.”
The research to identify bias — based in the Center for Applied Artificial Intelligence at the University of Chicago’s Booth School of Business — was established after an initial study uncovered racial bias in a widely used algorithm developed by the health services giant Optum to identify patients most in need of extra help with their health problems. They found that the algorithm, which used cost predictions to measure health need, was routinely giving preference to white patients over people of color who had more severe problems. Of the patients it targeted for stepped-up care, only 18% were Black, compared to 82% who were white. When revised to predict the risk of illnesses instead of cost, the percent of Black patients flagged by the algorithm more than doubled.
The study struck insurers — and the broader health care industry — like a lightning bolt, momentarily illuminating the extent of racial bias in a methodology used to allocate scarce health care resources across the United States. The researchers announced plans to broaden their inquiry, and invited organizations across health care to submit algorithms for review.
Health insurers became the primary patron of the research team’s services, which were also used to assess bias by dozens of organizations, including larger providers and health technology startups.
Among the insurers to reach out to the researchers was Harvard Pilgrim Health Care, a nonprofit health plan in Massachusetts that wanted to assess the potential for bias in its efforts to identify members who might benefit from additional outreach and care. A preliminary review suggested that one algorithm, a model developed by a third party to predict cost, places people with chronic conditions such as diabetes at a lower priority level than patients with higher-cost conditions such as cancer. Since diabetes is experienced at a high rate among Black patients, that could lead to a biased output.
Alyssa Scott, vice president of medical informatics at Harvard Pilgrim, said algorithmic flaws arise from the use of financial forecasts in decisions about who should qualify for additional outreach. Those forecasts, while accurate, often reflect historic imbalances in access to care and use of medical services, causing bias to bubble up in ways that are difficult to detect. “If you are not aware of that, implicit bias arises that is not intended at all,” Scott said.
Harvard Pilgrim is continuing to analyze its algorithms, including those that focus on chronic condition identification, to assess bias and develop a framework for eliminating it in existing and future algorithms. “Right now we’re in the phase of trying to brainstorm and get extra input to determine whether our methodology is valid,” Scott said. “If we do find there’s bias in our algorithms, we’ll make adjustments to accommodate for the imbalance.”
Another business that worked with the researchers, a Palo Alto, Calif.-based startup called SymphonyRM, found that an algorithm it was developing to identify patients in need of a heart consultation was not performing accurately for Black and Asian patients. The company, which advises providers on patients who need additional outreach and care, adjusted the thresholds of its model to increase outreach to those groups and is planning to conduct a follow-up study to examine outcomes.
Chris Hemphill, vice president of applied AI for SymphonyRM, said bias can be the product of what seem to be tiny technical choices. For example, by adjusting a model to prevent false alarms, one might fail to identify all the people in need of additional care. An adjustment in the opposite direction — to ensure that everyone at risk of a negative outcome is identified — can produce more false alarms and unnecessary care.
Along that pendulum are biases that are difficult to detect without careful auditing by an independent party. “If you’re not doing this audit — if you’re not looking for bias — then you can pretty much guarantee that you’re releasing biased algorithms,” Hemphill said. “You can have a model that’s performing really well overall, but then when you start breaking it down by gender and ethnicity, you start seeing different levels of performance.”
But as it stands now, oversight of algorithms is heavily reliant on self-enforcement by companies that are free to decide whether to expose their products to outside review. The Food and Drug Administration reviews some algorithmic products prior to their release, but the agency tends to focus scrutiny on products that rely on artificial intelligence algorithms in image-based disciplines such as radiology, cardiology, and neurologic care. That leaves unexamined a wide swath of checklists, calculators, and other tools used by providers and insurers.
Obermeyer said there is a clear need for additional regulation, but innovation in the use of health care data is outpacing the ability of regulators to develop performance benchmarks akin to those used to evaluate drugs and traditional devices.
“These algorithms don’t affect someone’s health. They reveal it,” he said. “I don’t think we’ve come to terms with how to regulate the production of information, making sure that the information is good and accurate and what we want.”
This is part of a series of articles exploring the use of artificial intelligence in health care that is partly funded by a grant from the Commonwealth Fund.