The great hope of artificial intelligence in breast cancer is that it can distinguish harmless lesions from those likely to become malignant. By scanning millions of pixels, AI promises to help physicians find an answer for every patient far sooner, offering them freedom from anxiety or a better chance against a deadly disease.
But the Food and Drug Administration’s decision to grant clearances to these products without requiring them to publicly disclose how extensively their tools have been tested on people of color threatens to worsen already gaping disparities in outcomes within breast cancer, a disease which is 46% more likely to be fatal for Black women.
Oncologists said testing algorithms on diverse populations is essential because of variations in the way cancers manifest themselves among different groups. Black women, for instance, are more likely to develop aggressive triple-negative tumors, and are often diagnosed earlier in life at more advanced stages of disease.
“These companies should disclose the datasets they’re using and the demographics that they’re using. Because if they don’t, then we essentially have to take their word for it that these new technologies apply equally,” said Salewa Oseni, a surgical oncologist at Massachusetts General Hospital. “Unfortunately we’ve found that’s not always the case.”
A STAT investigation found that just one of 10 AI products cleared by the FDA for use in breast imaging breaks down the racial demographics of the data used to validate the algorithm’s performance in public summaries filed by manufacturers. The other nine explain only that they have been tested on various numbers of scans from proprietary datasets or mostly unnamed institutions.
The data used to support AI products — first to train the system to learn, and then to test and validate its performance — is a crucial marker of whether they are effective for a wide range of patients. But the companies often treat that information as proprietary, part of the secret recipe that sets their products apart from rivals in an increasingly competitive market.
As it begins to regulate AI, the FDA is still trying to draw the line between the commercial interest in confidentiality and the public interest in disclosure, and whether it can, or should, force manufacturers to be fully transparent about their datasets to ensure AI algorithms are safe and effective.
“It’s an extremely, extremely important topic for us,” said Bakul Patel, director of the FDA’s Digital Health Center of Excellence. “As you saw in our action plan, we want to have that next level of conversation: What should that expectation be for people to bring trustworthiness in these products?”
In addition to the agency’s January action plan, which calls for the development of standard processes to root out algorithmic bias, the FDA issued guidance in 2017 calling on all makers of medical devices — whether traditional tools or AI — to publicly report on the demographics of populations used to study their products. But to date, that level of detail is not being provided in public summaries of AI products posted to the agency’s website.
So far, just 7 of 161 AI products cleared in recent years includes any information about the racial composition of their datasets. Nonetheless, those devices were cleared to use AI to help detect or diagnose a wide array of serious conditions, including heart disease, strokes, and respiratory illnesses.
The lack of disclosure in breast imaging raises particularly pressing questions — not only because of the variations in the biology, social factors, and signals of disease among different patients, but also because there is a long history of diagnostic tests and risk models in breast cancer care not performing as well for people of color.
Those failures do not mean that AI products will result in the same inequities. In fact, the notion driving their development is to do away with human biases that so often undermine a patient’s care.
But how to ensure that they eliminate bias — and do not perpetuate it — is a matter of intense debate among AI developers, researchers, and clinicians, who haven’t come to a consensus about if and how regulation needs to change.
And as that debate over the future of AI plays out, the tools in question continue to make their way into the market, forcing hospitals and other providers to grapple with unanswered questions about how well they work in different populations.
“The entire sector is trying to study this issue,” said Daniel Mollura, a diagnostic radiologist and founder of RAD-AID, a nonprofit seeking to increase access to radiology in underserved communities in the U.S. and around the globe. The organization sees a possible value in AI to address the shortage of radiologists, but it is also urging a conservative approach.
“We’ve known about the generalizability problem for a long time,” he said. “What we’re trying to find out now is where the bias comes from, what is the impact on performance, and how do you remedy it?”
Judging by public FDA filings, AI products cleared for use in breast imaging are being tested on relatively small groups of patients. The datasets range from 111 patients to over 1,000, in some cases from a single U.S. site and in other cases, from multiple sites across the globe.
But executives at AI companies said these data represent only a fraction of the cases used to train and test their products over many years. They asserted studies done at the behest of the FDA simply reinforce what they already know: Their products will perform accurately in diverse populations.
“It’s the tip of the tip of the iceberg,” Matthieu Leclerc-Chalvet, chief executive of the French AI company Therapixel said of the 240-patient study requested by the FDA to validate its product, called MammoScreen, which received clearance in March 2020.
Prior to that study, he said, the company’s device — which is designed to help clinicians identify potentially cancerous lesions on mammography images — was trained on data from 14 health care sites that were selected to ensure representation from people of color. He said the tool was also tested on patients on the East and West coasts to measure its accuracy across different populations.
Leclerc-Chalvet declined to identify the providers of the data, saying the information is proprietary. He said the FDA did not specifically request that the validation study include patients of different races and ethnicities, adding that the agency was more focused on the product’s ability to help radiologists accurately distinguish between harmless lesions and those that spiraled into cancer.
That study, published by the Radiological Society of North America, found that the diagnostic accuracy of radiologists improved when they were using MammoScreen. The study relied on 240 mammography images collected from a hospital in California between 2013 and 2016. The demographics of the dataset were broken down by age and level of breast density, but not by race.
Manufacturers of other breast imaging products reported similar experiences with the FDA’s regulatory process. Nico Karssemeijer, co-founder of ScreenPoint Medical, said the company’s product to identify potentially cancerous lesions on mammograms and tomosynthesis images, a type of low-dose X-ray used in breast imaging, was also trained on more than 1 million images from 10 countries in Europe, the U.S., and Asia.
In support of its application for FDA clearance, the Netherlands-based company also submitted a clinical study of 240 cases from two unidentified U.S. sites, finding that its product, called Transpara, improved the accuracy of radiologists.
Karssemeijer said the demographic information of the testing and validation sets was submitted to the agency for its review, but that information didn’t make its way into the public filings. To Karssemeijer, public disclosure of data through the FDA’s process is not essential because the company can answer those questions through follow-up studies conducted by clients in the U.S. and elsewhere.
Karssemeijer said the company has published five peer-reviewed studies on its product and given more than 20 presentations at major radiological conferences, which he argues is “more important than the FDA clearance.”
Not everyone has reached the same conclusion as Karssemeijer, though — clinicians and developers of AI are divided about what path a tool needs to take before it is widely used in patient care.
A group of researchers from Massachusetts General Hospital and the Massachusetts Institute of Technology, for example, have decided to conduct studies to validate the performance of a breast cancer risk prediction algorithm in multiple centers in the U.S. and around the globe before they even consider commercializing the tool.
In a recent paper published in Science Translational Medicine, the researchers report the results of that testing on patients in Sweden and Taiwan, as well as its performance among Black patients in the U.S. While the study found that the AI model, named Mirai, performed with similar accuracy across racial groups, the researchers are still doing more studies at other centers internationally.
They are also publicly identifying those sites: Novant Health in North Carolina, Emory University in Georgia, Maccabi Health in Israel, TecSalud in Mexico, Apollo Health System in India, and Barretos, a public hospital system in Brazil.
Regina Barzilay, a researcher leading Mirai’s development at MIT, said the researchers determined that widespread validation is crucial to ensuring it could perform equally across populations. In examining their model, the researchers found the AI could predict the race of the patient just from analyzing a mammography image, indicating that it was picking up on nuances in the breast tissue that are relevant to assessing the risks facing different patients.
In breast cancer care, the failure to include diverse populations in research has repeatedly undermined how well products and algorithms work for people of color. Extensive research shows that such technologies can exacerbate disparities. Often, that research suggests, those inequities are because diverse groups of people were not included in data used to test the products prior to their release.
One recent study found that a common genetic test used to assess breast cancer risk in patients — and identify candidates for adjuvant chemotherapy — has lower prognostic accuracy for Black patients. It found that Black patients were more likely to die than white patients with a comparable score.
The study noted that the racial and ethnic demographics of the tumor data used to develop the test, called 21-gene Oncotype DX Breast Recurrence Score, were not reported. But only 6% of patients enrolled in a trial to evaluate the test were Black.
Meanwhile, multiple studies have shown another breast cancer tool, this one used to inform screening recommendations and clinical trial protocols, underestimated the risk of breast cancer in Black patients, and overestimated the risk in Asian patients.
The Gail model uses a range of demographic and clinical factors — such as age, race, family history of cancer, and number of past breast biopsies — to assess a patient’s risk over five years. After studies pointed to inequities in performance, the model was adjusted to improve its generalizability in diverse populations.
Still another model, known as Tyrer-Cuzick, was recently found to overestimate risk in Hispanic patients. The paper by researchers from MIT and Massachusetts General Hospital also found that their algorithm, Mirai, significantly outperformed Tyrer-Cuzick in accurately assessing risks for African American patients.
Connie Lehman, chief of breast imaging at Massachusetts General Hospital and a co-author of the study, said the entire field has suffered for decades from a failure to include diverse groups of patients in research.
“Whether it’s AI or traditional models, we have always been complacent in developing models in European Caucasian women with breast cancer,” she said. “Even when we saw it wasn’t predictive in African American, Hispanic, and Asian women, we were complacent.”
Evidence of inequity did not lead regulators to pull those models from the market, though, because they are the only tools available. But Lehman said the problems with them should serve as a rallying cry to the developers of AI products who are now trying to use data to improve performance.
She sees huge potential in AI. But she also sees where it could go wrong, as was the case with an earlier generation of computer-aided detection (CAD) software for breast cancer. Despite FDA clearances and government reimbursements that allowed providers to collect hundreds of millions of dollars a year for using the products, they ultimately failed to improve care.
“Maybe we can say we learned lessons from CAD, we learned lessons from traditional risk models,” Lehman said. “Maybe we can say we’re not going to repeat those mistakes again. We are going to hold ourselves to a higher standard.”
This is part of a yearlong series of articles exploring the use of artificial intelligence in health care that is partly funded by a grant from the Commonwealth Fund.