Although artificial intelligence is entering health care with great promise, clinical AI tools are prone to bias and real-world underperformance from inception to deployment, including the stages of dataset acquisition, labeling or annotating, algorithm training, and validation. These biases can reinforce existing disparities in diagnosis and treatment.
To explore how well bias is being identified in the FDA review process, we looked at virtually every health care AI product approved between 1997 and October 2022. Our audit of data submitted to the FDA to clear clinical AI products for the market reveals major flaws in how this technology is being regulated.
The FDA has approved 521 AI products between 1997 and October 2022: 500 under the 510(k) pathway, meaning the new algorithm mimics an existing technology; 18 under the de novo pathway, meaning the algorithm does not mimic existing models but comes packaged with controls that make it safe; three were submitted with premarket approval. Since the FDA only includes summaries for the first two, we analyzed the rigor of the submission data underlying 518 approvals to understand how well the submissions were considering how bias can enter the equation.
In submissions to the FDA, companies are asked generally to share performance data that demonstrates the effectiveness of their AI product. One of the major challenges for the industry is that the 510(k) process is far from formulaic, and one must decipher the FDA’s ambiguous stance on a case-by-case basis. The agency has not historically asked for any buckets of supporting data explicitly; in fact, there are products with 510(k) approval for which no data were offered about potential sources of bias.
We see four areas in which bias can enter an algorithm used in medicine. This is based on best practices in computer science for training any sort of algorithm and the awareness that it’s important to consider what degree of medical training is possessed by the people who are creating or translating the raw data into something that can train an algorithm (the data annotators, in AI parlance). These four areas that can skew the performance of any clinical algorithm — patient cohorts, medical devices, clinical sites, and the annotators themselves — are not being systematically accounted for (see the table below).
Percentages of 518 FDA-approved AI products that submitted data covering sources of bias
|Aggregate reporting||Stratified reporting|
|Patient cohort||less than 2% conducted multi-race/gender validation||less than 1% approvals with performance figures across gender and race|
|Medical device||8% conducted multi-manufacturer validation||less than 2% reported performance figures across manufacturers|
|Clinical site||less than 2% conducted multisite validation||less than 1% approvals with performance figures across sites|
|Annotators||less than 2% reported annotator/reader profiles||less than 1% reported performance figures across annotators/readers|
Aggregate performance is when a vendor reports it tested different variables but only offers performance as an aggregate, not performance by each variable. Stratified performance offers more insight and means a vendor gives performance for each variable (cohort, device, or other variable).
It’s actually the extreme exception to the rule if a clinical AI product has been submitted with data that backs up its effectiveness.
A proposal for baseline submission criteria
We propose new mandatory transparency minimums that must be included for the FDA to review an algorithm. These span performance across dataset sites and patient populations; performance metrics across patient cohorts, including ethnicity, age, gender, and comorbidities; and the different devices the AI will run in. This granularity should be provided both for the training and the validation datasets. Results about the reproducibility of an algorithm in conceptually identical conditions using external validation patient cohorts should also be provided.
It also matters who is doing the data labeling and with what tools. Basic qualification and demographic information on the annotators — are they board-certified physicians, medical students, foreign board-certified physicians, or non-medical professionals employed by a private data labeling company? — should also be included as part of a submission.
Proposing a baseline performance standard is a profoundly complex undertaking. The intended use of each algorithm drives the necessary performance threshold level — higher-risk situations need a higher standard for performance — and is therefore hard to generalize. While the industry works toward a better understanding of performance standards, developers of AI must be transparent about the assumptions being made in the data.
Beyond recommendations: tech platforms and whole-industry conversations
It takes as much as 15 years to develop a drug, five years to develop a medical device, and, in our experience, six months to develop an algorithm, which is designed to go through numerous iterations not only for those six months but also for its entire life cycle. In other words, algorithms don’t get anywhere near the rigorous traceability and auditability that go into developing drugs and medical devices.
If an AI tool is going to be used in decision-making processes, it should be held to similar standards as physicians who not only undergo initial training and certification but also lifelong education, recertification, and quality assurance processes during the time they are practicing medicine.
Recommendations from the Coalition for Health AI (CHAI) raise awareness about the problem of bias and effectiveness in clinical AI, but technology is needed to actually enforce them. Identifying and overcoming the four buckets of bias requires a platform approach with visibility and rigor at scale — thousands of algorithms are piling up at the FDA for review — that can compare and contrast submissions against predicates as well as evaluate de novo applications. Binders of reports won’t help version control of data, models, and annotation.
What can this approach look like? Consider the progression of software design. In the 1980s, it took considerable expertise to create a graphical user interface (the visual representation of software), and it was a solitary, siloed experience. Today, platforms like Figma abstract the expertise needed to code an interface and, equally important, connect the ecosystem of stakeholders so everyone sees and understands what’s happening.
Clinicians and regulators should not be expected to learn to code, but rather be given a platform that makes it easy to open up, inspect and test the different ingredients that make up an algorithm. It should be easy to evaluate algorithmic performance using local data and retrain on-site if need be.
CHAI calls out the need to look into the black box that is AI through a sort of metadata nutrition label that lists essential facts so clinicians can make informed decisions about the use of a particular algorithm without being machine learning experts. That can make it easy to know what to look at, but it doesn’t account for the inherent evolution — or devolution — of an algorithm. Doctors need more than a snapshot of how it worked when it was first developed: They need continual human interventions augmented by automated check-ins even after a product is on the market. A Figma-like platform should make it easy for humans to manually review performance. The platform could automate part of this, too, by comparing physicians’ diagnoses against what the algorithm predicts it will be.
In technical terms, what we’re describing is called a machine learning operations (MLOps) platform. Platforms in other fields, such as Snowflake, have shown the power of this approach and how it works in practice.
Finally, this discussion about bias in clinical AI tools must encompasses not only big tech companies and elite academic medical centers, but community and rural hospitals, Veteran Affairs hospitals, startups, groups advocating for under-represented communities, medical professional associations, as well as the FDA’s international counterparts.
No one voice is more important than others. All stakeholders must work together to forge equity, safety, and efficacy into clinical AI. The first step toward this goal is to improve transparency and approval standards.
Enes Hosgor is the founder and CEO of Gesund, a company driving equity, safety, and transparency in clinical AI. Oguz Akin is a radiologist and director of Body MRI at Memorial Sloan Kettering in New York City and a professor of radiology at Weill Cornell Medical College.
First Opinion newsletter: If you enjoy reading opinion and perspective essays, get a roundup of each week’s First Opinions delivered to your inbox every Sunday. Sign up here.