At every stage of the Covid-19 pandemic, national reporting of racial and ethnic disparities in Covid-19 testing, diagnosis, disease severity, treatment, and vaccination by clinicians, public health organizations, and the media has been marred by frustrating data deficiencies.
How bad has the problem been? Far beyond bad.
In epidemiology research, missing more than 5% of data in a category is significant because, at that level, the missing data can no longer be treated as statistically random, which makes findings gleaned from the analysis become suspect. A whopping 56% of confirmed Covid-19 infections were missing race and ethnicity when first reported in July 2020. In a systematic review published in 2021, researchers had to exclude one-fifth of cross-sectional studies looking at Covid-19 disparities because data on race/ethnicity was missing for more than 20% of cases.
Alarms about the lack of demographic data for Covid-19 cases and deaths have been sounded over and over by physicians, scientists, and advocates since early 2020. Shocked to find that more than half of race or ethnicity data for confirmed Covid-19 infections were missing nationally — even months into the pandemic — many took to the media to articulate just how dangerous this was.
Seeking something better, volunteers scrambled to take up the mantle. Concerned civilians, worried scientists, private entities, and journalists struggled to fill the data vacuum. Despite urgent advocacy — published everywhere from the most prestigious academic journals to multiple media outlets — not much improved. During the first phase of the rollout of Covid-19 vaccines, from mid-December 2020 to mid-January 2021, race/ethnicity data were still missing in more than 40% of reported U.S. Covid-19 cases.
Epidemiology requires good data. Without it, epidemiologists can’t build an understanding of a disease’s spread and impact; public health experts can’t control or mitigate how it unfurls; and health policy professionals can’t formulate effective plans to address the crisis.
Covid-19 cuts along social lines. Though the virus was once theorized to be the great equalizer — it could take down anyone, no matter how young or rich — that myth was quickly busted. The lines of the disease carved deep into the lives of vulnerable populations to cause unequal pandemic suffering, hitting hardest those who have-not.
As a group of health equity scholars, we and our colleagues knew that a better understanding of how historical inequities might tie to contemporary Covid-19 disparities could help inform solutions that protect everyone. For months, we discussed the right way to study the throughlines between government-sponsored segregation, neighborhood disinvestment, and the higher rates of Covid-19 being seen in Black and brown Americans. Across country lines, time zones, and disciplines, we tinkered with our research question again and again.
We didn’t complete the project. We couldn’t, because we found that, even in 2022, the degree of missing data on race and ethnicity in federal Covid-19 databases was still simply too high. In a national dataset of more than 50 million Covid-19 cases assembled by the Centers for Disease Control and Prevention, more than 17 million did not have race/ethnicity data. That’s 34% of cases. By comparison, just 1% of cases were missing data on age and sex.
Since we were stymied from pursuing our original research questions, we pivoted to investigate the completeness of racial/ethnic data over the course of the pandemic. After obtaining deidentified patient-level data from the CDC’s national case surveillance, which includes all Covid-19 cases and associated demographic characteristics shared with the CDC, we mapped the missing-ness of racial and ethnic data by state, over time, to see if data collection and reporting improved with lessons learned throughout the pandemic.
They didn’t. Three years into the pandemic, the degree to which data on race or ethnicity are still missing is shameful, though with large variation in the degree of missing data between states:
Across the nation from 2020 to 2022, data on race and ethnicity were missing from 34% of all reported Covid-19 diagnoses. There was substantial variability from state to state, ranging from 8.7% missing race/ethnicity in Utah to 100% in North Dakota. Visualized by county, there was also significant variability within states, as well:
Reporting about Covid-related deaths was slightly better, with 15% of reports overall missing race/ethnicity data. Yet egregious limitations remained: three states, North Dakota, South Dakota, and West Virginia, provided no data on deaths by race or ethnicity to the CDC. These states, whose populations are more than 80% white, were ill-equipped to fully understand which of their residents were dying of Covid-19.
One study from Fulton County, Ga., found that even conservative adjustments for statistical biases associated with missing race and ethnicity data increased the incidence of Covid-19 by 130% for Black people, 170% for Hispanic people, and 160% for “other” (including Indigenous, Native Hawaiian, and Pacific Islander) people. In other words, the degree of missing data significantly skews conclusions about who is affected by Covid-19; unadjusted and incomplete data risk significantly lowballing the magnitude of inequities.
The pervasive missing-ness of racial and ethnic data amounts to nothing short of data genocide. This categorical erasure of viral transmission networks, lives lost, and missed vaccination opportunities in data collected from Black, Hispanic, Asian, and Indigenous communities represents a dismissal that may reverberate inequities for generations to come.
Deficient data collection early in the pandemic might be have been excusable. Though proper public health preparedness could have been more robust, it’s understandable that the epidemic’s initial tidal wave of upheaval led to inadequate documentation and record-gathering. But that explanation has expired. In fact, the missing-ness of Covid-19 data reported in early 2022 was worse than ever:
Hospitals, health care providers, and laboratories route public health surveillance data to the CDC via local, state, territorial, and tribal public health agencies. Federal mandates requiring the collection and reporting of basic data on race, age, sex, and ZIP code were in place as early as August 2020. It’s clear they haven’t worked for race and ethnicity.
Knowing basic population-level information about Covid-19 is vital, especially amid rationed care contexts. When scarce resources must be allocated to communities at greatest risk, the ability to slow disease spread and redress inequity depends on the ability to effectively target interventions. Vulnerable Americans have felt this intimately and painfully. An inability to calculate sociodemographic disparities is a fundamental obstacle to health equity. And, as the pandemic has clearly shown, concentrated harms facing marginalized populations inevitably spill over to affect the whole of society.
Surveillance of acute viral pandemics and chronic diseases alike is essential, but public health decision-making cannot remain tethered to strained data supply chains. The CDC recently announced an agency-wide structural overhaul, including the creation of an Office of Health Equity and an Office of Public Health Data, Surveillance, and Technology, to advance the agency’s plan to “build the data infrastructure necessary to connect all levels of public health with the critical data needed for action.”
Creating robust alternatives need not start from scratch. In Minnesota, a new cross-sector partnership produced near-complete weekly data on the pandemic’s impact across a plethora of sociodemographic characteristics including race, ethnicity, homelessness, and incarceration, as well as geospatial rurality and social vulnerability indices. This initiative required a concerted effort to coalesce the state’s largest health systems, community-based organizations, public health stakeholders, and government agencies for homelessness and criminal justice. Models like the Minnesota Electronic Health Record Consortium can serve as a nidus for reforming hyperlocal collection and reporting of comprehensive sociodemographic data, and for real-time monitoring of public health interventions.
The pervasiveness of missing data on who gets Covid-19 and who dies from it is a troubling public health failure. Three years into this pandemic, the fact that federal data still cannot confidently describe the current global priority says something about our nation’s capacity for public health. It also speaks to our country’s ability to emerge from this crisis and brace itself for the next one.
Jennifer W. Tsai is an emergency medicine physician and health equity researcher in New Haven, Conn., and a 2022 STAT Wunderkind. Rohan Khazanchi is an internal medicine and pediatrics physician in Boston, and a health services and health equity researcher. Emily Laflamme is an epidemiologist who focuses on structural causes of health inequities and was a senior analyst at the American Medical Association Center for Health Equity during the development of this research and essay. The authors acknowledge Fernando De Maio and Leila Morsy for their crucial intellectual and technical contributions to this article. The views expressed in this article are the authors’ alone and do not necessarily represent the views or policies of the institutions they work for or the American Medical Association.
First Opinion newsletter: If you enjoy reading opinion and perspective essays, get a roundup of each week’s First Opinions delivered to your inbox every Sunday. Sign up here.