Facing data gaps on trans populations, researchers turn to health records for answers

Electronic health records are enabling a new generation of health care researchers to study thousands and millions of patients at a time. But scientists can’t easily tap into that promise for transgender and gender diverse populations — because in many cases, the records simply don’t reflect their identities.

Up until 2015, most digital health records only allowed physicians to enter a patient’s sex, and the field offered just three options: male, female, or other. By ignoring the complex relationship between sex and gender, those limited choices can harm trans and gender diverse patients both psychologically — as they are repeatedly misgendered or deadnamed by providers — and physically, if a record fails to recommend screenings for cervical or prostate cancer, for example.

Crucially, they also leave population health researchers without information to build evidence-based clinical practices that support the health of the entire community of trans and gender diverse people, whose gender identity differs from their sex assigned at birth.


“I want to make sure I’m offering my patients the care that is correspondent with evidence that will support them in living long and healthy lives,” said Emily DeMartino, a family nurse practitioner who focuses on care for trans and gender diverse communities. But there are still huge gaps in providers’ understanding of how best to care for those patients.

The best solution — for patients and the researchers and providers who want to support their health — is to make sex, gender, and pronoun data as integral to medical records as date of birth. But in the absence of perfect data, researchers have been developing tools that help identify trans and gender diverse patients based on signals hidden in their electronic health records.


For now, both strategies will be necessary to improve care — which researchers say is critical. Trans and gender diverse people continue to experience high rates of depression and anxiety, violence victimization, and suicidal ideation.

But both are running into serious headwinds. Building gender identity into EHRs requires buy-in from every part of the health care system, from software vendors to primary care receptionists to federal regulators. And research based on existing records will continue to erase the experiences of certain trans and gender diverse populations without adequate funding and careful execution.

“We do need to develop algorithms that work without relying on EHRs updating,” said Clair Kronk, a biomedical informatician at the University of Cincinnati College of Medicine. “Because to be quite honest, charts are not going to be rolled out with these new things overnight, and providers aren’t going to use them overnight.”

In recent years, federal incentives made it much more common for EHRs to include standardized fields for sexual orientation and gender identity (SOGI, for short). “As more and more people are using EHRs and as more and more people really start to collect these data, it’s really going to enable us to understand what’s happening in the context of primary care,” said Chris Grasso, associate vice president for health informatics and data services at Boston-based Fenway Health, a pioneering LGBTQIA+ focused health provider.

But those fields are still underused. While most medical schools now train doctors in the art and importance of asking patients how they self-identify, providers as a whole are adopting those practices slowly.

The plodding pace inspired a group of researchers at the Vanderbilt University Medical Center to try to find another way. In research published in 2019, they set out to identify transgender patients in the center’s electronic health care records without relying on dedicated gender identity fields.

“Our hope was that we could develop a cohort of patients that would enable population-level statistics to then answer questions like ‘what should the laboratory reference values be in a group of patients,’ or ‘how do quality metrics differ for patients over time,’” said Jesse Ehrenfeld, who led the research and is now a senior associate dean at the Medical College of Wisconsin School of Medicine. “There’s no end of questions that we just haven’t been able to ask.”

Without those dedicated fields for sexual orientation and gender identity, they turned to two other signals: administrative billing codes and keywords that are common in records for transgender patients.

By searching for records that had at least one of their qualifying billing codes — say, for gender identity disorder — and at least one of their qualifying keywords, the researchers were able to accurately identify records of trans patients, accidentally turning up cisgender patients just 3% of the time.

Many public health researchers have adopted similar strategies. But these research approaches have constraints that limit their ability to reflect the diversity of gender minority experiences. Many of the diagnostic codes that researchers use focus on patients who have received gender-affirming care like hormone therapy, said Kellan Baker, a health services researcher at Johns Hopkins Bloomberg School of Public Health.

“That’s a specific group of trans people, and there are many many more people who aren’t able to get that care,” said Baker. “How do we reach those folks and how do we learn about their experience?”

Keywords, too, limit the sub-populations that make it into studies. When researchers build their cohorts by matching keywords, their vocabulary is limited both by their own familiarity with the full spectrum of gender identities and the fact that those terms are constantly evolving. Providers still commonly default to binary gender identities — male-to-female, female-to-male — even as an increasing number of people identify as nonbinary.

“There is not one transgender phenotype,” said Kronk. “Gender diversity is shaped by a wide array of environmental and genetic factors.”

The algorithms are also just not that smart. “Most of the algorithms right now that are in test stages would completely mistake somebody saying ‘my partner is transgender’ with ‘the patient is transgender,’ or even ‘the patient is not transgender,’” said Kronk. That’s what led to some of the false positives that snuck through Ehrenfeld’s algorithm. Some newer strategies try to parse those semantic distinctions, but they’re still far from full-blown natural language processing algorithms.

There is a way out of this imperfect, less-than-representative system, of course. The gold standard for both patient care and population research on trans and gender diverse folks is simple: just ask. “Unless you’re actually talking to people and getting information from them about their gender identity, ultimately it is guesswork,” said Baker. A two-step question, asking for both sex assigned at birth and gender identity, is often used as a best practice.

But that’s not likely to happen any time soon. While EHR providers like Epic continue to add features that support care for trans and gender diverse patients, their adoption will continue to grow slowly without national mandates. Existing federal rules only require that EHRs certified for meaningful use provide structured SOGI fields — not that they actually get filled in.

And patients, of course, aren’t obligated to disclose that information when asked. “Going into any health care experience is a really vulnerable moment, and when we’re asking people to also be vulnerable and disclose their gender identity for abstract, far-away research that may someday benefit gender diverse people, that’s a really big ask,” said DeMartino.

Without full buy-in from the health care system, researchers will need to develop smarter algorithms that can be both sensitive — capturing the widest diversity of trans and gender diverse experiences — and specific, avoiding those whose sex assigned at birth and gender identity are the same. Right now, with more simplistic algorithms, “you can cast a wider net,” said Baker, “but it means you’ll be pulling in cis people as well.”

Kronk imagines a patient identification algorithm that pulls in diagnostic codes, standardized medical terms, free text search, and particular drug and procedure combinations. “But that’s a much more complicated algorithm that needs to be very, very rigorously tested, multisite,” she said. “If it comes from one clinic with 100 patients I don’t care if it was 100% accurate, I don’t believe it.”

At some point, machine learning algorithms could begin to infer information about a patient’s gender identity that they never directly disclosed to a doctor. Existing research can typically only capture patients who have chosen to talk to their doctor about their gender identity. But it’s possible to imagine a (bad) algorithm that detects a record with a distant surgical history of hysterectomy, parses the current clinician’s notes filled with masculine pronouns, and autofills the patient’s gender identity. If that happens, researchers will enter even murkier ethical waters.

“In all biomedical research, there needs to be an appropriate balance between the desire to gain new insight and knowledge and protect the privacy and safety and rights of those involved, particularly in studies where patients’ data is being used and the patient may not have explicitly given permission for it,” said Ehrenfeld. That’s especially true when the patients have historically been mistreated by the medical system.

Patients whose EHRs get rolled into these databases may have signed consent forms to allow their data’s use in research, but if that consent was rolled into reams of intake papers, they might not be aware of what they’ve signed. In other cases, patients don’t consent, and their records are only used if they’ve been scrubbed of personally identifiable information. Institutional review boards exist to make sure that data is used appropriately, but “you do put the patient in significant danger if you’re not careful,” said Kronk.

“If the classifier is inferring something about a patient, I think that’s where it gets a little icky,” said Noémie Elhadad, an associate professor of biomedical informatics at Columbia University who has used natural language processing to research other populations. “I really wish I didn’t have to do NLP on patient data,” she said. “But I do think that it’s a necessary evil,” in order to answer questions about patient populations that have been underserved and even harmed by the medical system.

The researchers who spoke with STAT were not aware of any research that applies that kind of natural language processing to EHRs in order to identify trans populations. But some suggested that the time is nearing — and when it does, many say that trans researchers must lead the way.

“Whenever you’re doing population-level statistics the most important thing to do is understand your population,” said Kronk. “I think we’re going to get somewhere in the next two to five years because there will be more trans researchers involved in this.”

DeMartino, the family nurse practitioner, sees the way poor EHR data is impeding clinical practices that could help her better treat patients. But that’s not the only barrier: “I think about lack of funding, and gaps in the research pipeline,” she said. “I think about the structural barriers that trans and gender diverse people have to participating in research, leading research, leading health care.”

In the meantime, she’ll continue caring for patients to the best of her ability — helping them through everything from gender-affirming hormone therapy to rashes, bumps, bruises, and scrapes. But she looks forward to the day that health care research and technology adapts to her patients’ needs, rather than her patients adapting to the system.

“That’s what I find really frustrating,” she said. “I have to figure out all these hacks and tips and tricks so my patients can exist safely and happily.”

Source: STAT