Opinion: Generative AI’s three possibly insurmountable challenges for health care

Talk about AI in medicine often focuses on the most exiting possible innovations like precision diagnostics, clinical prediction systems, and analytics-driven drug discovery. However, with the arrival of large language models like GPT-4, Bard, and LLaMA, there is growing enthusiasm for how AI might reshape the more mundane aspects of clinical practice: clinical documentation and electronic health records. And it’s obvious why. As a patient, I hate the experience of having to talk to my doctors as they peer at me just over the laptop screen (all the while typing furiously). It really takes the feeling of care out of health care. And, of course, doctors hate EHRs, probably more than patients do. There’s no end to the complaints about increasing documentation demands, poor interface design, and incessant alerts.

It’s no wonder that doctors dream of a hands-free world where a device — like a Dr. Echo — sits in the corner, listening to everything said and then auto-generating the clinical notes, discharge summaries, prior authorization letters, and so on. If large language models can help providers be more present and focused on patient care, then that seems like a clear win. This is exactly what Microsoft, OpenAI, and Epic are hoping for in their new AI EHR collaboration, already underway at Stanford, UW-Madison, and UC San Diego.

Nevertheless, it’s important to take stock of what could be lost in this technological transition. While I’m often a patient who hates the experience of talking to my doctor through a laptop, I’m also a researcher who has devoted a large part of his career to better understanding what happens when new technologies are added to clinical spaces. From this perspective, I see three major challenges that have to be overcome before large language models can really serve as clinical scribes.


The truth challenge. The primary purpose of EHRs is to ensure that accurate records are available to support continuity of care. It’s essential that clear, correct, and complete information goes into EHRs, and there are already too many studies showing how anything less leads to increased medical error rates.

There are two major chokepoints where hands-free AI could lead to inaccurate medical records. The first is in the speech-to-text technology. When the university where I teach pivoted online for Covid-19, I suddenly found myself pre-recording lectures. The AI transcription systems we used were always making errors, swapping out one word for another. And I was raised in Iowa, giving me the most standard Midwestern accent possible. Digital assistants that rely on speech-to-text AI are famous for inaccurately capturing any of the accents we don’t usually hear on CNN or BBC News.


Even if we fix the transcription problem, AI-enabled EHRs would still have to generate accurate information for history and presentation notes, discharge summaries, or prior authorization letters. This is a huge issue for large language models, which are prone to what researchers often call “hallucinations.” What that really means is they make things up. Large language models generate text through next-word prediction. By applying deep learning architectures to large collections of text, large language models essentially learn which word is most likely to follow from any given previous word or words.

But whatever is most common may well not be true for any particular patient. This is likely to be an even greater issue for unique cases or rare diseases. There may not be enough relevant information in the data the AI was trained on. In such cases, the language model is likely to make up information that looks true, but isn’t. Ensuring that all notes are true is a huge challenge that limited speech-to-text AIs and current hallucination-prone large language models have yet to overcome.

The time challenge. AI and large language model enthusiasts routinely celebrate the potential of these new technologies to liberate doctors from the drudgeries of modern medicine. They hope that this newfound freedom will result in more time with patients. To be blunt, these folks need to have a serious conversation with the people that own and run most hospitals, because those people seem to have a very different idea about how this will play out.

Many new AI systems are tested and sold on economic and efficiency outcomes. That is, new AI systems are mostly sold based on the extent to which they make care faster or cheaper, not more pleasant for provider and patient. It’s impossible to imagine that in the current economic context of clinical care, LLM adoption will lead hospital administrators to support the idea of providers having more time with patients. You can already see this dynamic playing out with old-school human scribes. One 2018 study found that human scribes made it so clinics could squeeze in 8.8% more patients every hour. On the whole, scribe studies focus on economic and consumer satisfaction outcomes rather than health benefits. I can’t see why we should expect research, marketing, and procurement of AI scribes to be any different.

The thought challenge. Research on EHR use and clinical documentation shows that doctors make better decisions when they read and consult their clinical notes. Having direct access to EHR data supports better clinical decision-making. Looking at the screen, notes, and displays gives providers a chance to think about and synthesize relevant patient information. When doctors no longer engage directly with clinical notes, an active thinking process is replaced with passively waiting for alerts.

But as reliance on alerts increases, alert fatigue sets in as doctors stop paying attention to those alerts. This has already been identified as a serious challenge for expert systems, and one that may become more problematic as LLM-enhanced EHRs roll out. Just as important, some data show that the act of writing notes, however annoying, can also improve clinical thought. Taking the time to write a note forces a doctor to make choices about how they capture the clinical presentation. These are key elements of diagnostic decision-making, elements that ought not to be so casually discarded.

When it comes to adding LLMs, the hallucination problem is often presented as the main issue to overcome. But, unfortunately, that’s only one third of the overall picture. For clinical documentation and EHRs to do their job, truth, time, and thought all have to come together in the right mix. Even if we reliably solve the hallucination problem, LLMs can’t deliver on current promises if we just drop them into the current medical system. And they might just make things worse. Tackling persistent issues with the very structure of health care delivery has to be the priority if the goal really is to create a better experience for both patients and doctors.

S. Scott Graham is an associate professor at the University of Texas at Austin. He is the author of The Doctor and The Algorithm and The Politics of Pain Medicine.

Source: STAT