Mm-hm, uh-huh: AI-powered medical scribes still can’t understand our conversations

“Your vision is good?” asked the doctor. “Mm-hm,” replied the patient. “And your dentures fit fine?” “Yep,” the patient said. “No problems with them?” the doctor followed up. “Mm,” the patient said, indicating everything was OK.

The back-and-forth would have made perfect sense to the two people talking in the clinic. But to the automatic speech recognition tool tasked with transcribing it and turning it into visit notes, the “mm-hms” and mumbles became a garbled mess. “Your vision is good?” was caught clearly, but the patient’s reply was documented, nonsensically, as “is it,” making the machine’s version of the encounter all but unintelligible.

While medical providers are trying to decrease physician burnout by turning to tools sold by Microsoft and others to transcribe patient-provider conversations and write visit notes, a recent study found that speech-to-text engines meant to transcribe medical conversations do not accurately record clinically relevant “non-lexical conversational sounds,” or NLCS. The difference between “uh-huh” and “uh-uh” is subtle — but very important — in a clinical context, especially when taking a medical history. However, artificial intelligence tools are still not very good at telling them apart.


For companies like 3M, Microsoft-owned Nuance, Google, and Amazon, figuring out how to correctly record and interpret the more casual parts of conversation is a key hurdle to clear as they seek to develop clinical intelligence platforms.

“If I go to see my doctor, they ask me questions, I say, ‘Mm-hm.’ I don’t say ‘yes.’ And if you cannot pick that up, you can cause some serious problems,” said Kai Zheng, professor of informatics and emergency medicine at University of California, Irvine. “It may not be a substantial amount out of all words — as we reported [in the] paper, it’s a small amount — but they carry very important information.”


The mistakes the transcription tools make show just how much unconscious interpretation humans do during a conversation — and how hard that still is for machines to understand.

“One reason those non-lexical sound is difficult to process is also they are generally low volume. Like ‘mm-hm,’ as opposed to you actually speak something, which means that in real-world scenarios the situation could be much worse,” said Zheng.

When STAT transcribed Zheng’s remarks using Trint, a popular consumer transcribing tool, it mis-transcribed the word “non-” and the critical word “volume,” which exemplifies another problem Zheng pointed out: that these speech recognition engines often don’t transcribe non-native English speakers correctly. Such errors further complicate the use of such tools in real-world scenarios.

Their study, published recently in the Journal of the American Medical Informatics Association, looked at a dataset of 36 primary care visits that were recorded between 2007 and 2009 and have previously been used for a variety of research projects. To eliminate any errors that the ASR might create from a non-native English speaker or from a poor recording with background noise, the research team  re-recorded each doctor–patient conversation with native English speakers using a good microphone. (The re-enactors did not know the NLCS would be evaluated.)

The researchers transcribed these recordings with Google’s Cloud Speech-to-Text kit and Amazon’s Transcribe Medical engine, using the platforms’ respective clinical conversation models. They counted all of the NLCS — over 3,000 of them — in the transcripts and pinpointed which ones conveyed clinically relevant information, such as answers to questions like, “Are you allergic to aspirin?” Those were distinguished from sounds which simply served to show that the listener was listening, were filler words, or that indicated a question.

Out of the 76 sounds that conveyed clinically important information, 87% of them were replaced with an erroneous word using the Google ASR, versus 34% for Amazon’s. However, 65% of the clinically relevant NLCS were deleted using Amazon’s model, as opposed to 8% with the Google tool.

“This deletion of ‘mm-hm’ in a very important location could have pretty serious effects downstream, especially when physicians, in the future, might be accustomed to the summarization and it’s possible for them to miss this bit of information among the many other data elements that they might have to be accountable for,” said Tran, an M.D.-Ph.D. student at UC Irvine and primary author on the study. (A transcription tool, ironically, deleted “Mm-hm” in Tran’s quote.)

An exact transcription of Tran’s words underscores another issue that he and colleagues raise in the paper. People often backtrack in conversation, as they remember new information that’s relevant to the discussion or clarify their point in the natural flow of a comment. “Physicians are sort of accust—begin to—if you—if you think about in the future, they might be accustomed to the summarization and it’s possible for them to, like, miss this bit of information among the many other element—data elements that they might have to be accountable for,” the transcription read.

Though it looks odd on paper, such backtracking is so natural in human conversations that it often goes unnoticed by speakers and listeners. For clinical AI platforms, though, it poses a further problem. A doctor might ask if a patient has a certain medication, and the patient might say, “‘No, I don’t have it,’ and then two seconds later the patient might say, ‘Oh, actually, I have it,’” said Zheng. “So you actually need to figure that out — not only transcribing that, but also need to figure out what’s the actual accurate information to be documented.”

It will be a long time before an AI platform could transcribe a patient visit and produce the corresponding documentation automatically, according to Zheng. Because automated speech recognition engines map sounds onto words by using rhythm, intonation, context, and other variables, Tran said he believes that performance on NCLS can be improved if developers focus on optimizing these sounds specifically.

Zheng hesitated to put an estimate on when full automation might be possible and noted that even then, he thinks that only 80–90% of the tasks involved with making comprehensive notes from a provider visit could be done completely automatically. Currently, Nuance has several hospital partners using their DAX clinical AI platform, but humans are still checking the transcripts before finalizing them, which creates a several-hour delay. The company has said it is piloting its fully automated DAX Express product in several health systems and is hoping to ramp up use of the product later this year.

It’s unclear if this automation could ever be done entirely without humans, according to Zheng. There are so many complications — from too many speakers being in the room, to bad sound quality, to the platform having to translate how laypeople speak into medical terms — that it would be difficult to 100% trust an automatically generated document. There are legal and liability concerns, too, that would come into play with an AI fully responsible for notes, even if steps along the way do still safely save physicians time. But even getting the near-perfect transcription, Zheng said, could cut the pressure on doctors in half, saving them time and improving patients’ experience.

Source: STAT