Opinion: Three ways to test medical AI for safety

This article is adapted from “The AI Revolution in Medicine: GPT-4 and Beyond,” by Peter Lee, Carey Goldberg, and Isaac Kohane, published by Pearson.

“Thrashing.” That’s what old-school computer scientists called it when an operating system is running so many tasks at once that just switching among them basically crashes it. And that’s how I felt last fall when I tested GPT-4, the far more powerful successor to ChatGPT, on medical challenges for the first time. I was caught in a stuttering stasis between two competing, nearly overwhelming realizations.


On the one hand, here I was having a sophisticated medical conversation with an artificial intelligence platform that, I knew for a fact, knew nothing explicitly about medicine. (In contrast, I spent much of my youth earning an M.D., as well as a Ph.D. in computer science.) All it did was compute the next word in a sequence of words in our conversation. That such a “know-nothing” process could engage in a conversation about a diagnostic dilemma, hormonal regulation, and organ development in a way that 99 percent of practicing physicians could not keep up with was mind-blowing.

On the other hand, just as mind-blowing was the anxious realization that millions of families would soon have access to this impressive medical expertise, and I could not figure out how we could guarantee or certify that advice from GPT-4 would be safe or effective.

Hence my thrash state, rapidly alternating between wonder and concern. I’ve not stopped since. But I have, at least, managed to form more coherent thoughts than I did in my first few days of conversing with GPT-4, and this conclusion is foremost in my mind:


The new wave of more general AI promises tremendous potential boons for medicine. But its potential risks are also so significant that for the foreseeable future, it cannot be used in medical settings without direct human supervision.

So the most urgent question is: How do we test this so we can start using it as safely as possible?

Consider an initial experiment I ran with GPT-4. It began with a true story: A few years ago, I was called down to the newborn nursery for a consultation in my specialty, pediatric endocrinology, about a baby born full-term 18 hours earlier. The pediatrician who had called me had observed that the baby had hypospadias — urine would come out a hole under the phallus instead of its tip — and had a “small penis.” My exam confirmed the hypospadias and a stretched phallic length of just 1.6 cm but no other irregularities. The baby had a typically ridged scrotum, but I could not feel the usual masses of testicles inside it.

I fed all this to GPT-4 and asked for a diagnosis. I’ll omit the endocrinological esoterica here and just say that it offered four possibilities, including congenital adrenal hyperplasia (CAH) and androgen insensitivity syndrome, both genetic disorders that affect hormones.

I asked about the next steps in the diagnosis, and it responded just as I would, with all my years of study and experience. Then I shared with GPT-4 that an ultrasound of the baby’s abdomen revealed a uterus, and I also relayed some hormonal levels. GPT-4 responded:

Based on these new results, I have to revise my differential diagnosis and now consider CAH as the most likely diagnosis for this baby. The elevated 17-hydroxyprogesterone, ACTH [a hormone made by the pituitary that stimulates the production of steroids by the adrenal gland] and 11-deoxycortisol levels are indicative of a defect in the steroidogenic pathway that leads to cortisol deficiency and androgen excess.

It pointed to a less common form of the disorder, and it suggested the correct tests to confirm that diagnosis and the correct hormone therapy to follow.

It was right. Further testing bore out exactly what it surmised.

GPT-4 had just successfully diagnosed a disorder so rare it affects fewer than 1 in every 100,000 babies. Impressive.

But there’s a risk with AI in medicine. GPT-4, like ChatGPT, has a problem with “hallucinations,” falsehoods that it appears to simply make up and present as facts, without indicating that it is going beyond its knowledge set. To grapple with this problem, I see three possible approaches, but no quick-and-easy answer without a human “in the loop.”

The trial: Medical staffers and regulators are all very familiar with clinical trials for drugs and devices; the Food and Drug Administration routinely examines trials run on AI products and has already approved more than 500 AI-augmented devices after assessing the evidence via testing, much as they do for other new medical devices. For GPT4, though, the trial method has issues because what it can do is so general that its domain of expertise cannot be fully evaluated. The tasks of making diagnoses, choosing treatments, and managing care are so vast that no trial can offer confidence that it might not make an unanticipated and dangerous conclusion or suggestion with the next patient.

The trainee: GPT-4 gets more than 90 percent of questions on medical licensing exams correct. So, maybe we could ascertain that GPT-4 is safe for medical participation as we do a medical trainee? Well, many already complain that the hoops we have trainees jump through do not fully vet doctors-to-be. Also, baked into the training path are assumptions of a shared value system and the ability to make everyday decisions informed by common sense and not merely by medical training. There is no such common ground currently with GPT-4. So no, not good enough.

If all this sounds disappointing given all GPT-4’s capabilities, it needn’t be. Even if it does not act autonomously, GPT-4’s potential for improving health care appears off the charts — for supplementing rather than replacing healthcare providers.

The torchbearer: Even without further study, we can see that GPT-4 excels at one aspect of medicine: superhuman clinical performance. Think of the medical hero/villain of the TV series “House.” With the new AI, the super-doctor “torchbearer” can now move beyond fabled clinical colleagues or TV archetypes. It could become an everyday phenomenon.

Let’s take the case of a boy we’ll call John, one I encountered through the privilege of my work over the last decade with the Undiagnosed Disease Network. John was healthy through well past the toddler stage, then stopped meeting developmental milestones, and steadily lost essential functions such as speech and walking. A medical odyssey finally brought his parents to one of the clinical centers associated with the Undiagnosed Disease Network, where doctors identified a gene that they judged responsible — one required to synthesize many neurotransmitters. So John was given a cocktail of the missing neurotransmitters, and within a few months, he was walking and talking. That success validated the genetic diagnosis.

I posed the details of his case to GPT-4, along with initial genetic findings, and it responded — correctly — with the gene likeliest to be at fault. Does this mean that GPT-4 or its kin could be part of a completely computational pipeline to develop a genetic diagnosis for undiagnosed patients? It certainly seems so.

Notably, I don’t know how GPT-4 figured it out. I have no way of knowing which cases GPT-4 will excel at or fail. So unlike in “House,” a computational torchbearer must be configured as a dedicated team player, not a solo showboat. But what a team player it can be! GPT-4 has shown that it can not only help with diagnoses, it can write requests for insurance authorization, document patient-doctor interactions for the medical record, summarize complex medical research, educate patients at just the right literacy level in multiple languages, and much more. Always, of course, with humans prompting it and checking its responses, but still.

There are good reasons why health care professionals and patients alike are worried about AI in medicine. There were also, and continue to be, reasons for concerns about the introduction of genomics into medicine. We learned two things from this earlier introduction. First, extended public discussion to establish a broad societal consensus is critical. Second, while this discussion evolves, we need to be fastidious about safety — but also aware that, with new technologies that can dramatically help patients, unnecessary delays cause unnecessary harm.

Isaac “Zak” Kohane, M.D., Ph.D., is inaugural chair of the Department of Biomedical Informatics at Harvard Medical School and co-author of “The AI Revolution in Medicine: GPT-4 and Beyond.”

Source: STAT