Start your day with intelligence. Get The OODA Daily Pulse.

Home > Analysis > What Can Your Organization Learn from the Use Cases of Large Language Models in Medicine and Healthcare?

It has become conventional wisdom that biotech and healthcare are the pace cars in implementing AI use cases with innovative business models and value-creation mechanisms.  Other industry sectors should keep a close eye on the critical milestones and pitfalls of the biotech/healthcare space – with an eye toward what platform, product, service innovations, and architectures may have a potable value proposition within your industry. The Stanford Institute for Human-Centered AI (HAI) is doing great work fielding research in medicine and healthcare environments with quantifiable results that offer a window into AI as a general applied technology during this vast but shallow early implementation phase across all industry sectors of “AI for the enterprise.”  Details here. 

Doctors Receptive to AI Collaboration in Simulated Clinical Case without Introducing Bias

“Doctors worked with a prototype AI assistant and adapted their diagnoses based on AI’s input…this study shows that doctors who work with AI do so collaboratively. It’s not at all adversarial.”

While many healthcare practitioners believe generative language models like ChatGPT will one day be commonplace in medical evaluations, it’s unclear how these tools will fit into the clinical environment. A new study points to a future where human physicians and generative AI collaborate to improve patient outcomes.  In a mock medical environment with role-playing patients reporting chest pains, doctors accepted the advice of a prototype ChatGPT-like medical agent. They even willingly adapted their diagnoses based on the AI’s advice. The upshot was better outcomes for the patients. 

In the trial:

  • Fifty licensed doctors reviewed videos of white male and Black female actors describing their chest pain symptoms and electrocardiograms to make triage-, risk-, and treatment-based assessments of the patients.
  • In the study’s next step, the doctors were presented with ChatGPT-based recommendations derived from the same conversations and asked to reevaluate their assessments.

The study found that the doctors were not just receptive to AI advice but willing to reconsider their analyses based on that advice…the study’s findings go against the conventional wisdom that doctors may be resistant, or even antagonistic, to introducing AI in their workflows.  “This study shows that doctors who work with AI do so collaboratively. It’s not at all adversarial,” said Ethan Goh, a healthcare AI researcher at Stanford’s Clinical Excellence Research Center (CERC) and the study’s first author. “When the AI tool is good, the collaboration produces better outcomes.”

The study was published in preprint by medRxiv and has been formally accepted by a peer-reviewed conference, AMIA Informatics Summit in Boston this March.

Milestone Moment

“It’s no longer a question of whether LLMs will replace doctors in the clinic — they won’t — but how humans and machines will work together…”

Goh quickly points out that the AI tools used in the study are only prototypes and are not yet ready or approved for clinical application. However, he said the results, as are the prospects for future collaborations between doctors and AI, are encouraging.  “The overall point is when we do have those tools, someday, they could prove useful in augmenting the doctors and improving outcomes. And, far from resisting such tools, physicians seem willing, even welcoming, of such advances,” Goh said. In a survey following the trial, most doctors confirmed that they fully anticipate large language model-based (LLM) tools to play a significant role in clinical decision-making.

As such, the authors write that this particular study is “a critical milestone” in the progress of LLMs in medicine. With this study, medicine moves beyond evaluating whether generative LLMs belong in the clinical environment to how they will fit in that environment and support human physicians in their work, not replace them, Goh said.   “It’s no longer a question of whether LLMs will replace doctors in the clinic — they won’t — but how humans and machines will work together to make medicine better for everyone,” Goh said.

Generating Medical Errors: GenAI and Erroneous Medical References

“…much of the future of GenAI in medicine—and its regulation—hinges on the ability to substantiate claims; A new study finds that large language models used widely for medical assessments cannot back up claims.”

Large language models (LLMs) are infiltrating the medical field. One in 10 doctors already use ChatGPT daily, and patients have taken to ChatGPT to diagnose themselves. The Today Show featured the story of a 4-year-old boy, Alex, whose chronic illness was diagnosed by ChatGPT after over a dozen doctors failed to do so. This rapid adoption to much fanfare is despite substantial uncertainties about the safety, effectiveness, and risk of generative AI (GenAI). U.S. Food and Drug Administration Commissioner Robert Califf has publicly stated that the agency is “struggling” to regulate GenAI

The reason is that GenAI sits in a gray area between two existing forms of technology:

  • On one hand, the FDA does not regulate sites like WebMD that strictly report known medical information from credible sources. 
  • On the other hand, the FDA carefully evaluates medical devices that interpret patient information and make predictions in medium-to-high-risk domains. To date, the FDA has approved over 700 AI medical devices.

However, because LLMs combine existing medical information with potential ideas beyond it, the critical question is whether such models produce accurate references to substantiate their responses. Such references enable doctors and patients to verify a GenAI assessment and guard against the highly prevalent rate of “hallucinations.”   For every 4-year-old Alex, where the creativity of an LLM may produce a diagnosis that physicians missed, many more patients may be led astray by hallucinations. In other words, much of the future of GenAI in medicine—and its regulation—hinges on the ability to substantiate claims. 

Evaluating References in LLMs 

Unfortunately, very little evidence exists about LLMs’ ability to substantiate claims. In a new preprint study, we develop an approach to verify how well LLMs can cite medical references and whether these references support the claims generated by the models.   The short answer: poorly. For the most advanced model (GPT-4 with retrieval augmented generation), 30% of individual statements are unsupported, and nearly half of its responses are not fully supported. 

Evaluation of the quality of source verification in medical queries in LLMs. Each model is evaluated on three metrics over X questions. Source URL validity measures the proportion of generated URLs that return a valid webpage. Statement-level support measures the percentage of statements supported by at least one source in the same response. Response-level support measures the percentage of responses that have all their statements supported.

‘A Long Way to Go’

“As LLMs grow in their capabilities and usage, regulators and doctors should consider how these models are evaluated, used, and integrated.”

Many commentators have declared the end of health care as we know it, given the apparent ability of LLMs to pass U.S. Medical Licensing Exams. However, healthcare practice involves more than answering a multiple-choice test. It involves substantiating, explaining, and assessing claims with reliable scientific sources. And on that score, GenAI still has a long way to go. 

Promising research directions include more domain-informed work, such as adapting RAG to medical applications. Source verification should be regularly evaluated to ensure that models provide credible and reliable information. At least by the current approach of the FDA – which distinguishes medical knowledge bases and diagnostic tools regulated as medical devices – widely used LLMs pose a problem. Many of their responses cannot be consistently and fully supported by existing medical sources.   As LLMs grow in their capabilities and usage, regulators and doctors should consider how these models are evaluated, used, and integrated.

What Next?

Consider the following when attempting to apply the lessons learned from these two studies in the healthcare sector by HAI researchers to your organization or industry:  

  • What is your industry’s equivalent of a “clinical environment for healthcare practitioners providing medical evaluations”?  An equivalent, for example, might be the customer success managers who manage your top 20 company/client relationships (measured by revenue and/or lifetime total value of the customer).
    • Whatever the specific example in your sector, does your organization have access to field studies that speak to the comfort level, antagonistic attitudes, adoption rate, and efficacy of AI augmentation tools deployed “in the wild” amongst high-performance employees in your market?   
    • Does your organization, department, or division have the wherewithal and resources to be a source of such ethnographic, human subject-centered research efforts?
    • Is there research available that can inform business model innovation and value creation – and provide benchmarks and quantifiable metrics around the success or failure of prototype AI product or service launches within your organization, amongst competitors, or in your industry sector?   
  • The Stanford HAI researchers understood the healthcare industry’s attitudinal baseline, i.e., the “conventional wisdom that doctors may be resistant, or even antagonistic, to introducing AI in their workflows.”  Is there an equivalent conventional wisdom within your industry about core employees’ attitudes towards “introducing AI in their workflows?” 
    • If so, how can you design flow on research that tests the efficacy of AI prototypes introduced into your workflows based on understanding the comfort level of the human subjects participating in the research?      
    • If not, what industry organizations or internal research teams are exploring research about the adoption rate of generative AI and prototype AI assistants amongst highly skilled professionals in your industry (who may have an adversarial mindset towards using AI for evaluation and assessment tasks closely tied to their subject matter expertise, unique training, and job security)? 
  • What role will the trust and accuracy of LLM outputs play in the adoption rate, integration, regulation, and overall success of generative AI within your organization or industry subsector? 
    • What is the equivalent in your industry sector of how the “future of GenAI in medicine—and its regulation – hinges on the ability of LLMs to substantiate claims accurately?”  For example, in law, the accuracy of LLMs may be relative to the accuracy and precision of the work usually done by law clerks and paralegals. 
      • In this legal example, what is the probability of  – and level of impact if – a legal team takes an inaccurate LLM hallucination output into a high-stakes corporate law meeting or court in a high-stakes criminal case?
    • Right now, the risk assessment for all industries, not just the legal profession, should be: 
      1. The probability this type of LLM inaccuracy will happen is HIGH;
      2. What will the impact be if this LLM inaccuracy is brought into a real-world, high-stakes setting? HIGH. In some cases, it could be life or death or, at the very least, professionally catastrophic.       
    • Is there an equivalent real-world, high-risk narrative for highly skilled practitioners—like doctors and lawyers—in your professional arena?
        • If so, how can you position this risk narrative – this industry-wide “pain point”, “barrier to entry”, or “problem set” – at the center of your organization’s risk assessment of AI?     

Additional OODA Loop Resources 

Technology Convergence and Market Disruption: Rapid technological advancements are changing market dynamics and user expectations. See Disruptive and Exponential Technologies.

The New Tech Trinity: Artificial Intelligence, BioTech, Quantum Tech: Will make monumental shifts in the world. This new Tech Trinity will redefine our economy, threaten and fortify our national security, and revolutionize our intelligence community. None of us are ready for this. This convergence requires a deepened commitment to foresight preparation and planning on a level that is not occurring anywhere. The New Tech Trinity.

Benefits of Automation and New Technology: Automation, AI, robotics, and Robotic Process Automation are improving business efficiency. New sensors, especially quantum ones, are revolutionizing the healthcare and national security sectors. Advanced WiFi, cellular, and space-based communication technologies enhance distributed work capabilities. See: Advanced Automation and New Technologies

Rise of the Metaverse: The immersive digital universe is expected to reshape internet interactions, education, social networking, and entertainment. See Future of the Metaverse.

Daniel Pereira

About the Author

Daniel Pereira

Daniel Pereira is research director at OODA. He is a foresight strategist, creative technologist, and an information communication technology (ICT) and digital media researcher with 20+ years of experience directing public/private partnerships and strategic innovation initiatives.