Large language models have already changed how people write, search, and summarize information. Now Google is making a sharper bet that the same technology—carefully adapted—can support real clinical work. In a recent episode of NEJM AI Ground Rounds, Google researchers Dr. Alan Karthikesalingam and Vivek Natarajan described how their team is modifying and evaluating LLMs for medical use cases, moving the conversation from “Can an LLM answer a question?” to “Can it perform safely inside clinical workflows?”
Why this matters: healthcare doesn’t need clever chat—it needs reliable systems
Medicine is an unusually unforgiving environment for generative AI. Most industries can tolerate occasional errors or vague outputs; clinicians cannot. A model that sounds confident while being wrong is more than a quality issue—it’s a safety event waiting to happen. That’s why the most important signal in the NEJM AI discussion isn’t that Google is experimenting with LLMs (everyone is), but that it is emphasizing adaptation and evaluation for clinical applications, according to the episode.
In practice, “clinical LLMs” are less about producing eloquent paragraphs and more about doing high-friction, time-consuming work: distilling lengthy notes into structured summaries, drafting patient-friendly explanations, assisting documentation, or helping clinicians quickly retrieve relevant context from a chart. These are tasks where language is the interface—but the underlying requirement is precision, traceability, and alignment with clinical intent.
The hard part isn’t capability—it’s calibration
LLMs are general-purpose systems trained on broad internet-scale corpora. Clinical environments introduce three constraints that force a rethink:
1) Ground truth is messy. Medicine is filled with ambiguity: evolving diagnoses, incomplete histories, and differing standards across institutions. Evaluation can’t rely on simplistic right/wrong grading; it needs clinically meaningful benchmarks and expert review.
2) Context is everything. A correct answer in one patient can be incorrect in another due to comorbidities, medications, pregnancy status, allergies, or local practice. LLM outputs must be anchored in patient-specific data and up-to-date evidence—not generic patterns.
3) Risk isn’t evenly distributed. An LLM mistake in a discharge summary may be inconvenient; an error in anticoagulation guidance can be catastrophic. Clinical adoption will likely come in tiers, starting with lower-risk administrative support and gradually moving toward higher-stakes decision support under strict guardrails.
What Google appears to be exploring—again, as described on NEJM AI Ground Rounds—is the systematic work of reshaping the model and the testing process so performance is measured the way clinicians experience it: does it reduce cognitive load, preserve nuance, and avoid harmful failure modes?
Implications for clinicians: less clerical work, new verification duties
If LLMs are integrated thoughtfully, they could reduce the administrative burden that has fueled burnout for years. The near-term opportunity is not replacing clinicians; it’s compressing the “paperwork tax” of medicine—chart review, documentation, prior authorizations, referral letters, and after-visit summaries.
But the tradeoff is a new kind of professional responsibility: verification. Clinicians will need to review AI-generated text for subtle errors, missing contraindications, or misleading phrasing. This can create “automation bias,” where a polished answer is trusted too readily. Health systems will have to train users on how to interrogate model outputs and build interfaces that encourage healthy skepticism (for example, highlighting uncertainty, surfacing sources, and showing the patient data used).
There’s also an operational shift: quality improvement teams may start monitoring AI behavior the way they monitor lab turnaround times or readmission rates—because model performance can drift as documentation practices change, guidelines update, or patient populations shift.
Implications for patients: clearer communication—if privacy and equity are handled well
For patients, LLMs could meaningfully improve access to understandable information. A model that can translate clinician language into plain English, generate multilingual explanations, or produce tailored instructions could reduce confusion and improve adherence—especially for complex chronic conditions.
However, patients also bear the downside if safeguards are weak. Privacy is a first-order concern: clinical LLMs are only as trustworthy as the policies around data handling, logging, retention, and access controls. Equity is another: models trained on uneven data can underperform for historically underserved populations, amplifying disparities in communication quality or clinical recommendations.
Done right, LLMs could become a “universal interpreter” between medical systems and human needs. Done poorly, they could create a new layer of opaque decision-making that patients cannot contest.
The industry takeaway: evaluation is becoming the product
The competitive moat in clinical LLMs may not be the base model itself—open and proprietary options are proliferating—but the clinical-grade evaluation stack: benchmarks, human-in-the-loop review, domain-specific safety testing, and workflow trials that prove real-world value. The NEJM AI conversation highlights that serious teams are thinking less about spectacle and more about evidence.
Expect the next wave of progress to look less like viral chat demos and more like controlled deployments: pilots in radiology workflows, note summarization for inpatient teams, or patient messaging tools with strict escalation pathways to humans. Health systems will ask for measurable endpoints—time saved, error rates, patient comprehension, clinician satisfaction—paired with governance frameworks that define accountability when the model is wrong.
What comes next: from “LLM in the chart” to “LLM under governance”
The forward-looking question isn’t whether LLMs will enter clinical environments—it’s how quickly the field can standardize guardrails that make them safe, auditable, and worth trusting. Over the next 12–24 months, the most consequential advances will likely be hybrid systems: LLMs that generate language, coupled with retrieval tools that cite institutional policies and medical references, and wrapped in oversight mechanisms that constrain high-risk behavior.
As Google and others push further into clinical evaluation, the winners won’t be those who make medicine sound intelligent—they’ll be those who can prove, repeatedly, that the system improves care without introducing hidden risk.
Source: Reported based on the NEJM AI Ground Rounds episode “Google’s Exploration of Large Language Models in Medicine,” featuring Dr. Alan Karthikesalingam and Vivek Natarajan (https://ai-podcast.nejm.org/e/google-s-exploration-of-large-language-models-in-medicine/).

