Category: Research

Academic research, papers, and breakthroughs in healthcare AI

  • Virtual ERs Are Scaling Fast—This New Review Shows Where AI Helps, and Where It Can Hurt

    Virtual ERs Are Scaling Fast—This New Review Shows Where AI Helps, and Where It Can Hurt

    Emergency care is moving onto screens—and artificial intelligence is increasingly the engine behind it. A new systematic review in the International Journal of Medical Informatics maps how AI is being used in virtual emergency care, highlighting both promising gains (faster triage, smarter routing, better decision support) and persistent gaps in safety evidence, equity, and real-world deployment.

    Why AI in “virtual emergency” is suddenly a big deal

    Virtual urgent and emergency care has matured from a pandemic-era convenience into a strategic front door for many health systems. The drivers are familiar: overcrowded EDs, staffing shortages, rising acuity, and patient expectations shaped by consumer telehealth. What’s changed is the complexity of the work being asked of remote teams. Virtual emergency programs are no longer just “video visits for low-acuity issues.” They increasingly include paramedic-supported home assessments, remote patient monitoring, nurse navigation, and escalation pathways that can dispatch ambulances or coordinate direct-to-bed admissions.

    That complexity is also what makes AI attractive. In theory, algorithms can help a virtual emergency clinician make faster sense of incomplete information, detect deterioration earlier, and route the right patient to the right setting—without turning every encounter into an administrative burden. But emergency medicine is also where mistakes are least tolerated. The cost of an AI-driven miss—say, a subtle stroke flagged as “non-urgent”—is far higher than a scheduling error in primary care.

    What the review suggests AI is actually being used for

    According to the systematic review by Ravi Shankar and colleagues, published July 1, 2026, in the International Journal of Medical Informatics, AI applications in virtual emergency care cluster around a few core functions. One is triage: tools that stratify risk, prioritize queues, or recommend next steps based on symptoms, vitals, or prior history. Another is clinical decision support—algorithms that can flag red flags, suggest differentials, or support ordering and referral decisions. A third is operational: optimizing staffing, predicting volumes, and helping systems decide who can be safely managed at home versus who needs ED-level evaluation.

    Read between the lines, and the field is still in a “proof and pilot” era. Many of these tools work well in controlled settings or narrow cohorts, but fewer have robust validation across diverse populations and the messy reality of virtual encounters—variable lighting and camera quality, incomplete histories, language barriers, and missing vitals.

    The unique risks of AI when the exam is mediated by a screen

    Virtual emergency care has an inherent data problem: clinicians often have less objective information than in-person teams. Even with home devices, vitals can be absent, inaccurate, or delayed. That creates a temptation to lean harder on pattern-recognition systems trained on richer hospital data, where lab values and continuous monitoring are plentiful. If an AI model was developed on in-person ED data, it may fail quietly when applied to tele-triage inputs that are noisier and less complete.

    Bias concerns also get sharper in virtual settings. Access to bandwidth, device quality, digital literacy, and private space varies widely; these factors can correlate with socioeconomic status and race, shaping both what information is available to AI and how confidently it makes a recommendation. The result can be a new kind of inequity: not only who gets care, but who gets correctly classified as needing urgent escalation.

    Finally, virtual emergency programs are workflow-heavy by nature—handoffs to EMS, referrals to urgent care, follow-up pathways, prescriptions, and documentation. AI that generates recommendations without integrating into these workflows can create “alert fatigue at a distance,” where clinicians spend precious minutes adjudicating suggestions instead of treating patients.

    What this means for clinicians: augmentation, not autopilot

    For emergency physicians, nurses, and paramedic teams, the near-term value of AI in virtual emergency care is likely pragmatic: prioritization, documentation support, and early warning cues—not autonomous triage. The review’s focus on AI-enabled triage and decision support underscores a key principle: in emergency care, AI should reduce cognitive load and widen a clinician’s situational awareness, while leaving responsibility and final judgment with licensed professionals.

    Health systems adopting these tools should demand more than accuracy metrics. They should ask: How often does the model change a disposition decision? In which subgroups does it perform worse? What happens when the AI and clinician disagree? And how is performance monitored over time as patient populations and care pathways evolve?

    What this means for patients: faster access—if trust holds

    For patients, AI-assisted virtual emergency care could mean shorter waits, clearer routing (self-care vs. urgent care vs. ED), and earlier detection of serious problems. It can also reduce the “bounce” effect—patients sent to the ED after a telehealth visit because the clinician lacks confidence without an exam.

    But patient trust will hinge on transparency and outcomes. People will tolerate algorithmic support if it demonstrably improves speed and safety. They will not tolerate a system that feels like a gatekeeper designed to keep them away from in-person care. Clear communication—why a patient is being escalated or not, what data was used, and what symptoms should trigger reassessment—will matter as much as model performance.

    Where virtual emergency AI goes next

    The most important next phase is evaluation in the wild. As virtual emergency programs expand, AI tools will need prospective studies that measure clinical outcomes, not just agreement with clinician labels. Expect more focus on hybrid models that combine patient-reported data, device vitals, and longitudinal EHR context—alongside safeguards like uncertainty estimation, escalation triggers, and continuous auditing.

    Longer term, virtual emergency care could become a “learning system” where triage pathways continuously improve as outcomes feedback arrives. If that happens, the competitive advantage won’t be a single triage model—it will be the governance, monitoring, and clinical integration that keep AI safe as conditions change. The systematic review in the International Journal of Medical Informatics is a timely marker: the tools are arriving, but the industry’s credibility will be determined by how rigorously it proves they help more than they harm.

    Source: Shankar R, Wang L, Hoe HS, et al. “The role of artificial intelligence in virtual emergency care: a systematic review.” International Journal of Medical Informatics (July 1, 2026), as reported by the journal’s publication page: https://www.sciencedirect.com/science/article/pii/S1386505626001516

  • Why Mentorship May Be the Missing Infrastructure in Healthcare Machine Learning

    Why Mentorship May Be the Missing Infrastructure in Healthcare Machine Learning

    Healthcare has no shortage of machine learning pilots, promising papers, or new foundation models—but it still lacks something more basic: enough people who know how to build, evaluate, and deploy these systems responsibly in the real world. A recent post from the Stanford Center for AI in Medicine & Imaging (AIMI) argues that mentoring is not a “nice-to-have” in machine learning education; it’s an enabling layer that determines who enters the field, how quickly they become effective, and whether they learn the habits that prevent harm.

    In other words, the news here isn’t a new model architecture—it’s a reminder that the healthcare AI talent pipeline is itself a critical system. And like any critical system, it needs design, upkeep, and accountability.

    Mentorship as the hidden bottleneck

    As described in the Stanford AIMI Blog’s piece on mentoring in machine learning, mentorship shapes how aspiring practitioners navigate the field’s steep learning curve—everything from selecting projects to interpreting results to communicating uncertainty. That might sound like career advice, but in healthcare AI it’s operationally consequential. The gap between a technically correct model and a clinically useful one is filled with decisions that are rarely taught well in a course: dataset curation tradeoffs, label quality checks, leakage pitfalls, calibration, subgroup performance, and the difference between retrospective metrics and prospective value.

    This is where mentorship functions like quality control. A mentored trainee is more likely to learn “how to think” rather than “what to run,” and to internalize a workflow where robustness checks and error analysis are default behaviors. In medicine, we already accept that apprenticeship is central to competency. The Stanford AIMI perspective effectively asks why we’d treat machine learning for medicine—an intervention that can influence diagnoses, triage, and treatment pathways—any differently.

    Why this matters now: the field is moving faster than its training norms

    Healthcare ML is accelerating in three directions at once: larger models, broader deployment ambitions, and more scrutiny. Generative AI is expanding what clinicians and patients expect from software. Health systems are experimenting with ambient documentation, decision support, and operational forecasting. Regulators and hospital governance groups are simultaneously raising the bar for transparency and monitoring.

    That combination raises the stakes for “how we make builders.” When teams are under pressure to ship, junior researchers and engineers can be pushed toward optimizing leaderboard metrics rather than clinical relevance. Mentorship counterbalances that pressure by transferring tacit knowledge: how to partner with clinicians, when not to model something, how to characterize uncertainty, and how to document limitations in a way that downstream users can actually act on.

    Equally important, mentorship determines who gets access to high-impact work. In academic medicine and in industry, opportunities often flow through informal networks. Structured mentoring can widen the funnel, bringing in people from nontraditional backgrounds—data analysts in hospitals, nurses with informatics interests, residents who code at night—who may be closest to the real pain points but furthest from ML gatekeeping.

    Implications for healthcare professionals: safer tools, better collaboration

    For clinicians, the benefits of strong ML mentorship show up as better collaboration and clearer product behavior. A mentored ML practitioner is more likely to design with clinical workflow in mind: What is the decision point? What happens when the model is wrong? Who owns the follow-up? How will performance drift be detected? These are not “engineering details.” They are patient-safety questions.

    Mentorship also helps bridge language gaps. Many clinician–data scientist collaborations fail not because the model is impossible, but because requirements are ambiguous: the outcome definition is unstable, the ground truth is contested, or the deployment context shifts midstream. Good mentors teach their mentees to translate between clinical objectives and statistical proxies, and to treat data generating processes—documentation patterns, billing incentives, practice variation—as first-class modeling concerns.

    For healthcare organizations, investing in mentoring can reduce costly churn and rework. It’s expensive to repeatedly build models that never leave the “retrospective AUC” stage. Mentored teams are more likely to incorporate evaluation plans early—prospective validation, subgroup analysis, simulation of workflow impact—leading to fewer dead-end projects.

    Implications for patients: fewer silent failures, more trustworthy AI

    Patients rarely see the mentoring that happens behind the scenes, but they experience its absence. Under-mentored model development can produce systems that perform well on average and poorly for specific subgroups, or tools that degrade quietly after deployment because no one planned for monitoring. Mentorship encourages habits that directly reduce these risks: stress-testing on edge cases, examining error distributions, and acknowledging where data is missing or biased.

    Just as importantly, mentorship shapes the ethical reflexes of the field. It influences whether a young practitioner learns to ask: Should this be built? Who might be harmed? What recourse exists if the tool is wrong? Those questions are not automatic in an environment that rewards novelty and speed.

    What comes next: from informal advising to an institutional discipline

    The Stanford AIMI Blog post reads like a call to treat mentoring as infrastructure. The next step for the broader ecosystem—academic centers, health systems, and vendors—will be to operationalize it. That could mean formal mentoring tracks for clinician–data scientist pairs, protected time for senior reviewers to do methodological coaching, and “red team” style mentorship that teaches how to break models before patients do.

    Over the next few years, healthcare AI will likely be judged less on whether it can impress in a paper and more on whether it can hold up under messy clinical reality. The organizations that build enduring AI programs won’t just have better compute or bigger datasets. They’ll have better mentorship—because that’s how you scale judgment.

    Source: Stanford AIMI Blog, “Mentoring in Machine Learning” (as reported by Stanford AIMI), https://stanfordaimi.medium.com/mentoring-in-machine-learning-3d6f3e988bd3?source=rss-4e7de4cdea90——2

  • AI Scribes Are Saving Minutes, Not Miracles—And That’s Still a Big Deal for Clinicians

    AI Scribes Are Saving Minutes, Not Miracles—And That’s Still a Big Deal for Clinicians

    Ambient AI scribes are finally producing the kind of hard-number evidence health systems have been asking for—but the early payoff looks more incremental than transformative. A new study published in JAMA found clinicians using an AI scribe spent about 13 fewer minutes per day in the electronic health record (EHR) and about 16 fewer minutes per day on documentation, as reported by Healthcare Dive.

    That may not sound like a revolution. Yet in a profession where cognitive load accumulates in five-minute fragments—an inbox message here, a refill request there—shaving even a quarter-hour off daily EHR work can meaningfully change the rhythm of a clinic day. More importantly, it helps clarify what AI scribes are (and aren’t): a workflow tool that can reduce friction, not a magic wand that eliminates charting.

    Why “modest” time savings matter in the burnout economy

    Clinician burnout has many causes, but EHR burden remains a consistent accelerant. Health systems have spent years trying to “optimize” templates and order sets, often producing marginal gains. AI scribes represent a different approach: rather than asking clinicians to become better clerks, they attempt to automate the clerical layer—turning a conversation into structured notes and suggested elements of the visit record.

    The JAMA findings—roughly a half-hour reduction across EHR time and documentation time—should be interpreted in the context of how outpatient care actually works. In primary care and many specialties, the day is a sequence of compressed interactions. If documentation is the tax that must be paid after each encounter, then reducing that tax can return time to higher-value work: reviewing complex histories, calling families, coordinating care, or simply ending the day closer to on time.

    But the key word is “associated.” Real-world studies can demonstrate correlation and operational impact, yet they don’t automatically answer every question executives and clinicians will ask: Who benefits most? Does time saved translate into fewer after-hours “pajama time” sessions? And do these tools improve the quality of notes—or just produce faster notes?

    From novelty to infrastructure: what health systems will evaluate next

    Many AI scribe deployments begin as pilots aimed at clinician satisfaction, recruitment, and retention. The next phase is more infrastructural: integrating ambient documentation into enterprise workflows without creating new forms of work. That means health systems will increasingly scrutinize:

    Note quality and clinical utility. Faster documentation only helps if the resulting note is accurate, clinically meaningful, and aligned with coding and compliance expectations. Overly verbose AI-generated notes can create downstream burden for other clinicians, coders, and auditors—even if the original author saved time.

    Exception handling. The real cost of automation often hides in the edges: what happens when a patient speaks softly, multiple people talk at once, or clinical nuance requires careful phrasing? If clinicians spend time correcting outputs, the time savings can evaporate—or worse, shift risk into the chart.

    Workflow integration. An AI scribe that lives outside the EHR can become a “two-screen problem.” The winners will be tools that embed into existing documentation flows, minimize clicks, and produce outputs clinicians can trust with minimal editing.

    Governance and measurement. As usage scales, health systems will need policies on where these tools are allowed, how models are updated, and how performance is monitored. Measuring “minutes saved” is a start; measuring safety events, near-misses, and patient experience will become the bar.

    Implications for clinicians and patients

    For clinicians, modest time savings can compound. A reclaimed 15–30 minutes per day is the difference between finishing notes during clinic hours versus late evening catch-up. That matters for morale, turnover, and the sustainability of high-volume practices. It also changes how clinicians allocate attention in the exam room: if note-taking becomes less intrusive, the visit can feel more like a conversation than a transaction.

    For patients, the upside is subtle but real. Reduced documentation burden can translate into more eye contact, fewer awkward pauses, and more time for questions. However, ambient documentation also raises new expectations about transparency and consent. Patients may want to know when AI is listening, what is recorded, and how that information is used. Health systems will have to balance convenience with clear communication, especially in sensitive encounters.

    There’s also a broader patient-safety angle. Documentation errors are not hypothetical; they can propagate through problem lists, medication histories, and future clinical decisions. If AI scribes accelerate note production without robust review habits, they could unintentionally amplify inaccuracies. The counterpoint is that well-designed tools could improve completeness—capturing details clinicians might otherwise omit when rushing.

    What comes next: moving from time saved to outcomes earned

    The JAMA results reported by Healthcare Dive are likely to fuel more adoption, but the next wave of evidence needs to move beyond stopwatch metrics. Health systems and vendors will be pushed to show whether AI scribes reduce after-hours charting, improve clinician retention, and maintain—or improve—documentation quality. Even more compelling would be proof that they affect clinical outcomes indirectly by freeing attention for decision-making and patient education.

    In the near term, expect competition to shift from “who can draft the note” to “who can close the loop.” The most valuable systems won’t just generate prose; they’ll surface relevant history, propose orders with guardrails, and help clinicians reconcile medications and follow-ups—without turning the EHR into an even louder cockpit. If AI scribes can evolve from transcription engines into trustworthy workflow copilots, those “modest” minutes could become the foundation for a meaningful redesign of ambulatory care.

    Source: Healthcare Dive (reporting on research published in JAMA): https://www.healthcaredive.com/news/ai-artificial-intelligence-scribes-reductions-ehr-documentation-time-jama/816400/

  • The Next Bottleneck in Clinical NLP Isn’t the Model—It’s the Training Strategy

    The Next Bottleneck in Clinical NLP Isn’t the Model—It’s the Training Strategy

    Healthcare has no shortage of data—what it lacks is time. Clinicians and care teams still spend countless hours digging through notes to answer basic questions: What conditions does the patient have? Which medications are current? When did symptoms start? A June 2026 paper in the Journal of Biomedical Informatics argues that large language models (LLMs) can help, but only if we stop treating “the model” as the whole story and start optimizing how we train it for patient information extraction.

    The study—“A study of large language models for patient information extraction: Model architecture, fine-tuning strategy, and multi-task instruction tuning,” by Cheng Peng and colleagues—systematically examines how architectural choices, fine-tuning approaches, and multi-task instruction tuning influence performance on clinical extraction tasks, according to the Journal of Biomedical Informatics. In other words: it’s not just which LLM you pick; it’s how you adapt it to the messy realities of healthcare text.

    Why patient information extraction is still hard in 2026

    Information extraction sounds straightforward until you open an actual chart. Clinical notes are full of abbreviations, negations (“no evidence of…”), temporal language (“history of,” “rule out,” “since last visit”), copy-forward artifacts, and competing sources of truth (med list vs. narrative note vs. discharge summary). Even when LLMs can summarize a note impressively, converting that understanding into reliable, structured fields that downstream systems can trust remains a higher bar.

    This is why extraction is a pivotal use case. If you can consistently identify problems, medications, labs, procedures, and timelines, you unlock a cascade of workflows: pre-visit planning, registry reporting, cohort discovery for research, prior authorization support, quality measure automation, and even safety checks like medication reconciliation. But failure modes are also consequential—misclassifying a diagnosis as active when it’s ruled out, or attributing a medication to the patient that belongs to a family history section, can ripple into clinical decision support and create risk.

    What this study signals: the “how” of tuning matters as much as the “what”

    Many health systems and vendors still approach clinical NLP as a procurement question (“Which foundation model should we use?”). The thrust of this paper is that performance hinges on a more operational reality: the training recipe. The authors explicitly study model architecture and compare fine-tuning strategies alongside multi-task instruction tuning, as reported in the Journal of Biomedical Informatics. That emphasis reflects a broader shift in the field: foundation models may be generalists, but extraction is a specialist trade.

    Multi-task instruction tuning is particularly notable because it aligns with how clinical teams actually ask questions. A care manager might want problems and social determinants; a pharmacist might care about dosing and adherence; a coding specialist might focus on specificity and temporality. Training a model to follow varied, task-specific instructions can, in principle, reduce the fragility seen when a model performs well on one extraction schema but fails when the format or phrasing changes.

    From an industry standpoint, this points toward an uncomfortable truth: simply plugging an LLM into an EHR integration is not a product strategy. The defensible advantage is likely to come from curated training data, task design, evaluation rigor, and deployment controls—especially when systems must operate across departments, note styles, and patient populations.

    Implications for clinicians: less scavenger hunting, more verification

    If these training approaches translate into more accurate extraction, clinicians could see immediate workflow benefits. Instead of rereading three notes to find whether heart failure is active, the chart could surface a structured “active problem” list with traceable supporting evidence. Instead of manually reconciling a medication list from multiple sources, extraction could highlight discrepancies and point to the specific sentences that created them.

    But the job doesn’t disappear—it shifts. Clinicians may spend less time searching and more time verifying. That means interfaces matter: extracted facts need provenance (where in the note did this come from?), confidence cues, and a frictionless way to correct errors. Without that, clinicians will understandably revert to the original narrative, nullifying the promised efficiency gains.

    Implications for patients: cleaner records—and fewer invisible errors

    For patients, better extraction can mean fewer documentation mismatches that follow them across care settings. When structured data becomes more reliable, it improves care coordination and reduces repeated history-taking. It also has downstream effects: insurance approvals, referrals, and chronic disease management programs often depend on coded or structured elements that are currently inconsistently captured.

    The risk is equally patient-facing. Automated extraction can harden ambiguous text into a “fact,” and once something becomes structured, it tends to propagate—into summaries, problem lists, registries, and analytics. This is where safety practices become essential: human-in-the-loop review for high-stakes fields, audit trails, and continuous monitoring for drift as note templates and clinical language evolve.

    What to watch next: evaluation, generalization, and governance

    The paper’s focus on architecture and tuning underscores where the next competitive and clinical battleground will be: real-world generalization. Health systems will demand models that work across specialties, institutions, and documentation cultures—not just in benchmark settings. Expect increasing attention to evaluation frameworks that measure not only accuracy, but also calibration, robustness to negation/temporality, and performance across demographic subgroups.

    Looking forward, the most impactful deployments will likely pair multi-task instruction-tuned extraction with governance: clear boundaries on where automation is allowed, clinician feedback loops that improve the model over time, and transparent error reporting. If the industry gets that right, LLMs could finally turn clinical text into trustworthy, actionable signals—reducing administrative burden while keeping clinical accountability where it belongs.

    Source: Peng C, Dong X, Lyu M, Paredes D, Zhang Y, Wu Y. “A study of large language models for patient information extraction: Model architecture, fine-tuning strategy, and multi-task instruction tuning.” Journal of Biomedical Informatics (June 2026), as reported by the Journal of Biomedical Informatics. Available at: https://www.sciencedirect.com/science/article/pii/S1532046426000584?dgcid=rss_sd_all

  • Before AI Rolls Into the Clinic, Ask the Nurses: Why Kazakhstan’s New Attitude Scale Matters

    Before AI Rolls Into the Clinic, Ask the Nurses: Why Kazakhstan’s New Attitude Scale Matters

    As hospitals and primary care networks race to deploy AI tools, one practical question keeps getting ignored: do the clinicians expected to use these systems actually understand them—and trust them? A new study in Frontiers in Digital Health tackles that gap by validating a short, nine-item survey designed to measure nurses’ knowledge and attitudes toward AI, and it offers early insights from primary healthcare centre nurses in Almaty, Kazakhstan.

    The research may sound incremental—another questionnaire, another psychometric analysis—but it addresses a foundational problem in healthcare AI implementation. You can’t manage what you can’t measure, and attitudes toward AI aren’t just “soft” factors. They influence adoption, workarounds, documentation quality, escalation behavior, and ultimately whether patients benefit or get harmed.

    Why a nine-item scale is bigger than it looks

    AI tools are often evaluated through model metrics, FDA-cleared indications, and workflow integration plans. Yet many deployments falter because frontline clinicians experience them as extra work, unreliable “black boxes,” or compliance mandates rather than clinical support. Nurses sit at the center of this tension. They coordinate care, triage symptoms, reconcile medications, monitor deterioration, and translate plans into action. If nurses don’t understand what an AI system is doing—or feel it threatens their professional judgment—the tool’s clinical value can evaporate, regardless of how strong the underlying algorithm is.

    According to the Frontiers in Digital Health study, the authors set out to evaluate the psychometric properties of a previously validated nine-item instrument that captures nurses’ AI knowledge and attitudes, and to report initial findings among primary care nurses in Almaty. Validation work like this is the unglamorous scaffolding of implementation science: it helps ensure that when a health system says “our staff are ready,” that statement is grounded in reliable measurement rather than anecdotes.

    Why Kazakhstan—and why primary care?

    Much of the published conversation about clinical AI readiness comes from large academic centers in North America or Western Europe. That creates a distorted picture of global adoption: where resources are plentiful, informatics teams are established, and AI pilots are heavily subsidized. Studying nurses in Kazakhstan’s primary care environment matters because it reflects where AI could have enormous impact—and where the constraints are most real.

    Primary care is where AI’s promises are most frequently marketed: earlier risk detection, smarter triage, population health management, and administrative automation. It’s also where implementation is hardest. Primary care clinics often have lean staffing, fragmented IT, and high patient volumes. A tool that adds even a minute per visit can backfire. In that setting, nurse attitudes become a leading indicator of whether AI will streamline care—or simply become yet another layer of digital friction.

    Attitudes aren’t “feelings”—they’re patient safety signals

    Measuring attitudes toward AI isn’t about winning a popularity contest. It’s about mapping safety risks and training needs. A nurse who over-trusts an AI output may fail to escalate a subtle but dangerous change in patient status. A nurse who distrusts the system might ignore a valid alert, delay action, or document outside the tool to avoid it. Both failure modes—automation bias and algorithm aversion—are well documented in human factors research.

    A short, validated scale can help organizations segment readiness: who needs foundational AI literacy, who needs workflow coaching, and where leadership should slow down and redesign rather than push adoption. It also creates a way to measure change over time—before and after training, before and after rollout, and after adverse events—turning “culture” into something a quality team can track.

    What this means for healthcare leaders

    For executives and digital health teams, the lesson is straightforward: don’t treat nurses as downstream “end users.” Treat them as co-designers and safety partners. A validated instrument—like the one assessed in the Frontiers in Digital Health paper—can be used as a baseline diagnostic before procurement, not just as a post-rollout satisfaction survey.

    It also has procurement implications. If a clinic’s baseline AI knowledge is low, then a vendor’s interface, explainability features, and training burden become central to value—not afterthoughts. Conversely, if attitudes are positive but knowledge gaps are significant, leaders can justify targeted education rather than concluding that “staff are resistant.”

    What it means for nurses and patients

    For nurses, formal measurement can be empowering—if used correctly. It can support the case for protected training time, clearer accountability when AI is wrong, and better escalation pathways. But it can also be misused if leadership treats attitudes as compliance metrics. The right approach is to pair measurement with meaningful action: updated protocols, transparent model governance, and mechanisms for nurses to report issues without blame.

    For patients, the connection is direct. Nurses are often the first to notice when a tool is creating confusion, delaying care, or changing clinical priorities. When nurses are informed, confident, and appropriately skeptical, AI can become a genuine safety net. When they are rushed, undertrained, or excluded from decision-making, AI can magnify inequities and errors—especially in high-throughput primary care settings.

    The road ahead: from measurement to readiness-by-design

    The next phase for this line of work is to connect attitude and knowledge scores with real outcomes: adoption patterns, alert response times, documentation quality, near-miss reporting, and patient-level measures. If the scale can predict where AI deployments will struggle—or where harm is more likely—then it becomes a practical tool for governance, not just research.

    More broadly, this study is a reminder that the global AI conversation is shifting from “Can we build it?” to “Can we run it safely in everyday care?” As Kazakhstan and other health systems expand digital infrastructure, readiness measurement among nurses may become one of the most cost-effective interventions available: a small survey that helps prevent large, system-wide failure.

    Source: Frontiers in Digital Health, “Assessing nurses’ attitudes toward artificial intelligence in Kazakhstan: psychometric validation of a nine-item scale” (as reported by the journal and study authors).

  • Pancreatic cancer prognosis gets a transparency upgrade with Taiwan-scale explainable AI

    Pancreatic cancer prognosis gets a transparency upgrade with Taiwan-scale explainable AI

    Pancreatic cancer has long been a worst-case scenario for oncologists: late diagnoses, rapid progression, and survival curves that leave little room for uncertainty—yet in practice, uncertainty is everywhere. A new nationwide study from Taiwan, published in PLOS Digital Health, argues that the next leap in prognostic AI for pancreatic cancer won’t come from ever-more complex black boxes, but from models that can explain why they make a prediction—down to non-linear effects and interactions that clinicians can interrogate.

    According to the authors, the team built an explainable AI survival model using Taiwan’s national registry data, aiming to surface key prognostic variables and how they combine in patient-specific ways. That might sound incremental, but for a disease where treatment decisions are often made under severe time pressure—and where patients and families routinely ask for individualized, defensible expectations—interpretability is not a “nice to have.” It can be the difference between a tool that sits in a paper and one that changes care.

    Why prognostic AI in pancreatic cancer has hit a wall

    Most AI prognosis research faces a familiar tension: the models that perform best on paper can be the hardest to trust at the bedside. Deep learning approaches can ingest high-dimensional inputs and detect subtle patterns, but they often struggle to provide reasoning that aligns with clinical thinking. In pancreatic cancer, that trust gap is amplified by the disease’s heterogeneity. Two patients with similar stage labels can behave very differently depending on biology, comorbidities, functional status, and treatment access.

    Registry-scale datasets—especially national ones—offer a way through the data scarcity problem that plagues single-center studies. But using big data isn’t enough. If a model is trained on thousands of cases yet can’t show which features matter, when they matter, and how they interact, it risks becoming a statistical oracle rather than a clinical instrument.

    The Taiwan study’s core promise is to bridge that gap: pairing population-level breadth with explainability techniques intended to reveal non-linear relationships (where risk doesn’t increase in a straight line) and interactions (where one variable changes the meaning of another). As reported in PLOS Digital Health, the model is designed to generate patient-specific survival estimates while making the drivers of those estimates visible.

    What “explainable” could mean in day-to-day oncology

    Interpretability isn’t just an academic preference; it’s operationally important. For clinicians, “explainable” predictions can support three practical tasks:

    1) Risk communication. Pancreatic cancer care is filled with high-stakes conversations: whether to pursue aggressive chemotherapy, whether surgery is appropriate, and how to balance symptom control with life-prolonging therapy. If an AI tool can highlight the specific factors contributing to a patient’s predicted trajectory, clinicians can translate a probability into a narrative that patients can understand and challenge.

    2) Treatment planning and triage. Prognostic insight can influence how quickly patients are routed to specialized centers, clinical trials, genetic testing, or palliative care services. Explainable models may help teams justify why one patient should be escalated for multidisciplinary review while another might benefit more from supportive care focus—without relying on gut feeling alone.

    3) Error checking and bias detection. Transparency makes it easier to spot when a model is leaning too heavily on proxies for healthcare access or coding artifacts. In real-world registries, documentation patterns, missingness, and treatment selection biases can quietly shape predictions. Explanations don’t eliminate those issues, but they give clinicians and informaticians a handle for auditing them.

    Implications for patients: more than a number

    For patients, the value proposition is not simply “better accuracy.” It is actionable clarity. An individualized prognosis that comes with reasons can help patients weigh choices that extend beyond oncology—work decisions, caregiving arrangements, and personal goals. It can also improve shared decision-making by creating a structured way to discuss what is driving risk and what (if anything) might be modifiable.

    At the same time, explainable AI raises a subtle expectation problem: patients may infer that if the model can explain itself, it must be “right.” In reality, explanations can be persuasive even when they are incomplete. The clinical bar should be that explanations are faithful to the model and clinically coherent—not merely easy to visualize.

    What the industry should take from a nationwide registry approach

    The Taiwan registry-based design highlights a strategic direction for healthcare AI: models that learn from entire health systems, not boutique datasets. That matters because prognosis tools need robustness across varied hospitals, treatment patterns, and patient demographics. National data can also enable subgroup analyses that smaller studies can’t, helping identify where performance degrades—older adults, rural patients, those receiving non-standard therapies.

    But moving from research to deployment will still require careful steps. Registry variables may not map cleanly to what is available in an EHR workflow. Timelines matter (what was known at diagnosis versus after treatment begins). And prospective validation is essential: a model that looks strong retrospectively can behave differently when confronted with today’s shifting standards of care, new regimens, and evolving diagnostic pathways.

    The next chapter: from explainable predictions to decision support

    The larger opportunity is to connect explainable prognosis to clinical actions. The most useful tools won’t just say “high risk”; they’ll help answer “what now?” That could mean linking predictions to trial eligibility alerts, recommending referral to high-volume surgical centers, or flagging patients who may benefit from early palliative care integration. Future work may also combine registry data with imaging, genomics, and longitudinal lab trajectories—while preserving interpretability through careful model design and rigorous evaluation.

    If explainable AI can make pancreatic cancer prognosis both more personalized and more trustworthy, it could become a template for other aggressive cancers where time is short and decisions are complex. The Taiwan study, as reported by PLOS Digital Health, is a reminder that in clinical AI, transparency isn’t an aesthetic choice—it’s a pathway to adoption.

    Source: Tsai DR, Chiang CJ, Hsieh PC, Huang CY, Lee WC. “Explainable artificial intelligence for personalized prognosis in pancreatic cancer: A nationwide study from Taiwan.” PLOS Digital Health. https://journals.plos.org/digitalhealth/article?id=10.1371/journal.pdig.0001296

  • Medical AI’s Hardest Test Isn’t Accuracy—It’s Surviving the Realities of Low-Resource Care

    Medical AI’s Hardest Test Isn’t Accuracy—It’s Surviving the Realities of Low-Resource Care

    Medical AI keeps posting impressive results in controlled studies, but a new scoping review argues the real bottleneck is far more basic: getting these systems to work reliably, safely, and sustainably in low-resource settings. According to a paper in Frontiers in Digital Health, deployments in low- and middle-income countries (LMICs) continue to be constrained by infrastructure gaps, fragmented data environments, limited local technical capacity, and uneven governance—factors that can turn “AI-ready” prototypes into brittle tools in everyday clinics.

    Why this matters now

    AI’s promise in global health is hard to overstate. In settings with severe shortages of specialists, long travel times to tertiary hospitals, and overstretched primary care, decision support and automated triage can look like a shortcut to more equitable care. But the review highlights a central tension: many AI systems are designed around assumptions common in high-income health systems—stable internet, consistent power, interoperable records, clear accountability structures, and predictable clinical workflows.

    When those assumptions break, the risk isn’t merely that AI performs worse. The risk is that it becomes operationally irrelevant (never used), clinically unsafe (used incorrectly), or financially unsustainable (abandoned when grant funding ends). In other words, deployment is not an implementation detail—it’s the determining factor of whether AI improves outcomes or becomes another failed digital health initiative.

    The deployment gap: from model performance to system performance

    The review’s framing is a useful corrective to the industry’s “leaderboard” mindset. In low-resource settings, model accuracy is only one component of system performance, alongside uptime, maintenance, user training, workflow integration, monitoring for drift, and post-market governance.

    Three realities recur across LMIC implementations:

    Infrastructure constraints. Many clinics contend with intermittent electricity, inconsistent connectivity, limited device availability, and aging imaging equipment. AI tools that require constant cloud access—or high-end GPUs—can fail before they ever reach the patient.

    Data fragmentation and mismatch. Health data may be siloed across paper charts, inconsistent registries, and multiple donor-funded systems. Even when data exist, they may not reflect the population the model was trained on, raising the likelihood of performance degradation and bias.

    Local capacity and governance gaps. Without onsite expertise to maintain systems, troubleshoot, and evaluate performance, AI becomes dependent on external vendors or academic partners. That can slow iteration and obscure accountability when something goes wrong.

    What this means for clinicians and patients

    For healthcare professionals, the review underscores a practical point: clinical adoption hinges on trust and fit. If AI interrupts workflows, produces outputs that are hard to interpret, or cannot be relied on during peak demand, clinicians will revert to established practices. That’s rational behavior, not “resistance to innovation.”

    For patients, the stakes are higher than convenience. AI deployed without adequate safeguards can amplify existing inequities—such as under-diagnosis in rural communities, delayed referrals, or inconsistent triage decisions across facilities. Conversely, when designed for the environment, AI can expand access: supporting front-line workers with decision support, standardizing interpretation of diagnostics, and helping facilities prioritize limited resources.

    But the biggest patient-facing implication may be continuity. In low-resource settings, “pilot-itis” is a familiar problem: promising projects launch with fanfare and disappear within a year. Sustainable AI requires long-term operational planning—maintenance budgets, clear ownership, and monitoring—not just procurement.

    From “deploying AI” to building health AI ecosystems

    One of the most important takeaways from the Frontiers in Digital Health review is that successful AI in low-resource settings behaves less like a product drop-in and more like ecosystem building. That includes investing in data quality pipelines, governance frameworks, and workforce development alongside software.

    For health systems and implementers, a few strategic shifts follow naturally:

    Design for constraints, not exceptions. Offline-first architectures, edge inference, graceful degradation (safe fallback modes), and low-maintenance hardware matter as much as model selection.

    Prioritize local relevance. Tools should be trained and validated on representative data, evaluated in real clinical workflows, and adapted to local guidelines, languages, and referral pathways.

    Build capability, not dependency. Capacity-building—clinical informatics training, biomedical engineering support, and local ML expertise—reduces reliance on external partners and makes monitoring feasible.

    Governance must be explicit. Clear rules for accountability, model updates, error reporting, and data stewardship are essential, particularly where regulatory infrastructure is still developing.

    Forward look: the next era of global health AI

    The next wave of healthcare AI will be judged less by novel architectures and more by whether it can survive real-world conditions—variable power, mixed data, and human workflows under stress. The review in Frontiers in Digital Health is a reminder that equitable AI is not simply a matter of “bringing models” to LMICs; it’s a matter of building durable sociotechnical systems that can be owned and improved locally.

    As funders, governments, and vendors scale up efforts, the strongest signal of success may be boring in the best way: systems that stay online, get updated safely, are understood by clinicians, and keep delivering value after the pilot ends. The organizations that treat implementation as core R&D—rather than a last-mile chore—will define whether medical AI becomes a global health equalizer or another technology that works best where it’s needed least.

    Source: Frontiers in Digital Health, “Deploying medical AI in low-resource settings: a scoping review of challenges and strategies” (2026). https://www.frontiersin.org/articles/10.3389/fdgth.2026.1743634

  • AI ‘Co-Production’ Could Make Patient Input Less Token—and More Representative

    AI ‘Co-Production’ Could Make Patient Input Less Token—and More Representative

    Healthcare research has a diversity problem that doesn’t show up in the methods section: too often, the patients and members of the public who help shape studies are not the same people most affected by the outcomes. A new brief report in Frontiers in Digital Health describes an AI-enabled framework—called Panelyze—designed to augment Patient and Public Involvement and Engagement (PPIE) by widening participation beyond the limits of traditional panels, according to the authors.

    The idea is straightforward but consequential: use an AI-powered co-production system to help researchers gather, structure, and integrate patient and public perspectives at scale, particularly when conventional PPIE approaches struggle with recruitment, geography, time, and cost. If it works as intended, it could make “who gets heard” in health research less dependent on proximity to academic centers, free time, and prior familiarity with research processes.

    Why PPIE keeps falling short—despite good intentions

    PPIE is widely treated as a marker of rigor and legitimacy in healthcare research. Funders and ethics boards increasingly expect it. Clinicians and health systems recognize that interventions designed without lived experience can misread real-world constraints—like transportation, caregiving, stigma, language barriers, or digital access.

    Yet the operational reality is messy. Traditional PPIE often relies on standing advisory panels or periodic workshops—formats that can over-represent people who are already connected to academic networks, live near major institutions, or have the flexibility to join repeated meetings. The Frontiers in Digital Health report highlights familiar friction points: recruitment limitations, geographic constraints, and the resource intensity required to run panels that are both sustained and diverse.

    In other words, PPIE can become performative not because researchers don’t care, but because the machinery to do it well is hard to maintain. That’s the gap Panelyze aims to fill: not replacing human involvement, but expanding and systematizing it so that research teams can capture missing voices earlier and more consistently.

    What an AI “co-production” system could change

    According to Frontiers in Digital Health, Panelyze is positioned as an augmentation layer for existing PPIE—an AI framework that supports co-production, meaning patients and the public contribute to shaping research rather than simply reacting to it. That distinction matters. Many engagement efforts concentrate on feedback after the study is already largely defined. Co-production implies earlier influence: priorities, outcomes that matter, recruitment strategies, burden of participation, and interpretation of findings.

    At a practical level, an AI-supported approach can help with three persistent bottlenecks:

    1) Scale without exhausting staff. Facilitating sessions, synthesizing qualitative input, and closing the loop back to contributors requires time that research teams often underestimate. AI can assist with organizing themes, tracking concerns, and maintaining continuity across long projects—tasks that are tedious but essential for accountability.

    2) Broader reach. Digital systems can include people who are geographically dispersed or who can’t attend scheduled meetings. That doesn’t automatically create equity—access and trust still matter—but it can reduce the dependence on local networks and weekday availability.

    3) Consistency and traceability. One under-discussed failure mode in PPIE is “insight loss”: comments get captured in notes, then diluted as projects move from proposal to protocol to publication. A structured system can preserve the reasoning trail—what was suggested, what changed, and why.

    Implications for clinicians, researchers, and health systems

    For healthcare professionals, the promise of more representative PPIE is not abstract. It directly affects the usability of interventions that clinicians are asked to implement—care pathways, digital therapeutics, screening programs, consent materials, and follow-up protocols. When patient input is narrow, tools may look good on paper but fail in clinics because they ignore lived realities such as language needs, disability accommodations, or cultural perceptions of risk.

    For patients and communities, an AI-augmented engagement system could lower the threshold for participation, especially for people who have historically been under-consulted: those in rural regions, those with complex chronic illness, caregivers, and groups who may distrust institutions due to past harms. But it also raises a sensitive question: will people feel genuinely heard if an AI system sits between them and decision-makers?

    That question points to the make-or-break requirement for any such framework: transparency. If AI is used to summarize or prioritize input, communities will reasonably ask how that prioritization happens, what gets filtered out, and how bias is controlled. AI can amplify voices; it can also inadvertently standardize them—compressing nuance into neat themes that fit research workflows better than they fit reality.

    Safety, ethics, and governance: the hard part isn’t the model

    Systems like Panelyze arrive at a moment when healthcare is reckoning with algorithmic accountability. Engagement data is often sensitive, even when it’s not labeled “clinical”: narratives can reveal diagnoses, trauma histories, immigration status, or socioeconomic stressors. Any AI-enabled PPIE platform therefore inherits obligations around privacy, data minimization, consent, and secure handling.

    There’s also governance: Who owns the outputs? How are contributors credited? How are disagreements represented—especially when minority viewpoints are clinically important but statistically “rare”? And crucially, what feedback mechanisms ensure that participants can see the impact of their contributions rather than feeling mined for stories?

    What to watch next

    The next phase for AI-assisted PPIE will likely be less about shiny capability and more about evidence: Does it measurably improve diversity of participation? Does it change study designs in ways that reduce burden and improve recruitment and retention? Does it affect downstream outcomes like adoption, adherence, and satisfaction?

    Expect leading institutions to experiment with hybrid models—human facilitation plus AI tooling—while regulators, funders, and ethics committees refine expectations for transparency and auditability. If frameworks like Panelyze can demonstrate that they broaden representation without flattening nuance, they could become part of the default research stack. The long-term shift would be subtle but profound: patient involvement moving from a checkbox at proposal stage to a continuous, traceable input stream that shapes the life cycle of research.

    Source: As reported in a brief research report in Frontiers in Digital Health, “Amplifying missing voices in healthcare research: an AI framework for co-production of PPIE.”

  • Every Open Source Radiology Dataset You Need for AI Research in 2026

    Every Open Source Radiology Dataset You Need for AI Research in 2026

    Radiology sits at the intersection of imaging technology and clinical decision-making, making it one of the most data-rich and AI-ready specialties in medicine. From chest X-rays to brain MRIs, from abdominal CTs to mammograms, the volume and variety of radiological imaging data is staggering. And thanks to a growing commitment to open science, an impressive collection of these datasets is now freely available to researchers worldwide.

    This guide catalogs every major open source radiology dataset available for AI research, organized by anatomical region and imaging modality. Each entry includes direct links to download portals, GitHub repositories, and associated publications. This is designed to be a living reference for anyone building AI systems for radiology.

    Why Radiology Leads in Open Data

    Radiology was among the first medical specialties to embrace digital formats, with DICOM becoming the universal standard for medical imaging decades ago. This digital-native foundation, combined with the visual pattern recognition demands of the specialty, has made radiology the testing ground for medical AI. Large-scale NIH-funded projects like The Cancer Imaging Archive (TCIA) and mandates for data sharing in federally funded research have further accelerated dataset availability.

    Chest Radiology Datasets

    Chest imaging — both X-ray and CT — represents the largest category of open radiology data, driven by the global burden of pulmonary disease and the COVID-19 pandemic.

    Dataset Modality Size Annotations Source
    CheXpert Chest X-ray 224,316 images 14 pathology labels with uncertainty modeling Stanford ML Group
    MIMIC-CXR-JPG Chest X-ray 377,110 images Free-text reports + CheXpert-style labels PhysioNet
    NIH ChestX-ray14 Chest X-ray 112,120 images 14 disease labels, bounding boxes for 880 images NIH Clinical Center
    VinDr-CXR Chest X-ray 18,000 images 22 lesion categories with bounding boxes from 17 radiologists PhysioNet
    PadChest Chest X-ray 160,868 images 174 radiographic findings, 19 differential diagnoses BIMCV
    RSNA Pneumonia Detection Chest X-ray 30,000 images Bounding boxes for pneumonia opacities Kaggle
    LUNA16 Chest CT 888 scans Lung nodule locations from LIDC-IDRI Grand Challenge
    LIDC-IDRI Chest CT 1,018 scans Multi-reader nodule annotations with malignancy ratings TCIA
    NLST (National Lung Screening Trial) Low-dose CT 75,000+ scans Lung cancer screening outcomes NCI CDAS

    Neuroradiology Datasets

    Brain imaging datasets support research in tumor detection, stroke assessment, neurodegenerative disease tracking, and normal brain development.

    Dataset Modality Size Annotations Source
    BraTS 2023/2024 Multi-parametric MRI 2,000+ cases Glioma segmentation (enhancing, core, whole tumor, edema) Synapse
    ADNI MRI, PET 2,000+ subjects Longitudinal Alzheimer’s imaging with cognitive scores ADNI
    CQ500 Head CT 491 scans Intracranial hemorrhage, mass effect, midline shift, fractures qure.ai
    RSNA Intracranial Hemorrhage Head CT 25,000+ exams 5 hemorrhage subtypes + normal Kaggle
    ATLAS (Anatomical Tracings of Lesions After Stroke) T1w MRI 1,271 scans Manual stroke lesion tracings NITRC
    OpenNeuro MRI, EEG, MEG 900+ datasets BIDS-formatted neuroscience datasets openneuro.org
    HCP (Human Connectome Project) MRI (structural, functional, diffusion) 1,200 subjects High-resolution brain connectivity maps humanconnectome.org

    Abdominal Radiology Datasets

    Dataset Modality Size Annotations Source
    TotalSegmentator CT 1,204 scans 117 anatomical structures GitHub
    AbdomenAtlas 1.1 CT 9,262 volumes 25 organs + tumors GitHub
    AMOS 2022 CT + MRI 500 CT + 100 MRI 15 abdominal organs Grand Challenge
    LiTS CT 201 scans Liver + liver tumor segmentation CodaLab
    KiTS23 CT 599 scans Kidney + kidney tumor segmentation GitHub
    BTCV (Beyond the Cranial Vault) CT 50 scans 13 abdominal organ segmentations Synapse
    WORD CT 150 scans 16 abdominal organ segmentations GitHub

    Mammography and Breast Imaging

    Dataset Modality Size Annotations Source
    VinDr-Mammo Mammography 5,000 exams (20,000 images) BI-RADS assessment + findings with bounding boxes PhysioNet
    CBIS-DDSM Mammography 2,620 scans Mass and calcification annotations with pathology-confirmed labels TCIA
    INbreast Full-field digital mammography 410 images Contour annotations for masses, calcifications, and distortions INESC Porto
    RSNA Screening Mammography Mammography 54,706 images Cancer detection labels Kaggle
    Duke Breast Cancer MRI Breast MRI (DCE) 922 patients Pre-operative MRI with clinical and genomic data TCIA

    Musculoskeletal Radiology

    Dataset Modality Size Annotations Source
    MURA X-ray 40,561 images Normal/abnormal across 7 upper extremity types Stanford ML Group
    VerSe 2020 CT 374 scans Vertebra segmentation and labeling GitHub
    RSNA Cervical Spine Fracture CT 3,000+ scans Fracture detection and localization Kaggle
    KneeXray (OAI) X-ray 36,369 images Kellgren-Lawrence osteoarthritis grading NIH OAI
    fastMRI MRI (knee + brain) 10,000+ volumes Raw k-space data for accelerated MRI reconstruction NYU fastMRI

    Nuclear Medicine and PET Datasets

    Dataset Modality Size Annotations Source
    AutoPET PET/CT 1,014 studies Whole-body tumor segmentation from FDG-PET/CT Grand Challenge
    HECKTOR PET/CT 882 patients Head and neck tumor segmentation and outcome prediction Grand Challenge

    Report Generation and Vision-Language Datasets

    A rapidly growing category pairs radiology images with their associated reports, enabling AI systems that can generate or summarize radiological findings.

    Dataset Size Content Key Features Source
    MIMIC-CXR Reports 227,835 reports Chest X-ray free-text reports Largest radiology report dataset; paired with images PhysioNet
    IU X-Ray 7,470 pairs Chest X-ray images + reports Indiana University dataset; frequently used for report generation Open-i
    RadNLI 960 sentence pairs Natural language inference for radiology Entailment, contradiction, neutral labels for report sentences PhysioNet

    Ultrasound Datasets

    Dataset Focus Size Annotations Source
    BUSI (Breast Ultrasound Images) Breast 780 images Normal, benign, malignant classification + segmentation masks Cairo University
    HC18 Fetal head 1,334 images Head circumference measurement in 2D ultrasound Grand Challenge
    TN3K Thyroid 3,493 images Thyroid nodule segmentation GitHub
    EchoNet-Dynamic Cardiac 10,030 videos Ejection fraction + semantic segmentations echonet.github.io

    Open Source Radiology AI Frameworks

    To work effectively with these datasets, several open source frameworks have become essential tools in the radiology AI researcher’s toolkit:

    • MONAI — Medical Open Network for AI: comprehensive PyTorch framework for medical imaging
    • MONAI Model Zoo — Pretrained models for common medical imaging tasks
    • fastMRI — Tools for accelerated MRI reconstruction
    • Microsoft Health Intelligence — ML toolbox for medical imaging
    • nnU-Net — Self-configuring framework for medical image segmentation
    • 3D Slicer — Open source platform for medical image informatics

    Getting Started

    For newcomers to radiology AI, the path depends on your clinical focus. For chest imaging, CheXpert and MIMIC-CXR provide the scale needed for robust model development. For segmentation tasks, TotalSegmentator and the Medical Segmentation Decathlon offer comprehensive multi-organ benchmarks. For neuroradiology, BraTS remains the gold standard for tumor segmentation, while the Human Connectome Project provides unparalleled brain connectivity data.

    Regardless of your focus area, we recommend using the MONAI framework and nnU-Net as starting points for model development — both handle much of the preprocessing, data loading, and training pipeline complexity that can slow down medical imaging research.

    The open radiology dataset ecosystem is more robust than ever, and new datasets continue to emerge from major academic medical centers and government initiatives worldwide. Bookmark this page and check back regularly — we will keep it updated as new resources become available.

  • Open Source Pathology Datasets for AI: Every Major Resource in Digital Pathology

    Open Source Pathology Datasets for AI: Every Major Resource in Digital Pathology

    Digital pathology is undergoing a revolution. The marriage of whole slide imaging (WSI) with artificial intelligence is creating systems capable of detecting cancers, grading tumors, predicting molecular markers, and identifying patterns invisible to the human eye. At the heart of this revolution are open source datasets — massive collections of digitized tissue slides with expert annotations that enable researchers worldwide to develop and validate computational pathology algorithms.

    This guide provides an exhaustive catalog of every major open source pathology dataset available for AI research, with direct links to download portals and code repositories. Whether you are working on cancer detection, cell segmentation, survival prediction, or foundation model development, this is your complete reference.

    The Digital Pathology Data Landscape

    Pathology datasets present unique computational challenges. A single whole slide image (WSI) can be 100,000 x 100,000 pixels or larger — gigapixel-scale images that cannot be processed by standard deep learning pipelines without patch-based or multiple instance learning approaches. This scale, combined with the complexity of tissue morphology, makes pathology one of the most technically demanding areas of medical imaging AI.

    Datasets in this space range from small, meticulously annotated collections of a few hundred slides to massive multi-institutional archives containing tens of thousands of cases. The annotations themselves vary from coarse slide-level labels to pixel-level segmentations of individual cells and tissue structures.

    Large-Scale Whole Slide Image Archives

    These repositories serve as foundational resources, providing access to thousands of digitized slides across multiple cancer types.

    Dataset Slides Cancer Types Annotations Source
    TCGA (The Cancer Genome Atlas) 30,000+ diagnostic slides 33 cancer types Slide-level diagnoses; paired genomics, transcriptomics, clinical data GDC Portal
    CPTAC (Clinical Proteomic Tumor Analysis) 3,500+ slides 10+ cancer types Paired proteomics and histopathology NCI CPTAC
    GTEx 17,382 tissue samples Normal tissues (54 sites) Gene expression paired with histology gtexportal.org
    PAIP (Pathology AI Platform) Varies by challenge year Liver, colon, pancreas Tumor segmentation and viable tumor burden estimation Grand Challenge

    Breast Cancer Pathology Datasets

    Breast cancer pathology has the richest ecosystem of open datasets, reflecting its clinical importance and the maturity of AI research in this domain.

    Dataset Size Task Annotations Source
    Camelyon16 399 WSIs Lymph node metastasis detection Pixel-level metastasis annotations; landmark challenge in pathology AI Grand Challenge
    Camelyon17 1,000 WSIs from 5 centers Patient-level pN-staging Multi-center extension with clinical staging Grand Challenge
    BreakHis 9,109 microscopy images Benign vs. malignant classification 4 magnification levels (40x, 100x, 200x, 400x); 8 tumor subtypes UFPR
    BACH (BreAst Cancer Histology) 400 images + 30 WSIs 4-class classification Normal, benign, in situ carcinoma, invasive carcinoma Grand Challenge
    BRACS 4,539 ROIs + 547 WSIs 7-class breast lesion classification Atypical ductal hyperplasia included; region and WSI-level annotations BRACS
    TUPAC16 821 WSIs Mitosis detection + proliferation scoring Mitosis annotations; tumor proliferation speed prediction Grand Challenge

    Colorectal and Gastrointestinal Pathology

    Dataset Size Task Annotations Source
    NCT-CRC-HE-100K 100,000 patches 9-class tissue classification Adipose, background, debris, lymphocytes, mucus, smooth muscle, normal mucosa, cancer stroma, tumor epithelium Zenodo
    GlaS (Gland Segmentation) 165 images Gland instance segmentation Pixel-level gland boundaries in colon adenocarcinoma TIA Warwick
    DigestPath 2019 872 images Signet ring cell detection + colonoscopy tissue segmentation Cell-level and tissue-level annotations Grand Challenge
    CRAG 213 images Colorectal adenocarcinoma gland segmentation Instance-level gland segmentation with grading TIA Warwick

    Cell Detection and Segmentation Datasets

    Cell-level analysis is fundamental to computational pathology, enabling quantification of tumor-infiltrating lymphocytes, mitotic counts, and cellular composition.

    Dataset Size Task Annotations Source
    PanNuke 7,901 patches from 19 tissue types Nuclei instance segmentation 5 nuclei types: neoplastic, inflammatory, connective, dead, epithelial TIA Warwick
    CoNSeP 41 H&E image tiles Nuclei segmentation and classification 24,319 annotated nuclei with 7 cell types TIA Warwick
    MoNuSeg 44 H&E images from 7 organs Nuclear segmentation 21,623 manually annotated nuclear boundaries Grand Challenge
    Lizard 291 H&E patches Nuclear instance segmentation 495,179 nuclei labeled into 6 classes across colon tissue TIA Warwick
    NuCLS 1,744 ROIs from breast cancer Nuclear classification 220,000+ nuclei with 13 cell type labels NuCLS

    Prostate and Kidney Pathology

    Dataset Size Task Annotations Source
    PANDA (Prostate Cancer Grade Assessment) 10,616 WSIs Gleason grading Slide-level ISUP grades from 2 European centers; largest prostate pathology dataset Kaggle
    SICAPv2 18,783 patches from 155 WSIs Gleason pattern classification Patch-level Gleason pattern (3, 4, 5) and non-cancerous annotations Mendeley Data
    HuBMAP Kidney 20 WSIs Functional tissue unit segmentation Glomeruli and tubule segmentation in PAS-stained kidney Kaggle

    Foundation Models and Pretrained Weights

    The pathology AI community has produced several open source foundation models that can be fine-tuned on smaller datasets:

    Model Architecture Training Data Key Features Source
    UNI ViT-Large 100M+ patches from 100K+ WSIs General-purpose pathology feature extractor; state-of-the-art on 34 benchmarks GitHub
    CONCH Vision-Language 1.17M image-text pairs Contrastive learning from pathology images and captions GitHub
    CTransPath Swin Transformer 15M patches from 30K+ slides Semantically relevant contrastive learning for histopathology GitHub
    Phikon ViT-Base TCGA + CPTAC slides Self-supervised pathology model from Owkin Hugging Face
    HoverNet Custom encoder-decoder PanNuke + CoNSeP Simultaneous nuclei segmentation and classification GitHub

    Tools and Frameworks

    Several open source tools make working with these datasets more accessible:

    • CLAM — Data-efficient and weakly supervised computational pathology pipeline
    • MITI Minimum Information Standard — Standardized reporting for tissue image datasets
    • QuPath — Open source whole slide image analysis software
    • deep-histopath — Deep learning framework for computational pathology
    • MONAI Pathology — Pathology-specific extensions to the MONAI framework

    Getting Started

    For researchers entering computational pathology, we recommend starting with NCT-CRC-HE-100K for tissue classification (patch-level, manageable size) or Camelyon16 for WSI-level analysis. The PANDA dataset is excellent for learning multiple instance learning approaches on a well-documented, large-scale task. For cell-level analysis, PanNuke provides the breadth needed across tissue types.

    The field is moving rapidly toward foundation models, and researchers should consider leveraging pretrained weights from UNI, CONCH, or Phikon rather than training from scratch — particularly for smaller downstream datasets.

    As computational pathology matures, we anticipate more datasets incorporating immunohistochemistry, special stains, and spatial transcriptomics data, enabling multimodal models that capture the full complexity of tissue biology.

  • The Definitive Guide to Open Source Medical Imaging Datasets in 2026

    The Definitive Guide to Open Source Medical Imaging Datasets in 2026

    The democratization of medical imaging AI research has been driven by the availability of large, high-quality open source datasets. These publicly accessible resources have enabled researchers, clinicians, and developers worldwide to build, validate, and deploy algorithms that are transforming healthcare delivery. In this comprehensive guide, we catalog the most important downloadable medical imaging datasets available today, with direct links to their repositories and source data.

    Why Open Source Medical Imaging Datasets Matter

    Medical imaging AI has reached a critical inflection point. Models trained on proprietary datasets often lack generalizability, while those developed on diverse, well-annotated open datasets tend to perform more robustly across clinical settings. Open datasets also enable reproducibility — a cornerstone of scientific research that has historically been lacking in medical AI publications.

    The challenge, however, is that these datasets are scattered across dozens of platforms, institutional repositories, and challenge websites. This guide consolidates them into a single reference, organized by modality and clinical application.

    Multi-Modal and General Purpose Datasets

    Several landmark datasets span multiple imaging modalities or serve as foundational benchmarks for the broader medical imaging community.

    Dataset Modality Size Annotations Source
    Medical Segmentation Decathlon CT, MRI 2,633 scans across 10 tasks Expert segmentations for liver, brain, hippocampus, lung, prostate, cardiac, pancreas, colon, hepatic vessels, spleen medicaldecathlon.com
    The Cancer Imaging Archive (TCIA) CT, MRI, PET, Pathology 100M+ images across 150+ collections Varies by collection — segmentations, clinical data, genomics cancerimagingarchive.net
    Grand Challenge Datasets Multiple Varies per challenge Challenge-specific expert annotations grand-challenge.org
    AMOS 2022 CT, MRI 500 CT + 100 MRI scans 15 abdominal organ segmentations amos22.grand-challenge.org
    TotalSegmentator CT 1,204 CT scans 117 anatomical structures segmented GitHub
    AbdomenAtlas 1.1 CT 9,262 CT volumes 25 organ + tumor annotations GitHub

    Chest and Lung Imaging Datasets

    Chest X-ray and CT datasets represent the single largest category of open medical imaging data, largely driven by the COVID-19 pandemic and longstanding tuberculosis screening research.

    Dataset Modality Size Annotations Source
    CheXpert Chest X-ray 224,316 images from 65,240 patients 14 pathology labels with uncertainty annotations Stanford ML Group
    MIMIC-CXR Chest X-ray 377,110 images from 65,379 patients Free-text radiology reports + NLP-extracted labels PhysioNet
    NIH ChestX-ray14 Chest X-ray 112,120 images from 30,805 patients 14 disease labels (NLP-mined) NIH Clinical Center
    PadChest Chest X-ray 160,868 images from 67,625 patients 174 radiographic findings, 19 differential diagnoses BIMCV
    VinDr-CXR Chest X-ray 18,000 images 22 local lesion labels + 6 global disease labels PhysioNet
    LUNA16 Chest CT 888 CT scans Lung nodule annotations from LIDC-IDRI Grand Challenge
    COVID-CT Chest CT 349 COVID-19+ and 397 non-COVID CT slices Binary COVID classification GitHub
    RSNA Pneumonia Detection Chest X-ray 30,000 frontal chest X-rays Bounding boxes for pneumonia opacities Kaggle

    Brain and Neuroimaging Datasets

    Neuroimaging datasets have been critical for advancing our understanding of neurodegenerative diseases, brain tumors, and stroke.

    Dataset Modality Size Annotations Source
    BraTS (Brain Tumor Segmentation) MRI 2,000+ multi-modal MRI scans Expert glioma segmentations (enhancing, core, whole) Synapse
    ADNI (Alzheimer’s Disease Neuroimaging) MRI, PET 2,000+ subjects longitudinal Clinical assessments, biomarkers, cognitive scores ADNI
    IXI Dataset MRI (T1, T2, PD, MRA, DTI) 600 healthy subjects Multi-sequence brain MRI from 3 hospitals brain-development.org
    OASIS-3 MRI, PET 1,378 subjects, 2,842 MRI sessions Longitudinal Alzheimer’s + aging data oasis-brains.org
    ISLES (Ischemic Stroke Lesion) MRI 400+ stroke cases Stroke lesion segmentations isles-challenge.org
    FastSurfer / FreeSurfer Datasets MRI Varies Cortical parcellations and volumetrics GitHub

    Cardiac Imaging Datasets

    Cardiac imaging AI has matured rapidly, with open datasets enabling automated segmentation and functional analysis of the heart.

    Dataset Modality Size Annotations Source
    ACDC (Automated Cardiac Diagnosis) Cardiac MRI 150 patients (5 subgroups) LV, RV, myocardium segmentations CREATIS
    M&Ms (Multi-Centre Multi-Vendor) Cardiac MRI 375 patients from 6 centers Cardiac structure segmentations across vendors ub.edu
    EchoNet-Dynamic Echocardiography 10,030 echo videos Ejection fraction labels, semantic segmentations echonet.github.io
    CAMUS 2D Echocardiography 500 patients LV endocardium, epicardium, LA segmentations CREATIS

    Abdominal and Gastrointestinal Imaging

    Dataset Modality Size Annotations Source
    LiTS (Liver Tumor Segmentation) CT 201 CT scans Liver and liver tumor segmentations CodaLab
    KiTS (Kidney Tumor Segmentation) CT 599 CT scans Kidney and kidney tumor segmentations GitHub
    Kvasir-SEG Endoscopy 1,000 polyp images Polyp segmentation masks Simula
    CT-ORG CT 140 CT scans 6 organ segmentations (liver, lungs, bladder, kidney, bones, brain) TCIA

    Ophthalmology and Retinal Imaging

    Dataset Modality Size Annotations Source
    DRIVE Fundus Photography 40 retinal images Vessel segmentation masks Grand Challenge
    MESSIDOR-2 Fundus Photography 1,748 images Diabetic retinopathy grading ADCIS
    EyePACS Fundus Photography 88,702 images 5-class diabetic retinopathy severity Kaggle
    REFUGE Fundus Photography 1,200 images Glaucoma classification, optic disc/cup segmentation Grand Challenge
    OCTID OCT 500 images Retinal disease classifications Scholars Portal

    Musculoskeletal Imaging

    Dataset Modality Size Annotations Source
    MURA X-ray 40,561 musculoskeletal radiographs Normal/abnormal binary labels across 7 body parts Stanford ML Group
    VerSe CT 374 CT scans Vertebra segmentation and labeling GitHub
    OAI (Osteoarthritis Initiative) MRI, X-ray 4,796 subjects Longitudinal knee imaging with clinical outcomes NIH

    How to Get Started

    For researchers new to medical imaging AI, we recommend starting with well-documented, moderately sized datasets like the Medical Segmentation Decathlon or CheXpert. These offer clean annotations, established baselines, and active communities. For production-scale development, MIMIC-CXR and The Cancer Imaging Archive provide the volume needed to train robust models.

    Key considerations when selecting a dataset include licensing terms (most require data use agreements), annotation quality, patient demographics, and whether the dataset includes train/test splits for fair benchmarking. Always review the associated publications and data use agreements before incorporating any dataset into your work.

    The Road Ahead

    The open medical imaging ecosystem continues to grow. Initiatives like the Medical Open Network for Artificial Intelligence (MONAI), Hugging Face’s medical imaging hub, and institutional data-sharing policies are accelerating the availability of high-quality, diverse datasets. As federated learning matures, we may see a future where models can be trained across institutions without centralizing sensitive data — but until then, these open datasets remain the foundation upon which medical imaging AI is built.

    We will continue to update this resource as new datasets become available. If you know of a dataset we’ve missed, reach out to our editorial team.

  • From Echo to EHR: Multimodal LLMs Edge Closer to a Cardiologist’s Digital Co‑Pilot

    From Echo to EHR: Multimodal LLMs Edge Closer to a Cardiologist’s Digital Co‑Pilot

    Cardiology may be on the verge of a workflow shift: large language models that can reason across images, waveforms, and text are moving from “chatbot curiosity” to credible diagnostic support. A new paper in the Journal of Medical Systems spotlights the emerging role of multimodal large language models (MLLMs) in cardiovascular diagnostics—models designed to interpret multiple data types in tandem rather than treating each modality as a separate silo.

    That matters because cardiovascular care is fundamentally multimodal. A single patient with chest pain can generate an ECG strip, troponin labs, an echocardiogram, a coronary CTA, prior cath images, medication history, and a long narrative note—often scattered across systems and time. Humans integrate this information with impressive skill, but under real-world pressure: interruptions, time constraints, handoffs, variable documentation quality, and mounting data volume. MLLMs aim to act like an integrative layer that can “read the room” across modalities and produce structured, clinically relevant reasoning—if they can be validated and governed appropriately.

    Why multimodal now?

    Single-modality AI is already established in cardiovascular medicine. Computer vision models can quantify ejection fraction, detect cardiomegaly on chest X-rays, or segment cardiac chambers on MRI. Separate models can flag arrhythmias from ECGs. Other NLP tools can extract problems and medications from notes. The limitation is that each model tends to solve one narrow task, and clinicians still do the cross-modal synthesis.

    MLLMs promise something different: a common “brain” that can fuse narrative context with quantitative signals and imaging findings, and then express outputs in a clinician-friendly format. In principle, that could look like a model that reviews an echo video alongside a patient’s BNP trend and admission note, then drafts a differential for dyspnea, highlights red flags for decompensated heart failure, and recommends what additional data would reduce uncertainty.

    According to the Journal of Medical Systems article, the research community is increasingly exploring these multimodal approaches specifically for cardiovascular diagnostics, reflecting broader momentum around foundation models in medicine. The novelty isn’t just higher accuracy on a benchmark; it’s the potential to compress the “search and synthesize” burden that dominates clinical time.

    What’s at stake for clinicians

    If MLLMs mature, they could reshape several day-to-day tasks in cardiology:

    Faster triage and prioritization. Emergency departments and telemetry floors generate constant signals—ECGs, vitals, nursing notes, labs. A multimodal system could continuously integrate these streams and escalate concerning patterns earlier, potentially improving time-to-treatment for STEMI, cardiogenic shock, or malignant arrhythmias.

    More consistent interpretation. Even with guidelines, interpretation varies. MLLMs could provide a “second reader” that checks whether a report’s conclusion aligns with measured values and image features, reducing internal contradictions (for example, a normal EF stated despite low quantitative measurements).

    Documentation and communication. Cardiologists spend substantial time creating consult notes and explaining results. A model that can ingest imaging findings plus the clinical narrative and draft a patient-specific summary may reduce clerical load—while also improving handoffs when multiple teams are involved.

    But this also introduces new responsibilities. Multimodal models can be persuasive even when wrong, and their errors can be cross-modal (e.g., over-weighting a noisy ECG artifact because a note mentions “palpitations”). Clinicians will need interfaces that show provenance—what data the system used, what it ignored, and how confident it is—rather than opaque “answer engines.”

    Implications for patients: access, speed, and trust

    For patients, the potential upside is tangible: earlier detection of deterioration, fewer missed diagnoses, and more understandable explanations of complex findings. In resource-constrained settings, multimodal tools could help generalists interpret echoes or ECGs with cardiology-level support, narrowing specialist gaps.

    Yet the patient-facing risks are equally real. Cardiovascular data is deeply personal and high-dimensional—imaging, genomics, longitudinal notes. Deploying MLLMs raises sharp questions about privacy, data governance, and whether model outputs could inadvertently reveal sensitive information. Bias is another concern: if training data under-represents certain populations, MLLMs could systematically misinterpret findings or misestimate risk in ways that widen disparities.

    The hard part: validation beyond benchmarks

    Cardiovascular diagnostics is not a single “right answer” domain; it’s probabilistic and context-dependent. That makes validation more complex than measuring accuracy on curated test sets. What healthcare systems will want to see are prospective studies showing improved outcomes or safer, faster workflows—without creating alert fatigue or new failure modes.

    Multimodal evaluation should also test robustness: Can the model handle incomplete data, mislabeled imaging series, low-quality point-of-care ultrasound, or conflicting chart narratives? And can it gracefully say “I don’t know” and suggest next steps? These are clinical behaviors, not just model metrics.

    Where this goes next

    The Journal of Medical Systems paper lands at a moment when the industry is deciding what “AI in the clinic” should look like: point solutions, or platform-like assistants that sit across departments. Cardiology could be a proving ground because the specialty already runs on multimodal evidence, standardized measurements, and high-stakes time sensitivity.

    Over the next 12–24 months, expect the conversation to shift from “Can an MLLM interpret an ECG and an image?” to “Can it integrate longitudinal records safely, in real workflows, with auditability and governance?” The winners won’t be the models with the flashiest demos. They’ll be the ones embedded into clinical systems with strong guardrails—clear uncertainty reporting, dataset transparency, human-in-the-loop oversight, and rigorous post-deployment monitoring.

    Source: Journal of Medical Systems, “Emerging Utility of Multimodal Large Language Models in Cardiovascular Diagnostics” (as reported by the journal). Available at: https://link.springer.com/article/10.1007/s10916-026-02361-w