The Definitive Guide to Open Source Medical Imaging Datasets in 2026

·

The democratization of medical imaging AI research has been driven by the availability of large, high-quality open source datasets. These publicly accessible resources have enabled researchers, clinicians, and developers worldwide to build, validate, and deploy algorithms that are transforming healthcare delivery. In this comprehensive guide, we catalog the most important downloadable medical imaging datasets available today, with direct links to their repositories and source data.

Why Open Source Medical Imaging Datasets Matter

Medical imaging AI has reached a critical inflection point. Models trained on proprietary datasets often lack generalizability, while those developed on diverse, well-annotated open datasets tend to perform more robustly across clinical settings. Open datasets also enable reproducibility — a cornerstone of scientific research that has historically been lacking in medical AI publications.

The challenge, however, is that these datasets are scattered across dozens of platforms, institutional repositories, and challenge websites. This guide consolidates them into a single reference, organized by modality and clinical application.

Multi-Modal and General Purpose Datasets

Several landmark datasets span multiple imaging modalities or serve as foundational benchmarks for the broader medical imaging community.

Dataset Modality Size Annotations Source
Medical Segmentation Decathlon CT, MRI 2,633 scans across 10 tasks Expert segmentations for liver, brain, hippocampus, lung, prostate, cardiac, pancreas, colon, hepatic vessels, spleen medicaldecathlon.com
The Cancer Imaging Archive (TCIA) CT, MRI, PET, Pathology 100M+ images across 150+ collections Varies by collection — segmentations, clinical data, genomics cancerimagingarchive.net
Grand Challenge Datasets Multiple Varies per challenge Challenge-specific expert annotations grand-challenge.org
AMOS 2022 CT, MRI 500 CT + 100 MRI scans 15 abdominal organ segmentations amos22.grand-challenge.org
TotalSegmentator CT 1,204 CT scans 117 anatomical structures segmented GitHub
AbdomenAtlas 1.1 CT 9,262 CT volumes 25 organ + tumor annotations GitHub

Chest and Lung Imaging Datasets

Chest X-ray and CT datasets represent the single largest category of open medical imaging data, largely driven by the COVID-19 pandemic and longstanding tuberculosis screening research.

Dataset Modality Size Annotations Source
CheXpert Chest X-ray 224,316 images from 65,240 patients 14 pathology labels with uncertainty annotations Stanford ML Group
MIMIC-CXR Chest X-ray 377,110 images from 65,379 patients Free-text radiology reports + NLP-extracted labels PhysioNet
NIH ChestX-ray14 Chest X-ray 112,120 images from 30,805 patients 14 disease labels (NLP-mined) NIH Clinical Center
PadChest Chest X-ray 160,868 images from 67,625 patients 174 radiographic findings, 19 differential diagnoses BIMCV
VinDr-CXR Chest X-ray 18,000 images 22 local lesion labels + 6 global disease labels PhysioNet
LUNA16 Chest CT 888 CT scans Lung nodule annotations from LIDC-IDRI Grand Challenge
COVID-CT Chest CT 349 COVID-19+ and 397 non-COVID CT slices Binary COVID classification GitHub
RSNA Pneumonia Detection Chest X-ray 30,000 frontal chest X-rays Bounding boxes for pneumonia opacities Kaggle

Brain and Neuroimaging Datasets

Neuroimaging datasets have been critical for advancing our understanding of neurodegenerative diseases, brain tumors, and stroke.

Dataset Modality Size Annotations Source
BraTS (Brain Tumor Segmentation) MRI 2,000+ multi-modal MRI scans Expert glioma segmentations (enhancing, core, whole) Synapse
ADNI (Alzheimer’s Disease Neuroimaging) MRI, PET 2,000+ subjects longitudinal Clinical assessments, biomarkers, cognitive scores ADNI
IXI Dataset MRI (T1, T2, PD, MRA, DTI) 600 healthy subjects Multi-sequence brain MRI from 3 hospitals brain-development.org
OASIS-3 MRI, PET 1,378 subjects, 2,842 MRI sessions Longitudinal Alzheimer’s + aging data oasis-brains.org
ISLES (Ischemic Stroke Lesion) MRI 400+ stroke cases Stroke lesion segmentations isles-challenge.org
FastSurfer / FreeSurfer Datasets MRI Varies Cortical parcellations and volumetrics GitHub

Cardiac Imaging Datasets

Cardiac imaging AI has matured rapidly, with open datasets enabling automated segmentation and functional analysis of the heart.

Dataset Modality Size Annotations Source
ACDC (Automated Cardiac Diagnosis) Cardiac MRI 150 patients (5 subgroups) LV, RV, myocardium segmentations CREATIS
M&Ms (Multi-Centre Multi-Vendor) Cardiac MRI 375 patients from 6 centers Cardiac structure segmentations across vendors ub.edu
EchoNet-Dynamic Echocardiography 10,030 echo videos Ejection fraction labels, semantic segmentations echonet.github.io
CAMUS 2D Echocardiography 500 patients LV endocardium, epicardium, LA segmentations CREATIS

Abdominal and Gastrointestinal Imaging

Dataset Modality Size Annotations Source
LiTS (Liver Tumor Segmentation) CT 201 CT scans Liver and liver tumor segmentations CodaLab
KiTS (Kidney Tumor Segmentation) CT 599 CT scans Kidney and kidney tumor segmentations GitHub
Kvasir-SEG Endoscopy 1,000 polyp images Polyp segmentation masks Simula
CT-ORG CT 140 CT scans 6 organ segmentations (liver, lungs, bladder, kidney, bones, brain) TCIA

Ophthalmology and Retinal Imaging

Dataset Modality Size Annotations Source
DRIVE Fundus Photography 40 retinal images Vessel segmentation masks Grand Challenge
MESSIDOR-2 Fundus Photography 1,748 images Diabetic retinopathy grading ADCIS
EyePACS Fundus Photography 88,702 images 5-class diabetic retinopathy severity Kaggle
REFUGE Fundus Photography 1,200 images Glaucoma classification, optic disc/cup segmentation Grand Challenge
OCTID OCT 500 images Retinal disease classifications Scholars Portal

Musculoskeletal Imaging

Dataset Modality Size Annotations Source
MURA X-ray 40,561 musculoskeletal radiographs Normal/abnormal binary labels across 7 body parts Stanford ML Group
VerSe CT 374 CT scans Vertebra segmentation and labeling GitHub
OAI (Osteoarthritis Initiative) MRI, X-ray 4,796 subjects Longitudinal knee imaging with clinical outcomes NIH

How to Get Started

For researchers new to medical imaging AI, we recommend starting with well-documented, moderately sized datasets like the Medical Segmentation Decathlon or CheXpert. These offer clean annotations, established baselines, and active communities. For production-scale development, MIMIC-CXR and The Cancer Imaging Archive provide the volume needed to train robust models.

Key considerations when selecting a dataset include licensing terms (most require data use agreements), annotation quality, patient demographics, and whether the dataset includes train/test splits for fair benchmarking. Always review the associated publications and data use agreements before incorporating any dataset into your work.

The Road Ahead

The open medical imaging ecosystem continues to grow. Initiatives like the Medical Open Network for Artificial Intelligence (MONAI), Hugging Face’s medical imaging hub, and institutional data-sharing policies are accelerating the availability of high-quality, diverse datasets. As federated learning matures, we may see a future where models can be trained across institutions without centralizing sensitive data — but until then, these open datasets remain the foundation upon which medical imaging AI is built.

We will continue to update this resource as new datasets become available. If you know of a dataset we’ve missed, reach out to our editorial team.