The Definitive Guide to Open Source Medical Imaging Datasets in 2026

The democratization of medical imaging AI research has been driven by the availability of large, high-quality open source datasets. These publicly accessible resources have enabled researchers, clinicians, and developers worldwide to build, validate, and deploy algorithms that are transforming healthcare delivery. In this comprehensive guide, we catalog the most important downloadable medical imaging datasets available today, with direct links to their repositories and source data.

Why Open Source Medical Imaging Datasets Matter

Medical imaging AI has reached a critical inflection point. Models trained on proprietary datasets often lack generalizability, while those developed on diverse, well-annotated open datasets tend to perform more robustly across clinical settings. Open datasets also enable reproducibility — a cornerstone of scientific research that has historically been lacking in medical AI publications.

The challenge, however, is that these datasets are scattered across dozens of platforms, institutional repositories, and challenge websites. This guide consolidates them into a single reference, organized by modality and clinical application.

Multi-Modal and General Purpose Datasets

Several landmark datasets span multiple imaging modalities or serve as foundational benchmarks for the broader medical imaging community.

Dataset	Modality	Size	Annotations	Source
Medical Segmentation Decathlon	CT, MRI	2,633 scans across 10 tasks	Expert segmentations for liver, brain, hippocampus, lung, prostate, cardiac, pancreas, colon, hepatic vessels, spleen	medicaldecathlon.com
The Cancer Imaging Archive (TCIA)	CT, MRI, PET, Pathology	100M+ images across 150+ collections	Varies by collection — segmentations, clinical data, genomics	cancerimagingarchive.net
Grand Challenge Datasets	Multiple	Varies per challenge	Challenge-specific expert annotations	grand-challenge.org
AMOS 2022	CT, MRI	500 CT + 100 MRI scans	15 abdominal organ segmentations	amos22.grand-challenge.org
TotalSegmentator	CT	1,204 CT scans	117 anatomical structures segmented	GitHub
AbdomenAtlas 1.1	CT	9,262 CT volumes	25 organ + tumor annotations	GitHub

Chest and Lung Imaging Datasets

Chest X-ray and CT datasets represent the single largest category of open medical imaging data, largely driven by the COVID-19 pandemic and longstanding tuberculosis screening research.

Dataset	Modality	Size	Annotations	Source
CheXpert	Chest X-ray	224,316 images from 65,240 patients	14 pathology labels with uncertainty annotations	Stanford ML Group
MIMIC-CXR	Chest X-ray	377,110 images from 65,379 patients	Free-text radiology reports + NLP-extracted labels	PhysioNet
NIH ChestX-ray14	Chest X-ray	112,120 images from 30,805 patients	14 disease labels (NLP-mined)	NIH Clinical Center
PadChest	Chest X-ray	160,868 images from 67,625 patients	174 radiographic findings, 19 differential diagnoses	BIMCV
VinDr-CXR	Chest X-ray	18,000 images	22 local lesion labels + 6 global disease labels	PhysioNet
LUNA16	Chest CT	888 CT scans	Lung nodule annotations from LIDC-IDRI	Grand Challenge
COVID-CT	Chest CT	349 COVID-19+ and 397 non-COVID CT slices	Binary COVID classification	GitHub
RSNA Pneumonia Detection	Chest X-ray	30,000 frontal chest X-rays	Bounding boxes for pneumonia opacities	Kaggle

Brain and Neuroimaging Datasets

Neuroimaging datasets have been critical for advancing our understanding of neurodegenerative diseases, brain tumors, and stroke.

Dataset	Modality	Size	Annotations	Source
BraTS (Brain Tumor Segmentation)	MRI	2,000+ multi-modal MRI scans	Expert glioma segmentations (enhancing, core, whole)	Synapse
ADNI (Alzheimer’s Disease Neuroimaging)	MRI, PET	2,000+ subjects longitudinal	Clinical assessments, biomarkers, cognitive scores	ADNI
IXI Dataset	MRI (T1, T2, PD, MRA, DTI)	600 healthy subjects	Multi-sequence brain MRI from 3 hospitals	brain-development.org
OASIS-3	MRI, PET	1,378 subjects, 2,842 MRI sessions	Longitudinal Alzheimer’s + aging data	oasis-brains.org
ISLES (Ischemic Stroke Lesion)	MRI	400+ stroke cases	Stroke lesion segmentations	isles-challenge.org
FastSurfer / FreeSurfer Datasets	MRI	Varies	Cortical parcellations and volumetrics	GitHub

Cardiac Imaging Datasets

Cardiac imaging AI has matured rapidly, with open datasets enabling automated segmentation and functional analysis of the heart.

Dataset	Modality	Size	Annotations	Source
ACDC (Automated Cardiac Diagnosis)	Cardiac MRI	150 patients (5 subgroups)	LV, RV, myocardium segmentations	CREATIS
M&Ms (Multi-Centre Multi-Vendor)	Cardiac MRI	375 patients from 6 centers	Cardiac structure segmentations across vendors	ub.edu
EchoNet-Dynamic	Echocardiography	10,030 echo videos	Ejection fraction labels, semantic segmentations	echonet.github.io
CAMUS	2D Echocardiography	500 patients	LV endocardium, epicardium, LA segmentations	CREATIS

Abdominal and Gastrointestinal Imaging

Dataset	Modality	Size	Annotations	Source
LiTS (Liver Tumor Segmentation)	CT	201 CT scans	Liver and liver tumor segmentations	CodaLab
KiTS (Kidney Tumor Segmentation)	CT	599 CT scans	Kidney and kidney tumor segmentations	GitHub
Kvasir-SEG	Endoscopy	1,000 polyp images	Polyp segmentation masks	Simula
CT-ORG	CT	140 CT scans	6 organ segmentations (liver, lungs, bladder, kidney, bones, brain)	TCIA

Ophthalmology and Retinal Imaging

Dataset	Modality	Size	Annotations	Source
DRIVE	Fundus Photography	40 retinal images	Vessel segmentation masks	Grand Challenge
MESSIDOR-2	Fundus Photography	1,748 images	Diabetic retinopathy grading	ADCIS
EyePACS	Fundus Photography	88,702 images	5-class diabetic retinopathy severity	Kaggle
REFUGE	Fundus Photography	1,200 images	Glaucoma classification, optic disc/cup segmentation	Grand Challenge
OCTID	OCT	500 images	Retinal disease classifications	Scholars Portal

Musculoskeletal Imaging

Dataset	Modality	Size	Annotations	Source
MURA	X-ray	40,561 musculoskeletal radiographs	Normal/abnormal binary labels across 7 body parts	Stanford ML Group
VerSe	CT	374 CT scans	Vertebra segmentation and labeling	GitHub
OAI (Osteoarthritis Initiative)	MRI, X-ray	4,796 subjects	Longitudinal knee imaging with clinical outcomes	NIH

How to Get Started

For researchers new to medical imaging AI, we recommend starting with well-documented, moderately sized datasets like the Medical Segmentation Decathlon or CheXpert. These offer clean annotations, established baselines, and active communities. For production-scale development, MIMIC-CXR and The Cancer Imaging Archive provide the volume needed to train robust models.

Key considerations when selecting a dataset include licensing terms (most require data use agreements), annotation quality, patient demographics, and whether the dataset includes train/test splits for fair benchmarking. Always review the associated publications and data use agreements before incorporating any dataset into your work.

The Road Ahead

The open medical imaging ecosystem continues to grow. Initiatives like the Medical Open Network for Artificial Intelligence (MONAI), Hugging Face’s medical imaging hub, and institutional data-sharing policies are accelerating the availability of high-quality, diverse datasets. As federated learning matures, we may see a future where models can be trained across institutions without centralizing sensitive data — but until then, these open datasets remain the foundation upon which medical imaging AI is built.

We will continue to update this resource as new datasets become available. If you know of a dataset we’ve missed, reach out to our editorial team.