The democratization of medical imaging AI research has been driven by the availability of large, high-quality open source datasets. These publicly accessible resources have enabled researchers, clinicians, and developers worldwide to build, validate, and deploy algorithms that are transforming healthcare delivery. In this comprehensive guide, we catalog the most important downloadable medical imaging datasets available today, with direct links to their repositories and source data.
Why Open Source Medical Imaging Datasets Matter
Medical imaging AI has reached a critical inflection point. Models trained on proprietary datasets often lack generalizability, while those developed on diverse, well-annotated open datasets tend to perform more robustly across clinical settings. Open datasets also enable reproducibility — a cornerstone of scientific research that has historically been lacking in medical AI publications.
The challenge, however, is that these datasets are scattered across dozens of platforms, institutional repositories, and challenge websites. This guide consolidates them into a single reference, organized by modality and clinical application.
Multi-Modal and General Purpose Datasets
Several landmark datasets span multiple imaging modalities or serve as foundational benchmarks for the broader medical imaging community.
| Dataset | Modality | Size | Annotations | Source |
|---|---|---|---|---|
| Medical Segmentation Decathlon | CT, MRI | 2,633 scans across 10 tasks | Expert segmentations for liver, brain, hippocampus, lung, prostate, cardiac, pancreas, colon, hepatic vessels, spleen | medicaldecathlon.com |
| The Cancer Imaging Archive (TCIA) | CT, MRI, PET, Pathology | 100M+ images across 150+ collections | Varies by collection — segmentations, clinical data, genomics | cancerimagingarchive.net |
| Grand Challenge Datasets | Multiple | Varies per challenge | Challenge-specific expert annotations | grand-challenge.org |
| AMOS 2022 | CT, MRI | 500 CT + 100 MRI scans | 15 abdominal organ segmentations | amos22.grand-challenge.org |
| TotalSegmentator | CT | 1,204 CT scans | 117 anatomical structures segmented | GitHub |
| AbdomenAtlas 1.1 | CT | 9,262 CT volumes | 25 organ + tumor annotations | GitHub |
Chest and Lung Imaging Datasets
Chest X-ray and CT datasets represent the single largest category of open medical imaging data, largely driven by the COVID-19 pandemic and longstanding tuberculosis screening research.
| Dataset | Modality | Size | Annotations | Source |
|---|---|---|---|---|
| CheXpert | Chest X-ray | 224,316 images from 65,240 patients | 14 pathology labels with uncertainty annotations | Stanford ML Group |
| MIMIC-CXR | Chest X-ray | 377,110 images from 65,379 patients | Free-text radiology reports + NLP-extracted labels | PhysioNet |
| NIH ChestX-ray14 | Chest X-ray | 112,120 images from 30,805 patients | 14 disease labels (NLP-mined) | NIH Clinical Center |
| PadChest | Chest X-ray | 160,868 images from 67,625 patients | 174 radiographic findings, 19 differential diagnoses | BIMCV |
| VinDr-CXR | Chest X-ray | 18,000 images | 22 local lesion labels + 6 global disease labels | PhysioNet |
| LUNA16 | Chest CT | 888 CT scans | Lung nodule annotations from LIDC-IDRI | Grand Challenge |
| COVID-CT | Chest CT | 349 COVID-19+ and 397 non-COVID CT slices | Binary COVID classification | GitHub |
| RSNA Pneumonia Detection | Chest X-ray | 30,000 frontal chest X-rays | Bounding boxes for pneumonia opacities | Kaggle |
Brain and Neuroimaging Datasets
Neuroimaging datasets have been critical for advancing our understanding of neurodegenerative diseases, brain tumors, and stroke.
| Dataset | Modality | Size | Annotations | Source |
|---|---|---|---|---|
| BraTS (Brain Tumor Segmentation) | MRI | 2,000+ multi-modal MRI scans | Expert glioma segmentations (enhancing, core, whole) | Synapse |
| ADNI (Alzheimer’s Disease Neuroimaging) | MRI, PET | 2,000+ subjects longitudinal | Clinical assessments, biomarkers, cognitive scores | ADNI |
| IXI Dataset | MRI (T1, T2, PD, MRA, DTI) | 600 healthy subjects | Multi-sequence brain MRI from 3 hospitals | brain-development.org |
| OASIS-3 | MRI, PET | 1,378 subjects, 2,842 MRI sessions | Longitudinal Alzheimer’s + aging data | oasis-brains.org |
| ISLES (Ischemic Stroke Lesion) | MRI | 400+ stroke cases | Stroke lesion segmentations | isles-challenge.org |
| FastSurfer / FreeSurfer Datasets | MRI | Varies | Cortical parcellations and volumetrics | GitHub |
Cardiac Imaging Datasets
Cardiac imaging AI has matured rapidly, with open datasets enabling automated segmentation and functional analysis of the heart.
| Dataset | Modality | Size | Annotations | Source |
|---|---|---|---|---|
| ACDC (Automated Cardiac Diagnosis) | Cardiac MRI | 150 patients (5 subgroups) | LV, RV, myocardium segmentations | CREATIS |
| M&Ms (Multi-Centre Multi-Vendor) | Cardiac MRI | 375 patients from 6 centers | Cardiac structure segmentations across vendors | ub.edu |
| EchoNet-Dynamic | Echocardiography | 10,030 echo videos | Ejection fraction labels, semantic segmentations | echonet.github.io |
| CAMUS | 2D Echocardiography | 500 patients | LV endocardium, epicardium, LA segmentations | CREATIS |
Abdominal and Gastrointestinal Imaging
| Dataset | Modality | Size | Annotations | Source |
|---|---|---|---|---|
| LiTS (Liver Tumor Segmentation) | CT | 201 CT scans | Liver and liver tumor segmentations | CodaLab |
| KiTS (Kidney Tumor Segmentation) | CT | 599 CT scans | Kidney and kidney tumor segmentations | GitHub |
| Kvasir-SEG | Endoscopy | 1,000 polyp images | Polyp segmentation masks | Simula |
| CT-ORG | CT | 140 CT scans | 6 organ segmentations (liver, lungs, bladder, kidney, bones, brain) | TCIA |
Ophthalmology and Retinal Imaging
| Dataset | Modality | Size | Annotations | Source |
|---|---|---|---|---|
| DRIVE | Fundus Photography | 40 retinal images | Vessel segmentation masks | Grand Challenge |
| MESSIDOR-2 | Fundus Photography | 1,748 images | Diabetic retinopathy grading | ADCIS |
| EyePACS | Fundus Photography | 88,702 images | 5-class diabetic retinopathy severity | Kaggle |
| REFUGE | Fundus Photography | 1,200 images | Glaucoma classification, optic disc/cup segmentation | Grand Challenge |
| OCTID | OCT | 500 images | Retinal disease classifications | Scholars Portal |
Musculoskeletal Imaging
| Dataset | Modality | Size | Annotations | Source |
|---|---|---|---|---|
| MURA | X-ray | 40,561 musculoskeletal radiographs | Normal/abnormal binary labels across 7 body parts | Stanford ML Group |
| VerSe | CT | 374 CT scans | Vertebra segmentation and labeling | GitHub |
| OAI (Osteoarthritis Initiative) | MRI, X-ray | 4,796 subjects | Longitudinal knee imaging with clinical outcomes | NIH |
How to Get Started
For researchers new to medical imaging AI, we recommend starting with well-documented, moderately sized datasets like the Medical Segmentation Decathlon or CheXpert. These offer clean annotations, established baselines, and active communities. For production-scale development, MIMIC-CXR and The Cancer Imaging Archive provide the volume needed to train robust models.
Key considerations when selecting a dataset include licensing terms (most require data use agreements), annotation quality, patient demographics, and whether the dataset includes train/test splits for fair benchmarking. Always review the associated publications and data use agreements before incorporating any dataset into your work.
The Road Ahead
The open medical imaging ecosystem continues to grow. Initiatives like the Medical Open Network for Artificial Intelligence (MONAI), Hugging Face’s medical imaging hub, and institutional data-sharing policies are accelerating the availability of high-quality, diverse datasets. As federated learning matures, we may see a future where models can be trained across institutions without centralizing sensitive data — but until then, these open datasets remain the foundation upon which medical imaging AI is built.
We will continue to update this resource as new datasets become available. If you know of a dataset we’ve missed, reach out to our editorial team.

