Open Source Datasets for AI in Dermatology: A Complete Resource Guide

·

Dermatology has emerged as one of the most active frontiers for AI in healthcare, driven in large part by the visual nature of skin disease diagnosis. The field’s reliance on pattern recognition from images makes it a natural fit for deep learning — and the availability of open source datasets has been the catalyst for an explosion of research. From melanoma detection to rare disease classification, publicly accessible dermatology datasets are enabling researchers and developers to build systems that could one day match or exceed expert-level diagnostic accuracy.

This guide catalogs every major open source dermatology dataset available today, with direct links to source data and code repositories. Whether you’re training a skin lesion classifier, building a dermoscopic segmentation model, or exploring multimodal dermatology AI, this is your starting point.

The Landscape of Dermatology AI Data

Skin imaging datasets broadly fall into three categories: clinical photographs (taken with standard cameras in clinical settings), dermoscopic images (captured with dermatoscopes that use polarized light and magnification), and histopathological images (microscopy slides of skin biopsies). Each modality presents different challenges for AI systems, and the best models increasingly combine information across modalities.

A critical challenge in dermatology AI is skin tone diversity. Many early datasets were heavily skewed toward lighter skin tones, leading to models that performed poorly on darker skin. Recent initiatives have begun addressing this gap, and we highlight datasets that contribute to more equitable AI development.

Skin Lesion Classification Datasets

These datasets focus on categorizing skin lesions into diagnostic categories — the most common task in dermatology AI.

Dataset Images Classes Image Type Key Features Source
ISIC Archive 150,000+ Multiple (varies) Dermoscopic + Clinical Largest public skin lesion archive; basis for annual challenges since 2016 isic-archive.com
HAM10000 10,015 7 diagnostic categories Dermoscopic Curated from two sites; includes actinic keratoses, basal cell carcinoma, benign keratosis, dermatofibroma, melanoma, nevi, vascular lesions Harvard Dataverse
Fitzpatrick17k 16,577 114 conditions Clinical photographs Labeled with Fitzpatrick skin type (I-VI); addresses skin tone bias in dermatology AI GitHub
PAD-UFES-20 2,298 6 skin lesion types Clinical smartphone photos Includes patient metadata (age, sex, body region); smartphone-captured for real-world performance Mendeley Data
Derm7pt 2,000 Multiclass + 7-point checklist Dermoscopic + Clinical pairs Both dermoscopic and clinical images per lesion; 7-point checklist scoring for structured diagnosis SFU
DermNet Dataset 23,000+ 600+ conditions Clinical photographs Broadest condition coverage; images sourced from DermNet NZ Kaggle
SD-198 6,584 198 skin disease categories Clinical photographs Fine-grained classification benchmark GitHub
DDI (Diverse Dermatology Images) 656 78 conditions Clinical photographs Specifically curated for skin tone diversity; pathology-confirmed diagnoses ddi-dataset.github.io

Dermoscopic Segmentation Datasets

Segmentation datasets provide pixel-level masks delineating lesion boundaries, enabling AI systems to precisely locate and measure skin lesions.

Dataset Images Annotation Type Key Features Source
ISIC 2018 Task 1 2,594 Lesion boundary segmentation masks Part of ISIC Challenge; gold standard for lesion segmentation ISIC Challenge
PH2 200 Lesion segmentation + dermoscopic structures Expert annotations with asymmetry, border, color, dermoscopic structures ADDI Project
DermIS/DermQuest Varies Clinical descriptions + segmentations Historical atlas-style dataset DermIS
ISIC 2017 Challenge 2,750 Segmentation + classification Melanoma, seborrheic keratosis, benign nevi ISIC Challenge

Skin Cancer Screening Datasets

Dataset Images Focus Key Features Source
BCN20000 19,424 8 diagnostic categories Hospital Clinic Barcelona dataset; demographically rich metadata arXiv (Paper)
MClass-D / MClass-ND 100 / 100 Melanoma vs. nevi Benchmarking sets used in human-vs-AI studies skinclass.de
SIIM-ISIC Melanoma Classification 33,126 Melanoma detection Kaggle competition dataset with patient metadata; one of the largest melanoma-specific datasets Kaggle

Specialized Dermatology Datasets

Dataset Images Focus Key Features Source
SkinCon 3,230 48 clinical concept annotations Concept-based annotations for explainable AI in dermatology skincon-dataset.github.io
Monkeypox Skin Lesion Dataset 2,000+ Monkeypox vs. similar conditions Created during 2022 outbreak; includes measles, chickenpox, cowpox comparisons GitHub
Wound Imaging 1,335 Chronic wound classification Diabetic foot ulcers, venous ulcers, pressure injuries GitHub
SCIN (Skin Condition Image Network) 10,000+ Crowd-sourced skin conditions Google Health initiative; diverse skin tones; self-reported conditions GitHub

Multimodal and Text-Image Datasets

The latest generation of dermatology datasets pair images with rich textual descriptions, enabling vision-language models and more sophisticated AI systems.

Dataset Size Modalities Key Features Source
SkinGPT-4 Training Data 52,929 image-text pairs Dermoscopic images + diagnostic text Used to train SkinGPT-4 vision-language model GitHub
DermExpert 50,000+ pairs Clinical images + expert descriptions Expert-written descriptions for training diagnostic chatbots GitHub

Addressing Bias: Skin Tone Diversity

One of the most important developments in dermatology AI has been the growing recognition that datasets must represent the full spectrum of human skin tones. Early datasets like HAM10000 were overwhelmingly composed of images from light-skinned individuals, leading to models that underperformed on darker skin. The Fitzpatrick17k and DDI datasets were explicitly created to address this gap, and the ISIC Archive has been actively expanding its diversity.

Researchers building dermatology AI systems should evaluate performance across Fitzpatrick skin types I through VI and report disaggregated metrics. This is not just a technical concern — it is an ethical imperative that directly impacts clinical equity.

Model Repositories and Pretrained Weights

Several research groups have released pretrained models alongside their datasets, enabling rapid experimentation and transfer learning:

Getting Started with Dermatology AI

For newcomers, we recommend beginning with HAM10000 for classification tasks or the ISIC 2018 dataset for segmentation. Both are well-documented, moderately sized, and have established baselines. The Fitzpatrick17k dataset is essential for anyone building systems intended for clinical deployment, as it enables fairness evaluation across skin tones.

For production-grade melanoma screening systems, the SIIM-ISIC competition dataset provides the scale and metadata richness needed for robust model development. And for researchers exploring multimodal approaches, the SkinGPT-4 training data offers a starting point for vision-language model development in dermatology.

As the field continues to evolve, we expect to see more datasets incorporating 3D skin imaging, total body photography, and longitudinal monitoring data. The foundation for equitable, effective dermatology AI starts with the data — and these open resources are making that foundation stronger every year.