Study Reveals Persistent Racial Bias in Dermatology AI Models Trained on Public Datasets

A comprehensive evaluation of 22 commercially available and research-grade dermatology AI models found that diagnostic accuracy drops by an average of 18 percentage points when evaluated on patients with Fitzpatrick skin types V and VI, according to a study published this week in Nature Medicine.

The study, conducted by researchers at MIT, Harvard Medical School, and Emory University, tested models against a newly curated dataset of 12,000 biopsy-confirmed skin lesion images with balanced representation across all six Fitzpatrick skin types.

Key Findings

Across the 22 models tested:

Average sensitivity for melanoma on Fitzpatrick I-II skin: 91.4%
Average sensitivity for melanoma on Fitzpatrick V-VI skin: 73.2%
The gap was smallest (8 points) in models trained on diverse datasets and largest (29 points) in models trained primarily on data from European and North American populations
Three models showed no statistically significant performance difference across skin types, all of which were trained on intentionally balanced datasets

The Dataset Problem

“This is fundamentally a data problem, not an algorithm problem,” said Dr. Roxana Daneshjou, a dermatologist at Stanford and co-author of the study. “The most widely used public dermatology datasets are over 80% Fitzpatrick I-III. If you train on biased data, you get biased models. It’s that simple.”

The study found that even state-of-the-art foundation models, when fine-tuned on imbalanced dermatology datasets, inherit and sometimes amplify existing biases. This challenges the assumption that larger, more capable base models automatically produce fairer downstream performance.

Regulatory Response

The findings come as the FDA is developing updated guidance on demographic performance reporting for AI medical devices. Currently, manufacturers are not required to report disaggregated performance data across racial or ethnic groups, though the FDA has signaled this may change.

The authors recommend mandatory reporting of model performance across skin types for any dermatology AI seeking FDA clearance, as well as minimum performance thresholds that must be met across all demographic groups, not just in aggregate.