News|Articles|June 10, 2026

Experienced dermatologists outperform AI in real-world skin cancer diagnosis

Author(s)Rose McNulty
Listen
0:00 / 0:00

Key Takeaways

  • Multiclass accuracy rose with dermoscopy experience, peaking at 74.2% for >10-year experts, versus 72.2% for PanDerm image-only and 56.7% for a first-generation CNN.
  • Binary classification favored AI, with balanced accuracy 0.82 driven by specificity (94%–97% benign clearance), while clinicians traded specificity for sensitivity to avoid missed malignancies.
SHOW MORE

Artificial intelligence models matched mid-career dermatologists but trailed seasoned experts in diagnosing skin lesions across real-world cases.

Artificial intelligence (AI) has repeatedly been shown to match or beat dermatologists at reading skin lesions—but mostly under tightly controlled conditions. A new study published in JAMA Dermatology tested how that performance holds up against the broader mix of cases seen in the clinic and found that seasoned specialists still came out ahead.

“AI systems demonstrate strong potential as diagnostic support tools, particularly for early-career clinicians,” the authors wrote. “Despite overconfident mainstream narratives about achieving clinical excellence through AI-based technology alone, it remains crucial to continue training primary care physicians to recognize skin lesions, especially cancers, and to continue educating dermatologists toward expertise. This training is important not only in dermoscopy but also in the use of AI, including how it works and its limitations.”

The researchers, led by Julien Anriot, M.D., and Luc Thomas, M.D., Ph.D., of Claude Bernard University Lyon 1 in France, compared three AI systems against 652 physicians ranging from readers with less than 1 year of experience to readers with more than 10 years of experience. Drawing on 1,117 standardized cases that paired clinical and dermoscopic images with patient history and demographics, the team ran 1,092 human test iterations and intentionally included rare and atypical tumors that often challenge clinicians.

On the primary measure, which was accuracy across nine diagnostic categories, experts with more than 10 years of experience led at 74.2%. The strongest AI tool, the image-only version of the foundation model PanDerm, reached 72.2%. That was enough to outperform dermatologists with less than 1 year of experience (59.1%) and statistically match clinicians with three to 10 years of experience. A first-generation convolutional neural network had the lowest accuracy at 56.7%, trailing every group of human readers.

For the simpler benign-versus-malignant question, the image-only model had the highest balanced accuracy at 0.82, compared with 0.65 for humans, driven largely by specificity. The model correctly cleared benign lesions 94% of the time, and a version that also incorporated clinical photos and metadata hit 97%. Human readers were more cautious, accepting lower specificity to avoid missing cancers, and the most experienced clinicians retained the best sensitivity.

“The study confirmed the expected association between dermoscopy experience and diagnostic performance and quantified the gap between training levels,” the authors explained. “Thus, performance depended on the metric considered: the unimodal model achieved the highest binary balanced accuracy, largely associated with higher specificity, whereas the most experienced readers retained the highest multiclass diagnostic accuracy and the best sensitivity.”

One result surprised the investigators: adding clinical context made PanDerm less accurate instead of more accurate. The multimodal version scored 66.3% on multiclass accuracy, below the image-only configuration.

“Unlike human readers who benefit from clinical context, the AI system did not gain accuracy from additional data,” the authors wrote. “A likely explanation is a distribution shift between the close-up clinical images used for the unimodal and multimodal model training and the more distant, complex clinical images presented in the test set. Among the malignant lesions missed by both configurations, an apparent preponderance of acral localizations was noted. This could reflect underrepresentation of acral melanoma in publicly available training datasets.”

The authors caution that the results have limits. Readers were predominantly French, the patient population was largely of European origin, and darker skin phototypes were underrepresented. This generalizability concern mirrors broader questions about how AI performs across diverse populations.

That the multimodal model fell short reinforces a key point: AI progress hinges on how well data is integrated, not simply how much of it there is.

“The future likely lies in collaboration between humans and machines to optimize diagnostic performance. For novice practitioners, AI could serve as a safety net and educational tool. For experts, it could provide an efficient triage modality and a systematic second reading, particularly useful for reducing errors caused by fatigue or inattention."


Latest CME