Skip to main content
Skip to main content
Back to Grand Rounds
Grand RoundsWeekly Evidence Brief

Radiology

Edition

30-Second Takeaway

  • Commercial radiology AI validations rarely report subgroup performance, limiting bias assessment.
  • Baseline LLMs (GPT‑4.1, Llama 3.3) miss many clinically important report errors in zero-shot use.
  • MRI/US AI models predict axillary nodal metastasis with moderate accuracy but not as a replacement for biopsy.

Week ending June 20, 2026

Selected recent evidence on AI, equity, LLM QA, and radiomics performance in radiology

Commercial radiology AI validations rarely report subgroup performance.

EUROPEAN RADIOLOGYJun 13, 2026

This scoping review of validations for 252 commercial radiology products found only 77 studies reporting per-subgroup performance. Overall, 392 of 545 validation studies reported demographic data, but most did not present performance by sex, age, or race/ethnicity. Reporting varied by modality and body region, with chest and bone radiograph studies more likely to include subgroup results. The authors conclude current literature is inadequate to estimate subgroup bias for commercial radiology AI products.

GPT‑4.1 and Llama 3.3 fail to detect many clinically relevant report errors in zero‑shot testing.

EUROPEAN RADIOLOGYJun 19, 2026

In 256 radiology reports and 1,024 error variants, GPT‑4.1 and Llama 3.3 70B showed error‑type and modality dependent detection. Physiologically impossible errors were least detected (GPT‑4.1 detection as low as 8.7% in MRI). Both models detected inappropriate recommendations best, but missed many anatomical and physiologic errors. Study implies baseline LLMs are unsafe for unsupervised QA of clinically consequential report errors without domain tuning.

MRI and ultrasound AI models show moderate accuracy for axillary node metastasis prediction.

EUROPEAN RADIOLOGYJun 19, 2026

Meta-analysis of 41 studies reported pooled internal-validation sensitivity 0.79, specificity 0.78, and AUC 0.84 for DL/HCR models. External validation showed AUC 0.82, with LR+ ~3.0 and LR- ~0.33, indicating limited standalone utility. Ensemble DL+HCR and models including peritumoral features achieved higher AUCs than single-method approaches. Authors recommend these models be used as adjunctive risk stratification, not a replacement for histopathology, pending standardization and external validation.

References

Numbered in order of appearance. Click any reference to view details.

Additional Reads

Optional additional studies from this edition.

Edition context

Clinical signal

  • Ask vendors for per-subgroup performance before clinical adoption.
  • Do not rely on out-of-the-box LLMs for unattended radiology QA; validate locally first.
  • Use AI ALN prediction as adjunctive risk stratification alongside biopsy.