Commercial radiology AI validations rarely report subgroup performance.

EUROPEAN RADIOLOGYJun 13, 2026

This scoping review of validations for 252 commercial radiology products found only 77 studies reporting per-subgroup performance. Overall, 392 of 545 validation studies reported demographic data, but most did not present performance by sex, age, or race/ethnicity. Reporting varied by modality and body region, with chest and bone radiograph studies more likely to include subgroup results. The authors conclude current literature is inadequate to estimate subgroup bias for commercial radiology AI products.

GPT‑4.1 and Llama 3.3 fail to detect many clinically relevant report errors in zero‑shot testing.

EUROPEAN RADIOLOGYJun 19, 2026

In 256 radiology reports and 1,024 error variants, GPT‑4.1 and Llama 3.3 70B showed error‑type and modality dependent detection. Physiologically impossible errors were least detected (GPT‑4.1 detection as low as 8.7% in MRI). Both models detected inappropriate recommendations best, but missed many anatomical and physiologic errors. Study implies baseline LLMs are unsafe for unsupervised QA of clinically consequential report errors without domain tuning.

MRI and ultrasound AI models show moderate accuracy for axillary node metastasis prediction.

EUROPEAN RADIOLOGYJun 19, 2026

Meta-analysis of 41 studies reported pooled internal-validation sensitivity 0.79, specificity 0.78, and AUC 0.84 for DL/HCR models. External validation showed AUC 0.82, with LR+ ~3.0 and LR- ~0.33, indicating limited standalone utility. Ensemble DL+HCR and models including peritumoral features achieved higher AUCs than single-method approaches. Authors recommend these models be used as adjunctive risk stratification, not a replacement for histopathology, pending standardization and external validation.

Radiology

30-Second Takeaway

Selected recent evidence on AI, equity, LLM QA, and radiomics performance in radiology

Commercial radiology AI validations rarely report subgroup performance.

GPT‑4.1 and Llama 3.3 fail to detect many clinically relevant report errors in zero‑shot testing.

MRI and ultrasound AI models show moderate accuracy for axillary node metastasis prediction.

References

Additional Reads