This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Informatics, is properly cited. The complete bibliographic information, a link to the original publication on https://medinform.jmir.org/, as well as this copyright and license information must be included.

Deep neural networks are showing impressive results in different medical image classification tasks. However, for real-world applications, there is a need to estimate the network’s uncertainty together with its prediction.

In this review, we investigate in what form uncertainty estimation has been applied to the task of medical image classification. We also investigate which metrics are used to describe the effectiveness of the applied uncertainty estimation

Google Scholar, PubMed, IEEE Xplore, and ScienceDirect were screened for peer-reviewed studies, published between 2016 and 2021, that deal with uncertainty estimation in medical image classification. The search terms “uncertainty,” “uncertainty estimation,” “network calibration,” and “out-of-distribution detection” were used in combination with the terms “medical images,” “medical image analysis,” and “medical image classification.”

A total of 22 papers were chosen for detailed analysis through the systematic review process. This paper provides a table for a systematic comparison of the included works with respect to the applied method for estimating the uncertainty.

The applied methods for estimating uncertainties are diverse, but the sampling-based methods Monte-Carlo Dropout and Deep Ensembles are used most frequently. We concluded that future works can investigate the benefits of uncertainty estimation in collaborative settings of artificial intelligence systems and human experts.

RR2-10.2196/11936

Digital image analysis is a helpful tool to support physicians in their clinical decision-making. Originally, digital image analysis was performed by extracting handcrafted features from an input image. These features can be tuned to the underlying data, which means that for a specific disease, only specific features can be looked for in the observed image. With the advent of deep learning, however, a “black box” has been established that can, in the setting of supervised learning, intrinsically learn such features from labeled data. In recent years, deep learning–based methods have vastly outperformed traditional methods that rely on handcrafted features. With the learning-based methods, the focus has shifted from manually defining image features to providing clean and correctly annotated data to the learning system. With the data-centric approach, however, new challenges arise.

In a clinical setting, when such algorithms are meant to be used as diagnostic assistance tools, the user has to be able to understand how the artificial intelligence (AI) system came up with the diagnosis. One key component in this regard is a measure of confidence of the AI system in its prediction. Such a measure is important to increase trust in the AI system, and it may improve clinical decision-making [

In the results section, we categorize the reviewed works by the uncertainty estimation method they apply. We provide a table that serves as an overview of all the included studies. In the last section, we discuss the most frequently used metrics for evaluating the benefit of uncertainty estimation and give an outlook of possible future research directions with a focus on human-machine collaboration.

In a classification task, the neural network is supposed to predict how likely it is for a given input

In formula, the predictive distribution can be written as follows:

The predictive distribution given input

Depending on the modeled uncertainty, the predictive uncertainty can be divided into aleatoric uncertainty and epistemic uncertainty. The aleatoric uncertainty describes the uncertainty inherent in the data, whereas the epistemic uncertainty captures the uncertainty of the model. The softmax output of a typical classification network is only able to capture aleatoric uncertainty [

Ovadia et al [

Sampling-based methods are easy to implement as they make use of existing network architectures. The 2 most popular methods are Monte Carlo dropout (MCDO) [

The field of directly modifying the network architecture for improved uncertainty estimation is quite diverse. In the derivation of MCDO, the authors compare their approach to Gaussian processes (GPs). A GP is a method to estimate a distribution over functions [

Approaches that have been included in the comparison by Ovadia et al [

Comparable to sampling multiple models, one can also compute a distribution of predictions by running the network on different augmentations of the input data. Ayhan and Berens [

For the systematic review, we searched through Google Scholar, PubMed, IEEE Xplore, and ScienceDirect to identify relevant works that apply uncertainty estimation methods to medical image classification. We limited our search to works that have appeared between January 2016 and October 2021. As search terms, we used “uncertainty,” “uncertainty estimation,” “network calibration,” and “out-of-distribution detection,” and we combined them with the terms “medical images,” “medical image analysis,” and “medical image classification.”

The selection process was conducted according to the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines [

PRISMA (Preferred Reporting Items for Systematic reviews and Meta-Analyses) flow diagram.

Number of publications that apply the respective uncertainty estimation method. EDL: evidential deep learning; GP: Gaussian process; MCDO: Monte Carlo dropout; SVI: stochastic variational inference; TS: temperature scaling; TTA: test-time data augmentation.

Most of the included works evaluate the applied methods by computing an uncertainty measure (mostly predictive variance or predictive entropy). This uncertainty measure is often used to generate retained data versus accuracy evaluations.

Retained data versus accuracy plot from Filos et al [

Some included works focus on network calibration and try to decrease the expected calibration error (ECE) within their experiments. Some other works use the computed uncertainty measure to detect out-of-distribution (OOD) samples.

Overview of the selected studies.

Methods | Organs or sickness | Sensor | Network architecture | Reported metrics | Data access | Code available | Reference |

MCDO^{a}, GP^{b} |
Diabetic retinopathy from fundus images | Camera | Custom CNNs^{c} |
Retained data or accuracy, uncertainty or density | Public (Kaggle competition) | Yes | Leibig et al [ |

MCDO, SVI^{d} |
Retina | Optical coherence tomography | ResNet-18 | Predictive variance | Public | Yes | Laves et al [ |

MCDO | Skin cancer | Camera | VGG-16, ResNet-50, DenseNet-169 | Uncertainty or density, retained data or accuracy, uncertainty, confusion matrix | Public | Yes | Mobiny et al [ |

MCDO | Brain | MRI^{e} |
Modified VGGNet | Reliability diagrams, AUROC^{f} |
Private | Yes | Herzog et al [ |

MCDO | Breast cancer | Mammography | VGG-19 | Uncertainty, confusion matrix | Public | No | Caldéron-Ramírez et al [ |

MCDO, DUQ^{g} |
COVID-19 | X-ray | WideResNet | Jensen-Shannon divergence | Public | No | Caldéron-Ramírez et al [ |

MCDO, Ensembles, MFVI^{h} |
Diabetic retinopathy from fundus images | Camera | VGG Variants | Retained data or accuracy, retained data or AUROC, ROC^{i} |
Public (Kaggle competition) | Yes | Filos et al [ |

MCDO, Ensembles, M-heads | Histopathological slides | Microscope | DenseNet | Retained data or AUROC | Public | No | Linmans et al [ |

MCDO, Ensembles, Mix-up | Histopathological slides | Microscope | ResNet-50 | ECE^{j}, AUROC, AUPRC^{k} |
Private | No | Thagaard et al [ |

MCDO, Ensembles | COVID-19, Histopathological slides (breast cancer) | CT^{l}, microscope |
ResNet-152-V2, Inception-V3, Inception-ResNet-V2 | Predictive entropy, retained data or accuracy | Public | No | Yang and Fevens [ |

MCDO, Ensembles, TWD^{m} |
Skin cancer | Camera | ResNet-152, Inception- ResNet-V2, DenseNet-201, MobileNet-V2 | Entropy, AUROC | Public (Kaggle competition, ISIC data set) | No | Abdar et al [ |

MCDO, Ensembles, others | Lung | X-ray | WideResNet | AUROC, AUPRC | Public | No | Berger et al [ |

GP | Diabetic retinopathy from fundus images | Camera | Inception-V3 | AUROC | Public (Kaggle competition) | Yes | Toledo-Cortés et al [ |

EDL^{n} + Ensembles |
Chest | X-ray | DenseNet-121 | AUROC | Public | No | Ghesu et al [ |

EDL + MCDO | Breast cancer | Mammography | VGGNet | AUROC | Public + private | No | Tardy et al [ |

EDL | Chest, abdomen, and brain | X-ray, ultrasound, MRI | DenseNet-121 | AUROC, coverage or F1 score, coverage or AUROC | Public | No | Ghesu et al [ |

TS^{o}, MCDO |
Polyp | Colonoscopy (camera) | ResNet-101, DenseNet-121 | ECE, predictive entropy, predictive variance | Public + private | No | Carneiro et al [ |

TS, DCA^{p} |
Head CT, mammography, chest x-ray, histology | Multimodal | AlexNet, |
ECE | Public | No | Liang et al [ |

TTA^{q} |
Diabetic retinopathy from fundus images | Camera | ResNet-50 | Uncertainty or density, retained data or AUROC | Public (Kaggle competition) | Yes | Ayhan and Berens [ |

TTA,^{r}, |
Skin cancer | Camera | ResNet-50 | ECE | Private (31,000 annotated images) | No | Jensen et al [ |

TTA + MCDO | Skin cancer | Camera | Efficient-Net-B0 | Predictive entropy, predictive variance, Bhattacharya coefficient, retained data or accuracy | Public (ISIC data set) | No | Combalia et al [ |

TTA, TS, Ensembles | Diabetic retinopathy from fundus images | Camera | Modified ResNet | Reliability diagrams, AECE^{s}, retained data or AUROC |
Public (Kaggle competition) | Yes | Ayhan et al [ |

^{a}MCDO: Monte Carlo dropout.

^{b}GP: Gaussian process.

^{c}CNN: convolutional neural network.

^{d}SVI: stochastic variational inference.

^{e}MRI: magnetic resonance imaging.

^{f}AUROC: area under the receiver operating curve.

^{g}DUQ: deterministic uncertainty quantification.

^{h}MFVI: mean field variational inference.

^{i}ROC: receiver operating curve.

^{j}ECE: expected calibration error.

^{k}AUPRC: area under the precision recall curve.

^{l}CT: computed tomography.

^{m}TWD: three-way decision theory.

^{n}EDL: evidential deep learning.

^{o}TS: temperature scaling.

^{p}DCA: difference between confidence and accuracy.

^{q}TTA: test-time data augmentation.

^{r}MCBN: Monte-Carlo batch norm.

^{s}AECE: adaptive expected calibration error.

The first work that we have included is the study by Leibig et al [

Laves et al [

Mobiny et al [

Another work by Herzog et al [

In two other published works, Caldéron-Ramírez et al [

Another set of studies compared MCDO to Deep Ensembles (further simply denoted as Ensembles) and partly to other methods. Filos et al [

Linmans et al [

Thagaard et al [

In another work, Yang and Fevens [

Abdar et al [

In another work, Berger et al [

After having covered several works that focus on sampling-based uncertainty estimation methods, we will now look into works that directly apply to the network’s classification output to estimate uncertainties. One example is the work by Toledo-Cortés et al [

A set of other works applies EDL to estimate uncertainties. In their first work, Ghesu et al [

Comparably, Tardy et al [

Two works that we have included apply TS to medical image classification tasks. Carneiro et al [

Liang et al [

The concept of TTA is introduced by Ayhan and Behrens [

Another work by Jensen et al [

Combalia et al [

In a follow-up of their original work, Ayhan et al [

Through the reviewed publications, we gained an overview of which methods for uncertainty estimation are most frequently used in the field of medical image classification. We found that the sampling-based methods MCDO and Deep Ensembles are the most frequently applied methods. With the sampling-based approaches, it is possible to compute a distribution of predictions and from there determine an uncertainty measure, usually either in the form of predictive entropy or predictive variance. These measures help to identify samples where the neural network is uncertain about its predictions.

In addition to the sampling-based uncertainty evaluations, we also observed evaluations that analyze the calibration of the neural network. The calibration evaluations in terms of reliability diagrams and ECE are used to determine if the neural network’s output probabilities represent the actual likelihood of the prediction being correct. In the original paper on neural network calibration [

Another observation we made is that combining uncertainty estimation methods can improve the results. This holds for combinations of Ensembles and MCDO [

By presenting retained data versus accuracy curves, several works [

artificial intelligence

area under the precision recall curve

area under the receiver operating curve

convolutional neural network

difference between confidence and accuracy

expected calibration error

evidential deep learning

Gaussian process

Monte Carlo dropout

mean field variational inference

magnetic resonance imaging

out-of-distribution

Preferred Reporting Items for Systematic Reviews and Meta-Analyses

stochastic variational inference

temperature scaling

test-time data augmentation

The research is funded by the Ministerium für Soziales und Integration Baden Württemberg, Germany.

AK, AH, and TJB are responsible for concept and design. AK and KH did the study selection. HM, EKH, JNK, SF, and CvK critically revised the manuscript and provided valuable feedback.

TJB is the owner of Smart Health Heidelberg GmbH (Handschuhsheimer Landstr. 9/1, 69120 Heidelberg, Germany, https://smarthealth.de) which develops telemedicine mobile apps (such as AppDoc; https://online-hautarzt.net and Intimarzt; https://intimarzt.de), outside of the submitted work.