Label-Free Model Evaluation with Out-of-Distribution Detection

Zhu, Fangzhe; Zhao, Ye; Liu, Zhengqiong; Liu, Xueliang

doi:10.3390/app13085056

Open AccessArticle

Label-Free Model Evaluation with Out-of-Distribution Detection

by

Fangzhe Zhu

,

Ye Zhao

^*,

Zhengqiong Liu

and

Xueliang Liu

School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230601, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(8), 5056; https://doi.org/10.3390/app13085056

Submission received: 19 February 2023 / Revised: 13 April 2023 / Accepted: 13 April 2023 / Published: 18 April 2023

(This article belongs to the Special Issue Advanced Artificial Intelligence Theories and Applications)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In recent years, label-free model evaluation has been developed to estimate the performance of models on unlabeled test sets. However, we find that existing methods perform poorly in environments with out-of-distribution (OOD) data. To address this issue, we propose a novel automatic model evaluation method using OOD detection to reduce the impact of OOD data on model evaluation. Specifically, we use the representation of datasets to train a neural network for accuracy prediction and employ energy-based OOD detection to exclude OOD data during testing. We conducted experiments on several benchmark datasets with varying amounts of OOD data (SVHN, ISUN, ImageNet, and LSUN) and demonstrated that our method reduces the RMSE compared to existing methods by at least 1.27%. Additionally, we tested our method on transformed datasets and datasets with a high proportion of OOD data, and the results show its robustness.

Keywords:

out-of-distribution detection; model accuracy prediction; model generalization

1. Introduction

In the field of deep learning, evaluating model performance is crucial. Models tend to perform better on training sets than in the deployment environment, making it challenging to predict their performance in the deployment environment. To accurately evaluate the performance of a model in a deployment environment, it is necessary to sample massive unlabeled test data from that environment. This can lead to a large amount of data being sampled for two reasons: first, a large amount of data is required to accurately estimate model accuracy; second, the model may be deployed in multiple environments, requiring data collection from different environments. Manually annotating this data is prohibitively expensive and unrealistic. Therefore, it is necessary to evaluate the performance of a model in an unlabeled deployment environment in a cost-effective and efficient manner before deploying it. To address this issue, researchers have previously attempted to predict model performance on unlabeled data using methods based on prediction scores [1,2]. However, the performance of these methods is highly dependent on the selection of the threshold, and thus, they are unstable. In recent years, many researchers have used dataset features to predict the accuracy of models on unlabeled dataset [3,4,5,6]. For example, Ref. [3] proposed that the distribution difference between test sets and training sets could be used to predict the accuracy of the classifier on the test set. Ref. [6] explored the relationship between pretext tasks and the original classification task and estimated the accuracy of the classifier using rotation prediction. However, we note that existing methods do not consider that semantic shift may occur between the test and training sets.

Since existing label-free model evaluation methods are based on the closed-world assumption, i.e., both the test set and training set are from in-distribution (ID), these methods may not be able to deal with uncertainty when there is a semantic shift in the test data [7]. However, in open-world environments, the input data to the model may be from out-of-distribution (OOD) [8,9]. For example, a model designed to classify cats may be deployed in an environment with images of dogs. Ref. [10] has demonstrated that the automatic model evaluation (AutoEval) method, which depends on the statistical information of the dataset [3,4], fails on OOD data. Methods that use pretext tasks, such as predicting rotation [6], tend to overestimate the performance of models on OOD data due to the information learned from the ID data. For classification-based methods [11], the inclusion of OOD data will complicate the selection of threshold values. In conclusion, it can be seen that OOD data in a deployment environment has a significant impact on the accuracy of existing methods. This phenomenon significantly affects the performance of label-free model evaluation on unlabeled test sets.

To address this problem, our method improves the performance of the AutoEval method by utilizing OOD detection [12,13] to identify inputs beyond the training data distribution. OOD detection can be used to identify inputs outside the training data distribution. We use OOD detection to reject OOD data in the test set and feed in the ID data for model accuracy prediction. Specifically, when computing the features of images in the dataset, we also generate a mask for the dataset using energy-based OOD detection. The mask hides data that are recognized as OOD data to avoid the impact of OOD data on AutoEval. We conducted experiments under different settings, and the results proved the effectiveness and robustness of our method.

The main contributions of this paper include: (1) We propose the use of OOD detection to improve the performance of AutoEval in a deployment environment. We demonstrate the effectiveness of the method on datasets with OOD data. (2) We experimentally select OOD thresholds that are more suitable for predicting model accuracy and explore the relationship between OOD detection and AutoEval.

2. Related Works

Model generalization prediction. The goal of the model generalization prediction task [2,14,15,16] is to predict the difference in performance between a classifier on the training set and the test set. Ref. [17] explored how to estimate the test error of classifiers on datasets with randomly labeled assignments or added Gaussian noise. Ref. [18] proposed a persistent topological metric that can be used to determine the error of DNNs on unlabeled datasets. Ref. [2] used unseen unlabeled data to predict distribution generalization. However, the above methods usually assume that there is no semantic shift between the training set and the test set. In addition, unlike the objective of the above works, we focus on predicting the accuracy of the classifier on a specific dataset.

Out-of-distribution detection. OOD detection methods are generally used in environments that are highly dependent on safety [19,20,21], such as autonomous driving. Refs. [19,22] propose the use of softmax scores to identify OOD data. Ref. [20] used temperature scores to discriminate OOD data and increased the difference between OOD data and normal data by adding perturbations to improve the accuracy of OOD detection. Refs. [12,13] proposed that energy scores could distinguish OOD data from ID data more accurately and fine-tune the model to improve accuracy. In addition, the distance-based method [23,24] used the distance between the sample and the class center or sample similarity as the criterion for determining whether the sample is OOD data. Ref. [25] proposed combining feature-based methods with classification-based methods to avoid information loss and to further improve the accuracy of OOD detection. Our main goal using OOD detection is to detect abnormal test samples, i.e., images that do not belong to the categories contained in the training set.

Data-centric AutoEval. The main purpose of the data-centric AutoEval [5,10,11,26,27] is to predict the accuracy of the model on an unlabeled test set. Ref. [2] presented an unsupervised framework that used only unlabeled data and mild assumptions to estimate the error rates of the classifier. Ref. [6] demonstrated a strong linear relationship between classification task accuracy and rotation prediction task accuracy. Refs. [3,4] used training sets to generate meta-datasets and to predict model accuracy through dataset representation. Ref. [11] found a relationship between the difference of confidences (DoC) of a classifier’s predictions and classifier performance on unlabeled datasets. However, none of these methods consider the effect of OOD data on accuracy prediction. We use the AutoEval method with OOD detection to minimize the impact of OOD data and to improve AutoEval performance.

In addition, we note that [26,28,29] used OOD information in the presence of covariate shift to predict the accuracy of the classifier. In contrast, the aim of our work is to predict the accuracy of the classifier on datasets with OOD data that have semantic shift rather than distributional shift.

3. Methods

In our approach, we use the training set to create a set of meta-datasets by applying transformations. For each dataset, we compute its representation and accuracy on the classifier. We then use this information to train a regression model that can predict the accuracy of the classifier. During testing, we used energy-based out-of-distribution (OOD) detection to generate masks for the test set, which helped the regression model to focus solely on the in-distribution (ID) data in the test set. Our method’s overall framework diagram is presented in Figure 1. We will provide a detailed description of our method later.

3.1. Problem Formulation

Given a training set

D_{t r a i n} = {(x_{i}, y_{i})}

, where

i \in [1, . . ., M]

,

x_{i}

is the training set image,

y_{i}

is its class label for that image and M is the number of images in the training set. Classifier f is trained on the training set

D_{t r a i n}

. Our goal is to predict the accuracy of classifier f on the unlabeled test set

D_{t e s t} = {x_{j}}

, where

j \in [1, . . ., N]

,

x_{j}

is the test set image, and N is the number of images in the test set. Note that the

D_{t e s t}

contains OOD data, which belong to a label space that is not present in the training set

D_{t r a i n}

. The ID data in

D_{t e s t}

have labels that are consistent with the label space of

D_{t r a i n}

.

3.2. Autoeval

The accuracy of the f on labeled dataset

D = {(x, y)}

can be expressed as

A c c (f, x, y)

. Since the

D_{t e s t}

does not have labels, we attempt to estimate the classifier accuracy on

D_{t e s t}

using the AutoEval. Specifically, we use a regression model g to learn the relationship between

A c c (f, x, y)

and the representation of datasets h

g (h) = A c c (f, x, y)

(1)

where

A c c (f, x, y)

is the accuracy of classifier f on dataset D with labels, and h is the representation of dataset. h can be represented using the feature F of a layer of classifier f

F = {f (x)}, x \in X

(2)

To reduce computational effort, we only use the variance, mean for all image features, and Fréchet distance (FD) [30] between dataset D and training set

D_{t r a i n}

to represent D.

F \overset{μ, δ, F D (D, D_{t r a i n})}{\to} h

(3)

FD (D, D_{t r a i n}) = {‖ μ_{t r a i n} - μ ‖}_{2}^{2} + T r (Σ_{t r a i n} + Σ - 2 {(Σ_{t r a i n} Σ)}^{1 / 2})

(4)

In order to train the regression network g, we wish to collect a sufficient number of labeled datasets as the meta-dataset for the AutoEval.

3.3. Generate Meta-Dataset

The dataset used for g learning needs to satisfy two conditions: (1) a sufficiently large number of images in each sample dataset; (2) a sufficiently large and diverse variation between each sample dataset. However, it is not realistic to obtain a large number of datasets with labels in a real environment.

Following the method from Ref. [3], we can synthesize a meta-dataset based on the training set. We perform data enhancement on the images in the training set to obtain the meta-dataset. We use a total of six image-enhancement methods: adjusting contrast, adjusting brightness, randomly removing some colors, inverting colors, adding affine transformations and adding sharpening effects and applying these at random. This is shown in Figure 2. Since the image enhancement methods do not affect the subject of the images, the generated image labels remain the same. We can obtain many large datasets by transforming the images in the training set. These datasets are all labeled, differ from the original dataset, and have no semantic shift. We can calculate the accuracy of f on these synthesized datasets and compute the representation of each dataset.

3.4. Energy-Based OOD Detection

Since unlabeled test sets collected in real-world environments may have semantic shift, the unlabeled test set may contain OOD images. Obviously, the classifier f always gives wrong results for classifying the OOD data. However, the representation of the dataset does not reflect the effect of OOD data well, which can lead to AutoEval producing an overly optimistic evaluation of the classifier’s performance on the dataset that includes OOD data. We attempt to mitigate the influence of OOD data through the OOD detection. Specifically, we detect each input image by the OOD detection and use the AutoEval method to calculate the accuracy of the ID data.

In order to ensure the accuracy of OOD detection, we use energy-based OOD detection. Energy-based OOD detection uses the energy function instead of the softmax function to identify OOD data [12]. The method uses an energy score derived from the discriminant model to determine whether the sample is OOD. The density function of the discriminant model can be expressed as an energy function as:

p (x) = \frac{e^{- E (x; f) / T}}{\int_{x} e^{- E (x; f) / T}}

(5)

where

Z = \int_{x} e^{- E (x; f) / T}

; then, take the logarithm of the above formula to obtain:

log p (x) = - E (x; f) / T - log Z

(6)

Thus,

E (x; f)

is linearly aligned with the log-likelihood function. Low energy means high likelihood (ID), and high energy means low likelihood (OOD). We can set a threshold

τ

to classify:

G (x; τ, f) = \{\begin{matrix} 0, i f - E (x; f) < τ \\ 1, i f - E (x; f) > τ \end{matrix}

(7)

3.5. Threshold Selection

OOD detection is essentially a binary classification task. This method compares the energy score of the sample with threshold to judge whether the sample image is from the ID. Therefore, the choice of threshold is very critical.

Obviously, our method needs to choose an appropriate threshold to achieve good results. If the threshold is too low, the ID images with high energy score will be excluded. These excluded images will be flagged as misclassified, in which case the prediction accuracy of our method will be lower than the ground truth. This error will be more serious when the proportion of OOD data is low. If the threshold is too high, the OOD detection method cannot correctly exclude OOD images. Although, in this case, the accuracy of our method is always higher than the previous method, this is not the effect we expect.

Ideally, the ID data in the test set and the train data have the same distribution. To avoid manually selecting thresholds for each test set, we can choose thresholds by referring to the energy score distribution of the training set. Since the exclusion of too many images leads to a decrease in accuracy. We recommend choosing it as the 99.5% percentile of the energy score distribution of the training set as the threshold. We have experimentally demonstrated the applicability of this threshold to different datasets. The effect of different thresholds on the accuracy of AutoEval is further investigated in Section 4.4.

We input the representation of

D_{t e s t}

and mask into g. The output of g is the accuracy of classifier f on the unlabeled dataset

D_{t e s t}

. Since the classifier’s classification result on OOD data must be wrong, the accuracy of the classifier on the test set can be expressed as the percentage of ID data in the test set multiplied by the prediction accuracy of AutoEval on the ID data.

In addition, we find that not only do OOD data affect the accuracy of AutoEval, but also the absence of an ID data category (or very few samples in that category) has an impact on accuracy prediction. To address this issue, we propose to add a few samples to each ID category in the test set. Our objective is to avoid the effect of missing categories on AutoEval without affecting the test set distribution as much as possible.

4. Experiments

4.1. Dataset

Training set. We train the classifier on the training set of CIFAR-10 [31], which is a small dataset consisting of 50,000 training images and 10,000 test images for recognizing common objects. It contains RGB images of 10 classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck), and the classes in CIFAR-10 are completely mutually exclusive. In addition, we use CIFAR-10 as the original dataset to generate 800 datasets as a meta-dataset for learning the regression model.

Test set. To evaluate the efficacy of our method, we combine ID data and OOD data from different datasets into test sets that consist of 1000 images each, including 900 ID images and 100 OOD images. This configuration simulates the deployment environment of 10% OOD data.

We choose CIFAR-10.1 [32] as the ID data. CIFAR-10.1 contains 2000 test images. This dataset is a new test set for CIFAR-10 and minimizes the distribution shift relative to CIFAR-10. In addition, to verify the reliability of AutoEval on other natural datasets, we also select images from ImageNet [33] and Calthch-256 [34] as ID data. In order to ensure that the test set contains all the categories in the CIFAR-10, we select images from the two datasets as the ID data of the test set.

We select images from ImageNet with different categories (e.g., guns, chairs, books, screens, etc.) from CIFAR-10 as OOD data. To demonstrate that our method works on a variety of OOD data, we also select images from the digits dataset SVHN [35] and natural image datasets LSUN [36] and ISUN [37] as OOD data. The labels of the pictures in these datasets are not coincident with the labels of the CIFAR-10 dataset. We select 100 images from each dataset as OOD data for the training set.

4.2. Experimental Settings

We choose ResNet-44 [38] as the backbone and train a 10-way classifier on CIFAR-10. We use a five-layer fully connected network as a regression model g to fit the relationship between the accuracy of the classifier f on the test set and the representation of the test set. For OOD detection, we follow the settings in Ref. [12] and use WideResNet to train the image classification model on CIFAR-10. The model is fine-tuned using 80 million tiny images [39].

We use the root mean squared error (RMSE) to measure the performance of classifier accuracy estimation methods. The RMSE represents the mean squared distance between the predicted accuracy and the ground truth of the accuracy. A smaller RMSE represents a better estimate and vice versa. All experiments were conducted on an RTX-2080TI GPU.

4.3. Baseline

Prediction-based Method. This approach assumes that images with a higher maximum softmax output are more likely to be correctly classified. Therefore, this method sets a threshold based on maximum softmax scores. If an image’s maximum softmax output is above the threshold, it is considered correctly classified. To ensure that the maximum softmax output is large enough, we choose

τ

= 0.8,

τ

= 0.9.

Rotation Prediction. The basic assumption of this method is that if the classifier convolution layer can understand the image information correctly, it can also perform the pretext tasks well. This method adds a rotation prediction task after the convolution layer. The more correct the classifier rotation prediction is, the more correct the classification will be.

AutoEval. This method predicts the accuracy of the classifier on the entire dataset by using the dataset’s representation. To ensure prediction accuracy, we use only neural network regression rather than linear regression. The difference between our method compared to the other methods is whether or not to use OOD detection to filter the data.

4.4. Experimental Result

This paper introduces three possible methods to estimate the recognition accuracy, including the prediction-based method, rotation prediction and AutoEval. We report the estimations of these methods in Table 1. For the prediction-based method, two thresholds (i.e.,

τ = 0.8

and 0.9) are used.

For comparison with the prediction-based method, we choose 0.8 and 0.9 as the threshold. The experimental results show that the error of the prediction-based method is higher when

τ = 0.8

. As

τ

increases to 0.9, the error decreases, but the results are still unsatisfactory. In addition, we find that nearly 60% of OOD images have a large maximum softmax score. OOD images should be considered incorrectly classified, but these images are considered correctly classified due to the large maximum softmax scores. Therefore, the results of the prediction-based method are not reliable on datasets containing OOD data. Perhaps we could choose a threshold to achieve better results, but this is like guessing the accuracy by chance rather than by understanding whether each image is correctly classified.

For comparison with other label-free model evaluation methods, we compare our methods with the existing work. These methods include rotation prediction and AutoEval without using OOD detection. We conduct experiments using published code and compare our methods on datasets containing OOD data.The experimental results show that the average RMSE = 2.91% for our method on the eight datasets, much smaller than AutoEval without the OOD strategy (RMSE = 4.61%) and rotation prediction (RMSE = 4.18%). This suggests that our method outperforms existing methods in real environments containing OOD data.

For the same OOD data, our method performs better on the four datasets with ID data as ImageNet than on the four datasets with ID data as CIFAR-10.1. This is because the distribution shift between CIFAR-10.1 and CIFAR-10.1 is smaller. The OOD classification error rate is smaller on CIFAR-10.1 than on ImageNet.

We note that rotation prediction works better than AutoEval using the OOD strategy when the OOD data were SVHN. SVHN is significantly different from the other OOD datasets in that this dataset consists of grey-scale digital images, whereas the other datasets are all natural images. Because the classifier was trained on the natural dataset, the branching task of determining the rotation angle has a low accuracy on SVHN. For other OOD natural images, rotation prediction may maintain a certain level of accuracy with learned information (e.g., the animal’s eye is above the mouth).

Furthermore, we find that the error in our method comes from two sources: (1) misclassification of OOD detection; (2) the inherent error of the AutoEval method. The error introduced by the former can be reduced by selecting a better threshold. The latter can be reduced by selecting a better representation of the dataset.

Effect of OOD threshold. As mentioned earlier, the choice of OOD detection thresholds has a significant impact on the predictive results. Therefore, we analyzed the error of our method in different thresholds. We show the OOD detection error and prediction accuracy error under different thresholds in Figure 3. The following experiments were conducted on datasets with a 9:1 ratio of ID data to OOD data.

The experimental results demonstrate that the impact of OOD detection thresholds on the accuracy prediction error comes from two main sources: (1) treating ID data as OOD data, which leads to the exclusion of ID data and directly affects the prediction error; (2) treating OOD data as ID data, resulting in the conflation of OOD data, which indirectly affects the prediction error by affecting the accuracy of AutoEval. Usually, both errors exist simultaneously, but the two errors have different effects on the prediction error at different thresholds. When the threshold

τ

is set to 0.95, OOD detection directly marks nearly 5% of the ID data in the test set as classification errors (the actual impact will be much larger). This results in a larger error for our method. At this threshold, the first type of error has a greater impact on the prediction error than the second type. As the threshold increases, the first type of the error decreases, while the second type of the error becomes larger. The error of our method decreases and then increases until the AutoEval using OOD detection degenerates into traditional methods due to the large threshold.

Different proportions of OOD data. To demonstrate the robustness of our method in datasets with large amounts of OOD data, we conducted experiments using images from CIFAR-10 and ImageNet as ID data, and SVHN and ImageNet as OOD data. We conducted experiments varying the percentage of OOD data between 20% and 40% while keeping all other parameters constant to demonstrate the robustness of our method in datasets with a large amount of OOD data.

As shown in Figure 4, our method is almost unaffected by an increase in OOD data, while the error of traditional AutoEval without OOD detection shows a significant increase in error. The rotation prediction method is less affected by the increase in the proportion of OOD data but still performs much worse than our method.

However, we observed that our method does not perform as well when the percentage of OOD data is low (<5%). This may be attributed to the incorrect exclusion of some in-distribution (ID) data by the OOD detection mechanism.

Ours method for VGG-16. We also conducted experiments on VGG-16. The performance of each method is shown in Table 2. All experiment settings are the same as before.

Robustness of ours method on transformed test sets. To verify the robustness of our method, we also conducted experiments on real datasets with image transformations. For -A, we used cutout. For -B, we used shear and change the hue. These transformations are not used in the process of generating the meta-dataset. As shown in Figure 5, the experimental results demonstrate that our method performs better than AutoEval on real datasets with image transformations.

5. Conclusions

In this paper, we proposed an AutoEval method using energy-based out-of-distribution (OOD) detection. Our method adjusts the representations of the test set with OOD detection to address the limitation of traditional AutoEval methods in handling OOD inputs, making it more suitable for a deployment environment with OOD inputs. When the proportion of OOD data is significant (>5%), our method achieves more accurate predictions than existing methods and is closer to the true value of the classifier. Moreover, our method does not require retraining of the classifier and regression model, nor does it rely on any additional data. We explained the relationship between the OOD detection error rate and the error of our method. Additionally, we selected a threshold suitable for the AutoEval method through experiments and validated it on different datasets and classifiers, demonstrating its applicability to various OOD data. We also experimented on a corrupted dataset and proved that our method is robust and outperforms existing methods. In summary, our method is more efficient, faster, and more accurate in predicting the accuracy of classifiers in a deployment environment where performance needs to be evaluated.

However, our method has limitations. We noticed that when the proportion of OOD data is very low (<5%), the performance of our method may decrease due to possible errors in OOD detection. To address this, we attempted to use more advanced and accuracy-predictive OOD detection methods to minimize the impact of this error on prediction accuracy. Additionally, we noticed that the energy scores and other information provided by the OOD classifier when performing OOD detection on input images can provide some guidance for predicting accuracy using regression models. In the future, we plan to utilize this information to further improve prediction accuracy.

Author Contributions

Conceptualization, F.Z. and Y.Z.; methodology, F.Z.; software, F.Z.; validation, F.Z.; formal analysis, F.Z.; investigation, F.Z.; resources, Y.Z., X.L. and Z.L.; data curation, F.Z.; writing—original draft preparation, F.Z.; writing—review and editing, F.Z. and Y.Z.; visualization, F.Z.; supervision, Y.Z., X.L. and Z.L.; project administration, Y.Z.; funding acquisition, Y.Z., X.L. and Z.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the University Synergy Innovation Program of Anhui Province under grant no. GXXT-2022-043 and by the Natural Science Foundation of Shanxi Province under grant no. 202203021211116.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data will be made available on request.

Conflicts of Interest

The authors declare no conflict of interest.

References

Platanios, E.A.; Dubey, A.; Mitchell, T. Estimating accuracy from unlabeled data: A bayesian approach. In Proceedings of the International Conference on Machine Learning, PMLR, New York, NY, USA, 20–22 June 2016; pp. 1416–1425. [Google Scholar]
Donmez, P.; Lebanon, G.; Balasubramanian, K. Unsupervised Supervised Learning I: Estimating Classification and Regression Errors without Labels. J. Mach. Learn. Res. 2010, 11, 1323–1351. [Google Scholar]
Deng, W.; Zheng, L. Are labels always necessary for classifier accuracy evaluation? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 15069–15078. [Google Scholar]
Sun, X.; Hou, Y.; Li, H.; Zheng, L. Label-free model evaluation with semi-structured dataset representations. arXiv 2021, arXiv:2112.00694. [Google Scholar]
Sun, X.; Hou, Y.; Deng, W.; Li, H.; Zheng, L. Ranking models in unlabeled new environments. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 11761–11771. [Google Scholar]
Deng, W.; Gould, S.; Zheng, L. What does rotation prediction tell us about classifier accuracy under varying testing environments? In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 13–15 April 2021; pp. 2579–2589. [Google Scholar]
Yang, J.; Zhou, K.; Li, Y.; Liu, Z. Generalized out-of-distribution detection: A survey. arXiv 2021, arXiv:2110.11334. [Google Scholar]
Bogdoll, D.; Nitsche, M.; Zöllner, J.M. Anomaly detection in autonomous driving: A survey. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4488–4499. [Google Scholar]
Shafaei, A.; Schmidt, M.; Little, J.J. Does your model know the digit 6 is not a cat? A less biased evaluation of “outlier” detectors. arXiv 2018, arXiv:1809.04729. [Google Scholar]
Garg, S.; Balakrishnan, S.; Lipton, Z.C.; Neyshabur, B.; Sedghi, H. Leveraging unlabeled data to predict out-of-distribution performance. arXiv 2022, arXiv:2201.04234. [Google Scholar]
Guillory, D.; Shankar, V.; Ebrahimi, S.; Darrell, T.; Schmidt, L. Predicting with confidence on unseen distributions. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1134–1144. [Google Scholar]
Liu, W.; Wang, X.; Owens, J.; Li, Y. Energy-based out-of-distribution detection. Adv. Neural Inf. Process. Syst. 2020, 33, 21464–21475. [Google Scholar]
Zhu, Y.; Chen, Y.; Xie, C.; Li, X.; Zhang, R.; Xue, H.; Tian, X.; Zheng, B.; Chen, Y. Boosting Out-of-distribution Detection with Typical Features. arXiv 2022, arXiv:2210.04200. [Google Scholar]
Defard, T.; Setkov, A.; Loesch, A.; Audigier, R. Padim: A patch distribution modeling framework for anomaly detection and localization. In Proceedings of the Pattern Recognition, ICPR International Workshops and Challenges, Virtual Event, 10–15 January 2021; Proceedings, Part IV. Springer: Berlin/Heidelberg, Germany, 2021; pp. 475–489. [Google Scholar]
Eilertsen, G.; Jönsson, D.; Ropinski, T.; Unger, J.; Ynnerman, A. Classifying the classifier: Dissecting the weight space of neural networks. arXiv 2020, arXiv:2002.05688. [Google Scholar]
Jiang, Y.; Neyshabur, B.; Mobahi, H.; Krishnan, D.; Bengio, S. Fantastic generalization measures and where to find them. arXiv 2019, arXiv:1912.02178. [Google Scholar]
Zhang, C.; Bengio, S.; Hardt, M.; Recht, B.; Vinyals, O. Understanding deep learning (still) requires rethinking generalization. Commun. ACM 2021, 64, 107–115. [Google Scholar] [CrossRef]
Corneanu, C.A.; Escalera, S.; Martinez, A.M. Computing the testing error without a testing set. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2677–2685. [Google Scholar]
Hendrycks, D.; Gimpel, K. A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv 2016, arXiv:1610.02136. [Google Scholar]
Liang, S.; Li, Y.; Srikant, R. Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv 2017, arXiv:1706.02690. [Google Scholar]
Zhang, Y.; Deng, W.; Zheng, L. Unsupervised Evaluation of Out-of-distribution Detection: A Data-centric Perspective. arXiv 2023, arXiv:2302.08287. [Google Scholar]
Yu, Q.; Aizawa, K. Unsupervised out-of-distribution detection by maximum classifier discrepancy. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9518–9526. [Google Scholar]
Chen, X.; Lan, X.; Sun, F.; Zheng, N. A boundary based out-of-distribution classifier for generalized zero-shot learning. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part 18 XXIV 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 572–588. [Google Scholar]
Lee, K.; Lee, K.; Lee, H.; Shin, J. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. Adv. Neural Inf. Process. Syst. 2018, 31. Available online: https://proceedings.neurips.cc/paper/2018/hash/abdeb6f575ac5c6676b747bca8d09cc2-Abstract.html (accessed on 12 April 2023).
Wang, H.; Li, Z.; Feng, L.; Zhang, W. Vim: Out-of-distribution with virtual-logit matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 4921–4930. [Google Scholar]
Baek, C.; Jiang, Y.; Raghunathan, A.; Kolter, J.Z. Agreement-on-the-line: Predicting the performance of neural networks under distribution shift. Adv. Neural Inf. Process. Syst. 2022, 35, 19274–19289. [Google Scholar]
Chen, L.; Zaharia, M.; Zou, J.Y. Estimating and explaining model performance when both covariates and labels shift. Adv. Neural Inf. Process. Syst. 2022, 35, 11467–11479. [Google Scholar]
Maggio, S.; Bouvier, V.; Dreyfus-Schmidt, L. Performance Prediction Under Dataset Shift. In Proceedings of the 2022 26th International Conference on Pattern Recognition (ICPR), IEEE, Montreal, QC, Canada, 21–25 August 2022; pp. 2466–2474. [Google Scholar]
Risser-Maroix, O.; Chamand, B. What can we Learn by Predicting Accuracy? In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 2390–2399. [Google Scholar]
Dowson, D.; Landau, B. The Fréchet distance between multivariate normal distributions. J. Multivar. Anal. 1982, 12, 450–455. [Google Scholar] [CrossRef]
Krizhevsky, A.; Hinton, G. Learning Multiple Layers of Features from Tiny Images. 2009. Available online: https://www.cs.toronto.edu/~kriz/ (accessed on 12 April 2023).
Recht, B.; Roelofs, R.; Schmidt, L.; Shankar, V. Do CIFAR-10 Classifiers Generalize to CIFAR-10? arXiv 2018, arXiv:1806.00451. [Google Scholar]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, IEEE, Miami, FL, USA, 20–25 June; pp. 248–255.
Griffin, G.; Holub, A.; Perona, P. Caltech-256 Object Category Dataset. 2007. Available online: https://resolver.caltech.edu/CaltechAUTHORS:CNS-TR-2007-001 (accessed on 12 April 2023).
Netzer, Y.; Wang, T.; Coates, A.; Bissacco, A.; Wu, B.; Ng, A.Y. Reading Digits in Natural Images with Unsupervised Feature Learning. 2011. Available online: http://ufldl.stanford.edu/housenumbers (accessed on 12 April 2023).
Yu, F.; Seff, A.; Zhang, Y.; Song, S.; Funkhouser, T.; Xiao, J. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv 2015, arXiv:1506.03365. [Google Scholar]
Xu, P.; Ehinger, K.A.; Zhang, Y.; Finkelstein, A.; Kulkarni, S.R.; Xiao, J. Turkergaze: Crowdsourcing saliency with webcam based eye tracking. arXiv 2015, arXiv:1504.06755. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Torralba, A.; Fergus, R.; Freeman, W.T. 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2008, 30, 1958–1970. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Left: The training process of the regression model. We first generate a set of transformed datasets from the training set and use the information from transformed datasets to train the regression model for predicting model accuracy. Note that in these transformed datasets, the mask of the images is always set to 1 because the labels of these images have not changed and still belong to the original distribution. Right: We use the trained regression model to predict the accuracy of the model, and the image shown in the red box indicates out-of-distribution (OOD) data. To avoid the impact of these OOD images on the prediction of the regression model, an OOD detector sets the mask of the corresponding features of the image to 0.

Figure 2. The transformed image in the meta-dataset. The source images are from the training set. These images have the same label as the source image.

Figure 3. (a): The vertical axis represents the error between the model accuracy predicted by our method and the true accuracy of the classifier, while the horizontal axis represents the OOD detection threshold. This error decreases initially and then increases as the OOD threshold increases. (b): The vertical axis represents the error of OOD detection, while the horizontal axis represents the OOD detection threshold. This error decreases as the OOD threshold increases. The ID/OOD data in the dataset comes from CIFAR-10.1/ImageNet and ImageNet/SVHN, respectively. The ratio of ID data to OOD data is 9:1.

Figure 4. The error changes of our method under different proportions of OOD data. When the proportion of OOD data is low, our method only shows a slight improvement over other methods. However, as the proportion of OOD data increases, our method’s superiority over other methods becomes more prominent. The dataset consists of ID data and OOD data from CIFAR-10.1/ImageNet (a) and ImageNet/SVHN (b).

Figure 5. The graph shows the comparison of the error rate between our proposed method and AutoEval on transformed test sets. Img represents the ImageNet dataset, and CF10.1 represents the CIFAR10.1 dataset. For group A images, we applied cutout, while for group B, we used shear and changed the hue. (−)/(+) indicates that the predicted accuracy is lower/higher than the ground-truth accuracy. As shown in the figure, our method always outperforms traditional AutoEval on transformed datasets.

Table 1. Performance of the various methods on eight datasets containing OOD data. These datasets all consist of 90% ID data and 10% OOD data.

ID Data	ImageNet				CIFAR-10
OOD Data	ImageNet	SVHN	ISUN	LSUN	ImageNet	SVHN	ISUN	LSUN
Pred score (=0.8)	14.11	13.61	13.43	12.78	9.62	9.12	8.94	8.29
Pred score (=0.9)	8.64	7.85	7.31	6.95	4.16	3.37	2.83	2.46
Rotation	3.75	2.95	6.24	5.88	3.29	2.08	4.99	4.26
AutoEval	4.90	4.21	3.75	3.85	5.65	4.84	4.59	5.29
Ours	2.56	3.91	3.39	3.02	3.01	3.73	2.07	1.71

Table 2. The performance of various methods on VGG-16 classifier. ID data/OD data are CIFAR-10.1/ImageNet, ImageNet/SVHN. RMSE(%) is shown: lower is better.

Method	CIFAR-10.1/ImageNet	ImageNet/SVHN
Rotation	5.77	5.30
AutoEval	6.11	4.69
ours	4.38	3.47

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhu, F.; Zhao, Y.; Liu, Z.; Liu, X. Label-Free Model Evaluation with Out-of-Distribution Detection. Appl. Sci. 2023, 13, 5056. https://doi.org/10.3390/app13085056

AMA Style

Zhu F, Zhao Y, Liu Z, Liu X. Label-Free Model Evaluation with Out-of-Distribution Detection. Applied Sciences. 2023; 13(8):5056. https://doi.org/10.3390/app13085056

Chicago/Turabian Style

Zhu, Fangzhe, Ye Zhao, Zhengqiong Liu, and Xueliang Liu. 2023. "Label-Free Model Evaluation with Out-of-Distribution Detection" Applied Sciences 13, no. 8: 5056. https://doi.org/10.3390/app13085056

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Label-Free Model Evaluation with Out-of-Distribution Detection

Abstract

1. Introduction

2. Related Works

3. Methods

3.1. Problem Formulation

3.2. Autoeval

3.3. Generate Meta-Dataset

3.4. Energy-Based OOD Detection

3.5. Threshold Selection

4. Experiments

4.1. Dataset

4.2. Experimental Settings

4.3. Baseline

4.4. Experimental Result

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI