1. Introduction
In recent years, researchers have focused on applying deep learning to numerous tasks, mainly because of the development of large datasets and computers to achieve high accuracy in areas such as image processing. Examples of such applications include AlexNet [
1], VGG [
2], ResNet [
3], EfficientNet [
4], and VisionTransformer [
5]. The range of applications is expanding. In addition to general object recognition, it has also been applied in highly specialized fields, such as medical images [
6,
7]. However, the number of datasets and parameters in the model is becoming increasingly important [
8]. It is difficult for state-of-the-art models, such as VisionTransformer, to perform well, especially for small datasets. Nevertheless, the high cost of collecting rare and specialized images makes it difficult to collect a sufficient sample size.
Pre-training models on similar datasets including large number of samples often improves performance when an objective has limited datasets [
9,
10,
11,
12]. For example, VisionTransformer is pre-trained on JET-300M [
8] before training the inference model on ImageNet-1K [
13] because ImageNet-1k has 10 million images, which is insufficient for training VisionTransformer. In contrast, JET-300M contains 300 million images and can be used as a sufficient pre-training dataset for VisionTransformer. Thus, when training on a small- or medium-scale dataset, pre-training on a larger dataset yielded higher accuracy than training on a full scratch dataset.
CLIP [
14] and ALIGN [
15] are effective models for several downstream tasks trained on large datasets. CLIP trains two models from a combination of 400 million images and natural language explanatory text for each image: a model that obtains image features and one that obtains text features. CLIP can be used as a pre-trained model to obtain higher accuracy with small datasets. Similarly, ALIGN learns image features from over one billion images and alternative text pairs from the web. Instead of manually screening the samples, ALIGN was trained on a dataset containing noise samples to collect a larger dataset.
However, regardless of the pre-trained model employed, adequate accuracy cannot be obtained without fine-tuning with appropriate hyperparameters, which vary with the datasets. Therefore, the best model must be obtained by searching for hyperparameters [
16,
17,
18] using methods, such as grid search [
19], random search [
20], and generic algorithms [
21]. In addition, certain hyperparameters can improve or worsen the learned model. To increase the accuracy of the model obtained via fine-tuning, we must combine the best parts of these multiple models.
In general, using several different models for inference can be highly efficient. For example, ensembles [
22] increase accuracy by averaging the outputs of several different classifiers. Ensembles are typically used to increase the accuracy of machine learning and other methods. In deep learning, ensembles improve the accuracy [
23] and obtain good fine-tuning points for several models. However, the computational cost of a single deep learning model is higher than that of other machine-learning models. Therefore, increasing the accuracy of a single model is essential as an alternative to using several different models during inference.
Model soups [
24] provided a model with a computational cost of inference of a single model and an accuracy comparable to that of an ensemble. The method averages the weights of fine-tuned models with different hyperparameters from the same pre-trained model. This method does not increase the computational cost of the model during inference and provides the best of several models, such as an ensemble.
However, the model soups train models based on the same training data, although they are trained using different hyperparameters. Typically, different training data change the behavior of the model. Partitioning the data into training and validation data changes the behavior of the model, particularly when the number of datasets is small. In particular, the accuracy should be equally high regardless of how the dataset is divided. This problem must be overcome to extend the model soups to smaller datasets.
The method of k-fold cross-validation [
25] is popular for avoiding errors caused by partitioning datasets into training and validation datasets. The method of k-fold cross-validation divides the available data into
k subsets, creates
k different training and validation datasets, and trains the models on these subsets. We assume that synthesizing models from different training and validation datasets will provide higher accuracy in model soups than synthesizing models with different hyperparameters, such as the learning rate and data augmentation intensity.
This study aimed to apply model soups to different training and validation data in divided datasets to achieve robust learning on small amounts of data. Two experiments were performed for this purpose. First, we examined how the correlation between validation accuracy and test accuracy changes when models for different training and validation datasets are synthesized using model soups. Second, this study demonstrates how synthesizing models based on different training and validation data achieves both high accuracy and reliability.
2. Preliminaries
2.1. Model Soups
There are numerous ways to obtain a better model after training multiple models. The simplest method, “Best on val”, uses the model with the highest accuracy in the validation data as the resulting model. However, this approach only uses the selected model. We hypothesize that the accuracy is improved when using more models.
Model soups [
24] is a simple insight-based method. The weights of models fine-tuned from the same pre-trained model with different hyperparameters can be interpolated to improve recognition accuracy and robustness. The study proposes two main methods: uniform and greedy soups. Uniform soups take the average weight of all the learned models as the weight of the new model. Considering the parameters of several different models
, the function average, which represents the average of the parameters of a particular model, can be expressed as follows:
Specifically, uniform soups apply the function average to all the models obtained with different hyperparameters.
Greedy soups greedily adopted models with high accuracy for the validation data.
is the accuracy for dataset
D and the model parameters
and
are the validation data shared in the model soups. Here, greedy soups are executed, as in Algorithm 1.
Algorithm 1 Greedy Soups |
Input: Parameters (sorted in decreasing order of Acc()) ingredients for i = 1 to k do do if Acc(average(ingredients ∩), ) ≥ Acc(average(ingredients), ) then ingredients ← ingredients ∩ end if end for return: average(ingredients)
|
Prior experimental studies have shown that greedy soups are more accurate on test data and more robust for distribution-shifted data than uniform soups and the original models.
However, prior research has focused on fine-tuning medium-sized datasets, such as ImageNet [
13]. ImageNet contains approximately 1.2 million images for 1000 classes of images, and its number of samples is considered sufficient to reduce the influence of splitting the training data into training and validation data. However, in many cases, it is difficult to collect even medium-sized numbers. For example, CIFAR10 [
26] contains only 50,000 images from 10 classes. For the stable application of model soups to these domains, the results of different training data should be considered.
2.2. k-Fold cross Validation
The k-fold cross-validation [
25] is a common evaluation method in machine learning, as shown in
Figure 1. Algorithm 2 was used for the training and evaluation.
Algorithm 2 k-Fold Cross Validation |
Input: Original dataset and number of split k Divide evenly into k groups without duplication. for i = 1 to k do do end for return:
|
This method improves the reliability of machine learning. For example, in a hyperparameter search, when optimizing a fixed set of training and validation data, overfitting may occur on the combination. The k-fold cross-validation prevents overfitting by optimizing k types of training and validation data.
As described above, the k-fold cross-validation was more effective with smaller datasets. Therefore, dataset bias, a bottleneck in applying model soups, may be reduced by splitting datasets into several subsets of training and validation data, as in the k-fold cross-validation. The following sections describe the methods used to reduce the bias.
3. Methods
This study determines whether model soups can be effectively applied to models trained using different training and validation data. A distinct problem arises from k-fold cross-validation. When training a model, we separate the training and validation data. However, when some models are synthesized, validation data for one model are included in the training data for the other models. This is a critical problem because the validation data are responsible for detecting whether the model is overfitted with the training data. Additionally, the accuracy of the validation data for a model indicates whether the model is selected. For example, for Best on val, which served as the baseline in model soups, the model with the highest accuracy against the validation data was adopted. Furthermore, greedy soups adopt a model with higher accuracy for the validation data, as shown in Algorithm 1. Therefore, when training and validation data are separated, the accuracy of the validation data determines whether the synthesized model is expected to outperform the pre-synthesized models. However, when models based on different training and validation data are synthesized, we use validation data to learn the parameters of the post-synthesized model. Even validation data for the pre-synthesized models are less reliable for the post-synthesized models. For instance, consider training and validation data divided into five subsets. Model A is assigned the first subset as validation data and the remaining subset as training data. For model A, the ratio of each subset used for training is 0:1:1:1:1. Similarly, model B is assigned the second subset as validation data and the remaining subsets as training data. For model B, the ratio of each subset used for training is 1:0:1:1:1. When the two models are synthesized, the ratio of the trained subsets for the post-synthesized model becomes 1:1:2:2:2. The first and second two subsets are used less frequently than the other subset, but not zero. The validation data, as defined by k-fold cross-validation, do not exist for the post-synthesized model. Then, an alternative indicator is required to validate the post-synthesized model. Here, we investigated whether the validation data for pre-synthesized models can be used as an accuracy indicator the post-synthesized model.
First, we show the relationship between the accuracy of the validation and test data when uniform soups are applied to the same hyperparameters and the training and validation data. The dataset used in the experiment provides the test data, which are independent of the training and validation data. Therefore, the accuracy of the test data determines the accuracy of all models, including the post-synthesized models. The test data cannot be used to indicate the model to be synthesized, to avoid leaks. Based on the above results, we showed how to merge models based on different training and validation data while utilizing the validation data as effectively as possible.
3.1. Uniform Soups for Different Training and Validation Data
We conducted experiments to illustrate the relationship between the validation and test data when models trained by different subsets were combined with model soups.
- 1.
Create a subset of k types of training and validation data through k-fold cross-validation.
- 2.
Provide T type hyperparameters.
- 3.
Fine-tune kinds of models.
- 4.
Synthesize models by applying uniform soups from k models to n models for each hyperparameter.
First, let
be a parameter of the model trained on
and validated on
: Second,
is the accuracy of the model with the parameter
with respect to the validation data
. Third,
is the accuracy of the model with the parameter
for the test data
. Finally,
is the average of the provided model parameters, similar to the uniform soups. We propose two measures for the accuracy of the validation data:
AvgAcc is the average of the set of accuracies for the validation data corresponding to the included models. Similarly, MinAcc represents the minimum accuracy among the sets of accuracies for the validation data corresponding to the included models.
The following points were verified for each synthesized number of models created using the above procedure.
- 1.
RMSE for each of AvgAcc, MinAcc for validation data, and Acc for test data.
- 2.
Correlations between val accuracy and test accuracy.
First, to measure how far the validation data are from the test data for a particular number of synthesized models
n, RMSE (root mean squared error) is calculated as in Equation (
4):
where
is the model synthesized with the
t-th hyperparameter and ValAcc is the val accuracy: AvgAcc or MinAcc. The expected value of the difference between the accuracies
and
for model
is close to zero if the validation data
and test data
have the same distribution. In contrast, the difference is likely to be larger when the validation data are larger than the test data, because of the synthesis of different training and validation data. This experiment shows that the distribution over the validation data diverges from the test data, depending on the number of merged models. Furthermore, we showed that MinAcc is more effective than AvgAcc based on the hypothesis that a more stringent criterion for the model is more robust than an average criterion.
Second, by calculating the correlation between the val accuracy and test accuracy for the number of models to be merged, we showed that the validation data effectively select models that enhance the test data. Furthermore, it showed correlations in regions of high accuracy, which is particularly important when applying greedy soups.
3.2. Greedy Soups Containing Different Training and Validation Data
To apply model soups to different training and validation data, we experimented with three different extensions of greedy soups.
Greedy soups (for all subset)
Best on val for greedy soups per subset (BGSSs)
Greedy soups for greedy soups per subset (GSGSs)
The first method applies greedy soups without distinguishing between models that include all training and validation data based on MinAcc calculated using Equation (
3). Although mixing models from different training data may improve the test accuracy of the synthesized model, it can mix the models multiple times based on unreliable validation data.
The second method is to apply greedy soups to each training and validation dataset and then adopt the model with the highest val accuracy. The validation data are more reliable because they are not mixed with the training data that the synthesized model follows; however, they do not interpolate between models with different training data. Hence, the benefit of the improved accuracy is small.
The last method is to apply greedy soups to each training and validation dataset and then to each synthesized model. As the models synthesized in the first step were based on the same training and validation data, the same greedy soups as in previous studies were applied. The next step is to apply greedy soups to different models of the training and validation data. Thus, they are synthesized using indicators based on val accuracy. This method applies greedy soups on a subset-by-subset basis, allowing decisions to be made using MinAcc only in the last iterations, maintaining the reliability of the validation data.
The val and test accuracies of the models were compared.
3.3. Experimental Setup
As in previous studies, we employed CLIP ViT-B/32 [
14] as the pre-trained model. Furthermore, we applied fine-tuning only to the final linear layer before fine-tuning the entire model (LP initialization), as described by Kumar et al. [
27]. To make this correspond to different training and validation subsets, we applied LP initialization to each of the
k subsets and trained the model of the corresponding subset.
CIFAR10 was used as the dataset. CIFAR10 provides training data and test data. Note that there is no official validation data. The training data include 50,000 samples, and the test data include 10,000 samples. Each sample is an image belonging to one of the 10 classes. The size of these images is , but the images are scaled up to when they are input to the model. Both the training and test data include an equal number of samples for each class: the training data include 5000 images of the same class each, and the test data include 1000 images of the same class each.
Moreover, We split the training data defined in CIFAR10 into subsets of training and validation data, as shown in Algorithm 2. We experiment with five splits; hence, each subset contained training data including 40,000 samples and validation data including 10,000 samples. The training data are divided so that each subset contains an equal number of image classes. Training and validation data are assigned when training each model. Note that the validation data assigned to a model is not used for training the model, before synthesis.
We trained the models with 18 different hyperparameters. The searched hyperparameters contain the learning rate, strength of data augmentation and whether to apply mixup [
28]. This study adopts grid search. The learning rate is chosen from 0.00003, 0.00001 or 0.000003. Strength of data augmentation is chosen from minimal, medium or strong categories, which were defined in a previous study [
24]. As a result, there are 18 combinations of hyperparameters that are searched for by grid search. In addition, based on five different training and validation data, a total of 90 models are trained.
4. Experimental Results
As described in
Section 3.3, this study trained 90 models based on 18 hyperparameters and 5 training and validation data in the experiment. We show the relationship between the validation accuracy and test accuracy of the trained models using
Figure 2. These models’ validation data have the highest accuracy (0.9844), and the corresponding model’s test data have the best accuracy (0.9809). The lowest accuracy of the validation data is 0.9720, and the corresponding model’s test data accuracy is 0.9721, the 11th least accurate. These models are used in both of the following experiments.
4.1. Model Soups for Different Training and Validation Data
4.1.1. Comparison between AvgAcc and MinAcc
Table 1 shows the RMSE of the validation and test accuracies for each number of models merged in the graph.
The experimental results showed that the distribution of the validation and test data shifted as the number of merged models increased. In addition, the RMSE of MinAcc was smaller than that of AvgAcc for any number of merged models, indicating that the validation data are reliable.
4.1.2. Correlation of Classification Accuracy between Validation and Test Data
Figure 3 shows the accuracy of the validation and test data when
n models with the same hyperparameters and different training and validation data were synthesized via uniform soups.
In the above graph, the correlation between val and test accuracies for each number of models
n merged in uniform soups is shown in “corr for all” in
Table 2. This result implies that the greater the number of synthesized models, the stronger the correlation between MinAcc and test accuracy.
However, to improve the accuracy of greedy soups, reliability in domains with high accuracy is important. In
Figure 3, there are regions in which both MinAcc and the test accuracy have low accuracy. These samples were those in which the interpolation of weights worsened the accuracy because of the significant distance between the models. Owing to their low validation accuracy, these models have not been synthesized to improve the accuracy of greedy soups. Next, we investigated how the accuracy changes for each number of synthesized models in the domain with a high val accuracy.
Figure 4 shows a graph of the results of the synthesized model, removing the results of the model with a value accuracy less than the threshold. The threshold is the minimum val accuracy of the model before the synthesis. The correlation between the val accuracy and test accuracy in the extracted models shows that the correlation coefficient is lower for
than for
, as indicated by the “corr over thre” in
Table 2.
As shown above, the synthesis of models based on different training and validation data reduced the reliability of the validation data. Conversely, although the validation data are overestimated, this allows us to confirm that synthesizing different subsets of models can increase the accuracy. Therefore, val accuracy can detect an extreme decrease in the test accuracy when the model is synthesized.
4.2. Greedy Soups Containing Different Training and Validation Data
For greedy soups with different training and validation data, we conducted the following experiments.
Three types of greedy soups based on different subsets of training and validation data.
Uniform soups for all models.
Best on val, the single model with the greatest accuracy on the validation data.
The results of these vals (calculated based on MinAcc, except for Best on val) and test accuracies are shown in
Table 3.
The experiment shows that GSGSs, applying greedy soups for each subset and finally applying greedy soups between subsets produces the best test accuracy. The highest val accuracy was achieved by applying greedy soups with all subsets mixed; however, the test accuracy was lower than that of other greedy soup methods. This observation seems to result from the unreliability of the validation data owing to the repeated synthesis of models that include each other’s validation data in the training data. However, greedy soups exceeded the baseline Best on val, indicating that the method eliminated cases in which synthesis would result in extremely low accuracy.
5. Discussion and Future Work
5.1. Reliability of val Accuracy
This study synthesized models with different subsets of training and validation data assigned by greedy soups while maximally preserving the reliability of the validation data. Conversely, even if the number of models is approximately five, the correlation between the val and test accuracies indicates that synthesizing different model subsets reduces the reliability of val accuracy.
Some accuracy is maintained even when applying greedy soups based on val accuracy, which has decreased in reliability because val accuracy is an adequate evaluation index for test accuracy for low accuracy. This situation prevents merging, which results in a lower accuracy than the baseline. However, the choice to maximize accuracy is difficult because reliability is lower in areas with higher accuracy.
In the future, we will develop and verify a method for calculating the val accuracy that is more reliable than the MinAcc. In addition, we will investigate whether there is an index with a higher correlation with test accuracy, and verify whether test accuracy can be improved by combining it with greedy soups.
5.2. Hyperparameters Optimization
Uniform soup results were above the baseline in the previous study and below the baseline in this study. This difference may be because some hyperparameter settings deviated from those of the original model. To show that this deterioration in accuracy was not related to the combination with k-fold cross-validation, we ran uniform soups for each subset, and the one with the highest validation was adopted.
Table 4 confirms that the accuracy deteriorated when combined with the k-fold cross-validation.
The above results suggest that greedy soups are more effective when the region of the optimal parameters is unknown.
5.3. Generalization Performance for Highly Specialized Datasets
This research aims to apply this method to highly specialized fields, such as medical images, where it is difficult to collect multiple samples. However, this experiment was conducted within the scope of general category recognition. In the future, we will consider the type of problems that may arise when the proposed method is applied to highly specialized datasets.
First, a few samples may cause considerable distance between the models trained using different training and validation data. If the distance between the models is significantly large, the effectiveness of merging by model soups is reduced, and the result would be the same as if the model with the highest validation accuracy was simply adopted. This problem can be solved by adjusting the hyperparameters to constrain the distance from the source model such that it is not considerably far.
Second, the source model possibly did not capture the features of the target dataset. For example, CLIP [
14] comprises generic images that have been collected. Therefore, it is known that excessive specialization of the target data reduces the effectiveness of transfer learning from ImageNet when the distance from the source dataset is large. One way to address this issue is to use datasets that are closer to the target data. For example, for medical images, such as CT images, it is important to use a dataset in a close domain, such as RadImageNet [
29] to train the source model.
5.4. Model Constraints
The proposed method is essentially model agnostic. The model employed in this experiment is overly large to be applicable for running on low-memory devices. However, lighter pre-trained models make the proposed method applicable. Alternatively, the study is not limited by the size or type of model. As the proposed method is applied to various tasks in the future, the best model for each task can be adopted.
5.5. Future Works
This study shows experimentally that synthesized models trained on multiple subsets of training and validation data are more accurate than synthesized models trained on each subset. However, it is unclear how efficient these methods are. Alternatively, is it more efficient to explore the learning rate or the strength of data augmentation than to explore multiple subsets? In this experiment, 90 different models were trained based on 18 hyperparameters and 5 sub-sets. For example, increasing the range of the learning rate or increasing the density of the search may improve accuracy. To verify this possibility, we must first consider the optimal search, as described above. In the future, we will show that the accuracy of the test data with respect to the total number of models trained benefits the proposed method.
6. Conclusions
In this study, models trained with different training and validation data using a similar method to k-fold cross-validation were verified to improve accuracy by merging the weights of multiple models using model soups. Therefore, we examined how the reliability of val accuracy changes when models that include validation data in the training data are synthesized by model soups.
First, for validation, we synthesized multiple models within the same hyperparameter by using uniform soups on a set of models trained with different hyperparameters and training and validation data. We observed that the correlation between the validation and test data results decreased with each merged model, especially in regions where the accuracy of the validation data was higher than that of the baseline. However, if the validation data contained regions where the accuracy of the validation data was lower than the baseline, the correlation increased, indicating that it can be used to detect a significant decrease in test accuracy.
Based on the above results, we proposed two greedy soups that allow models to be synthesized with each other’s validation data in the training data and compared them with the baselines. In particular, the accuracy of GSGSs is higher than that of the baseline, which performs greedy soups on each model based on the same training and validation data. Thereafter, it performs greedy soups on different subsets. This observation shows that val accuracy estimates the test accuracy, even if the models are synthesized using different training and validation data.
This research allowed us to create models that are less susceptible to bias owing to differences in assignments for training and validation. In particular, the smaller the amount of data, the larger this bias becomes, making this study more valid for fine-tuning the small amounts of data for which model soups are effective.
However, this study has four limitations. First, the val accuracy is not sufficiently reliable when models based on different training and validation data are synthesized. We showed that the val accuracy computed by the proposed method is sufficiently reliable to be used as an indicator of which model should be synthesized by experiments. However, it is less reliable than the conventional val accuracy. In further research, we must investigate whether there is a calculation method that is more reliable than the proposed method, even if the model is synthesized based on training and validation data. Second, the search space for hyperparameters is non-trivial. The optimal hyperparameters vary for each dataset. If the search space cannot be narrowed down to an optimal range, the accuracy of the learned model becomes highly variable. We must improve the efficiency of the search in model soups in future works. Third, a single benchmark dataset was used in the experiments. This research has a high affinity for datasets that find it difficult to collect numerous samples due to their specialized nature, such as medical images. Future experiments should be conducted on such datasets. Finally, although the proposed method has been shown to improve the accuracy of the synthesized model, it has not shown its efficiency compared to existing methods. In the future, we will show the effectiveness of the proposed method.