1. Introduction
Convolutional neural networks (CNNs) are nowadays the state-of-the-art methods for a wide range of computer vision tasks, thanks to the large-scale public datasets [
1,
2,
3] and high performance accelerators like graphical processing units (GPUs). For example, CNNs rank the highest in benchmarks in image classification [
4,
5,
6], object detection [
7,
8,
9,
10], semantic segmentation [
11], and action recognitions [
12,
13,
14]. The general recipe for a successful CNN model includes training a large-sized model with a large-scale dataset. However, the size of a model is usually constrained by the accelerator’s memory, or the training (or inference) speed. Collecting a large-scale dataset is also very expensive. To solve these issues, previous works have proposed multi-task learning (MTL) [
12,
15,
16,
17]. By definition, multi-task learning trains a model with multiple functionalities. In the case of a CNN, the backbone feature extractor is shared among multiple tasks, and each task has a task-specific head assigned. Given a single input, the shared backbone extracts a feature map, each task-specific head processes the feature map, and finally multiple outputs are computed for multiple tasks. By sharing the computationally expensive backbone, multi-task learning can be quite effective in saving the computational and parametric costs, in contrast to training multiple models for multiple tasks. One of the early works in multi-task learning, UberNet [
18], utilizes task-specific skip connections, combines features across different layers, and computes the task-specific outputs. The network is trained with task-specific losses simultaneously. Another work on multi-task learning, One Model to Learn Them All, has shown that multi-task learning works with heterogeneous tasks such as image classification, machine translation, image captioning and speech recognition. Multi-task learning is also used to improve the target task’s performance by training with auxiliary tasks [
19].
Most of the CNNs are trained in a data-driven manner, and that indicates a prerequisite for large-scale datasets. Dataset collection and annotation require a lot of time and financial cost, and a benchmark performance on ImageNet [
20] shows that it takes a billion-scale dataset to achieve the state-of-the-art performance. There have been many efforts to make dataset collection more efficient with active learning [
21], yet the methods are usually applicable to certain types of tasks such as classification only. At the end of the day, we need to manually collect the dataset, and we have to consider many factors, including the size of the dataset, the privacy issues, the copyrights issues and diversity issues. To annotate a large-scale dataset, some crowd-sourced platforms are used, such as Amazon Mechanical Turk, but it is hard to ensure the quality of the annotations. The quality of the annotations must be checked or refined iteratively. Due to the difficulty in the data collection and annotation, CNN models are usually trained with limited data. In this work, we proposed to utilize a set of such limited data with multi-task learning. The low-data condition is common in real-world scenarios, and let us assume that there are multiple datasets for different tasks. Even though the datasets are collected for different purposes, they share common knowledge, such as low-level image features. Throughout a multi-task learning, the CNN model can better learn such common knowledge with a larger, combined dataset. Thus, multi-task learning can be an effective solution to utilize multiple small-scale datasets, and improve each task’s performance. The efficacy of multi-task learning is empirically validated with multiple sets of small-scale benchmarks.
In this work, we propose a task-specific feature filtering module that does not require heavy task-specific heads to generate task-specific features. Specifically, the feature filtering module learns to select feature channels for each task head in a data-driven manner. Instead of passing the full feature to the task-specific heads, we use a 1-d channel-wise scaling filter to be multiplied to each feature map. The filter is assigned for each task, and the feature filtering is performed right before the task-specific heads. The computational or parametric overhead of this feature filtering module is minimal. Moreover, we are using small task-specific heads, so the overhead over the single-task network is also minimal. Throughout extensive experiments, we validate that the proposed task-specific feature filtering module improves the performance of various multi-task models.
Our main contribution is 2-fold:
We empirically validate the efficacy of multi-task learning in low-data conditions. The model learns the shared knowledge through multi-task learning from multiple datasets, and eventually improves all task performances over single-task models;
We propose a simple yet effective way to generate task-specific features, with a task-specific feature filtering module. We consistently yet significantly improved the performances of multi-task models with minimal overhead. With the proposed module, small task-specific heads can exploit the task-specific features with minimal computational costs and high performances.
The remainder of this paper is organized as follows:
Section 2 introduces related works.
Section 3 includes the descriptions for learning methods, model configurations, and filter configurations.
Section 4 introduces the experiment’s results of the proposed method.
Section 5 includes the analysis for the experiment’s results. Finally,
Section 6 summarizes our methods and results, and
Section 7 proposes their limitations and the future directions of our work. sec:RW
2. Related Works
In this section, we investigate previous works related to early-exit architecture, the network architecture we used, and the multi-task learning method we used to merge and train datasets.
2.1. Early-Exit Architecture
Szegedy et al. [
22] proposed a network called Inception, creating an exit that outputs directly from a branch in the middle of the network, and proposed a training method in the form of helping back-propagation. In addition, Cai et al. [
23] improved the accuracy in object detection by proposing a cascade structure using easy-exit.
However, these methods are not preferred in a situation where the amount of computation and memory are limited because the parameters and computation of the network are improved. Unlike previous studies, Phuong et al. [
24] noted that the performance in the early-exit was not significantly different from the performance in the final exit. Therefore, Phuong et al. used the early exit architecture for training in order to prune the model without degrading the accuracy of the network in a situation where the amounts of computation and memory are limited.
Inspired by these studies, we used an early exit architecture to effectively utilize multiple datasets in situations where the size of the dataset is small or the size of the memory is limited. Our early exit architecture infers the results of a different dataset for each exit, and the backbone networks share it.
2.2. Multi-Task Learning
Multi-task learning is a method used in various recognition fields using deep learning such as object detection and instance segmentation. The most well-known application is in object detection, where object localization and object classification are performed simultaneously. Recently, Kim et al. [
25] revealed that the performance of object localization is greatly improved if only object localization is performed except for the object classification part in object detection. Kokkinos [
18] also found that multi-task learning generally degrades the performance of each task.
On the other hand, Deng et al. [
26] improved the performance of face detection by multi-task learning of the task of matching the landmark position of the face and the task of aligning the position of the face bounding box.
As such, multi-task learning is an efficient method to reduce the amount of parameters and computations by sharing backbone networks, but it is a method that may decrease or increase performance depending on the relationship between each task. In order to solve the problem that the performance of each task cannot be preserved in multi-task learning, we propose a method to improve the performance of all tasks by using an early exit architecture.
3. Method
In this section, we introduce the mathematical notation and the necessary background for the discussion of the proposed multi-task training of multi-exit architectures. Then, we discuss the features extracted by training different datasets at the same time, and introduce task-specific feature filtering, a method to efficiently use features in multi-task training with multi-exit architectures.
3.1. Dataset Integration
For multi-task training with multi-exit architectures, we first need to integrate the datasets. For example, to integrate ImageNet and Places365 datasets and merge them into the same mini-batch, the total quantity of these datasets, the size of images, and the number of channels must be considered. In order to match the image size and the number of channels, we fit the smaller image to the larger image size and the number of channels. One-channel images (i.e., grayscale images) are expanded to three channels and are integrated with a dataset consisting of only three RGB channels, and the smaller image size is upsampled through bilinear interpolation to fit the larger size image. In addition, small datasets were oversampled to have the same size as the largest dataset.
3.2. Multi-Exit Architectures
Our key assumption is that, even in different datasets, the low-level features extracted from the model’s feature extractor can be shared because models for different tasks exhibit similar characteristics [
27]. Therefore, if we train with the integrated dataset, the model learns better low-level features due to the increased diversity in the training samples. To further maximize the efficacy of sharing low-level features, we choose early-exit architectures. In early-exit architectures, low-level features are shared among all tasks, and high-level features are not shared.
Early-exit architectures are layered classification architectures with exits at different depths.
Figure 1 illustrates a general form of an early-exit architecture. The early-exit architecture has multiple exits, and each exit corresponds to one task. For example, if we merge multiple classification tasks, each exit is a classification network (i.e., multi-layer perceptron) for each task. In order to designate a specific exit for each dataset in the integrated dataset, we sorted the number of classes
K in each dataset in ascending order and determined the exit depth according to that order. So, the number of integrated datasets
and there is an image and label pair
for each dataset
,
,
. N is the number of images, and the objective function of multi-exit architectures
can be expressed as follows :
The final training objective is the sum of all task specific losses. However, please note that labels are only partially available for each training sample. For example, if the training sample is from ImageNet, then only the ImageNet label is available, so we only have the ImageNet cross entropy loss for that sample. The integrated dataset is a balanced combination of multiple datasets, so the minibatch is expected to be sampled from all datasets in a balanced way.
3.3. Task-Specific Feature Filtering
The learned features in MTL are shared and optimized with respect to multiple tasks. The advantage would be learning diverse and sharable features, and the disadvantage would be a lack of task-specific features. Therefore, our assumption is that performance can be improved by generating task-specific features through removing redundant features and inputting them into the classifier rather than naively using the shared features As found in [
28], feature filtering can significantly improve the model performances. As revealed in studies [
5,
28,
29,
30], features with relatively low activation values are unnecessary or redundant features. Attention mechanisms have been proposed to use this efficiently, but it is difficult to use to remove unnecessary features because it is not performance effective in multi-exit architectures and weights according to attention values without removing features. Therefore, we propose a task-specific feature filtering method that effectively removes unnecessary features for each task-specific exit. The task-specific feature filter module shown in
Figure 2 is located before the exit so as not to affect other exits, and is not located in the shared feature extractor. The redundant feature filter has trainable parameters
and
that have scalar values as many as the number of input channels when feature
F input to each exit is input to the redundant feature filter.
and
sequentially perform multiplication and addition operations on the channel. The filter value obtained through this process is adjusted between 0 and 1 through the sigmoid
S, and redundant features are removed by multiplying the input features.
This process can be simply expressed in notation as follows:
4. Experiments
This section includes the details of the ablation study for searching the optimal multi-exit architecture, and the details of different sets of multiple tasks. The datasets we used for all experiments are in low-data condition, where we uniformly sample the training data. During the uniform sampling, the class distribution is kept the same as the original full dataset. The validation or test datasets remain unchanged. Our experiments are implemented in PyTorch 1.9.0 with NVIDIA RTX3090 (24 G) × 4.
4.1. Multi Task Learning Details
This subsection explains the details of the baseline training settings. VGG [
6] and ResNet-50 [
31] are chosen as the main backbone architectures throughout this paper. They are the most widely used backbone architectures in the computer vision research community. ResNet is composed of four Layers, and the Layers are composed of multiple convolutional blocks with convolutional layers, batch norm layers, ReLUs, and residual connections. ResNet-50 Layers will have 256, 512, 1024, 2048 channel outputs, respectively. When ResNet-50 is used in a single-task learning (in case of a classification task), the final 2048-channel feature map will be average pooled and classified with the final fully connected layer. Conventional approaches in multi-task learning add heavy task-specific heads in the baseline backbone model with computational and parametric overheads. To reduce the computational overheads, we also search the minimal task-specific heads, where we only use a single fully-connected layer as the task-specific heads. Another design choice for a multi-task model is the exit location.
As discussed in
Section 3 Method, different tasks require different levels of semantics. For any task that requires fewer semantics, it will have an earlier exit location, then the other tasks can fully exploit the rest of the computation. As a comprehensive evaluation of multi-task learning in low-data conditions, we compare the effect of heavy task-specific heads and the optimal exit location in a multi-task model. The experiment’s results are included in the Ablation Study section. The final baseline for each experiment’s setting will be chosen after the architecture search with the above factors. The model with the best performance across all tasks will be chosen. For example, in the ImageNet-Places365 experiment, the best setting uses the Layer 4 features for ImageNet classification, and the Layer 3 features for the Places365 classification with the minimal task-specific head of one fully-connected layer.
The hyper-parameters used are as follows. The total number of epochs is 90, the batch size is 256, the base learning rate is 0.1, the learning rate is decayed by 0.1 at epoch 30 and 60 epoch. The validation is done in a single-crop manner in the validation set. We use the overall classification accuracy as the single metric to use. During training, we jointly use all images and labels from the combined dataset with the available label for each image. That is, the images and labels from the ImageNet dataset will be used to compute the cross-entropy loss for the ImageNet task-specific head, and no loss is computed for the Places365 task-specific head, and vice versa. As a result, the multi-task learning baselines improves over the single-task models in all multi-task learning settings by a margin, in all 5%, 10%, 20% low-data settings, with minimal computational overheads over the single-task models.
However, the minimal task-specific heads may not be enough to generate task-specific features as the task-specific heads are very small, and the performance improvements are mostly due to the learned common features with the combined dataset. To further utilize the task-specific features with minimal heads, we propose a task-specific feature filtering module. Details will be explained in the next section.
4.2. Task-Specific Feature Filtering Module Details
In the previous section, we have explained how to compose a strong multi-task model with minimal overheads over the single-task models. However, as the task-specific heads are as minimal as one fully-connected layer, the features used for each task may not be specialized for each task. Therefore, we propose a task-specific feature filtering module that generates task-specific features with minimal overheads yet which are effective in performance improvement. In this section, we explain the proposed task-specific feature filtering module in detail.
The proposed task-specific feature filtering module learns to select useful features for each task in the channel-wise manner. In the module, there are two trainable parameters: a scaling vector and an offset vector. Both vectors are 1D vectors that have the same number of elements with the number of input feature channels. As indicated in the name, the scaling vector will be multiplied to the input features, and the offset vector will be added to the input features, sequentially. The filtered values will be normalized by a Sigmoid layer. The computed values will be normalized within 0 1, and will be used as the final filtering weight to be multiplied to the original input feature map. The computations are all element-wise multiplications, addition, or Sigmoid, so the final overhead is as small as 0.0005G MACs per module. This module is shown to be very effective in multi-task learning; for example, it improves the ImageNet performance by 3.5% and the Places365 performance by 2% in the 5% low-data setting. All the hyper-parameters and the training schemes remain unchanged as the baseline settings found in the previous section.
4.3. Experiment Result
To validate the efficacy of the proposed module, we train models in low-data conditions for ImageNet and Places365 datasets by using 5%, 10% and 20% of the full training set. The methods to be compared are the single-task learning model (STL), the multi-task learning model (MTL), and the multi-task model with our proposed module. In the case of our multi-task learning model, it consists of multiple outlets as shown in
Figure 1 In this experiment, our model was constructed by creating exits in Layer 3 and Layer 4. To experiment with our proposed module, our module as shown in
Figure 2 was added to each exit of the model as described above. The module we attached has alpha and beta values, respectively, and serves to lower the value of the redundant feature.
As shown in
Table 1, it is shown that the multi-task models have a significantly higher performance than the single-task baseline, and our module further improves significantly over the multi-task models. It is worth noting that the improvement in ImageNet is very significant: the multi-task model improves in ImageNet by 12.891% and the proposed module further improves by 3.046%. The performance improvements are consistently shown across different data conditions.
As shown in
Table 2, we also conducted experiments for ImageNet-Oxford Pets [
32], ImageNet-Caltech-101 [
33] multi-task settings. The efficacy of the multi-task learning and the proposed module is consistent across different multi-task settings.
Based on the significant improvements over the single-task baselines, we empirically validated the efficacy of multi-task learning in low-data conditions and the efficacy of the proposed task-specific feature filtering module with minimal overheads. Therefore, in low-data conditions, multi-task learning is a very effective method to try with datasets in similar domains but different tasks, and the proposed module is a very cost-effective method to be added.
We verified our method in an additional network. The networks used are SE-ResNet and the VGG 16 Network. In the case of SE-ResNet, we performed Places365 tasks through the feature map output from Layer3, and ImageNet tasks through the feature map output from Layer4 in the same way as ResNet. In the case of the VGG 16 Network, Place365 tasks were performed through the feature map output from the 10th layer, and ImageNet tasks were learned using the feature map output from the 13th layer. Additionally, we used a fully connected layer that extends to 4096 channels before performing the classifier to take advantage of the characteristics of VGG network.
As shown in
Table 3, the result shows that our method is an efficient method in other networks, and that the proposed filtering also further improves performance in our method.
4.4. Multi-Exit Architecture Search
It was necessary to compare the performance of each model to find a combination of features suitable for performing multi-task learning in our structure. Therefore, we performed the task of 10% ImageNet in Layer4 and measured how the inference performance changed according to the combination of Layer1 to Layer4. The model with the exit attached to Layer1 improved the performance of ImageNet compared to the model learned with single task learning, but the performance of Places365 was lower than that learned with single task learning. The model with the exit attached to Layer 2 had the best performance of ImageNet. However, the performance of Places365 was still lower than that of the model learned with single task learning. As shown in
Table 4, we can empirically confirm that the exit model placed in Layer2 produces sacrificial results for the task performed in Layer4. That is, a model in which an exit is attached to Layer2 may enrich the presentation of Layer4. However, multi-task learning does not aim to use the other task sacrificially for one task. Therefore, we did not use the model in which the exit was placed in Layer2. The model in which the exit was placed in Layer3 had lower ImageNet performance than the model in which the exit was placed in Layer2, but the performance in Places365 was the best at 47.095%.
It is important to maximize both tasks’ performances with MTL. Therefore, as a result, we have chosen to use Layer3 features and Layer4 features for the two tasks respectively. The model in which the exit was placed in Layer4 with an exit at Layer4 had lower performances in all tasks than the model with an exit at Layer3. In addition, it was not selected as our architecture because it accounts for the highest percentage of memory and computing sources among the models experimented so far. We additionally conducted the experiment by adding one more Layer4 to the existing ResNet, inspired by the fact that existing multi-task learning creates task specific features. As a result, although MACs and parameters were the highest among the models we tested, ImageNet performance was the second worst among the models we tested, and Places365 performance did not exceed that of the model placed on Layer3. Therefore, it was confirmed that it was not helpful to unconditionally add a large amount of computation to generate task specific features. As the last experiment, we experimented with the filtering we suggested. We confirmed that our filtering adds fewer MACs and parameters and can show a better performance than baseline models.
6. Conclusions
In this work, we have validated the efficacy of multi-task learning in low-data conditions, and proposed a cost-effective feature filtering module to generate task-specific features with minimal overheads. Our main target is the low-data conditions, so we used 5%, 10%, and 20% of public benchmark training datasets as the training datasets. Among various design options for multi-task models, we focus on the early-exit options and the size of the task-specific heads. Apart from the conventional methods, we discovered that minimal task-specific heads can be more effective than heavy task-specific heads with proper choices of early-exit in the backbone feature extractor. Furthermore, we propose a task-specific feature filtering module to exploit task-specific features for the minimal task-specific heads. As a result, we have shown a structured approach to designing models for multi-task learning, and have shown that the proposed module is effective in all tasks in all low-data conditions. The future direction of this research would be an extension to more various tasks, such as segmentation and detection. The multi-task learning among multiple segmentation tasks, or among segmentation and classification tasks is to be discovered.