Training Acceleration Method Based on Parameter Freezing

Tang, Hongwei; Chen, Jialiang; Zhang, Wenkai; Guo, Zhi

doi:10.3390/electronics13112140

Open AccessArticle

Training Acceleration Method Based on Parameter Freezing

by

Hongwei Tang

^1,2,3,4,

Jialiang Chen

^1,4,*,

Wenkai Zhang

^1,4 and

Zhi Guo

^1,4

¹

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100190, China

²

University of Chinese Academy of Sciences, Beijing 100190, China

³

School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100190, China

⁴

Key Laboratory of Network Information System Technology (NIST), Institute of Electronics, Chinese Academy of Sciences, Beijing 100190, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(11), 2140; https://doi.org/10.3390/electronics13112140

Submission received: 31 March 2024 / Revised: 24 May 2024 / Accepted: 28 May 2024 / Published: 30 May 2024

(This article belongs to the Collection Deep Learning for Computer Vision: Algorithms, Theory and Application)

Download

Browse Figures

Versions Notes

Abstract

:

As deep learning has evolved, larger and deeper neural networks are currently a popular trend in both natural language processing tasks and computer vision tasks. With the increasing parameter size and model complexity in deep neural networks, it is also necessary to have more data available for training to avoid overfitting and to achieve better results. These facts demonstrate that training deep neural networks takes more and more time. In this paper, we propose a training acceleration method based on gradually freezing the parameters during the training process. Specifically, by observing the convergence trend during the training of deep neural networks, we freeze part of the parameters so that they are no longer involved in subsequent training and reduce the time cost of training. Furthermore, an adaptive freezing algorithm for the control of freezing speed is proposed in accordance with the information reflected by the gradient of the parameters. Concretely, a larger gradient indicates that the loss function changes more drastically at that position, implying that there is more room for improvement with the parameter involved; a smaller gradient indicates that the loss function changes less and the learning of that part is close to saturation, with less benefit from further training. We use ViTDet as our baseline and conduct experiments on three remote sensing target detection datasets to verify the effectiveness of the method. Our method provides a minimum speedup ratio of 1.38×, while maintaining a maximum accuracy loss of only 2.5%.

Keywords:

training acceleration; parameter freezing; object detection; deep learning; neural network

1. Introduction

With the advancement of remote sensing technology, the resolution of remote sensing images has been continuously improved and their coverage has become more extensive, allowing for a greater amount of information to be extracted. Object detection in remote sensing images is one of the focal issues in the field of remote sensing image interpretation [1]. Its objective is to identify and precisely classify and locate various objects in complex remote sensing images, such as airplanes, ships, vehicles, and more. This technology plays an irreplaceable role in a large number of applications.

In recent years, there has been rapid development in object detection methods based on deep learning [2]. Compared with traditional object detection algorithms, deep learning-based algorithms utilize deep neural networks with large amounts of data for training, allowing them to learn the distinctive features of objects. As a result, they achieve a higher detection accuracy and a greater efficiency compared to handcrafted feature extraction algorithms. Broadly speaking, deep learning-based algorithms can be categorized into two main categories. The first type is two-stage region proposal algorithms, including R-CNN [3], Fast R-CNN [4], and Faster R-CNN [5]. These algorithms extract candidate regions in the image, then classify and localize the objects through these regions. Although they demonstrate a good performance, their complex structure and slower speed may be their drawback. The other type is one-stage object detection algorithms based on regression, such as YOLO [6], SSD [7], and RetinaNet [8]. These algorithms transform the localization and classification task into a regression problem, reducing spatial and temporal overhead. They have a higher speed but a lower detection accuracy compared to two-stage algorithms.

Deep neural networks benefit greatly from their parameter size, ranging from millions to billions, as well as the stacked non-linear activation layers, which enable the models to have much better capabilities of nonlinear system modeling. Larger and deeper deep neural networks are currently a popular trend in both natural language processing tasks and computer vision tasks [9,10]. With the increasing parameter size and model complexity in deep neural networks, it is also necessary to have more data available for training to avoid overfitting and to achieve better results. In the field of remote sensing object detection, large-scale datasets with more than twenty thousand remote sensing images, such as LEVIR [11], DOTA [12], and DIOR [13], are commonly used for model training.

The expansion in the size of datasets and model parameters leads to a growing demand for time and resources during model training. Researchers need to debug models and compare and analyze the results of the experiments. In the case of practical applications, the models need to rapidly adapt to new application scenarios and new data in order to ensure the effectiveness of the models. Therefore, time-consuming training processes that may take days or even months significantly slow down the progress of research and application. Conventional methods for accelerating the training of deep neural networks based on parameter and model structure compression are always difficult to design and have limited generalizability. Therefore, we aim to explore a training acceleration method that can be applied to different models by focusing on training strategies.

The main contribution of this paper is summarized as follows:

We design a training strategy based on freezing the parameters of models according to the convergence trend during the training of deep neural networks;
We implement a linear freezing algorithm, which can help save at least 19.4% of training time;
We present an adaptive freezing algorithm according to the information provided by the gradient, achieving a speedup ratio of up to 1.38×.

2. Related Works

2.1. Remote Sensing Object Detection

Extensive research has been devoted to object detection in optical remote sensing images, inspired by the great success of deep learning-based object detection methods in the computer vision community. Many improvements from multiple perspectives have been made in order to ameliorate the performance of deep neural networks applied to remote sensing object detection.

The excellent performance of R-CNN for natural scene object detection leads to the adoption of the R-CNN pipeline in remote sensing object detection. Cheng et al. [14] proposed a rotation-invariant CNN (RICNN) model by adding a new rotation-invariant layer to the standard CNN model, which is used for the multi-class detection of geospatial objects. In order to further improve the performance of remote sensing object detection, Cheng et al. [15] imposed a rotation-invariant regularizer and a Fisher discrimination regularizer on the CNN features to train a rotation-invariant and Fisher-discriminative CNN (RIFD-CNN) model. Long et al. [16] presented an unsupervised score-based bounding box regression method for the accurate localization of geospatial objects and to optimize the bounding box of the objects with non-maximum suppression.

With the introduction of Faster R-CNN, remote sensing object detection also had an advancement. Based on Faster RCNN, Li et al. [17] presented a rotation-insensitive RPN that can effectively handle the problem of rotation variations of geospatial objects by introducing multi-angle anchors into the existing RPN. In addition, a dual-channel feature combination network is designed to learn local and contextual properties to address the problem of appearance ambiguity. Xu et al. [18] proposed a deformable CNN to model the geometric variations of objects; the increase in false region proposals was reduced with non-maximum suppression constrained by aspect ratio. Zhong et al. [19] introduced a fully convolutional network based on the residual network to solve the dilemma between the translation variance in object recognition and the translation invariance in image classification. Pang et al. [20] argue that most detectors suffer from the issue of imbalance at the sample level, feature level, and objective level. At the sample level, IoU-Balanced Sampling guides the selection of samples to ensure that more hard negative samples are chosen during the training process, as the hard negative samples play a more significant role in model training. At the feature level, the Balanced Feature Pyramid resizes all the feature maps to a uniform size and then combines them, aiming to fully utilize feature maps at various scales. At the objective level, Balanced L1 Loss judges and weighs between the classification task and localization task. With these modules, their model can have better results. Qin et al. [21] use an Arbitrary-Oriented Region Proposal Network to generate rotational candidate regions. In order to obtain a more accurate bounding box, a multi-head network divides the bounding box regression into several tasks such as center point location, scale prediction, etc.

Methods based on regression-based models also have development in remote sensing object detection. Liu et al. [22] replaced the traditional bounding box with a rotatable bounding box (RBox) embedded in the SSD framework, which is, thus, a rotation invariant due to its ability to estimate the orientation angles of objects. Tang et al. [23] used a regression-based object detector to detect vehicle targets, which is a similar idea to SSD. Specifically, a set of default boxes with different scales per feature map location are employed to generate the detection bounding boxes. Furthermore, the offsets are predicted to better fit the object shape for each default box. Liu et al. [24] have designed a framework for the detection of arbitrarily oriented ships. By using the YOLOv2 architecture as the underlying network, the model can directly predict rotated bounding boxes. Zhong et al. [25] propose a cascaded detection model combining two independent convolutional neural networks with different functionalities to improve the detection accuracy. Xu et al. [26] attributed the poor detection performance of objects with large aspect ratios and different scales to the problem of feature misalignment and proposed a method for the detection of feature alignment.

Although most of the existing deep-learning based methods have demonstrated considerable success on the task of object detection in remote sensing, they have been transferred from the methods designed for natural scene images. Indeed, remote sensing images differ significantly from natural scene images, in particular regarding rotation, scaling, and complex and cluttered backgrounds. Although existing methods have partially addressed these problems by introducing prior knowledge or designing proprietary models, the task of object detection in remote sensing images remains an important question that deserves further research.

The above methods improve the performance of remote sensing object detection, mostly by adding new modules that specifically focus on the characteristics of remote sensing images. However, more modules often increase the complexity of the models, which can affect the training speed of the model.

2.2. Deep Neural Network Training Acceleration

Common methods for deep neural network training acceleration typically focus on model design. The compression of parameters and model structures is used for the purpose of reducing the training time of deep neural networks.

2.2.1. Compression of Parameters

Deep neural networks typically have a large number of parameters and always perform computations in 32-bit floating-point numbers, which is the main computational cost during the training process.

Parameter pruning involves evaluating the model parameters or parameter combinations and removing parameters that contribute little to the training process [27,28]. It can also prune connections between layers in the deep neural network [29].

Parameter quantization targets the storage of parameters by replacing 32-bit floating-point numbers with 16-bit or 8-bit floating-point numbers. In appropriate cases, binary or ternary quantization can even be applied [30,31], significantly reducing the storage space and memory usage of the parameters.

Low-rank decomposition decomposes the convolutional kernel matrix by merging dimensions and imposing low-rank constraints. It utilizes a small number of basis vectors to reconstruct the convolutional kernel matrix [32], thereby reducing storage and computational requirements.

Parameter sharing is similar to parameter pruning and takes advantage of the redundancy in model parameters. It develops a method to map all parameters to a small amount of data and performs computations using this limited data.

2.2.2. Compression of Model Structures

There are two main categories of methods for compressing the structure of deep neural networks.

The first method is lightweight model design, which involves directly redesigning components of the deep neural network to optimize its structure. Some classic examples include SqueezeNet [33], which uses smaller convolutional kernels; MobileNet [34], which splits common convolutions into depth-wise convolutions and point-wise convolutions to reduce the number of multiplications; and ShuffleNet [35], which utilizes point-wise group convolution and channel shuffle.

The other method is knowledge distillation, by transferring the knowledge from a pre-trained large teacher model to a smaller student model. This allows the student model to achieve a performance similar to that of the teacher model, while maintaining a smaller size. Knowledge distillation methods are typically categorized into response-based distillation [36], feature-based distillation [37], and relation-based distillation [38].

All of the above methods are effective in improving the training speed of deep neural networks. However, these methods always face the challenge of design complexity. Additionally, training acceleration methods specifically designed for model architecture often require tailored optimizations, resulting in limited generalizability. Therefore, this paper aims to explore a training acceleration method that can be applied to different models by focusing on training strategies.

2.3. Similarity Measure between Deep Neural Network Representations

Although deep learning has made significant progress in many fields, there has been a lack of in-depth research on how to describe and understand the representations learned by deep neural networks during the training process. To this end, Raghu et al. [39] proposed the Singular Vector Canonical Correlation Analysis (SVCCA) method. It measures the similarity between two intermediate layer activations in two deep neural networks by computing their linear correlation, allowing for the observation of the representations learned by deep neural network models.

Furthermore, Kornblith et al. [40] introduced the Centered Kernel Alignment (CKA) method for measuring the similarity between deep neural networks. This method calculates the alignment of kernel matrices computed from the activations of the intermediate layers in two deep neural networks. It captures the structural and topological information of the deep neural networks and effectively evaluates the representational capacity of models.

With these methods, it is possible to observe and analyze the training process of deep neural networks.

3. Pre-Experiment: Observation of Training Process

To directly represent the convergence process of deep neural networks, we utilize the CKA method to compare the similarity of models at different stages during the training process. Firstly, the model loads two different weights and uses the same samples to obtain two representations, X and Y, from the corresponding layers. Then, we calculate their Gram matrices as follows:

M = X X^{T}, N = Y Y^{T} .

(1)

A centering matrix H is constructed to calculate the Hilbert–Schmidt Independence Criterion (HSIC):

H = I_{n} - \frac{1}{n} 1_{n} {1_{n}}^{T},

(2)

HSIC (M, N) = \frac{1}{{(n - 1)}^{2}} tr (M H N H)

(3)

Finally, normalization is performed to obtain the CKA score:

CKA (M, N) = \frac{HSIC (M, N)}{\sqrt{HSIC (M, M) HSIC (N, N)}} .

(4)

First, we choose the image branch of VSE++, a typical multimodal retrieval model. Every few epochs, the weight of the model is stored. After all the training, the CKA was used to calculate the similarity between the weights during training and the weight obtained after complete training. The results are shown in Figure 1. From the graph, it can be observed that as the training progresses, the parameters of the shallow layers tend to converge earlier compared to the parameters of the deep layers. However, the convergence process does not strictly follow a linear relationship with the training process, making quantitative analysis difficult to perform.

Another experiment is conducted with the classic object detection model Faster R-CNN. As shown in Figure 2, the convergence speed of the model is faster due to loading the pre-trained model of Resnet50. The same pattern is presented in this part of the experiment.

We also directly compare the weights during training and the weight obtained after complete training. We save the weights after each epoch and calculate the difference between each weight and the weight of the last epoch by subtracting them and calculating the norm of the results. Combining the results in Figure 3 and the CKA Similarity score, it can be seen that the deeper the layers of the model in which the parameters are located, the greater the difference with the parameters after complete training, i.e., more adequate training is needed to obtain better results. On the other hand, the parameters at the shallow layers change less during the training process, which means that they have less impact on the model performance in the later stages of training.

4. Parameter Freezing Algorithm

From the conclusion in Section 3, it is clear that the parameters of deep neural networks follow an order of convergence from shallow to deep during training. Our goal is to freeze the parameters based on the order of convergence so that we can accelerate the training process with as little loss of performance as possible.

4.1. Linear Freezing Algorithm

The training of deep neural networks can easily be divided into two main processes. The first process is the forward propagation phase, which is performed from the training data as input to the resultant output. Using the designed deep neural network, features are extracted from a batch of labeled samples through operations such as convolution, pooling, and full connectivity; then, the extracted features are used to compute and obtain the output of the network. What we are interested in is the backpropagation stage. Backpropagation is a process performed in the opposite direction to forward propagation. The purpose of training is to optimize the model performance. Thus, in order to make the error between the prediction value and the actual labeled value as small as possible, the loss function is calculated based on the comparison error between the prediction value and the ground truth; then, the gradient of the parameters is calculated according to the loss function. When calculating the gradient of the parameters, the value of the intermediate result of the corresponding layer’s forward propagation needs to be used. This stage usually takes more time than the forward propagation.

Therefore, according to the parameter convergence trend of the deep neural networks, after a certain amount of training, we freeze the parameters of the shallow layers so that they are no longer involved in the backpropagation process in training, thus saving this part of the computational overhead and speeding up the training. As shown in Figure 4, the Linear Freezing Algorithm (LFA) freezes a fixed number of blocks after every several epochs.

If necessary, the range of each block and when to freeze can be flexibly defined so that common deep neural networks can use this approach.

4.2. Adaptive Freezing Algorithm

In the process of backpropagation, by calculating the gradient of the loss function with respect to the model parameters, it is possible to understand the direction and rate of change of the model at the current parameter values. Based on the information provided by the gradient, parameter updates are made to the model to improve its performance. Specifically, a larger gradient indicates that the loss function changes more drastically at this position, implying that there is more space for improvement with the parameter involved; a smaller gradient indicates that the loss function changes less and the learning of this part is close to saturation, with less benefit from further training.

Therefore, we propose an adaptive method to judge the progress of parameter freezing by comparing the gradients of parameters at different layers, aiming to further accelerate the training of deep neural networks.

After a certain amount of training, the number of frozen layers at timestep T is decided as follows:

N_{f} (T) = \underset{N_{f} (T - 1) \leq n \leq N}{\arg \min} {‖g_{n} (T)‖}_{F}, T \geq 1, N_{f} (0) = 0 .

(5)

g_{n} (T)

is the gradient of layer n at timestep T; the Frobenius norm of the gradients

{‖g_{n} (T)‖}_{F}

is gathered and compared. To avoid the effect of random initialization and errors, the upper limit of the freezing parameter at timestep T is set to be as follows:

N_{\max} (T) = k N + (1 - k) N_{f} (T - 1), 0 < k \leq 1,

(6)

where N is the total number of layers, and hyper-parameter k controls the freezing speed during training.

With Equation (5), the model can judge the number of frozen layers at certain timesteps in the training process. Figure 5 shows that the Adaptive Freezing Algorithm (AFA) can freeze the model at a much faster pace, leading to a better effect of acceleration. The pseudo code is shown in Algorithm 1.

Algorithm 1 Adaptive Freezing Algorithm (AFA)

Input: number of layers

N

, number of layers

N_{f} (T - 1)

, time

T

, the Frobenius norm of the gradients

{‖g_{n} (T)‖}_{F}

the upper limit of the freezing layers

N_{\max} (T)

Output: number of layers

N_{f} (T)

1: T ← 0;
2: while one epoch is finished do
3: T = T + 1;
4: for layer index =

N_{f} (T - 1)

to

N

do
5:

N_{f} (T) = \underset{N_{f} (T - 1) \leq n \leq N}{\arg \min} {‖g_{n} (T)‖}_{F}

;
6: if

N_{f} (T)

> N_{\max} (T)

do
7:

N_{f} (T)

=

N_{\max} (T)

;
8: for layer index =

N_{f} (T - 1)

to

N_{f} (T)

do
9: freeze the layers;

5. Experiments

5.1. Setup

5.1.1. Experimental Environment

The experiments were carried out in a Linux environment using the Ubuntu 20.04 operating system. The device for experiments has an NVIDIA Tesla V100 GPU with 32 GB of RAM, Python 3.8.0, Pytorch 1.8.0, and CUDA 11.1 with CUDNN 8 loaded to assist the experiments.

5.1.2. Model

We selected ViTDet [41] as the baseline model in our experiments. ViTDet utilizes Vision Transformer as the backbone, which has a larger number of parameters, allowing for the presentation of the acceleration effect of the freezing algorithm much better. We use ViT-B with 12 encoders as the backbone and define each encoder as one block.

5.1.3. Datasets

In a wide variety of datasets for remote sensing object detection, three datasets of different sizes have been used. DIOR has 23,463 remote sensing images of 800 × 800 resolution with 20 categories, including airplane, airport, ship, etc. SIMD contains 15 categories, most of which are different kinds of cars. It has 5000 images selected from Google Earth. RSOD is a small dataset with only 976 images and 4 categories including aircraft, oil tank, overpass, and playground. More details of these datasets can be seen in Table 1.

5.1.4. Evaluation Metrics

When calculating precision and recall metrics, the results of the model outputs are categorized into four groups based on true labeling—true positives (TPs), true negatives (TNs), false positives (FPs), and false negatives (FNs).

Precision is a statistical measure that evaluates the model’s ability to classify objects. It represents the ratio of correctly predicted instances to all predicted instances in the detection results; it is calculated as follows:

p r e c i s i o n = \frac{T P}{T P + F P} .

(7)

Recall is a performance metric that measures the ability of a model to correctly identify all positive instances. Recall is defined as the proportion of true positives correctly predicted by the model out of all the actual positive instances. It can be calculated as follows:

r e c a l l = \frac{T P}{T P + F N},

(8)

The value of recall ranges from 0 to 1, with higher values indicating a better performance. A recall value closer to 1 means that the model can more accurately identify positive instances and has a lower rate of missing positive samples.

Average Precision (AP) is derived by the method of averaging the precision values along the precision–recall (PR) curve. It is computed by integrating the area under the precision–recall curve, as follows:

A P = \int_{0}^{1} p (r) d r .

(9)

The mean Average Precision (mAP), as the average of the Average Precision (AP) values, is the most commonly used evaluation statistic in object detection. The value of mAP represents the model’s overall performance in all the categories.

As for the effect of acceleration, we compare the training time with or without parameter freezing. With

T_{0}

as the training time without parameter freezing and

T_{f}

as the other, the speedup is calculated as follows:

s p e e d u p = \frac{T_{0}}{T_{f}} .

(10)

5.2. Results

We trained the model on each of the three datasets with a maximum of 65 epochs and froze part of the model with the Linear Freezing Algorithm (LFA) and the Adaptive Freezing Algorithm (AFA) every 5 epochs. The hyper-parameter k in Equation (6) was determined as 0.3, because higher values lead to serious performance degradation. To make the results more accurate, each experiment was repeated five times and the average of the results was taken as the final result.

The trend of time consumption for each epoch is shown in Figure 6. In the early stages of training, the AFA has an aggressive performance compared to the LFA, so that the AFA finishes freezing all the blocks earlier than the LFA.

Table 2, Table 3 and Table 4 show all the results of our experiments. With the same freezing algorithms, the difference in acceleration ratios is small. In total, the LFA saves 19.4% of training time, while the AFA saves 28.6%. As for the performance of object detection, the training on DIOR has been influenced to some extent. We believe that the model needs more training on the larger datasets, so the freezing operation in the early stages of the training process may lead to inadequate training on the larger datasets. Conversely, training without a freezing algorithm on small datasets like RSOD probably suffers from overfitting at the shallow layers of the model; thus, the freezing algorithm may provide a benefit to the increase of mAP. The same pattern shows up in the effect of the freezing algorithms on the recall rate.

The confusion matrixes of the results on the DIOR dataset are shown in Figure 7. The data on the diagonal is the Precision, while the last column is the False Negative Rate (FNR), and the rest is the False Positive Rate (FPR). As indicated by the different color blocks, the Precision, FPR, and FNR are similar in the three figures in Figure 7, which means the parameter freezing algorithms do not interfere with the training process when specific to each category.

We compare our parameter freezing algorithms with the state-of-the-art object detection approaches on all three datasets. From Table 5, Table 6 and Table 7, we can see that although the freezing algorithms may lead to some degradation on performance, they still achieve better results.

In the figures below, Figure 8, Figure 9 and Figure 10, the visual detection results with different training strategies are shown. As we expected, the freezing algorithms have little effect on the model performance. The detection results in Figure 9 and Figure 10 exhibit the same level as in Figure 8.

6. Limitations and Further Work

Gaining a deeper understanding of the intricate convergence process during the training of deep neural networks is crucial for optimizing freezing algorithms, especially when it comes to quantitative analysis. The current landscape of our comprehension regarding the training dynamics of these networks is still fragmented, posing limitations to the effectiveness and efficiency of parameter freezing methods.

As we strive to improve the overall performance of deep learning models, a more refined analysis of the convergence process becomes paramount. This would not only allow us to comprehend the behavior of the networks more accurately, but also help us to identify potential bottlenecks or areas for improvement.

In our future work, we aim to explore innovative ways to dissect and analyze the training process of deep neural networks. Through a combination of novel techniques and rigorous experimentation, we hope to gain a better understanding of the fundamental principles that govern the behavior of these networks. A comprehensive analysis of the convergence process in deep neural networks holds immense potential for advancing the field of deep learning. By improving our understanding of this process, we can develop more efficient and effective freezing algorithms, paving the way for faster and more accurate training of deep neural networks.

7. Conclusions

This paper presented a training strategy based on parameter freezing for accelerating the training of deep neural networks. By observing the convergence trend during the training of deep neural networks, we freeze part of the parameters so that they are no longer involved in subsequent training and reduce the time cost of training. The information reflected by the gradient of the parameters plays a significant role in determining the speed of parameter freezing. Through various experiments, the effectiveness of the parameter freezing algorithm has been demonstrated. The results consistently showed that a freezing algorithm can help saving 28.6% of the training time, with little effect on model performance.

Author Contributions

Conceptualization, H.T., J.C. and W.Z.; investigation and analysis, H.T. and W.Z.; resources, W.Z. and Z.G.; software, H.T.; validation, H.T. and J.C.; visualization, H.T.; writing—original draft preparation, H.T. and J.C.; writing—review and editing, W.Z. and Z.G.; supervision, J.C. and Z.G. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

In this paper, the DIOR dataset was downloaded from Google Drive (https://drive.google.com/drive/folders/1UdlgHk49iu6WpcJ5467iT-UqNPpx__CC, accessed on 29 March 2024), the SIMD dataset was downloaded from Github (https://github.com/ihians/simd, accessed on 29 March 2024), and the RSOD dataset was downloaded from Github (https://github.com/RSIA-LIESMARS-WHU/RSOD-Dataset-, accessed on 29 March 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sun, X.; Wang, P.; Yan, Z.; Xu, F.; Wang, R.; Diao, W.; Chen, J.; Li, J.; Feng, Y.; Xu, T.; et al. FAIR1M: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2022, 184, 116–130. [Google Scholar] [CrossRef]
Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object detection in 20 years: A survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 1137–1149. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 779–788. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Christian, S.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision—ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Part I 14. pp. 21–37. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
Dehghani, M.; Djolonga, J.; Mustafa, B.; Padlewski, P.; Heek, J.; Gilmer, J.; Steiner, A.; Caron, M.; Geirhos, R.; Alabdulmohsin, I.; et al. Scaling vision transformers to 22 billion parameters. arXiv 2023, arXiv:2302.05442. [Google Scholar]
Zou, Z.; Shi, Z. Random access memories: A new paradigm for target detection in high resolution aerial remote sensing images. IEEE Trans. Image Process. 2017, 27, 1100–1111. [Google Scholar] [CrossRef]
Xia, G.S.; Bai, X.; Ding, J.; Zhen, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3974–3983. [Google Scholar]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Cheng, G.; Zhou, P.; Han, J. Learning rotation-invariant convolutional neural networks for object detection in VHR optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2016, 54, 7405–7415. [Google Scholar] [CrossRef]
Cheng, G.; Han, J.; Zhou, P.; Xu, D. Learning rotation-invariant and fisher discriminative convolutional neural networks for object detection. IEEE Trans. Image Process. 2018, 28, 265–278. [Google Scholar] [CrossRef]
Long, Y.; Gong, Y.; Xiao, Z.; Liu, Q. Accurate object localization in remote sensing images based on convolutional neural networks. IEEE Trans. Geosci. Remote Sens. 2017, 55, 2486–2498. [Google Scholar] [CrossRef]
Li, K.; Cheng, G.; Bu, S.; You, X. Rotation-insensitive and context-augmented object detection in remote sensing images. IEEE Trans. Geosci. Remote Sens. 2017, 56, 2337–2348. [Google Scholar] [CrossRef]
Xu, Z.; Xu, X.; Wang, L.; Yang, R.; Pu, F. Deformable convnet with aspect ratio constrained nms for object detection in remote sensing imagery. Remote Sens. 2017, 9, 1312. [Google Scholar] [CrossRef]
Zhong, Y.; Han, X.; Zhang, L. Multi-class geospatial object detection based on a position-sensitive balancing framework for high spatial resolution remote sensing imagery. ISPRS J. Photogramm. Remote Sens. 2018, 138, 281–294. [Google Scholar] [CrossRef]
Pang, J.; Chen, K.; Shi, J.; Feng, H.; Ouyang, W.; Lin, D. Libra r-cnn: Towards balanced learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 821–830. [Google Scholar]
Qin, R.; Liu, Q.; Gao, G.; Huang, D.; Wang, Y. MRDet: A multihead network for accurate rotated object detection in aerial images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–12. [Google Scholar] [CrossRef]
Liu, L.; Pan, Z.; Lei, B. Learning a rotation invariant detector with rotatable bounding box. arXiv 2017, arXiv:1711.09405. [Google Scholar]
Tang, T.; Zhou, S.; Deng, Z.; Lei, L.; Zou, H. Arbitrary-oriented vehicle detection in aerial imagery with single convolutional neural networks. Remote Sens. 2017, 9, 1170. [Google Scholar] [CrossRef]
Liu, W.; Ma, L.; Chen, H. Arbitrary-oriented ship detection framework in optical remote-sensing images. IEEE Geosci. Remote Sens. Lett. 2018, 15, 937–941. [Google Scholar] [CrossRef]
Zhong, J.; Lei, T.; Yao, G. Robust vehicle detection in aerial images based on cascaded convolutional neural networks. Sensors 2017, 17, 2720. [Google Scholar] [CrossRef] [PubMed]
Xu, T.; Sun, X.; Diao, W.; Zhao, L.; Fu, K.; Wang, H. ASSD: Feature aligned single-shot detection for multiscale objects in aerial imagery. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5607117. [Google Scholar] [CrossRef]
LeCun, Y.; Denker, J.; Solla, S. Optimal brain damage. Adv. Neural Inf. Process. Syst. 1989, 2, 598–605. [Google Scholar]
Wen, W.; Wu, C.; Wang, Y.; Chen, Y.; Li, H. Learning structured sparsity in deep neural networks. Adv. Neural Inf. Process. Syst. 2016, 29, 2082–2090. [Google Scholar]
Han, S.; Pool, J.; Tran, J.; Dally, W.J. Learning both weights and connections for efficient neural network. Adv. Neural Inf. Process. Syst. 2015, 28, 1135–1143. [Google Scholar]
Courbariaux, M.; Bengio, Y.; David, J.P. Binaryconnect: Training deep neural networks with binary weights during propagations. Adv. Neural Inf. Process. Syst. 2015, 28, 3123–3131. [Google Scholar]
Li, F.; Liu, B.; Wang, X.; Zhang, B.; Yan, J. Ternary weight networks. arXiv 2016, arXiv:1605.04711. [Google Scholar]
Liu, B.; Wang, M.; Foroosh, H.; Tappen, M.; Penksy, M. Sparse convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 806–814. [Google Scholar]
Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50× fewer parameters and< 0.5 MB model size. arXiv 2016, arXiv:1602.07360. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, CA, USA, 18–22 June 2018; pp. 6848–6856. [Google Scholar]
Buciluǎ, C.; Caruana, R.; Niculescu-Mizil, A. Model compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Philadelphia, PA, USA, 20–23 August 2006; pp. 535–541. [Google Scholar]
Adriana, R.; Nicolas, B.; Ebrahimi, K.S.; Chassang, A.; Gatta, C.; Bengio, Y. Fitnets: Hints for thin deep nets. Proc. ICLR 2015, 2, 1. [Google Scholar]
Yim, J.; Joo, D.; Bae, J.; Kim, J. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4133–4141. [Google Scholar]
Raghu, M.; Gilmer, J.; Yosinski, J.; Sohl-Dickstein, J. Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. Adv. Neural Inf. Process. Syst. 2017, 30, 6078–6087. [Google Scholar]
Kornblith, S.; Norouzi, M.; Lee, H.; Hinton, G. Similarity of neural network representations revisited. In Proceedings of the International Conference on Machine Learning (PMLR), Long Beach, CA, USA, 9–15 June 2019; pp. 3519–3529. [Google Scholar]
Li, Y.; Mao, H.; Girshick, R.; He, K. Exploring plain vision transformer backbones for object detection. In Proceedings of the European Conference on Computer Vision, Tel-Aviv, Israel, 23–27 October 2022; pp. 280–296. [Google Scholar]

Figure 1. Convergence trends of VSE++ image branch.

Figure 2. Convergence trends of Faster R-CNN with pretrained model of Resnet50.

Figure 3. The difference between parameters during training and after complete training.

Figure 4. Flow chart of the Linear Freezing Algorithm (LFA).

Figure 5. Flow chart of the Adaptive Freezing Algorithm (AFA).

Figure 6. Time cost of one epoch after every time the freezing algorithm works, with the dotted line as time cost of one epoch without freezing algorithm: (a) the Linear Freezing Algorithm (LFA); (b) the Adaptive Freezing Algorithm (AFA).

Figure 7. Confusion Matrix of the results on DIOR: (a) ViTDet; (b) LFA; (c) AFA.

Figure 8. Visual detection results of ViTDet: (a) aircraft; (b) oil tank; (c) playground.

Figure 9. Visual detection results of ViTDet with the Linear Freezing Algorithm (LFA): (a) aircraft; (b) oil tank; (c) playground.

Figure 10. Visual detection results of ViTDet with the Adaptive Freezing Algorithm (AFA): (a) aircraft; (b) oil tank; (c) playground.

Table 1. Details of remote sensing object detection datasets.

Dataset	Categories	Images	Instances	Image Width
DIOR	20	23,463	192,472	800
SIMD	15	5000	45,096	1024
RSOD	4	976	6950	~1000

Table 2. Experimental results on DIOR.

Freezing Algorithm	Time Cost at Epoch 1 (s)	Time Cost at Epoch 31 (s)	Time Cost at Epoch 61 (s)	Average Time per Epoch (s)	Total Training Time (s)	AR@10	mAP	Speedup
Without	6904.83	6959.95	6642.45	6895.93	448,235.56	0.50	78.8	1×
LFA	6875.81	5527.05	4075.56	5554.59	361,048.18	0.49	76.7	1.24×
AFA	6891.45	4763.51	4074.37	5010.23	325,664.90	0.48	75.9	1.38×

Table 3. Experimental results on SIMD.

Freezing Algorithm	Time Cost at Epoch 1 (s)	Time Cost at Epoch 31 (s)	Time Cost at Epoch 61 (s)	Average Time per Epoch (s)	Total Training Time (s)	AR@10	mAP	Speedup
Without	2369.53	2405.23	2313.78	2381.71	154,811.11	0.60	87.1	1×
LFA	2370.68	1898.20	1380.30	1909.06	124,088.89	0.61	88.0	1.25×
AFA	2369.44	1637.21	1381.68	1712.61	111,319.42	0.60	87.8	1.39×

Table 4. Experimental results on RSOD.

Freezing Algorithm	Time Cost at Epoch 1 (s)	Time Cost at Epoch 31 (s)	Time Cost at Epoch 61 (s)	Average Time per Epoch (s)	Total Training Time (s)	AR@10	mAP	Speedup
Without	443.98	451.22	435.28	444.29	28,878.98	0.53	90.5	1×
LFA	446.96	360.38	252.15	357.72	23,251.98	0.54	92.1	1.24×
AFA	447.04	302.75	250.98	316.87	20,596.85	0.54	91.9	1.40×

Table 5. The mAP of different methods on the DIOR dataset.

Method	mAP
Eff-Det	66.1
RSADet	72.2
R²IPoints	74.6
SFSANet	76.6
ViTDet	78.8
ViTDet with LFA (ours)	76.7
ViTDet with AFA (ours)	75.9

Table 6. The mAP of different methods on the SIMD dataset.

Method	mAP
Faster R-CNN	70.8
YOLOX-s	77.4
YOLOv7-tiny	82.2
MAY	78.2
ViTDet	87.1
ViTDet with LFA (ours)	88.0
ViTDet with AFA (ours)	87.8

Table 7. The mAP of different methods on the RSOD dataset.

Method	mAP
CFA-Net	72.8
RoI-Trans	81.8
YOLOv7	84.4
URSNet	87.2
ViTDet	90.5
ViTDet with LFA (ours)	92.1
ViTDet with AFA (ours)	91.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tang, H.; Chen, J.; Zhang, W.; Guo, Z. Training Acceleration Method Based on Parameter Freezing. Electronics 2024, 13, 2140. https://doi.org/10.3390/electronics13112140

AMA Style

Tang H, Chen J, Zhang W, Guo Z. Training Acceleration Method Based on Parameter Freezing. Electronics. 2024; 13(11):2140. https://doi.org/10.3390/electronics13112140

Chicago/Turabian Style

Tang, Hongwei, Jialiang Chen, Wenkai Zhang, and Zhi Guo. 2024. "Training Acceleration Method Based on Parameter Freezing" Electronics 13, no. 11: 2140. https://doi.org/10.3390/electronics13112140

APA Style

Tang, H., Chen, J., Zhang, W., & Guo, Z. (2024). Training Acceleration Method Based on Parameter Freezing. Electronics, 13(11), 2140. https://doi.org/10.3390/electronics13112140

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Training Acceleration Method Based on Parameter Freezing

Abstract

1. Introduction

2. Related Works

2.1. Remote Sensing Object Detection

2.2. Deep Neural Network Training Acceleration

2.2.1. Compression of Parameters

2.2.2. Compression of Model Structures

2.3. Similarity Measure between Deep Neural Network Representations

3. Pre-Experiment: Observation of Training Process

4. Parameter Freezing Algorithm

4.1. Linear Freezing Algorithm

4.2. Adaptive Freezing Algorithm

5. Experiments

5.1. Setup

5.1.1. Experimental Environment

5.1.2. Model

5.1.3. Datasets

5.1.4. Evaluation Metrics

5.2. Results

6. Limitations and Further Work

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI