1. Introduction
Hyperspectral images (HSIs) have become an important tool for resource exploration and environmental monitoring because they contain a lot of spectral segments and extensive spatial information. By using a convolutional neural network (CNN) [
1,
2,
3,
4], features of HSIs were extracted [
5] and classified, which greatly improved the classification performance. Therefore, deep network methods have been widely applied in HSI classification.
However, the powerful feature representation ability of CNN relies on the complex structure of the model and a large number of parameters. With the development of remote sensing technology, the resolution is improved, which makes the size of the image larger, and such data size significantly influences the computational and storage requirements [
6,
7]. This hinders the application of networks to satellites, aircraft, or other mobile platforms, which greatly reduces the practical efficiency of remote sensing images. Therefore, reducing the complexity of deep network models is an enduring problem for deploying on limited resource devices [
8]. Neural network model compression can be used to solve the problem.
Neural network pruning is regarded as a simple yet efficient technique to compress model while maintaining their performance [
9], which makes it possible to deploy the remote sensing lightweight analysis model on hardware. Generally speaking, network pruning methods can be classified as manual and automatic pruning methods. Pruning rules and selection of solutions in traditional manual methods are designed by domain experts. LeCun [
10] first proposed optimal brain damage (OBD), which removed the low-value parameters by calculating the second derivative of parameters and sorting them. Han et al. [
11] used an iterative pruning method to prune the weights that were less than a manually preset layer threshold. Lee et al. [
12] proposed an importance score for global pruning; the score was a rescaling of weight magnitude that incorporates the model-level distortion incurred by pruning, and did not require any hyperparameter tuning. Recent advances in neural tangent kernel (NTK) theory have suggested that the training dynamics of sufficiently large neural networks was closely related to the spectrum of the NTK. Motivated by this finding, Wang et al. [
13] pruned the connections that had the least influence on the spectrum of the NTK. The pruning method was applied to remote sensing images. Qi et al. [
14] used the original network as a teacher model and guided the model to pruning through loss. Wang et al. [
15] pruned according to the scaling factor of the BatchNorm layer. Guo et al. [
16] designed a sensitivity function to evaluate the pruning effect of channels in each layer. Furthermore, the pruning rate of each layer was adaptively corrected. It is important to note that the criteria of manual pruning methods are not uniform, such as the absolute value of the network weights, the activation value of the neurons, and so on. As a result, a lot of time and labor costs are required to design and select appropriate pruning criteria for different networks. Furthermore, the sparse network obtained by manual pruning is generally not optimal due to the limited exploration space [
17].
Different from the traditional manual pruning methods, automatic pruning methods can reduce the design cost [
18]. As an automatic pruning method, evolution-based pruning methods constructed the pruning of the network as an optimization task, which can find and retain better sparse network structure in discrete space. Zhou et al. [
19] implemented pruning of medical image segmentation CNNS by encoding filter and skipping some sensitive layers. By considering the sensitivity of each layer, our previous work proposed a differential evolutionary pruning method based on layer-wise weight pruning (DENNC) [
20]. In addition, a multi-objective pruning method (MONNP) [
21] was proposed, which can balance the network accuracy and network complexity at the same time. Furthermore, MONNP generated different sparse networks to meet various hardware constraints and requirements more efficiently. Zhou et al. [
22] searched sparse networks at the knee point on Pareto-optimal front, and the networks create a trade-off between accuracy and sparsity. Zhao et al. [
23] compressed the model with a pruning filter and applied the multi-objective optimization of CNN model compression to remote sensing images. Wei et al. [
24] proposed a channel pruning method based on differentiable neural architecture search to automatically prune CNN models. The importance of each channel was measured by a trainable score. In conclusion, evolutionary pruning methods reduce the cost of manually designing pruning rules; however, network structures designed for hyperspectral data are becoming more and more complex, which also causes certain difficulties in evolutionary pruning methods.
For cases where the task is difficult to optimize, introducing additional knowledge to facilitate the search process of the target task provides feasible ideas. Ma et al. [
25] proposed a multi-task model ESMM, which contains a main task CVR (post-click conversion rate) prediction, and an auxiliary task CTCVR (post-view click-through conversion rate) prediction. The CTCVR task was used to help the learning of CVR to avoid problems such as over-fitting and poor generalization of CVR prediction due to small samples. Ruder [
26] pointed out that in multi-task learning, by constructing additional tasks, the prompts of these tasks can promote the learning of the main task. Feng et al. [
27] considered the random embedding space as additional task for the target problem, which ensured the effectiveness of the search on the target problem by simultaneously optimizing the original task and the embedding task. Evolutionary multitasking can be used to optimize multiple tasks simultaneously to achieve the promotion of their respective tasks. In evolutionary multi-task optimization, effective facilitation between tasks relies on task similarity.
In HSI classification, if there exists different HSIs from the same sensor, the spectral information has a similar physical meaning (radiance or reflectivity) [
28,
29], and the similarity between two images is high. As shown in
Figure 1, the HSIs obtained by the same sensor had the same spectral range. The comparison of spectral curves of the Indian Pines and Salinas reflected the similarity between HSIs. If the ground features of different HSIs are close, there is an underlying similarity between them. When the same network is trained on similar data, the distribution of network parameters is close. Thus, there are also similarities between structural sparsification tasks on different datasets. When dealing with HSI, deep neural networks mainly learn the spectral characteristics of the data through the convolution layer, and the parameters of the convolution layers realize the feature extraction of the data. Therefore, the structural information of the neural network is regarded as the transferred knowledge, which can be used as prior knowledge for other parallel tasks. In addition, the labels of hyperspectral data are limited, and CNN need enough data to learn features, which affects the training process of neural networks. When distribution of network parameter is close, knowledge transfer can obtain useful representation information from other image to alleviate the problem of limited labeled samples.
In this paper, a network collaborative pruning method is proposed for HSI classification based on evolutionary multi-task optimization. The main contributions of this paper are as follows:
A multi-task pruning algorithm: by exploiting the similarity between HSIs, different HSI classification networks can be pruned simultaneously. Through parallel optimization, the optimization efficiency of each task can be improved. The pruned networks can be applied to the classification of limited labeled sample HSIs.
Model pruning based on evolutionary multi-objective optimization: the potential excellent sparse networks are searched by an evolutionary algorithm. Multi-objective optimization optimizes the sparsity and accuracy of the networks at the same time, and can obtain a set of sparse networks to meet different requirements.
To ensure effective knowledge transfer, the network sparse structure is the transfer of knowledge, using knowledge transfer between multiple tasks to achieve the knowledge of the search and update. A self-adaptive knowledge transfer strategy based on the historical information of task and dormancy mechanism is proposed to effectively prevent negative transfer.
The rest of this paper is organized as follows.
Section 2 reviews the background. The motivation of the proposed method is also introduced.
Section 3 describes the model compression methods for HSI classification in detail.
Section 4 presents the experimental study.
Section 5 presents the conclusions of this paper.
3. Methodology
This section provides a comprehensive description of the proposed network collaborative pruning method for HSI classification. Firstly, the overall framework of the method is introduced. Secondly, compression of the model is achieved by an evolutionary multi-task pruning algorithm, the algorithm is introduced, and the initialization of individual and population, genetic operators, and self-adaptive knowledge transfer strategy are described in detail. Finally, the complexity of the proposed method is calculated.
3.1. The Framework of the Proposed Network Collaborative Pruning Method for HSI Classification
The overall framework of the proposed method is shown in
Figure 5. First, different optimization tasks are constructed for two similar HSIs, i.e., there is a similarity between the two sparsification tasks. The evolutionary algorithm is used to search the potential excellent sparse network structure on the respective HSI. Genetic operators are designed according to the representation of the network structure. In the process of the parallel optimization of two tasks, interaction between tasks is needed to transfer the local sparse network structure. At the same time, in order to avoid the possible negative transfer, the self-adaptive knowledge transfer strategy is used to control the interaction strength between tasks. After completing the pruning search in different tasks and fine-tuning on the respective HSI, a set of sparse networks is obtained.
3.2. Evolutionary Multi-Task Pruning Algorithm
3.2.1. Mathematical Models of Multi-Tasks
In the evolutionary pruning algorithm, modeling is performed on different HSIs and the similarity between images is high. Therefore, the models of multi-tasks are given in (
2).
where
represents the classification and structure sparsification task on a certain HSI and the search space of
is
. Furthermore, the optimization of the task is achieved by searching the result pruned network weights
. Similarly,
represents the classification and structure sparsification task on a different HSI, the search space of
is also
, and the pruned network weights obtained by searching is
.
Each task is a multi-objective optimization model which can be expressed by (
3). Generally speaking, in the search process, when the network sparsity is reduced, the accuracy of the network will reduce; sparsity and accuracy are two conflicting goals. One objective function
represents the accuracy of the neural network on the test dataset
, and another objective function
represents the sparsity of the network, which can be represented by the pruning rate of the network. Specifically, sparsity can be expressed as the ratio of the number of all elements that are not zero to the number of all elements.
3.2.2. Overall Framework of Proposed Evolutionary Multi-Task Pruning Algorithm
The evolutionary pruning algorithm is shown in
Figure 6. One-dimensional vectors are designed for different tasks to represent different pruning schemes, which can also be regarded as a set of sparse networks. In these two optimization tasks, the stepwise optimization of the network structure within the task is achieved. Through the knowledge transfer between different tasks, the optimization efficiency of the two tasks is further improved. After the evolution is completed, a set of network pruning schemes that can balance accuracy and sparsity are obtained. The specific implementation of the evolutionary pruning algorithm based on multi-task parallel optimization is shown in Algorithm 1.
Algorithm 1 The proposed evolutionary multi-task pruning algorithm |
Input: : task population size, t: number of evolutionary iterations, P: parent population, : random mating probability, : maximum number of generation Output: a set of trade-off sparse networks for multiple HSIs
- 1:
Step (1) Train a state-of-the-art network N - 2:
Step (2) Construct task and task in - 3:
Step (3) Pruning - 4:
Set then initialize the population - 5:
while () do - 6:
← Binary Tournament Selection () - 7:
Generate offspring → Refer Algorithm 2 - 8:
- 9:
Update scalar fitness in - 10:
Select fittest members from to form by NSGA-II - 11:
Self-adaptively update → Refer Algorithm 3 - 12:
- 13:
end while - 14:
Step (4) Fine-tuning the optimized results in and task
|
3.2.3. Representation and Initialization
In this paper, we adopt a one-dimensional vector to represent a layer-by-layer differentiated pruning scheme, which can also represent a unique sparse network. This can more comprehensively reflect the sensitivity differences of different layers in the neural network, so as to achieve more refined and differentiated pruning. This encoding method can be well extended to a variety of networks, only needing to determine the depth of the network to achieve encoding and pruning. On the other hand, the use of one-dimensional vector encoding makes the design of genetic operators more convenient. Each element in the vector represents the weight pruning ratio of each layer of the network, which is the proportion of 0 elements in the
matrix. Thus, the encoding vector of layer
i can be represented by the
as:
Similar to (
3),
represents the number of nonzero elements in the layer
i, and
represents the number of elements in this layer. In the pruning process, the weights are sorted from small to large according to the element value of the
i-th bit of the one-dimensional vector, and the weight of the former
is pruned. The upper and lower bounds of
are 0 and 1, respectively. In this way, the network weights are pruned layer by layer, and the sparse network structure corresponding to the one-dimensional vector can be finally obtained. The search process tries to approach the real Pareto-optimal front. The decoding operation is the reverse process of the encoding operation.
Specifically, as shown in
Figure 7, for a pruning scheme, its
i-th element is
a and its
j-th element is
b. Firstly, the weights of layers
i and
j are arranged from small to large. Suppose that pruning
of the weights in the
i-th convolution layer, the total parameter
of this layer is
, where
represents the height of the convolution kernel,
represents the width of the convolution kernel,
represents the number of convolution filters in this layer. Suppose that pruning
of the weights in the
j-th fully connected layer, the total parameter
is the product of the input neurons
and output neurons
. After determining the pruned parameter, the corresponding bit is set to zero to indicate that the parameter is pruned.
According to the depth L of the network and the population size , one-dimensional vectors of length L are randomly generated to form the initial population of task. This represents pruning schemes, which can also be regarded as different sparse networks. The population is initialized in the same way for different tasks.
3.2.4. Genetic Operator
The genetic operators used in proposed algorithm include crossover and mutation operators. It is necessary to judge the skill factor of the individual when two individuals crossover. This is similar to MFEA [
45]. If two randomly selected parent pruning schemes have the same skill factor, they come from the same task and crossover directly. Otherwise, it comes from different tasks, and
is needed to determine whether to carry out knowledge transfer between tasks. After completing the crossover operation, the individual performs the mutation operation. The generated offspring individuals inherit the skill factor of the parent individual. If within-task crossover is performed, the skill factor of the offspring is the same as that of the parents, otherwise, the offspring randomly inherits the skill factor of one parent. The details are shown in Algorithm 2.
Algorithm 2 Genetic operations |
Input: : candidate parent individuals, : the skill factor of the parent, : random mating probability, : a random number between 0 and 1 Output: offspring individual - 1:
if then or - 2:
← Crossover() - 3:
for do - 4:
← Mutate() - 5:
end for - 6:
if then - 7:
inherits the skill factor from - 8:
else - 9:
if then - 10:
inherits from - 11:
else - 12:
inherits from - 13:
end if - 14:
end if - 15:
else - 16:
for do - 17:
← Mutate() - 18:
inherits the skill factor from - 19:
end for - 20:
end if
|
Both between-task and within-task crossover operators are designed in the same single-point crossover. The
i-th value in
of parents
and
are swapped to generate two new individuals
and
. As shown in
Figure 8, when individuals crossover at a certain bit, the bit on different individual vectors is swapped directly. Because pruning rate and sparse structure correspond one-to-one, it is also directly exchanged at the weight matrix of the network.
A polynomial-mutation [
57] is designed when the crossover operation is complete.
Figure 9 depicts the mechanism of the designed mutation operator. Taking individual
for example, the
i-th value changes as preset mutate probability from 0 to 0.25, which can be calculated from the polynomial mutation in
Figure 9. The change quantity
in layer
i is related to the
and the non-negative real number
.
is the distribution exponent. The larger this value is, the more similar the offspring and the parent are, so
is set as the mutation probability. There are four input neurons and three output neurons in this layer for a total of 12 weight parameters. During pruning, the weights are sorted, then select the weight from small to large for pruning, and the sparse structure obtained after mutation operation is unique. Therefore, a total of three bits in the matrix need to be changed.
The crossover and mutation operators adopted in this paper not only realize the self-evolution within tasks but also transfer the effective sparse structure so as to promote the search efficiency of two tasks.
3.2.5. Self-adaptive Knowledge Transfer Strategy
Although there is a high similarity between the two tasks [
58], negative transfer is still inevitable; this affects the search efficiency and solution quality. So, a self-adaptive knowledge transfer strategy based on historical information and a dormancy mechanism is designed. The intensity of transfer can be adjusted adaptively by taking advantage of individual contributions. The dormancy mechanism is used to suppress irrelevant knowledge transfer, reduce the interference of useless knowledge to task search, and save computing resources.
Algorithm 3 introduces the self-adaptive knowledge transfer strategy. New individuals generated by knowledge transfer between tasks are labeled as
. After the fitness evaluation of the generated offspring, the Pareto rank of the offspring individual in the non-dominated ranking is obtained. The knowledge transfer contribution
can then be represented by the rank of the individual with the best non-dominated rank result among these newly generated individuals. Then,
controls the value of
. Notice that when comparing the Pareto rank of the offspring, the task to which the offspring belongs is not distinguished.
Algorithm 3 Self-adaptive knowledge transfer strategy |
Input: : the population size in multi tasks, : minimum rank of non- dominated sort, : new individuals generated by knowledge transfer, : preset threshold Output: random mating probability - 1:
- 2:
- 3:
Transfer knowledge contribution - 4:
if then - 5:
- 6:
else - 7:
- 8:
end if
|
When the value of is less than the set threshold of population interaction, the dormancy condition of the population is reached, and is set to a small fixed value. When the value of is greater than , the transfer of useful knowledge is detected at this time, the self-adaptive update is resumed, and then, the value of is the value of . Through the self-adaptive strategy to control the frequency of knowledge transfer in the evolution process and the dormancy mechanism, the impact of negative transfer between tasks on task performance can be effectively avoided.
3.3. Fine-Tune Pruned Neural Networks
After pruning, a set of sparse networks is obtained. Then, they are retrained, as studied in [
59]. In detail, these networks are trained with the Adam optimizer, and the initial learning rate, weight decay, and training epochs are set differently according to different data. The learning rate is adjusted by cosine annealing with the default setting.
3.4. Computational Complexity of Proposed Method
An analysis of the computational complexity of the proposed method is calculated in two parts: the computational cost of evolutionary computation and the computational cost of fine-tuning. In the pruning parts, the computational complexity is , where G is the number of generations, P is the number of individuals, and C is the cost of given function. Assuming the computational cost of training for each epoch is , the fine-tuning computational complexity is , E denotes the number of training epochs. Therefore, the computational complexity of the proposed approach is . Because the proposed method is multi-task optimization and is able to handle two HSIs pruning tasks simultaneously, it is twice the computational complexity of a single evolution and fine-tuning process.
4. Experiments
In this part, experiments that are carried out on HSIs to verify the effectiveness of the proposed method are described. Firstly, it is verified that the pruned network has better classification accuracy with limited labeled samples on multiple HSIs. The proposed method is compared with other neural network pruning methods, and the relevant parameters of the pruned network are compared with other methods. After that, the sparse networks obtained on the Pareto-optimal front are compared to prove the effectiveness of the multi-objective optimization. The effectiveness of the proposed self-adaptive knowledge transfer strategy is proven by quantifying the knowledge transfer between tasks. Finally, the proposed method is validated on more complex networks and larger HSI.
4.1. Experimental Setting
A 3DCNN [
36] trained on the HSI was used to validate proposed method. The structure of network is composed of convolutional layers of different stride. The convolutional layer with stride 1 is called Conv, and the convolutional layer with stride 2 is called ConvPool. Excluding the classification layer, the number in the network structure is the number of the filter of the convolutional layer, and the network structure can be expressed as: 3DConv(20)-1DConvPool(2)- 3DConv(35)- 1DConvPool(2)-3DConv(35)- 1DConvPool(2)-3DConv(35)-1DConvPool(35)-1DConv(35)-1DconvPool(35).
HSIs use Indian Pines, Salinas, and University of Pavia datasets. Data in the real world not only have the problem of limited labeled samples, but also the labeled samples often cannot reflect the real distribution of the data. For example, only part of the HSI in a certain area of the ground are sampled in the detection, and these data are continuous but may not be comprehensive. In order to simulate limited sample data, 10% labeled samples were set for each dataset, and the sample of the corresponding comparison methods was also 10%.
The Indian Pines (IP) dataset is collected by the sensor AVIRIS [
60] from a pine forest test site in northwest India. Its wavelength range is 400–2500 nm. After removing the water absorption area, there are 200 spectral segments in total, and the spatial image size of each spectral segment is
, with a total of 16 types of labels. The spatial resolution of this dataset is only 20 m.
Figure 10 shows the pseudo-color plots and labels of Indian Pines.
The Salinas (SA) dataset is collected from the Salinas Valley in California by the sensor AVIRIS. After removing the water absorption area, there are a total of 200 spectral segments, and the spatial image size of each spectral segment is
, with a total of 16 labels. The spatial resolution of this dataset is 3.7 m.
Figure 11 show the pseudo-color plots and labels of Salinas.
The University of Pavia (PU) dataset is collected by the sensor ROSIS near the University of Pavia, Italy. After removing the water absorption area, there are a total of 103 spectral segments, and the spatial image size of each spectral segment is
, with a total of nine categories of labels. The spatial resolution of this dataset is 1.3 m.
Figure 12 shows the pseudo-color plot and labels of the University of Pavia.
The proposed method was compared with five deep learning methods, including 1DCNN [
61], 3DCNN [
62], M3DCNN [
63], DCCN [
64], HybridSN [
65], ResNet [
66], and DPRN [
67]. In the experiment, three evaluation metrics—overall accuracy (
), average accuracy (
), and Kappa coefficient (
)—were used to evaluate the classification effect of the proposed method. The parameters of our proposed method are shown in
Table 1.
The experimental server included four Intel(R) Xeon(R) Silver 4214R cpus @ 2.40 GHz, 192 GB DDR4 RAM, Two NVIDIA Tesla K40 12 GB Gpus and eight NVIDIA Tesla v1000s Gpus were used. The software environment used the Ubuntu operating system with Pytorch framework and Python 3.6 as the programming language. The optimizer of the convolutional neural network was set to Adam optimizer, the weight decay was 0, betas = (0.9, 0.999), and eps = . The initial learning rate was , the learning rate decay was adopted by cosine annealing, the number of training epochs of the network was 200, and the batch size was 100.
4.2. Results on HSIs
4.2.1. Classification Results
In the experiment, two groups of experiments were constructed to analyze the influence on the performance of the proposed method. The first group uses the Indian Pines dataset and the Salinas dataset, and the second group uses the University of Pavia dataset and the Salinas dataset. The Indian Pines dataset and Salinas dataset are from the same sensor, and the University of Pavia dataset and Salinas dataset are from different sensors.
The classification result of the Indian Pines dataset is shown in
Figure 13, and the specific classification result table is shown in
Table 2. Although the pruned network do not obtain the best results on the three evaluation metrics, it obtain the highest classification accuracy on the seven categories, all of which are 100%. The network for Indian pines dataset is able to prune 91.2% of the parameters.
From the overall evaluation metrics, it can be seen that when the Indian Pines dataset from the same sensor is used as an another task, it obtains relatively better results, and pruning 87.2% of the network weights. By transferring the existing knowledge, the method successfully improves the classification accuracy of the network and greatly reduces the complexity of the network model. It is basically superior to other deep learning methods in the and . Although the number of samples in each category of data is not balanced, the knowledge transfer can improve the overall performance of the sparse network, so that the network still achieves a high , that is, the distribution of classification accuracy on each category is balanced.
The classification result of the University of Pavia dataset is shown in
Figure 13, and the specific classification results are shown in
Table 3. It can be seen that although 83.1% of the parameters are pruned, the pruned network still obtains high
,
, and
values, which are 97.57%, 97.84%, and 96.79%, respectively. In addition to this, the best results are achieved in three categories. This proves that leveraging the knowledge transferred from other images can facilitate the training of the network on the current image.
Sub-optimal results were obtained on the different sensor University of Pavia dataset, which still has certain advantages compared with other deep learning methods. Using the University of Pavia dataset as another task, 84.3% of network parameters were pruned. Compared with the results on the Indian Pines dataset, the number of retained parameters is greater, and the classification performance and consistency are lower.
These two groups of experiments show that the search efficiency of task can be promoted by transferring the important sparse structure of the SOTA network from the another task. In view of the differences between the two groups of experiments due to the same physical imaging logic under the same sensor device the similarity between the datasets is higher, and the spectral features are more common, so the better results can be achieved. Due to the lack of labeled training samples and the high complexity of the network model, the parameters are too large, so the evaluation metrics of the unpruned neural network is low, which reflects the limitation of the lack of labeled samples on the network training.
4.2.2. Comparison with other Neural Network Pruning Methods
The proposed method was compared to three neural network pruning methods in
Table 4. NCPM is the network collaborative pruning method proposed in this paper. Because NCPM is a multi-objective optimization method, it selects a sparse network on the Pareto-optimal front.
The first pruning method L2Norm [
68] is based on L2 norm, which sets a threshold for pruning for each layer by comparing the weight value of network parameters in each layer. In addition, NCPM is compared with MOPSO [
21], a method based on particle swarm optimization. LAMP [
12] is an iterative pruning method. LAMP utilizes a layer-adaptive global pruning importance score for pruning.
The three comparison methods and the proposed method all use the 3D-DL network. The original three pruning methods are all proposed based on 2DCNN and are suitable for image classification datasets, such as MNIST and CIFAR10. Therefore, the original pruning method needs to be changed to the pruning of 3DCNN. When training the network model, the same experimental settings such as the optimizer and learning rate are used as in NCPM.
NCPM obtains the best pruning results on Salinas and University of Pavia, and the OA of the pruned network is much better than that of L2Norm and MOPSO with the same pruning rate. The pruned network on Indian Pines is highly similar to the LAMP method, but both are better than L2Norm and MOPSO.
From the three HSIs, it can be clearly seen that the sparse network searched by the L2Norm is sub-optimal due to the single redundancy evaluation criterion, and the evolutionary pruning method can search a better sparse network structure. Due to the lack of diversity in selecting solutions, the sparse network searched by MOPSO is inferior to the NCPM method. The LAMP method is an iterative pruning method, and it will be retrained in an iteration process, which will cause additional computational complexity.
Compared with other pruning methods, NCPM can simultaneously prune two hyperspectral data classification networks, which improves the search efficiency. At the same time, the multi-objective optimization of the sparsity and accuracy of the network structure can obtain a set of sparse networks after one run.
4.2.3. Complexity Results of the Pruned Network
Table 5 shows the comparison results between the pruned network and the original network, as well as other neural networks, where the training time refers to a training time of 200 epochs. Our method is able to prune the 3D-DL network, and when compared with the original network, 3D-DL, the pruned network can cut most of the parameters and can also accelerate the test time of the network in a certain range. On the Univeristy of Pavia dataset, the training time was reduced by 18.23%, on the Salinas dataset, the training time was reduced by 4.18%, and on the Indian Pines, the time was almost unchanged. The pruned network achieves the best results when compared to other methods the Indian Pines and University of Pavia datasets. The comparison experiment proves the significance and necessity of neural network pruning.
4.2.4. The Result of the Sparse Networks Obtained by Multi-Objective Optimization
Figure 14 represents the Pareto-optimal front without fine-tuning in both two experiments. The Pareto-optimal front obtained for the Indian Pines dataset is uniformly distributed, whereas the Pareto-optimal front obtained for the University of Pavia is sparsely distributed. For the comparison of the Salinas dataset Pareto-optimal front in different experiments, the diversity of solutions is better in the multi-task optimization experiment of the Indian Pines dataset with the same sensor.
The hypervolume curve
Figure 15 is used to represent the convergence of the evolutionary search process. The hypervolume of each generation is determined by the sparse network on the Pareto-optimal front, and the diversity and quality of the sparse network affect the hypervolume. The initialization of the two experiments is random, so the initial
is different. By comparing the results on the Salinas dataset in different experiments, it can be seen that the Salinas hypervolume curve optimized by the Indian Pines multi-task optimization converges faster and improves more, which again verifies the influence of the similarity between tasks on the results of multi-task optimization. In addition, the growth trend of the
is the same in the two sets of experiments, and the period of faster growth of
coincides, which can be understood as the promotion effect of knowledge transfer between the two tasks for their respective tasks.
Four networks on the Indian Pines dataset were selected for comparison with the original unpruned network in
Table 6. We can see that although about 80–90% of the parameters were pruned, after fine-tuning, the total accuracy was about 3% different from the original network. In some categories, such as classes 1, 4, and 7, the classification accuracy can be basically guaranteed to be 100%. Through multi-objective optimization, a set of sparse network structures can be obtained after one run, which have different sparsity and accuracy, and are suitable for different application conditions and application scenarios.
Four networks on the University of Pavia dataset were selected for comparison with the original unpruned network in
Table 7. Compared with the original network, the
of the pruned network was improved, and the
reached 97.58% when the pruning rate was 92.93%. With the improvement of pruning rate, the obtained sparse network can still maintain the optimal classification accuracy on many categories.
Five of the sparse networks on the Salinas dataset obtained from each of the two experiments were selected for comparison with the original unpruned networks in
Table 6 and
Table 7. Implementing multi-task pruning with the Indian Pines dataset pruned 87.15% of the networks, and obtained the best results. Each class in the original network did not reach 100%, but the network after pruning can be completely classified correctly in multiple classes, which indicates that the training of the network is limited in the case of limited samples, and the problem of limited samples can be alleviated after knowledge transfer between tasks. Different sparse networks obtain the best classification accuracy on different categories, which provides a choice for different classification requirements.
The proposed method uses the evolutionary multi-objective optimization model to realize the simultaneous optimization of network performance and network complexity, and automatically obtains multiple sparse networks. Some points on the Pareto-optimal front are selected for comparison, the classification results of the pruned network obtained on the Pareto-optimal front on different HSIs are shown in
Figure 16. With the increase in the sparsity, the
and
of the network gradually decrease, but they are better than the neural network method directly trained on limited labeled sample data. In general, the proposed method can obtain a set of non-dominated sparse network solution at the same time, and the quality of sparse network is high, which can provide reference for practical datasets without labeled, and the method can be applied to different datasets.
4.2.5. Effectiveness Analysis of Self-Adaptive Knowledge Transfer strategy
For the quality of knowledge transfer between tasks, three metrics are given:
Proportion of migrated individuals: After the elite retention operation of NSGA-II, the proportion of individuals who survived through knowledge transfer in the new population is calculated in the whole population, and the overall quality of the transfer is evaluated. The higher the ratio is, the better the quality of knowledge transfer is, which can greatly promote the population optimization.
Transfer knowledge contribution degree: the minimum non-dominated rank of all transfer individuals after non-dominated sorting of the main task. The smaller the rank is, the more excellent the transfer individual is in the population, which indicates the greater contribution of the population optimization.
Self-adaptive knowledge transfer probability (): the variable used to control the degree of knowledge transfer in the self-adaptive transfer strategy. A larger value of represents a stronger degree of interaction.
As shown in
Figure 17, there are more individuals with transfer knowledge in the early stage of evolution, with the proportion distribution ranging from 50% to 10%. Although the
curve shows that the strength of knowledge transfer is almost the same, which indicates that the knowledge transfer in the early evolution can greatly help the search, but with the continuous optimization and convergence of the population, the effect of knowledge transfer is declining. Because of the contribution degree of transfer knowledge—although fewer individuals survive through knowledge transfer—part of the knowledge is still of high quality, which is still very effective for promoting the optimization of tasks.
Because the search of the task has not converged in the early stage, knowledge can provide a general network structure to guide the search. However, with the continuous optimization of the task, it is necessary to transfer very high-quality knowledge to promote search. At this time, although the knowledge transfer is heavy, only the part of individuals containing high quality can survive. Therefore, the self-adaptive knowledge transfer strategy based on the historical information is necessary.
During the evolution of the University of Pavia dataset as another task, as shown in
Figure 17, a long dormancy mechanism is triggered, which indicates that the self-adaptive transfer strategy during this period considers the knowledge as invalid and intrusive. This may be due to the fact that there are differences between the datasets collected by different detection devices and there are few spectral features in common. Therefore, it is more useful to build multi-task optimization with datasets collected by the same sensor.
4.2.6. Discussion
In this part, the proposed method is validated on more complex networks and larger HSI dataset. The proposed method is used to prune the complex network CMR-CNN [
69] for HSI classification, the number of parameters is 28,779,784. A new cross-mixing residual network denoted by CMR-CNN is developed, wherein one three-dimensional (3D) residual structure responsible for extracting the spectral characteristics, one two-dimensional (2D) residual structure responsible for extracting the spatial characteristics, and one assisted feature extraction (AFE) structure responsible for linking the first two structures are designed.
Table 8 shows the pruning results of CMR-CNN on different HSIs. For this network, there is almost no decrease in the OA of the network after pruning nearly 75% of the parameters, and the OA of the network on Indian Pines is improved by 0.46%, which proves that our method can be applied to complex networks and can alleviate the overfitting problem of training on complex networks. Compared with the original network, the pruned network can cut most of the parameters, and can also accelerate the test time of the network in a certain range. On the University of Pavia dataset, the training time is reduced by 9.58% and on the Salinas dataset, the training time is reduced by 14.8%. The above comparison experiment proves the significance and necessity of neural network pruning.
In addition, AlexNet [
6] and VGG-16 [
7] are pruned on image classification dataset CIFAR10, The Naive-Cut [
70] method is a manual pruning method that uses the weight size as the redundancy.
The comparison results after fine-tuning are shown in
Table 9. As the complexity of the network and the number of parameters increase, the gap between the proposed method and other neural network pruning methods becomes larger. Compared with the traditional single-objective pruning methods Naive-Cut and L2-pruning, the proposed method can obtain a set of networks with different sparsity and accuracy values in one run. At close accuracy, the solution obtains more sparse results. This is because the proposed evolution-based method has strong local search capability and is able to obtain sparse network structures in the search space. Due to the higher search efficiency and better diversity maintenance strategy, the proposed method can better ensure the population diversity in the evolution process than MOPSO.
The Pavia Center is captured by the ROSE-3 satellite, and the photographed terrain is the urban space of the University of Pavia, Italy. This dataset has a spatial resolution of 1.3 m and an image size of
pixels. The dataset contains 114 spectral bands with spectral wavelengths ranging from 430 to 860 nm. After removing the noise bands, the number of bands used for classification is 104.
Figure 18 show the pseudo-color plots and labels of Pavia Center.
Table 10 compares the classification results of the pruned network with the results of other neural network methods.
Figure 19 shows the classification maps of different methods on Pavia Center. In the collaborative pruning task in the University of Pavia and Pavia Center datasets, a sparser network structure is obtained on the Pavia Center. OA is still maintained at 97.45%. On the University of Pavia dataset, a 97.39% OA is obtained in Pavia Center, which is better than the original network 3DDL, as well as the results on 1DCNN and M3DCNN. This also proves that proposed method can be applied to larger HSIs.
5. Conclusions
Classification and network pruning tasks for several HSIs are established. In the evolutionary pruning search within each task, important local structural information is acquired and learned. Knowledge transfer between tasks is used to transfer important structures for representation in other tasks to the current task, which guides the learning and optimization of the network on limited labeled samples. It effectively improves the problem of network model overfitting and difficult training caused by limited labeled samples in each task. The self-adaptive transfer strategy based on historical information and dormancy mechanism achieves the original design goal: transferring as much good knowledge as possible and avoiding as much negative knowledge as possible.
Experiments on HSIs show that the proposed method can simultaneously realize classification and structure sparsification on multiple images. By comparing with other pruning methods on image classification data, the proposed method can search for sparser networks while maintaining accuracy. For structured pruning, which is currently more popular, the computation of sparse weight matrices can be avoided, so our future work will consider applying the proposed framework to structured pruning. Therefore, it is necessary to consider knowledge and knowledge transfer strategy in structured pruning. This will further expand our work in the area of neural network architecture optimization. Finally, the proposed method needs to be tested on hardware devices to verify the feasibility and practicability of the method.