Research on a Convolution Kernel Initialization Method for Speeding Up the Convergence of CNN

Xu, Chunyu; Wang, Hong

doi:10.3390/app12020633

Open AccessArticle

Research on a Convolution Kernel Initialization Method for Speeding Up the Convergence of CNN

by

Chunyu Xu

^1,2

and

Hong Wang

^1,*

¹

School of Mechanical Engineering and Automation, Northeastern University, Shenyang 110819, China

²

Department of Information Engineering, Liaoning Provincial College of Communications, Shenyang 100122, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(2), 633; https://doi.org/10.3390/app12020633

Submission received: 6 December 2021 / Revised: 1 January 2022 / Accepted: 4 January 2022 / Published: 10 January 2022

Download

Browse Figures

Versions Notes

Abstract

:

This paper presents a convolution kernel initialization method based on the local binary patterns (LBP) algorithm and sparse autoencoder. This method can be applied to the initialization of the convolution kernel in the convolutional neural network (CNN). The main function of the convolution kernel is to extract the local pattern of the image by template matching as the target feature of subsequent image recognition. In general, the Xavier initialization method and the He initialization method are used to initialize the convolution kernel. In this paper, firstly, some typical sample images were selected from the training set, and the LBP algorithm was applied to extract the texture information of the typical sample images. Then, the texture information was divided into several small blocks, and these blocks were input into the sparse autoencoder (SAE) for pre-training. After finishing the training, the weight values of the sparse autoencoder that met the statistical features of the data set were used as the initial value of the convolution kernel in the CNN. The experimental result indicates that the method proposed in this paper can speed up the convergence of the network in the network training process and improve the recognition rate of the network to an extent.

Keywords:

convolutional neural network; convolution kernel; local binary patterns; sparse autoencoder

1. Introduction

Deep learning is an important branch of artificial intelligence. The convolutional neural network (CNN) has received much attention in recent years as one of the important deep learning models. The CNN is a feed-forward neural network with a deep structure. Compared with other deep learning models, the weight sharing strategy of the CNN greatly reduces the complexity of the network model and the number of parameters. When the input data is multi-dimensional in the network, the advantages of the strategy are more obvious. The data can be directly input into the network without complex feature extraction and data reconstruction processes in the traditional recognition algorithms. At the same time, the CNN has translation invariance, rotation invariance, and scale invariance to image target in space. According to these characteristics of CNN, it is widely used in image classification, target localization, semantic segmentation, and other visual tasks [1,2,3,4,5]. However, there are many problems that have not been solved for CNN. The convolution kernel initialization is one of the questions. The CNN is a non-linear mapping, and the optimization of the training process has a great relationship with the initial conditions. Therefore, the initialization of the convolution kernel is very important for CNN training. Much research in recent years has focused on this field.

A popular method for initializing convolution kernels is random assignment [6]. This method makes the weight parameters obey the Gaussian distribution with a mean value of 0 and a standard deviation of 1. This method is simple and straightforward, but it has many obvious disadvantages. For example, it will lead to slowing down the learning speed of the network, and it will make the learning process fall into a local optimal problem. In order to solve the problem, Xavier Glorot et al. proposed a convolution kernel initialization method commonly known as “Xavier initialization value” [7]. When the number of input neurons is n, the initial value of the convolution kernel obeys the uniform distribution with a mean value of 0 and a variance of 1/n. This method improves the convergence speed of the network. But the initialization method’s one of the assumptions in the derivation process is that the activation function is Linear. As a result, this method is not suitable for non-linear activation functions such as Relu. Later, Kaiming He et al. proposed an initialization method specifically for the Relu activation function [8]. The method is seen as an improved version of the Xavier initialization value. The initial value of the convolution kernel obeys the Gaussian distribution with a mean value of 0 and a variance of 2/n. Much research has demonstrated that the above two methods can both get better results for different datasets.

However, in the process of networking training, convolution kernels and the training sample are independent of each other after the convolution kernels are initialized. Convolution kernels match with the local pattern of the sample in the training images, which is a small probability event. After many iterations, the convolution kernels can better match the local patterns of training samples. Therefore, the network training needs a longer time, and the network convergence rate slows down.

Many methods for initializing the convolution kernel have been proposed to solve the above problems. OrthoNorm is an orthogonal matrix initialization method. It is better than the approximate orthogonal Gaussian distribution, and the method can also be used for non-linear networks [9]. The Layer sequence unit variance (LSUV) method extends orthogonal initialization to the iterative process. This method uses Gaussian noise with unit variance to replace the weights. Then they are decomposed into standard orthogonal bases with orthogonal matrix and upper triangular matix decomposition or singular value (SV) decomposition, and one of the components is used to replace the weights. In this method, not only the orthogonality but also the unit variance of the output in each layer are used [10]. The principal component analysis (PCA) net was proposed in 2014 by Tsung-Han Chan et al. In the cascaded principal component analysis stage of the model, the model obtains all overlapping image patches from a feature map and initializes the convolution kernel by calculating the principal components of image patches. However, the model has the disadvantage that the number of feature maps increases exponentially with the increase in layers, which limits the depth of cascaded PCA [11]. Many other scholars use an unsupervised pre-training method to initialize the convolution kernel. The main idea is to use sparse autoencoder (SAE) to initialize the convolution kernel of the first layer in the network and obtain the filter set that accords with the statistical characteristics of the dataset in order to solve the problem that the previous layers of the network cannot be fully trained [12,13].

This paper proposed an improved method for the convolution kernel initialization method. Representative local patterns were extracted in typical sample images that were from a specific dataset. These representative local patterns were the initial value of the convolution kernels. The convolution kernels can extract the local patterns of the images better by matched filtering when the network started training, which made the network training locate near the convergence domain with high probability and obtain the global optimal solution. At the same time, the network got better discrimination accuracy. The training samples used the CIFAR-10. Typical sample images were selected in CIFAR-10. The Local Binary Patterns (LBP) algorithm was applied to these typical sample images, which got the texture information of these images. Then, this texture information was divided into several small blocks. These blocks were input into the sparse autoencoder for pre-training. After finishing the pre-training, a set of convolution kernel filters that accord with the statistical characteristics of the dataset was obtained. They were the initial value of the convolution kernel. The experimental result shows that the method can obviously accelerate the convergence speed of CNNs and improve the recognition accuracy of CNNs to an extent.

2. Materials and Methods

Since the training process of CNN optimization has a relationship with the initialization conditions, the initialization of the convolution kernel is very important in CNN training. The “Xavier initialization value” and the “He initialization value” are applied to initialize the convolution kernel. Because these initial values are independent of training samples, that the convolution kernel and the local pattern of the training samples match is a very small probability event. That the network needs multiple iterations of training to match the local pattern of the training samples increases the training time of the CNN and slows down the convergence speed of the network. In order to speed up the convergence of the network and improve the recognition accuracy of the network, the representative local patterns in sample images are extracted, and they are applied to become the initial value of the convolution kernel. Therefore, the convolution kernels are able to match the local pattern of the images better in the initial training of the network and make the network appear near the convergence zone with a great probability. The network will obtain the global optimal solution quickly and preferable discrimination accuracy.

2.1. The Dataset

At present, there are many public datasets for image classification, such as Imagenet, CIFAR-10, MNIST, Caltech 101, etc. Imagenet is the largest image dataset in the world, including 15 million images and 22,000 classifications. The scale of this dataset is too large. The experiment is to verify whether the proposed method of initializing the convolution kernel in this paper can accelerate the convergence speed of the network, so the dataset is not suitable for this experiment. MNIST is an entry-level computer vision dataset that contains various handwritten digital images. Each image is a single-channel gray image. The dataset is generally used as a benchmark. Caltech 101 is a dataset composed of 101 categories of objects. It is mainly used for target recognition and image classification. There are 40 to 800 pictures in every category, and the size of each image is 300 × 200. For the convenience of calculation, the height and width of the images in the dataset should be equal. According to the characteristics of the above datasets, the CIFAR-10 dataset was selected for the experiment. The images in the dataset have the same height and width. They are three-channel color images, and there is no overlap in the dataset. There are no two kinds of things in the same image, and the proportion and characteristics of the objects in the image are also different. The noise in the image is large, and it is difficult to identify. In addition, in the case of limited computational power, CIFAR-10 is a better choice.

CIFAR-10 is provided by Krizhevsky et al. of the Hinton team [6]. There are 60,000 color images in total. Fifty thousand images were used for training, and 10,000 images were used for testing in the data set. The size of these images was 32 × 32, and they had three channels and a total of 10 categories. Every category contained 5000 training images and 1000 testing images. The testing batch contained 1000 randomly selected images from each category, and the training batch contained the remanent images in random order. The figure below shows 10 categories in the data set, and each category contained 10 random images, as shown in Figure 1.

2.2. Overview of the Method

In order to make the convolution kernel and the sample training have a better matching degree and shorten the training time of the CNN network, an improved convolution kernel initialization method was proposed in this paper. Firstly, some typical sample images were selected. The sample features had greater similarity to the images of the same category in the training sample; therefore, the typical local features of some images that were extracted randomly from a certain category were able to represent the main features of this category to an extent. In the training sample set, some sample images were selected to construct a typical image subset from the subset representing each category. Then the LBP algorithm was applied to the typical image subset in order to extract the texture information. At last, the texture information was divided into several small blocks and was input into the sparse autoencoder for pre-training. After the training was completed, a set of filters that met the statistical characteristics of the dataset were obtained, and these filters were used to assign initial values to the convolution kernel of the CNN network. The entire process of the method is shown in Figure 2.

2.3. Typical Sample Images

Firstly, typical sample images were selected. The specific method was to set the labeled training sample as

T = W_{1}, W_{2}, \dots, W_{C}

. There were

C

sample subsets in the training sample set. Every sample subset of each category was

W_{d} = I_{1}^{d}, I_{2}^{d}, \dots, I_{n}^{d}, d = 1, 2, \dots, C

there were n images in each sample subset. The image whose dynamic range and entropy were both bigger than the mean was selected as the typical sample image. An image with a larger dynamic range has relatively good contrast, while the larger image entropy indicates that average information is larger in the image. Therefore, the dynamic range and entropy of each image where required. The size of every image was N × N, and the dynamic ranger was estimated as:

\begin{array}{l} D (I_{k}^{d}) = \frac{I_{k}^{d} {(x, y)}_{\max}}{I_{k}^{d} {(x, y)}_{\min}} \\ d = 1, 2, \dots C; k = 1, 2, \dots n; x = 0, 1, \dots N - 1; y = 0, 1, \dots N - 1 \end{array}

(1)

where

I_{k}^{d} {(x, y)}_{m a x}

and

I_{k}^{d} {(x, y)}_{m i n}

, respectively, represent the largest and the smallest pixel value of the image

I_{k}^{d}

. The average dynamic range of a sample subset was estimated as:

D_{d} = \frac{1}{n} \sum_{k = 1}^{n} D (I_{k}^{d})

(2)

The one-dimensional entropy of the images is estimated as:

S (I_{k}^{d}) = - \sum_{o = 0}^{255} h_{0} \log h_{0} k = 1, 2, \dots, n

(3)

where

h_{0}

represents the probability of a certain pixel value appearing in the image. The average entropy of the sample subset was estimated as:

S_{d} = \frac{1}{n} \sum_{k = 1}^{n} S (I_{k}^{d})

(4)

The average dynamic range and the average entropy were calculated for each category subset, and the dynamic range and the entropy of each image in the subset were calculated. The images whose dynamic range and entropy were bigger than the corresponding mean were selected. The formula is shown in (5):

\begin{array}{l} G (Y^{d}) = {Y^{d} | Y^{d} = I_{k}^{d}, D (I_{k}^{d}) > D_{d}, \\ S (I_{k}^{d}) > S_{d}, k = 1, 2, \dots n, d = 1, 2, \dots C} \end{array}

(5)

where

G (Y^{d})

is a sample set in which the dynamic range and image entropy are both bigger than the mean. The dynamic range of the images in the sample set was relatively larger, and the amount of information was relatively richer. Therefore, the typical images were obtained from the set of images in each category, and the features of these images represented the main features of the sample images in the category.

G (Y^{d})

is the set of typical sample images [14,15,16,17].

2.4. Local Binary Patterns

Local binary patterns (LBP) is an operator that is used to describe local texture features. It was proposed by T. Ojala, M. Pietikäinen, and D. Harwood in 1994 [18]. The algorithm calculates the gray value of the local area of the image point by point and respectively counts the times that different LBP values appear to describe the texture features of the image in the area. The characteristics of the algorithm are simple calculation, low complexity, and strong gray and rotation invariance. The algorithm is used to extract local texture features of images and is widely used in image retrieval, face recognition, and military fields. It is highly respected by researchers.

The original LBP operator is defined in the window whose size is 3 × 3. The central pixel of the window is the threshold that is compared with the gray value of the adjacent 8 pixels. If the surrounding pixel value is bigger than the central pixel value, the surrounding pixel value is marked as 1; otherwise, it is 0. The 8-bit binary number is obtained after comparing, which is the LBP operator. Because it is an 8-bit binary number, there are 256 values in total. The values are used to represent the texture information of the area. The whole process is shown in Figure 3, and the formula is shown in (6).

LBP (x_{c}, y_{c}) = \sum_{n = 0}^{N - 1} 2^{n} s (i_{n} - i_{c})

(6)

where

N = 8

,

(x_{c}, y_{c})

is the central pixel,

i_{c}

is the gray value of the central pixel,

i_{n}

is the gray value which is adjacent to the center pixel, and

s

is a symbolic function:

s (x) = {\begin{matrix} 1 \\ 0 \end{matrix} \begin{matrix} x \geq 0 \\ x < 0 \end{matrix}

(7)

The pixel value of the eight-neighborhood is bigger than the central point value; the pixel value is marked as 1; otherwise, it is marked as 0 in Formula (7). Then the 8-bit binary number is obtained by reading clockwise. As shown in the above figure, the 8-bit binary number is 11100001, which is converted to a decimal number, 225. There is no requirement for the reading order of binary numbers if only the same order is kept in the processing.

The original LBP operator only covers a small area within a fixed radius, which is the biggest drawback. Therefore, it is not suitable for different sizes and frequency textures. Ojala et al. improved the LBP operator [19]. The 3 × 3 neighborhood is extended to any neighborhood, and the square neighborhood is replaced by a circular neighborhood. The improved algorithm has any number of pixels in a circular neighborhood whose radius is R, so the LBP operator of P sample points is obtained.

{LBP}_{R}^{P}

means to take P sample points in a circular area whose radius is R. It is shown in Figure 4.

Because R is the radius of the neighborhood and P is the number of sample points, the number of sample points may not be an integer. In order to solve this problem, double-line interpolation is used to determine the pixel value of the point. The linear combination of the four-pixel values around the point is used to represent the value of the point [20,21].

The LBP operator has gray invariance from the definition of LBP. Gray invariance means that the illumination change has little impact on the description. As a result, it has strong robustness to illumination [22]. The 8-neighborhood is taken as an example. When the illumination changes, it is difficult to change the relationship between the center pixel and the surrounding 8 pixels. Because the illumination change is not a single pixel change but a regional change. When there is strong light on the 9 pixels, the 9-pixel values will increase at the same time, but the relative relationship of the size remains unchanged. Although the original LBP has gray invariance, it does not have rotation invariance. The different LBP operators will be obtained after the image is rotated, by which, Meanpaa et al. improved the LBP algorithm. They proposed LBP operators with rotation invariance [11]. The process of the algorithm is to continuously rotate the circular neighborhood to obtain a series of initially defined LBP operators, and the minimum values are taken as the LBP value of the neighborhood. The process is estimated as:

{LBP}_{P, R}^{ri} {= \min (ROR (LBP}_{P, R}^{ri}, i) | i = 0, 1, \dots P - 1)

(8)

where

{LBP}_{P, R}^{ri}

represents the LBP operator with rotation invariance, and

ROR (x, i)

is the rotation function, which means to cycle

x

to the right

i

bit. Figure 5 is a schematic diagram of the process of getting the LBP operator with the rotation invariant. 8 LBP operators are taken as an example in Figure 5. The numbers below the LBP operators indicate the corresponding LBP values. There are 8 LBP modes after rotating at different angles, and the smallest LBP value of 15 is selected as the LBP value of the neighborhood. No matter how the neighborhood is rotated, the LBP value is (0000 1111) = 15.

The improved LBP algorithm has gray invariance and rotation invariance. It can extract the image texture at the same time. In this paper, there is no image that is rotated in the CIFAR-10 dataset; therefore, the original LBP algorithm that only has grey invariance was selected. In some cases, in order to increase the diversity of the data, operations such as rotation and mirroring can be performed on the images of the sample data. At the same time, the improved LBP algorithm can be used to extract the texture features of the images.

2.5. Sparse Autoencoder

Autoencoder was proposed by Rumehart in 1986 [23]. It is used to process high-dimensional complex data. It is a neural network that is used to reproduce the input signal as much as possible [24], and it is also a common algorithm in deep learning. The main idea is to use the hidden layer in the network as an encoder and a decoder. The input data is encoded and decoded by the hidden layer so that the output layer data is almost equal to the input layer data. At the same time, the hidden layer neurons can learn a compressed representation of the input data. For example, the input layer is an image whose size is 10 × 10, and it has 100 pixels in total. Then, in the autoencoder network model, the input layer has 100 neurons, the hidden layer has 50 neurons, and the output layer has the same number of neurons as the input layer. After the training of the autoencoder, the 50 neurons of the hidden layer can learn the features of the input layer, while the model can ensure that the input data is equal to the output data as much as possible. Therefore, the autoencoder is generally used for dimension reduction or feature learning.

The autoencoder is estimated in Figure 6. There are 6 neurons in the input layer and output layer. There are 3 neurons in the hidden layer. The model makes the hidden layer learn a compressed representation of the input data by training.

The autoencoder is an unsupervised learning method that makes the goal value identical to the input value by using the back propagation algorithm. The loss function of the autoencoder is estimated as:

\begin{array}{l} J (W, b) = \frac{1}{m} \sum_{i = 1}^{m} J (W, b; x^{(i)}, y^{(i)}) + \frac{λ}{2} \sum_{l = 1}^{n_{i} - 1} \sum_{i = 1}^{s_{l}} \sum_{j = 1}^{s_{l} + 1} {(W_{j i}^{(l)})}^{2} \\ = \frac{1}{m} \sum_{i = 1}^{m} (\frac{1}{2} {‖ h_{w, b} (x^{(i)} - y^{(i)}) ‖}^{2}) + \frac{λ}{2} \sum_{l = 1}^{n_{l} - 1} \sum_{i = 1}^{s_{l}} \sum_{j = 1}^{s_{l} + 1} {(W_{j i}^{l})}^{2} \end{array}

(9)

where the first item is mean square deviation and the second item is weight attenuation that is used to reduce the weight values to prevent overfitting [25,26,27]. The sparse autoencoder (SAE) adds the sparse limit to the autoencoder. In order to achieve the goal of sparse, it constrains the hidden layer so that most neurons are in an inhibitory state and just a few neurons are in an active state. The loss function of the sparse autoencoder is estimated as:

J_{s p a r s e} (W, b) = J (W, b) + β \sum_{j = 1}^{s_{2}} K L (ρ ‖ {\hat{ρ}}_{j})

(10)

where KL (Kullbak-Leibler) is the distance that is estimated as:

K L (ρ ‖ {\hat{ρ}}_{j}) = ρ \log \frac{ρ}{{\hat{ρ}}_{j}} + (1 - ρ) \log \frac{1 - ρ}{1 - {\hat{ρ}}_{j}}

(11)

where

ρ

is the sparsity parameter, and its value is usually close to zero.

{\hat{ρ}}_{j}

is the average output value of the hidden layer neurons, and it is estimated as:

{\hat{ρ}}_{j} = \frac{1}{m} \sum_{i = 1}^{m} (a_{j}^{2} X)

(12)

where

a_{j}^{2}

is the activation value of the input vector X to the hidden layer neuron

j

.

When an autoencoder is used, it should be noted that if the input data of the network is completely random. For example, each input variable is an identically distributed Gaussian variable that is independent of other data; then, the compression will be very difficult to learn. However, if some features are implied in the data, for example, some data features are related to each other, the network can find these correlations in the input data. In fact, a simple autoencoder can usually learn a low-dimensional representation of input data that is very similar to the result of principal component analysis (PCA) [28]. In this paper, the sample images that were calculated based on the LBP algorithm were diced, and the diced data was used as the input of the sparse autoencoder. After the network training, the reduced dimension representation of the sample image texture features was obtained, which was the initial value of the convolution kernel of CNN.

3. Results and Discussion

In order to accelerate the convergence speed of CNN and shorten the training time of the network, it is hoped that the initial value of the convolution kernel had a certain orientation in the network optimization space. Therefore, a convolution kernel initialization method based on the LBP and the SAE was proposed in this paper. Firstly, some typical images were selected in the sample data set. The LBP algorithm was used on the typical images to obtain the texture image of the typical image. Then, the texture image was divided into several blocks. At last, a sparse autoencoder was constructed, and these blocks were input into the sparse autoencoder for training. After the training was completed, the weights of the autoencoder that were used as the initial value of the convolution kernel of CNN were obtained. The result of the experiment indicated that the training speed of CNN and the classification accuracy were improved to an extent.

3.1. Result

A high-performance computer workstation with one CPU and two 2080ti GPUs was used in the experiment. The CPU was an Intel Xeon W-2150b produced by Intel Corporation of the United States. The GPU was an NVIDIA Geforce RTX 2080ti produced by NVIDIA Corporation of the United States. The experiment was based on Python 3.8 and the TensorFlow framework. It can be seen from the previous introduction that there were 10 categories in the CIFAR-10 data set, and each category included 6000 images, including 5000 training images and 1000 testing images. Therefore, there were 50,000 training images and 10,000 testing images in the whole dataset. According to the method that was introduced from the Materials and Methods in Section 2.3, the images whose dynamic range and image entropy was greater than the mean value in each category of training images were selected as the typical images. The number of images selected in each category is shown in Table 1, and 7991 images were selected from 50,000 training images.

Ten images were selected randomly from each category in the typical image subset. As a result, a total of 100 images in all categories were selected, and these images were extracted the texture features. The original image and the texture image are shown in Figure 7. The left is the original image, and the right is the texture feature of the image.

In this experiment, the first convolutional layer had 16 convolution kernels, and the size of convolution kernels was 3 × 3 × 3. Therefore, the number of hidden layer neurons in the sparse autoencoder was 16. The number of input layers and output layers was 27. One hundred images were randomly selected from the typical image subset. The texture features were extracted from these images. Each texture image was divided into several blocks. The size of the block was 3 × 3, and each image was divided into 900 (30 × 30) blocks. There were 90,000 blocks in 100 images. These blocks were input into the sparse automatic encoder as a training set for training. After completing the training, the weight of the sparse autoencoder was obtained. Its structure was 27 × 16 in size, and the structure of the weight was converted to 16 × 3 × 3 × 3. The set of convolution kernel filters in the CNN was obtained. The initialization of the convolution kernel was completed. During the experiment, 30 images were also tried to be selected in the typical image subset. However, in the process of dividing the texture image into small blocks, the running time of the program was much longer than that of randomly selecting 10 images. At the same time, when 30 images were selected from each category, the recognition rate of the CNN was basically the same as that of randomly selecting 10 images, and there was little difference between them. In other words, increasing the number of selected images was not helpful in improving the network recognition rate. Therefore, 10 images were randomly selected from each category of typical image subsets to carry out the whole process of the experiment in this paper.

In order to verify the effectiveness of the initialization method, three initialization methods were designed to classify the CIFAR-10 data set, which were the Xavier random initialization method, the He random initialization method, and the initialization method based on LBP and SAE in this paper. These three methods were used respectively to initialize the convolution kernel of the first convolution layer in the CNN model. The CNN model was implemented using the TensorFlow framework. In the model, the operation of “two convolutions and one pooling” was performed alternately, then the two fully connection layers were connected; finally, the output layer used the softmax classifier to divide the images into 10 categories. The activation function was Relu, the update of the weight parameter used the Adam algorithm, and the loss function was the cross-entropy. The input of the network was the image whose size was 32 × 32 × 3 in the training set. The specific network structure diagram is shown in Figure 8. The basic parameter value of the network is shown in Table 2.

Batch_Size: The parameter was the number of images that were input into the network each time. Therefore, 128 images were sent into the input layer each time during the network training.

Epoch_Number: The parameter meant the total iteration times. In the experiment, these values were 20, 30, and 50, respectively.

Learning_Rate: The parameter was the learning rate. Its initial value was 0.01, and its attenuation factor was 0.1. During the network training, when the current iteration number of times was less than 40% of the total iteration number, its value was 0.01. When the current iteration number was between 40% and 80% of the total iteration number, its value was 0.001. When the current iteration number was greater than 80%, its value was 0.001.

Log_Frequency: The parameter was 391. It meant that the value of loss function was output at the terminal every 391 training. 391 training represented one epoch, which meant the update times when all data in the training set had been used once. In the experiment, there were 50,000 images in the training set, and the “Batch_Size” was 128, so the “Log_Frequency” was 50,000/128 = 391.

On the appropriate hardware platform, the appropriate model and the iteration number are selected, then the recognition rate of the CIFAR-10 data set can reach more than 95% [29]. However, the purpose of this experiment was not to obtain a better recognition rate for the CIFAR-10 dataset but to verify the effectiveness of the convolution kernel initialization method based on LBP and SAE. In the experiment, under the same conditions, the CNN used the Xavier initialization method, the He initialization method, and the initialization method based on LBP and SAE to initialize the convolution kernel of the first convolutional layer, and the iteration number was 20, 30, and 50. The training results are shown in Figure 9. The results of the recognition rate are shown in Table 3.

Table 3 shows one of the several CNN training results. Although the recognition rate of the CNN for the test data was not exactly the same after each training, the result of each training was that the method proposed in this paper had a higher recognition rate. In this result, as the iteration times increased, the recognition rates of the three initialization methods improved. However, when the iteration times were increased to an extent, the recognition rate of the network was basically unchanged. Because as the iteration times increased, the network model gradually converged. When the network iterated 20 times, using the Xavier initialization method, the recognition accuracy was 73.93%, and using the He initialization method, the recognition accuracy was 78.51%. Compared with the Xavier method, the He method increased the recognition rate of the network by 4.58%. Using the initialization method based on LBP and SAE that was proposed in this paper, the recognition accuracy was 81.59%. Compared with the Xavier method and the He method, the network recognition rate increased by 7.66% and 3.08%, respectively. When the network iterated 30 times, the method respectively increased the recognition rate by 6% and 3.02% compared with the previous two methods. When the network iterated 50 times, the recognition rate of the network was also increased. In addition, Figure 9 shows the results of training using the three different initialization methods. It was apparent that the loss of the convolution kernel initialization method based on LBP and SAE decreases faster. Compared with the other two initialization methods, the network converged and reached the minimum faster. The experimental result indicated the effectiveness and versatility of the convolution kernel initialization method based on LBP and SAE, which made the recognition rate of the network improve to an extent and the network converged faster.

3.2. Discussion

The size of the image is 32 × 32 in the CIFAR-10 dataset. The target that is recognized in most images occupies a large portion of the image. The image was cut to several blocks after the LBP algorithm was completed according to the initialization method proposed in this paper. The size of the block was 3 × 3, then the probability that the block contains the entire target to be recognized is small. Therefore, this method has a small improvement in the recognition accuracy of the CIFAR-10.

Ten images were randomly selected from each category of typical images that were selected from the CIFAR-10 dataset, and the 10 images that represented all images of a certain category were performed using the initialization method proposed in this paper. Therefore, the initial value of the convolution kernel matched with the parts of some images rather than all images in the training dataset. In addition, the first layer convolution kernels of the CNN were initialized according to the method. The convolution kernels of the other layers still used the default initialization method in the TensorFlow framework. The training of convolution kernels of other layers still took plenty of time to make the network converge. As a result, based on the above two reasons, although this method can accelerate the convergence speed of a CNN, the effect is not very obvious.

Compared with the Xavier initialization method and the He initialization method, the method proposed in this paper is relatively cumbersome to obtain the initial value of the convolution kernel. Firstly, the typical images were selected, then the texture images were extracted from the typical images, and the sparse autoencoder was trained to obtain the initial value of the convolution kernel. On the contrary, the Xavier initialization method and the He initialization method, which are relatively mature initialization methods, can be completed by directly using corresponding functions in the TensorFlow framework [30,31,32,33,34,35,36].

As can be seen from the above experimental results and analysis, although the convolution kernel initialization method based on LBP and SAE proposed in this paper has many shortcomings, it still accelerates the convergence speed of the network and improves the recognition accuracy to an extent. The main reason is that the texture features of the image are extracted, which are processed by the sparse autoencoder as the initial value of the convolution kernel so that the convolution kernels match the local pattern of some images in the training samples. Therefore, it shows that the representative local pattern in the typical sample images can be extracted in advance as the initialization value of the convolution kernel so that the convolution kernels may be matched with the local patterns of the training samples, which can accelerate the convergence speed of the network and improve the recognition accuracy of the network.

4. Conclusions

The convolution kernel initialization of CNN is very important to accelerate the convergence speed of CNN and obtain high classification accuracy. The random initialization method is simple and direct, but the convergence speed of the network is slow, and the network may fall into the local optimization. In the most severe cases, gradient dispersion may be caused. The Xavier initialization method is not applicable to non-linear activation functions such as Relu. Although the He initialization method is proposed for the Relu activation function, the matching between the convolution kernel and the local pattern of the training sample images is a small probability event, and the network needs many iterations to converge to the minimum value. Based on the above reasons, the convolution kernel initialization method based on LBP and SAE was proposed in this paper. The CIFAR-10 dataset was used as the experimental dataset. Firstly, the images whose dynamic range and image entropy were greater than the mean value in the dataset were selected as typical images. Ten images were randomly selected from each category of the typical images. Then, the LBP algorithm was used to extract the texture features of these images. These texture feature images were divided into many small blocks. These blocks are the local pattern of the typical images. Finally, these small blocks were input into the sparse autoencoder for pre-training. After finishing the pre-training, the first layer weight values in the sparse autoencoder were extracted as the convolution kernel initial value of the first layer of CNN. The size of the image in the training set was 32 × 32. The experimental results showed that the convergence speed and the recognition rate both improved for small-sized images. During the experiment, there was no absolute relationship with the size of the image. For the convenience of calculation, the square image was selected as the training set. Therefore, the large-size image can also get the same experimental results, but the running time of the program may be prolonged in the process of dividing the texture image into small blocks.

In summary, under the same experimental conditions, this method sped up the convergence of the network to an extent in the network training process and guided the optimization direction of the network. At the same time, the recognition rate of the network was improved to an extent. The experimental result also indicated that in the problem of the convolution kernel initialization, the representative local patterns in the training images might be extracted in advance as the initial value of the convolution kernel.

The method proposed in this paper also has some problems and shortcomings. After completing the LBP algorithm, these texture feature images are divided into many small blocks. The size of the small blocks is 3 × 3 so the probability of a complete match between the block and the target to be identified is small, resulting in a small improvement in the accuracy of recognition. It can be considered to appropriately increase the size of the block to increase the matching degree between the block and the target to be identified to improve the accuracy of network recognition. In addition, the method proposed in this paper only initializes the first layer convolution kernel of CNN. It can be considered to improve the method to obtain the initial value of the other layers’ convolution kernels in order to further accelerate the convergence speed of the network. Finally, the method proposed in this paper is relatively cumbersome in obtaining the initial value of the convolution kernel. It can be considered to build a platform for obtaining the initial value of the convolution kernel. Users can directly input the dataset and the corresponding parameters into the platform and directly obtain the initial value of the CNN convolution kernel through platform calculation. The above three aspects are the main content of the next research work. At the same time, how to effectively extract the representative local model and better match the initial value of the convolution kernel to make the network quickly enter the global optimal state and how to improve the recognition rate of the network will also be the focus of future research.

Author Contributions

C.X. conceived the methodology, developed the algorithm, and designed and performed the experiment. C.X. and H.W. analyzed the data and wrote the manuscript. H.W. finished editing and proofreading. All authors have read and agreed to the published version of the manuscript.

Funding

This study was funded by the innovative team project of colleges and universities in Liaoning Province (LT2014006).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

http://www.cs.toronto.edu/~kriz/cifar.html (accessed on 3 October 2021).

Conflicts of Interest

The authors declare no conflict of interest. The founding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; and in the decision to publish the results.

References

Wang, L.; Zhang, Y.; Xi, R. Study on image classification with convolution neural networks. In Proceedings of the 5th International Conference on Intelligence Science and Big Data Engineering, Suzhou, China, 14–16 June 2015; Volume 9242, pp. 310–319. [Google Scholar]
Tompson, J.; Goroshin, R.; Jain, A.; LeCun, Y.; Bregler, C. Efficient object localization using Convolutional Networks. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 648–656. [Google Scholar]
Noh, H.; Hong, S.; Han, B. Learning deconvolution network for semantic segmentation. In Proceedings of the 15th IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1520–1528. [Google Scholar]
Pfister, T.; Simonyan, K.; Charles, J.; Zisserman, A. Deep convolutional neural networks for efficient pose estimation in gesture videos. In Proceedings of the 12th Asian Conference on Computer Vision, Singapore, 1–5 November 2014; Volume 9003, pp. 538–552. [Google Scholar]
Razavian, A.S.; Azizpour, H.; Sullivan, J.; Carlsson, S. CNN features off-the-shelf: An astounding baseline for recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Columbus, OH, USA, 23–28 June 2014; pp. 512–519. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolution Neural Networks. Adv. Neural Inf. Process. Syst. 2012, 25, 1097–1105. [Google Scholar]
Glorot, X.; Bengio, Y. Understanding the difficulty of training deep feedforward neural networks. J. Mach. Learn. Res. 2010, 9, 249–256. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1026–1034. [Google Scholar]
Saxe, A.M.; McClelland, J.L.; Ganguli, S. Exact Solutions to the nonlinear dynamics of learning in deep linear neural networks. Comput. Sci. 2014, 2, 1–22. [Google Scholar]
Miskin, D.; Matas, J. All you need is a good init. In Proceedings of the 4th International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016; pp. 1–13. [Google Scholar]
Chan, T.-H.; Jia, K.; Gao, S.; Lu, J.; Zeng, Z.; Ma, Y. PCANet: A Simple Deep Learning Baseline for Image Classification? IEEE Trans. Image Process. 2015, 24, 5017–5032. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Wang, G.H.; Xu, J. Fast feature representation based on multilevel pyramid convolution neural network. Comput. Appl. Res. 2015, 32, 2492–2495. [Google Scholar]
Zhang, W.D.; Xu, Y.L.; Ni, J.C.; Ma, S.P.; Shi, H.H. Image target recognition algorithm based on multi-scale block convolutional neural network. Comput. Appl. 2016, 4, 1033–1038. [Google Scholar]
Xie, D.; Xiong, J.; Pu, S. All You Need is Beyond a Good Init: Exploring Better Solution for Training Extremely Deep Convolutional Neural Networks with Orthonormality and Modulation. In Proceedings of the 2017 Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5075–5084. [Google Scholar]
Zhang, M.; Li, W.; Du, Q. Diverse Region-Based CNN for Hyperspectral Image Classification. IEEE Trans. Image Process. 2018, 27, 2623–2634. [Google Scholar] [CrossRef]
Song, G.H.; Jin, X.G.; Chen, G.L.; Nie, Y. Two-level hierarchical feature learning for image classification. Front. Inf. Technol. Electron. Eng. 2016, 17, 897–906. [Google Scholar] [CrossRef]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
Liu, H.; Cocea, M.; Ding, W. Decision tree learning based feature evaluation and selection for image classification. In Proceedings of the 16th International Conference on Machine Learning and Cybernetics, Ningbo, China, 9–12 July 2017; Volume 2, pp. 569–574. [Google Scholar]
Luan, S.; Chen, C.; Zhang, B.; Han, J.; Liu, J. Gabor Convolutional Networks. IEEE Trans. Image Process. 2018, 27, 4357–4366. [Google Scholar] [CrossRef] [Green Version]
Huang, G.; Liu, Z.; Laurens, M.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar]
Zhu, D.; Du, B.; Zhang, L. Two-stream convolutional networks for hyperspectral target detection. IEEE Trans. Geosci. Remote. Sens. 2020, 59, 6907–6921. [Google Scholar] [CrossRef]
Lee, C.; Sarwar, S.S.; Panda, P.; Srinivasan, G.; Roy, K. Enabling Spike-Based Backpropagation for Training Deep Neural Network Architectures. Front. Neurosci. 2020, 14, 119–131. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Luo, W.; Li, J.; Yang, J.; Xu, W.; Zhang, J. Convolutional Sparse Autoencoders for Image Classification. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 3289–3294. [Google Scholar] [CrossRef] [PubMed]
Zhu, Z.; Wang, X.; Bai, S.; Yao, C.; Bai, X. Deep Learning Representation using Autoencoder for 3D Shape Retrieval. Neurocomputing 2016, 204, 41–50. [Google Scholar] [CrossRef] [Green Version]
Seyfioglu, M.S.; Gurbuz, S.Z. Deep Neural Network Initialization Methods for Micro-Doppler Classification with Low Training Sample Support. IEEE Geosci. Remote Sens. Lett. 2017, 14, 2462–2466. [Google Scholar] [CrossRef]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. ImageNet Large Scale Visual Recognition Challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef] [Green Version]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deep with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Waseem, R.; Wang, Z. Deep Convolutional Neural Networks for Image Classification: A Comprehensive Review. Neural Comput. 2017, 10, 200–215. [Google Scholar]
Lecun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Peng, Y.; Zheng, Z.; Li, J.; Pan, Z.; Li, X.; Zhai, Z. Manifold sparse coding based hyperspectral image classification. Int. J. Signal Process. Image Process. Pattern Recognit. 2016, 9, 281–288. [Google Scholar]
Hinton, G.; Deng, L.; Yu, D.; Dahl, G.E.; Mohamed, A.-R.; Jaitly, N.; Senior, A.; Vanhoucke, V.; Nguyen, P.; Sainath, T.N.; et al. Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups. IEEE Signal Process. Mag. 2012, 29, 82–97. [Google Scholar] [CrossRef]
Zhang, H.; Wang, Y.; Luo, L.; Lu, X.; Zhang, M. SIFT flow for abrupt motion tracking via adaptive samples selection with sparse representation. Neurocomputing 2017, 249, 253–265. [Google Scholar] [CrossRef]
Hinton, G.E.; Srivastava, N.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R.S. Improving neural networks by preventing co-adaptation of feature detectors. Comput. Sci. 2012, 3, 212–223. [Google Scholar]
Srivastava, N.; Hinton, G.E.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R.S. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
Zhang, Y.; Chan, W.; Jaitly, N. Very deep convolutional networks for end -to-end speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speed and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 4845–4849. [Google Scholar]

Figure 1. CIFAR-10 data set image sample.

Figure 2. The process of the method.

Figure 3. Local binary patterns (LBP) operator.

Figure 4. The improved LBP operator.

Figure 5. Rotation invariant LBP diagram.

Figure 6. Autoencoder with the input layer neuron x and the output layer neuron y.

Figure 7. The original image and the texture image.

Figure 8. The CNN structure.

Figure 9. The comparison of CNN loss of three initialization methods with different iteration times.

Table 1. Typical images subset.

Catergory	Number of Images
1	939
2	1030
3	810
4	795
5	711
6	777
7	661
8	823
9	691
10	754

Table 2. The basic network parameter settings.

Batch_Size	Epoch_Number	Learning_Rate	Log_Frequency
128	20, 30, 50	0.01	50,000/128 = 391

Table 3. The comparison of the recognition rate of three initialization methods.

Methods	Recognition Rate
Methods	Epoch_Num = 20	Epoch_Num = 30	Epoch_Num = 50
Xavier	73.93%	74.9%	78.01%
He	78.51%	78.79%	80.01%
LBP and SAE	81.59%	81.81%	82.01%

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, C.; Wang, H. Research on a Convolution Kernel Initialization Method for Speeding Up the Convergence of CNN. Appl. Sci. 2022, 12, 633. https://doi.org/10.3390/app12020633

AMA Style

Xu C, Wang H. Research on a Convolution Kernel Initialization Method for Speeding Up the Convergence of CNN. Applied Sciences. 2022; 12(2):633. https://doi.org/10.3390/app12020633

Chicago/Turabian Style

Xu, Chunyu, and Hong Wang. 2022. "Research on a Convolution Kernel Initialization Method for Speeding Up the Convergence of CNN" Applied Sciences 12, no. 2: 633. https://doi.org/10.3390/app12020633

APA Style

Xu, C., & Wang, H. (2022). Research on a Convolution Kernel Initialization Method for Speeding Up the Convergence of CNN. Applied Sciences, 12(2), 633. https://doi.org/10.3390/app12020633

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on a Convolution Kernel Initialization Method for Speeding Up the Convergence of CNN

Abstract

1. Introduction

2. Materials and Methods

2.1. The Dataset

2.2. Overview of the Method

2.3. Typical Sample Images

2.4. Local Binary Patterns

2.5. Sparse Autoencoder

3. Results and Discussion

3.1. Result

3.2. Discussion

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI