**Development of a Robust Multi-Scale Featured Local Binary Pattern for Improved Facial Expression Recognition**

**Suraiya Yasmin <sup>1</sup> , Refat Khan Pathan <sup>2</sup> , Munmun Biswas <sup>2</sup> , Mayeen Uddin Khandaker 3,\* and Mohammad Rashed Iqbal Faruque <sup>4</sup>**


Received: 14 August 2020; Accepted: 14 September 2020; Published: 21 September 2020

**Abstract:** Compelling facial expression recognition (FER) processes have been utilized in very successful fields like computer vision, robotics, artificial intelligence, and dynamic texture recognition. However, the FER's critical problem with traditional local binary pattern (LBP) is the loss of neighboring pixels related to different scales that can affect the texture of facial images. To overcome such limitations, this study describes a new extended LBP method to extract feature vectors from images, detecting each image from facial expressions. The proposed method is based on the bitwise AND operation of two rotational kernels applied on LBP(8,1) and LBP(8,2) and utilizes two accessible datasets. Firstly, the facial parts are detected and the essential components of a face are observed, such as eyes, nose, and lips. The portion of the face is then cropped to reduce the dimensions and an unsharp masking kernel is applied to sharpen the image. The filtered images then go through the feature extraction method and wait for the classification process. Four machine learning classifiers were used to verify the proposed method. This study shows that the proposed multi-scale featured local binary pattern (MSFLBP), together with Support Vector Machine (SVM), outperformed the recent LBP-based state-of-the-art approaches resulting in an accuracy of 99.12% for the Extended Cohn–Kanade (CK+) dataset and 89.08% for the Karolinska Directed Emotional Faces (KDEF) dataset.

**Keywords:** facial expression recognition system; computer vision; multi-scale featured local binary pattern; unsharp masking; machine learning

#### **1. Introduction**

Facial expression recognition (FER) is a regular and incredible sign to decipher the state of human feelings and expectations, expressing human emotion without saying anything, as faces are considerably more than key to singular personalities. In a word, one can say that it is one of the most natural, current, and robust means for communicating people's intentions and emotions with others. As it is related to human emotion, which differs from one to another, researchers discovered many methods by both machine learning and deep learning techniques to obtain a critical understanding of this matter. Nowadays, things are becoming more mechanized through computer automation, where computer vision is playing a vital role in the automation process by training computers to interpret and understand the visual world. Thus, studies on FER show high demand in computer vision, which can be utilized in autonomy, neuro-advertising, scholastics, and altogether in security. Besides this, FER is one of the most challenging biometric recognition technologies due to its characteristics of nature, intuition, etc.

FER has two essential stages: feature extraction (geometric and appearance-based) and classification. While the geometrically-based feature extraction includes facial components like eye, mouth, nose, and eyebrow, the appearance-based one comprises the exact section of the face. On the other hand, the classification categorizes the expression, like a smile, sadness, anger, disgust, surprise, or fear. Researchers have worked with many neural networking concepts like Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), and machine learning classifiers like Support Vector Machine (SVM), K Nearest Neighbour (KNN) to find, relatively, the most accurate FER technique. In connection to this, several researchers utilized the Neural Network based on different kinds of popular methods like CNN [1], CNN-RNN [2], 3DCNN-DAP [3,4], Weighted Mixture Deep Neural Network [5], CNN with attention mechanism (ACNN) where it empowers the model to move consideration from the impeded patches to other unhampered ones, just as distinct facial regions are dependent on patch-based ACNN (pACNN) and global-local based ACNN (gACNN) [6]. Although neural networks are easy to build with the latest programming languages like Python, R, and tools like Matlab and Weka, nevertheless, when it comes to the computational power, especially in facial image processing with many classes, it requires very high processing power with a high amount of random access memory (RAM) and a graphics processing unit (GPU). Additionally, suppose it is not a supercomputer. In that case, one needs hours to simply train a neural network model, which calculates too many features, where most of them are non-object-orientated, making the model prone to overfitting. However, since currently, the artificial intelligence (AI) receives a focal point to replicate or simulate human intelligence in machines, the incorporation of a multimodal concept (such as both machine learning and deep learning techniques) may produce a better FER compared to the typical models and sub-processes.

Machine learning classifiers like SVM, KNN, and Tree cannot extract features automatically from raw images like the Neural Network (NN). Moreover, many other classifiers such as Principal Component Analysis (PCA), Extreme Learning Machine (ELM), Conditional Random Fields (CRF), and so on can also be used to classify facial emotion. However, classifiers need a state-of-the-art descriptor to extract a feature-set from natural images to classify into different classes. A wide range of methods and innovations have been tested by many researchers to find the best way for the classification of human disclosure. Features for FER are generally extracted with appearance-based methods like local binary pattern (LBP), local derivative pattern (LDP), local geometric binary (GLBP), and geometric methods like the histogram of oriented gradients (HOG), salient facial patches, classifier for salient areas on the faces [7], local binary pattern from three orthogonal planes (LBP-TOP) [8], local texture coding operator [9], and differential geometry. For instance, with the appearance-based method LBP, Zhang et al. applied a new method named Multi-resolution Histograms of Local Variation Patterns (MHLVP) on Gabor wavelets [10] and obtained a very impressive outcome on the Facial Recognition Technology (FERET) dataset; however, the computational complexity and element measurement was excessive. One of LBP's universal drawbacks is relevant to its small 3 × 3 neighborhood, which cannot capture dominant features with large scale structures [11–16]. Zhao and Pietik Ainen extended the LBP operator to the spatiotemporal space, and they named it the volume local binary patterns model [17], which has been generally embraced in catching powerful features by rotating and concatenating different methods but worked with a single dataset, thus, the accuracy may fall for blurred images. Coming out from regular filtered images, features extraction with noise, and partial occlusions, a combined method of the histogram of oriented gradients (HOG) with the uniform-local ternary pattern (U-LTP) [18] is described, which gives a good filtering process as well. More discriminative features in higher-order derivative directions were captured by the LDP [19], which improved LBP. However, it is mostly limited to the surrounding eight-pixel values by avoiding more significant dimensional relations.

Along with LBP, many geometric based methods were also used in FER. Images are partitioned into blocks and sub-blocks, and an active appearance model was used for revealing the essential facial portions and extracted by differential geometric features [20] which has more accuracy in FER than the static geometric features, also provides valuable geometric data with the time and sequence of facial expression images. For non-formed images, a method of cases that were out-of-plane head revolutions was taken care of using the turn inversion invariant histogram of oriented gradients [21], which has insufficient time complexity and improved the learning model of the cascade to collaborate with the classification technique. Tsai and Chang have applied the filter of Gabor, discrete cosine, change, and transformation of angular radial [22] to use HFs, consolidating with self-quotient image (SQI) channels for improving FER accuracy under different light source environments. Typically, there are some miss images in the examination, and it is essential to include a non-face class in outward appearance classifications that are not clarified there. The facial illustration is to infer a gathering of features from unique face images to viably speaking faces. It should limit the inside class varieties of articulations while amplifying between class contrasts. In general circumstances, the geometric method needs very well structured facial images. Practically, most of the time, it is not possible to capture well-textured images to perform geometric methods.

In addition to the many geometric and appearance-based methods, there are some more methods like the response method [23] that extracts features from directional texture and number patterns where performance is tested in constrained and unconstrained situations. Researchers have not been limited to static features only. There are some other methods for extracting dynamic and multilevel features [24], which have coordinated into an end-to-end network to participate flawlessly with one another. Moreover, to solve a small sample size (SSS) issue, using a novel method-directional multilinear independent component analysis (ICA) technique was demonstrated in [25], which prompts the dimensionality situation by encoding the input image or high dimensional data array as a general tensor. A different methodology for facial expression analysis is the use of the Human-Computer Interaction (HCI) context [26] disintegrated into smaller micro-decisions that are separately made by particular binary classifiers with higher accuracy of the general model. Besides the above-described methods, some methods are also used for the detection of real-time expressions such as embedded systems [27], Radon Barcodes [28], and many more. Classifiers acquire characteristic features from the above strategies as their sources as inputs. However, the classifier's execution relies on the nature of feature vectors. A summary of a few recent works in the field of FER is shown in Table 1.


**Table 1.** Key information on some similar recently studied methods on facial expression recognition (FER).

In light of the information mentioned above, one can observe a non-negligible limitation, especially in appearance-based typical LBP methods. Therefore, this study proposes a feature extraction method based on a new extended LBP "Multi-Scale Featured Local Binary Pattern", which can be used not

only in FER but also in various purposes to analyze an image. Since the automatic face expression recognition requires two significant angles: facial illustration and classifier style, this study utilizes four machine learning classifiers: SVM, KNN, Tree, and Discriminant Quadric Analysis. There are so many datasets, for example, Japanese Female Facial Expression (JAFFE), Chinese Academy of Sciences Institute of Automation (CASIA), Static Facial Expressions in the Wild (SFEW), Chinese Academy of Sciences Micro-expression-II (CASME), Spontaneous Micro-expression (SMIC), Acted Facial Expressions in the Wild (AFEW), and all are available in the literature. However, we used two well-known facial image datasets: Extended Cohn–Kanade Dataset (CK+) and Karolinska Directed Emotional Faces (KDEF) to verify our proposed method. Note that the Extended Cohn–Kanade Dataset (CK+) [29] is an extended version of Cohn–Kanade (CK) [30] and finds greater use in developing and evaluating facial expression analysis algorithms. It contains a better example of catching the sample space than the CK dataset, which includes 304 labeled videos with 5521 frames of test subjects from various ethnicities in varied age groups extending from 18 to 50. – – –

On the other hand, the used KDEF dataset helps assess the emotional contents and appraise intensity and arousal scale. Moreover, it contains a legitimate arrangement of feeling the full facial images. More details about these datasets are shown in Table 2 and some sample faces are shown in Figure 1.


**Table 2.** Used datasets in the proposed method.

– **Figure 1.** Sample face image from Extended Cohn–Kanade (CK+) and Karolinska Directed Emotional Faces (KDEF) datasets.

#### **2. Contribution**

Based on the available literature, we observed that if the images are not well textured and blurred, then the prediction value falls. Thus, we have proposed a new feature extraction process for images that makes the texture of an image more machine-readable and converts the sub-region to 58 Uniform LBP and gives a classifier friendly feature vector tested on four machine learning classifiers. In this research, we have implemented three different angles where all the members are told to attempt to inspire the feeling that should have been expressed and to make the expression sharp and clear. The main contribution in the global LBP method is the process of calculating bitwise AND for two neighboring pixel values to obtain the relation between them after applying two suggested kernel matrices. Here, we have justified this method by detecting facial expression from an image that greatly relies on the image texture.

This manuscript is arranged with the proposed method in Section 3, including Section 3.1 pre-processing, Section 3.2: feature extraction, and Section 3.3: normalization. The result analysis is discussed in Section 4, and the conclusion is in Section 5.

#### **3. Proposed Method**

#### *3.1. Pre-Processing*

As the colored image sensitively affects light impact, the images were converted into grayscale as it has various shades of dark in the center, so to convert the image into grayscale, we used Equation (1) where *r* is the pixel value of red, *g* is green, and *b* is blue.

$$
gamma = 0.3r + 0.59\,\text{g} + 0.11b \tag{1}
$$

The grayscale image may have an environmental and useless background as well, which increases the computational complexity and misleading accuracy. From the CK+ and KDEF dataset of the raw image, it was observed that the images are size 640 × 490 and 562 × 762 pixels on average. Therefore, for better results and lower complexity, the facial part from the whole image was detected and the face was cropped by Haar cascade frontal face-based on the Viola-Jones detection algorithm, which precisely detects faces then crops and resizes them to 100 × 100 pixels. Each of the images was then compared with a 5 × 5 table cell and it was observed that key portions of models such as eyes, nose, and lips areas are in 3 × 3 table cells (60 × 60 pixels). Therefore, for avoiding the unnecessary parts, we have cropped this to 3 × 3 cells, shown in Figure 2. = 0.3 + 0.59 + 0.11

**Figure 2.** Pre-processing steps.

After detecting and cropping the images, the unsharp masking kernel [31] (shown in Figure 3) was used for sharpening the edges with Equation (2), which reduces some noises and gives a bright look. Grinding the images are essential for better understanding and communicating nearby grayscale change data by the contrast between each single points, and utilizes the weighted qualification in the eight directions as the local shade change data in the path, which is commotion and light-delicate and has no strength. The sharpening kernel was used in the side-by-side method where the Kernel moves in every one pixel.

$$S(\mathbf{x}, y) = \sum\_{i=-2}^{2} \sum\_{j=-2}^{2} K(i, j) \times M(\mathbf{x} - i, y - j) \tag{2}$$

 where *K* is the Kernel in Figure 3, and M is the pixel values of the given image, and *S*(*x*,*y*) is the central pixel value, which creates a sharpened image. The unsharp masking kernel was chosen in this study because it provides a good texture output in pixel values of different image datasets among many variants of kernels.


**Figure 3.** Unsharp masking kernel.

#### *3.2. Feature Extraction*

' In this study, a method was developed for extracting features from an image to identify emotions. We depend not only on the shadow effect of the grayscale images but also on using a new kernel-based method to enhance the shadow effect to extract the features that are flexible and classifier friendly. We have proposed two kernels on the LBP of an image to be more precise about the shadow and light effect of the face parts, which mainly decides the face's emotional states. In this step, the pre-processed image was taken and applied to the serial process shown in Figure 4 to finally obtain the features using the algorithm indicated in Figure 5. '

**Figure 4.** Feature extraction process.

Generally, LBP (P, R) is used in one radius on eight directional coordinates of the matrix value where P is the number of pixels to be considered and R is the radius from the central pixel. However, we used two LBP (LBP (8, 1) and LBP (8, 2) ) and applied two kernel matrix to calculate the central pixel of that cell. Considering the first stage of the image, we have divided it into sub-cells where 3 × 3 for LBP (8, 1) and 5 × 5 for LBP (8, 2) with two proposed kernels. A sample 3 × 3 image segment has been shown in Figure 6a and the model is shown in Figure 6b for the first Kernel, where each matrix is a 45◦ rotation, and the central matrix is the 3 × 3 cell of the pre-processed image. Considering that *S*<sup>1</sup> denotes the grey estimation of the pixel point in the 3 × 3 neighborhood of the pre-processed image, and the kernel value of pixel points in the area is *K*1, the central pixel can be obtained by applying the first rotation kernel with Equation (3).

$$G(\mathbf{x}, y) = \sum\_{i=-1}^{1} \sum\_{j=-1}^{1} K\_1(i, j) \times S\_1(\mathbf{x} - i, y - j) \tag{3}$$


```
4.
5. For d = 1 to 8 
6. For i = 1 to 3 
7. For j = 1 to 3 
8. Q = S(i,j) × K1(d,i,j) 
9. End For 
10. End For 
11. C = Q > 0 ? 1 : 0 
12. Q_D = Q_D + C × 2^d
13. For i = 1 to 5
14. For j = 1 to 5 
15. R = S(i,j) × K2(d,i,j) 
16. End For 
17. End For 
18. C = R > 0 ? 1 : 0 
19. R_D = R_D + C × 2^d
20. End For 
21. Central_Pixel = Q_D AND R_D 
22.
```
23. **Output:** Central pixel value of the given portion of image

**Figure 6.** (**a**) Sample image segment of 3 × 3, (**b**) Description of Local Binary Pattern (LBP) (8, 1) : kernel value.

Here, K<sup>1</sup> is eight rotational kernels with 45 ◦ rotations each. Therefore, Equation (3) was applied eight times to obtain the value q<sup>0</sup> to q<sup>7</sup> in Figure 7, G (x, y) is the central pixel value, which will make the pixel matrix for 1st Kernel. After the calculation is shown in Figure 7, converting the positive value as one and the negative value as 0, we obtain the central decimal pixel value. By using the sample image segment in Figure 6a, we used Equation (3) to show the calculation to find the central pixel matrix values q<sup>0</sup> to q<sup>7</sup> (as shown in Figure 7). This same procedure has been followed with the 5 × 5 image segment and kernel are shown in Figure 8 to find the central pixel matrix of Figure 9.


 

**Figure 7.** Calculation of LBP (8, 1) . 

**Figure 8.** Description of LBP (8, 2) : kernel value.


**Figure 9.** Calculation of LBP (8, 2) .

The model for the second Kernel is shown in Figure 8, where each matrix is a 45 ◦ rotation, and the central matrix is 5 × 5 cells of the pre-processed image. Again, accepting that *S*<sup>2</sup> denotes the grey estimation of the pixel point in the 5 × 5 neighborhood of the pre-processed image, and the kernel value of pixel points in the area is *K*2, the value of the central pixel can be obtained by applying the second Kernel with Equation (4).

$$H(\mathbf{x}, y) = \sum\_{i=-2}^{2} \sum\_{j=-2}^{2} K\_2(i, j) \times \mathbb{S}\_2(\mathbf{x} - i, y - j) \tag{4}$$

Similarly, kernel K<sup>2</sup> will have eight rotations with 45 ◦ each for obtaining q<sup>0</sup> to q<sup>7</sup> values in Figure 9. *H* (*x*, *y*) is the central pixel which will make the pixel matrix for 2nd Kernel. Once again, converting the positive value as one and negative value as 0, we acquire the central decimal pixel value which is shown in Figure 9.

In the final stage, we have applied bitwise AND of *G* (*x*, *y*), *H* (*x*, *y*), where the binary output value of a model is determined to utilize Equation (5), which tells to the nearby change data between the center point and the 8-neighborhood pixels. It counts the number of spatial transitions from 0 to 1 or 1 to 0. In this stage, the equation will be as follows:

$$BM(\mathbf{x}, y) = \left(\sum\_{i=-1}^{1} \sum\_{j=-1}^{1} K\_1(i, j) \times \mathbf{S}\_1(\mathbf{x} - i, y - j)\right) \\ AND(\sum\_{i=-2}^{2} \sum\_{j=-2}^{2} K\_2(i, j) \times \mathbf{S}\_2(\mathbf{x} - i, y - j)) \tag{5}$$

Simplifying Equation (5) as:

$$BM(\mathfrak{x}, \mathfrak{y}) = G(\mathfrak{x}, \mathfrak{y}) \operatorname{AND} H(\mathfrak{x}, \mathfrak{y})$$

where *BM* (*x*, *y*) is the binary matrix, the values of which are defined as 1 if *G* (*x*, *y*) = *H* (*x*, *y*) = 1 or 0 if any of *G* (*x*, *y*) or *H* (*x*, *y*) is 0.

We have used an assessment by applying a condition to find the output cell's central pixel in decimal in Equation (6).

$$MSLBP(x\_{\mathfrak{c},\mathcal{Y}}) = \sum\_{n=0}^{7} BM(w\_n) 2^n \tag{6}$$

where *w<sup>n</sup>* corresponds to the neighboring binary value of the eight surrounding pixels of the binary matrix *BM* and *MSLBP*(*xc*,*yc*) is the final central decimal pixel value.

After calculating the *MSLBP* matrix, we have divided the whole image into 6 × 6 = 36 cells and mapped each cell's value to the uniform local binary pattern (*ULBP*) by Equation (7). For *ULBP*, each cell pattern maps to 58-bin histograms. *ULBP* has unique 58 numbers where we will convert the

'

*MSLBP* pixel matrix to a one-dimensional array by mapping pixel values to *ULBP* values. A single-cell value of 255 will be converted to 58 by using *ULBP*.

(, ) = (, ) (, )

 

$$FV = \text{ULBP}\Big(\text{MSLBP}\_{(x,y)}\Big) \tag{7}$$

'

where *FV* is the feature vector, *ULBP* is the array of mapping values. *MSLBP* (*x*, *y*) is the pixel value of the image, which will be used as an index.

For one image, neighbor pixels are generally related; thus, the binary sequences of MSLBP (p, r) of the various radius can be seen as described. After ascertaining all values from left to right, we have obtained a binary pattern for every cell of an image. Taking all weighted values into account, we have found a decimal number in symmetric neighbor sets for various coordinates (x, y). The grey values of neighbors that are not the focal region for matrices can be evaluated by commitment. After that, we discovered one histogram for each cell, then we have concatenated all those histograms from each cell into a one-linear histogram shown in Figure 10. There will be a two-dimensional matrix for each image of seven classes where rows represent the image index, and the column represents the features. This long concatenated histogram is the initially featured vector with many noises and mismatched values within a class. We have normalized the histogram data to solve this kind of problem, which shows good accuracy in validation test cases compared with the original feature vectors.

**Figure 10.** Converting process of selected geographical features of a histogram.

#### *3.3. Normalization*

Due to the so many images with different expressions and features, it is challenging to maintain continuity among the classes. Therefore, normalization of data becomes mandatory to handle within a range of values so that each class keeps some kind of consistency. We have used the Generalized Procrustes Analysis (GPA) [32] as normalization in our proposed method. It takes each level data individually and utilizes a measure of variance. The GPA generates a weighting factor by analyzing the differences in the scaling factor applied to respondent scale usages and individual scale usage. As a result, the distance between different classes' values was increased. Initially, we see the happy class's data situated on the scatter plot shown in Figure 11a (before normalization), then we can see that the images are getting closer to each other in Figure 11b (after normalization). In brief, the GPA takes all those features and reduces the fluctuation, and after using this, all related emotional state values have become at a closer level which causes the classification to act more precisely as the variance increases between different classes.

**Figure 11.** (**a**) Regular data, (**b**) Normalized data. Axis values are two feature values before (**a**) and after (**b**) normalization.

#### **4. Results and Discussion**

#### *4.1. Performance Analysis of the Proposed Method*

We have tested our proposed method on the CK+ and KDEF dataset. The given datasets are the most widely used for facial expression recognition, and this includes seven different facial expression labels or classes. We have used several machine-learning classifiers like K-nearest neighbors (KNN), Binary Tree, Quadric Discriminant Analysis (QA), and Support Vector Machine (SVM) shown in Figure 12. Among them, SVM gives the highest testing accuracy, which is shown in the confusion matrix for both dataset's test set following the 80-20 train-test split rule in Tables 3 and 4, respectively. From the CK+ dataset, almost 6000 images are used for training and 2000 for validation and testing, and for the KDEF dataset, almost 2900 images are used for training and 1000 for validation and testing. A total of 10 iterations of K-Fold cross-validation was used in all four classifiers. All values are shown in percentage (%).

**Figure 12.** KDEF (KNN: 38.09, QA: 64.52, Tree: 68.33, SVM: 89.05), CK+ (KNN: 83.05, QA: 88.70, Tree: 96.01, SVM: 99.12).

The precision, recall, and F1 Score of the CK+ and KDEF dataset for SVM shows the outcome's excellent structure. For finding these values, we first have to analyze the confusion matrix. When the actual class is positive, and the predicted class is also positive, it is counted as True Positive (TP) value. When the actual class is negative, and the predicted class is too negative, it is counted as a True

= / ( + )

= / ( + )

'

Negative (TN) value. Along with these, if the actual class is positive but predicted as negative, it is counted as False Negative (FN). If the true class is negative but predicted as positive, it is counted as False Positive (FP).


**Table 3.** Confusion matrix of the CK+ dataset (SVM).

**Table 4.** Confusion matrix of the KDEF dataset (SVM).


Precision: It is the ration of *TP* and the total positive predictions. High precision means less classification error.

$$Precision = TP/(TP + FP)$$

Recall: It is the ration of *TP* and the total true positive classes.

$$Recall = TP/(TP + FN)$$

*F*1 Score: *F*1 Score is sometimes more useful than accuracy. It is the weighted average of the values of Precision and Recall. *F*1 Score is important here because we have an uneven number of classes.

$$F1\ \text{Score} = \mathfrak{2} \ast (\text{Precision} \ast \text{ Recall}) / (\text{Precision} + \text{Recall})$$

Table 5 shows the precision, recall, and *F*1 Score for datasets. We have presented the precision, recall, and *F*1 score comparatively in Figures 13 and 14 for CK+ and KDEF datasets for all the K-folding cross-validations. Values are shown for SVM classifier because it has the highest accuracy.

**Table 5.** Pre (Precision), Rec (Recall), F1 (F1 Score) shown for dataset CK+, and KDEF. Values are shown for the Support Vector Machine (SVM) classifier for seven classes.


1 = 2 ∗ ( ∗ ) / ( + )

1 = 2 ∗ ( ∗ ) / ( + )

**Figure 13.** Dataset: CK+, Precision, Recall and F1 score shown for SVM.

**Figure 14.** Dataset: KDEF, Precision, Recall, and F1 score is shown for SVM.

#### *4.2. Analyses and Discussion of Results*

Throughout this study, it is observed that classical LBP works with every pixel, which is contrasted and utilizes its eight surrounding 3 × 3 neighborhood by subtracting the center pixel value. Then, the resulting negative values are encoded with 0, otherwise 1. Finally, the encoded binary value is converted to decimal to obtain the center pixel value. The ongoing variety of LBP, for example, extended local binary patterns (ELBP) [15] operator not only performs the binary comparison of the center pixel and its neighbors but also encodes their exact grey-value differences (GDs) utilizing some extra binary units. In the completed modeling of the local binary pattern (CLBP) [16], it includes both the sign and the GDs between a given center pixel and its neighbors to improve the original LBP operator's discriminative intensity. The two strategies have utilized LBP (8,1) and compare the absolute value of GD with the given central pixel again to create an LBP-liked code. In Ref. [8], the authors first used the optical flow technique to obtain the Necessary Morphological Patches (NMPs) of micro-expressions; then, they calculated LBP-TOP operators by cascading them with optical flow histograms to make fusion features of dynamic patches. In local texture coding, the operator [9] enhances real-time system performance, utilizing four directional gradients on 5 × 5 grids for reducing sensitivity to noise. In Ref. [28], the authors present an observing framework using some features, such as LBP/LTP/red blood cell (RBC) for children, which utilizes an automatic pain detection system, and it could be accessed through wearable or mobile devices. A weighted fusion strategy [5] is proposed to completely utilize the features that were separated from various image channels with a partial Visual Geometry Group called the VGG16 network. Moreover, the method can develop consequently for extracting features of images on account of an absence of successful pre-prepared models dependent on LBP. The classical LBP and its varieties utilize pixel values of a different radius, but the relationships among

them are missing. In this study, we have fulfilled the missing relational information among pixel values of varying radii. This study utilized an image into sub-cells where 3 × 3 for LBP (8, 1) and 5 × 5 for LBP (8, 2) with two proposed kernels with 45◦ rotations. After applying these kernels, bitwise AND operation occurred among the resulting matrices to establish the relation of different radii. Moreover, in pre-processing, we used the unsharp masking kernel to obtain a sharp image so that the intensity of pixel values can be more accurate. Compared with the neural network models, our method is a core algorithm to extract features where a neural network like CNN is a stack of automatic extraction of hidden layer features. Even though the latest neural network models are useful in the FER process, they still show unavoidable limitations. Different features like AAM/Arithmetic Unit system (AUs) [33] and Active Appearance Model (AAM)/Gabor [34] were used the CK+ dataset, and some other features like Gabor [35] and Facial Landmarks [36] used the KDEF datasets, all gaining different accuracies, which were much lower than our acquired accuracy. However, it can be expected that the addition of a neural network with our core algorithm to classify expressions might provide much higher efficiency on the other available standard FER datasets. Much readymade software, such as the Noldus network with Face-reader 8 [37] and Microsoft Emotion API [38], are available to obtain the facial expression easily from an image or live video. In Noldus face reader 8, besides FER, several things such as the detection of age, gender, ethnicity, facial hair, and glasses are performed. In doing so, a 3D model is created using the Active Appearance Method (AAM), and also an artificial neural network is used for training and classification. On the other hand, Microsoft Emotion API is a C# client-side library file, which is suitable for use as a third party API for detecting facial expressions in different projects under Microsoft Azure Cognitive Services. This API is licensed under the Massachusetts Institute of Technology (MIT), and the backend image processing model is developed and maintained by Microsoft. The primary comparison among Noldus Face-reader 8, Microsoft Emotion API and our work is incompatible as they are, in fact, software methods, and ours is a research method about MSFLBP. Moreover, only very little information is available on their methods, algorithms, and test results for building their FER models.

The outcome of SVM on the proposed MSFLBP method is shown in Table 6, compared with some of the most recent state-of-the-art methods. It demonstrates that the proposed feature extraction method outperforms the most recent state-of-the-art methods.


**Table 6.** Results of reviewed works for static image approaches (values are in %).

#### **5. Conclusions**

The study demonstrates the recognition rate improvement based on the calculation time of facial expression recognition methods. In the classification performance, we have used two notable datasets, CK+ and KDEF, and analyzed, as a set of cell size and number of direction, containers for the seven fundamental universal expressions' exact characterization. We have used an unsharp masking kernel for sharpening the raw images. Then, we have applied two Kernel and bitwise AND to both binary matrices and converted the final binary matrix into a central decimal pixel value. After that, we have divided the output image into 64 cells and mapped each cell with ULBP mapping to obtain the features, like a histogram. By concatenating all cells' assigned values, we have finally obtained the feature vector, which was then trained and tested with four classifiers with 10 K-Fold cross-validations. Among them, SVM provides the best outcome. In this study, the traditional LBP method's limitations are overcome by applying bitwise AND on two rotational kernels by solving the pixel variance limitations. We have analyzed the neighboring pixel relation of traditional LBP and found two 3 × 3 and 5 × 5 kernels for obtaining the central pixel values, and after that, bitwise AND was applied to make the relation of the output central pixels of two kernels. Our described method can improve different texture recognition performance, utilize specific word applications with non-interrupting low-goals imaging, and also accomplish considerable accuracy. Several benefits of the described method include precise frequency extraction capability and less complexity, better efficiency in prediction, and fewer data storage. The addition of some more datasets from the different geographical regions can improve the real-time FER process. More combined methods like LBP-CNN can be used to identify augmented images.

**Author Contributions:** Conceptualization, M.B.; methodology, S.Y. and R.K.P.; software, S.Y. and R.K.P.; validation, M.B.; formal analysis, R.K.P.; S.Y.; and M.B.; investigation, S.Y. and R.K.P.; resources, M.B.; data curation, S.Y.; R.K.P.; and M.B.; writing—original draft preparation, S.Y.; R.K.P.; and M.B.; writing—review and editing, M.U.K.; visualization, M.U.K.; supervision, M.B.; project administration, M.R.I.F. and M.B.; funding acquisition, M.R.I.F. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the Research Universiti Grant, Universiti Kebangsaan Malaysia, Dana Impak Perdana (DIP), code: 2020-018.

**Acknowledgments:** The authors are appreciative of the Department of Computer Science and Engineering, BGC Trust University Bangladesh, and International Islamic University Chittagong, Bangladesh, for giving the workplaces to lead this research work. This work was supported by the Research Universiti Grant, Universiti Kebangsaan Malaysia, Dana Impak Perdana (DIP), code: 2020-018.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **Face and Body-Based Human Recognition by GAN-Based Blur Restoration**

#### **Ja Hyung Koo, Se Woon Cho, Na Rae Baek and Kang Ryoung Park \***

Division of Electronics and Electrical Engineering, Dongguk University, 30 Pildong-ro 1-gil, Jung-gu, Seoul 04620, Korea; koo6190@dongguk.edu (J.H.K.); jsu319@dongguk.edu (S.W.C.); naris27@dongguk.edu (N.R.B.)

**\*** Correspondence: parkgr@dongguk.edu; Tel.: +82-2-2260-3329; Fax: +82-2-2277-8735

Received: 21 July 2020; Accepted: 11 September 2020; Published: 14 September 2020

**Abstract:** The long-distance recognition methods in indoor environments are commonly divided into two categories, namely face recognition and face and body recognition. Cameras are typically installed on ceilings for face recognition. Hence, it is difficult to obtain a front image of an individual. Therefore, in many studies, the face and body information of an individual are combined. However, the distance between the camera and an individual is closer in indoor environments than that in outdoor environments. Therefore, face information is distorted due to motion blur. Several studies have examined deblurring of face images. However, there is a paucity of studies on deblurring of body images. To tackle the blur problem, a recognition method is proposed wherein the blur of body and face images is restored using a generative adversarial network (GAN), and the features of face and body obtained using a deep convolutional neural network (CNN) are used to fuse the matching score. The database developed by us, Dongguk face and body dataset version 2 (DFB-DB2) and ChokePoint dataset, which is an open dataset, were used in this study. The equal error rate (EER) of human recognition in DFB-DB2 and ChokePoint dataset was 7.694% and 5.069%, respectively. The proposed method exhibited better results than the state-of-art methods.

**Keywords:** multimodal human recognition; blur image restoration; DeblurGAN; CNN

#### **1. Introduction**

Currently, there are several methods of human recognition, including face, iris, fingerprint, finger-vein, and body. However, long-distance face recognition in indoor and outdoor environments is still limited. The human recognition methods can be largely divided into face, body, and iris. However, there are problems with face and iris recognition methods. In these methods, original images can be damaged due to motion blur or optical blur, which is generated when the images of human face or iris are obtained from a long distance. The human recognition performance is significantly degraded due to these types of damages. To solve this problem, the human body is typically used as for long-distance recognition in indoor and outdoor environments.

The data can still contain a blur when human body is used for recognition. However, the human body recognition is less affected than face or iris recognition. There are two methods for human body recognition: gait recognition of an individual and texture and shape-based body recognition, which is based on the still image of a human body. Gait recognition does not exhibit a blur problem. However, the time required for forming the dataset is long because continuous image acquisition is required. Thus, an experiment was conducted indoors for recognition using still images of a human body.

There are disadvantages to human body recognition in an indoor environment. The color of clothes significantly affects the recognition performance. Thus, the human body is divided into two parts to evaluate the recognition performance. In several studies, the body and face have been separated. However, blur restoration of the obtained data has never been performed before.

The method proposed in this study involves restoring the images of human body and face with a blur via a generative adversarial network (GAN). Subsequently, the features of body and face are extracted using a convolutional neural network (CNN) model. The final recognition performance is determined based on the weighted sum and weighted product, which is a score-level fusion approach, using the extracted features.

#### **2. Related Work**

Previous studies on long-distance human recognition can be divided into human recognition with or without blur restoration, and they can be further divided into single modality-based or multimodal-based methods.

#### *2.1. Without Blur Restoration*

Single modality-based methods include face recognition, body recognition based on texture, and body recognition based on gait. Several extant studies have been conducted on face recognition. Grgic et al. [1] obtained face data from three designated locations using five cameras. The recognition performance was determined based on principal component analysis (PCA) of the obtained face data. Banerjee et al. [2] used three types of datasets, namely FR\_SURV, SCface, and ChokePoint, for the experiment. The recognition was performed through soft-margin learning for multiple feature-kernel combination (SML-MKFC) with domain adaptation (DA). The drawback of face recognition is that facial information is vulnerable to noise, such as blur. There are important features in a face, such as nasal bridge, eyebrow, and skin color, for recognizing an individual. The visibility of facial features is reduced when important features are combined with noise, such as a blur, thereby interfering with face recognition.

Most of the body recognition methods are gait-based, while others are texture and shape-based. For gait-based recognition, Zhou et al. [3] obtained data using two methods of original side-face image (OSFI) and gait energy image (GEI) fusion, as well as enhanced side-face image (ESFI) and GEI fusion. Furthermore, they proceeded with recognition based on PCA and multiple discriminant analysis (MDA). Gait-based recognition is less affected by noise, such a blur, because several images of an individual's gait are cropped based on the difference image of the background and object. The difference image is compressed into a single image. However, an extensive amount of time and data are required to obtain sufficient gait information. For texture and shape-based body recognition, Varior et al. [4] used the Siamese CNN (S-CNN) architecture. Nguyen et al. [5] obtained image features using AlexNet-CNN and then evaluated the recognition using PCA and support vector machine (SVM). Shi et al. [6] used the S-CNN architecture reported in an extant study [4]. However, they used five convolution blocks. Furthermore, a discriminative deep metric learning (DDML) was used in the study. This method is not significantly affected by a blur because the object's body information is included. However, the color of clothes worn by the object comprises of a large portion of the body information. Hence, the recognition performance is drastically reduced if the color of the clothes is similar to that of the object, which is being recognized.

Multimodal-based methods are categorized into two types, namely face and gait-based body recognition and face and texture and shape-based body recognition. For face and gait-based body recognition, Liu et al. [7] measured the performance using the dataset obtained by other researchers based on hidden Markov model (HMM) and Gabor features-based elastic bunch graph matching (EBGM). Hofmann et al. [8] used eigenface calculation for face recognition and α-GEI for gait recognition. This method exhibits the same advantages and disadvantages as gait-based body recognition. The common advantage is that it is less affected by a blur because a gait feature is used. The disadvantage is that it requires sufficiently high amount of data with continuous image motion for obtaining the gait image. In a previous study [9], human body and face were separately experimented

in indoor environments for face and texture and shape-based body recognition. Visual geometry group (VGG) face net-16 for face and residual network (ResNet)-50 for body were used to obtain the features, and the final recognition performance was evaluated based on a score-level fusion approach using the obtained features. However, the problem with blur still persists when images are obtained in indoor environment. Therefore, in the study [9], only the images without a blur were used by determining the presence of a blur as per the threshold based on the method in the study [10].

#### *2.2. With Blur Restoration*

A blur is generated due to two main reasons. Motion blur is generated when an object moves, and optical blur is generated when a camera films the object. Thus, researchers improved the images using a deblur method and then proceeded with the evaluation of the recognition performance. Alaoui et al. [11] performed image blurring by applying point spread function (PSF) with the face recognition technology (FERET) database. The images were deblurred with fast total variation (TV)-l1 deconvolution, image features were obtained using PCA, and feature matching was performed with Euclidean distance. Hadid et al. [12] generated a blur using PSF and then proceeded with deblurring based on deblur local phase quantization (DeblurLPQ) and measured the recognition performance. Nishiyama et al. [13] used two types of datasets and generated an arbitrary blur using PSF with the FERET database and face recognition grand challenge (FRGC) 1.0. For blur restoration method, Wien filters or bilateral total variation (BTV) regularization was used. Mokhtari et al. [14] performed face restoration using two methods, namely centralized sparse representation (CSR) and adaptive sparse domain selection with adaptive regularization (ASDS-AR). Face recognition was performed using PCA, linear discriminant analysis (LDA), kernel principal component analysis (KPCA), and kernel Fisher analysis (KFA). Heflin et al. [15] used the FERET database wherein the face area was detected in the blurred image, motion blur and atmospheric blur were measured using a blur point spread function (PSF), and, finally, face deblurring was performed using a deconvolution filter, such as Wiener filter, to evaluate the recognition performance. Yasarla et al. [16] proposed uncertainty guided multi-stream semantic network (UMSN) and performed facial image deblurring. This method involves dividing the facial image region into four semantic networks and deblurring the blurred image and image divided into four regions via a base network (BN). Considering the aforementioned issues of previous researches, we propose a recognition method in which the blur on a body and face is restored using a GAN, and the features of body and face obtained using a deep CNN are used to fuse the matching score.

Although they are not the researches on long-distance human recognition, Peng et al. studied two challenges in clustering analysis, that is, how to cluster multi-view data and how to perform clustering without parameter selection on cluster size. For this purpose, they proposed a novel objective function to project raw data into one space where the projection embraces the cluster assignment consistency (CAC) and the geometric consistency (GC) [17]. In addition, Huang et al. proposed a novel multi-view clustering method called as multi-view spectral clustering network (MvSCN) which could be the first deep version of multi-view spectral clustering [18]. To deeply cluster multi-view data, MvSCN incorporates the local invariance within every single view and the consistency across different views into a novel objective function. They also enforced and reformulated an orthogonal constraint as a novel layer stacked on an embedding network.

Table 1 shows the summary of this study and previous studies on person recognition using surveillance camera environment.


**Table 1.** Summary of this study and previous studies on person recognition using surveillance camera environment.

## **3. Contribution of Our Research**

Our research is novel in the following four ways in comparison to previous works:


#### **4. Proposed Method**

#### *4.1. System Overview*

Figure 1 shows the overall configuration of the system proposed in this study. A face image is obtained from the original image acquired in an indoor environment (step (1) in Figure 1). A body image is obtained from the original image excluding the face image (step (2) in Figure 1). The focus score of the face image is calculated (step (3) in Figure 1). An image exhibiting a focus score value of less than the threshold (step (4) in Figure 1) undergoes restoration using DeblurGAN (step (5) in Figure 1) and is combined with images exhibiting a focus score value that is greater than or equal to the threshold. The restoration of body image via DeblurGAN is conducted in the same manner. Image features of face and body are extracted by applying a CNN model to the image combined from the restored face and body images and the image with a focus score greater than or equal to the threshold (step (6) and (7) in Figure 1). The authentic/imposter matching distance is calculated using the feature vectors obtained above (step (8) and (9) in Figure 1). The score-level fusion is conducted using the matching distance (step (10) in Figure 1). The weighted sum and weighted product methods were for the score-level fusion in this study. The final recognition rate was measured using score-level fusion (step (11) in Figure 1).

**Figure 1.** Overall procedure of proposed method.

#### *4.2. Structure of GAN*

A general description of a GAN is provided in this section. GAN consists of two networks, namely generator and discriminator. Generator aims to generate a fake image similar to a real image by considering Gaussian random noise as an input, whereas discriminator aims to find the fake image by discriminating the real image from the fake image generated by the generator. Therefore, a discriminator is trained to easily discriminate real and fake images, while a generator is trained to ensure that a fake image is close to the real image to the maximum possible extent. However, it is difficult to control the desired output for vanilla GAN because the input corresponds to Gaussian random noise.

First, cycle-consistent adversarial networks (CycleGAN) [20] were used. Unlike the existing GAN models, a CycleGAN does not distinguish between an input image and a target image. It uses a reference image as an input that is expected to be the result of input image and output image. There are two types of generators in CycleGAN, namely U-Net [21] architecture and residual blocks. The generator used in this study exhibits a residual block architecture [20]. One of the characteristics of a CycleGAN is the cycle-consistency loss. For example, if an input image X has generated an output Y through a generator, the output Y goes through the generator again to generate X'. The cycle-consistency loss refers to calculating the difference between X and X'.

Second, Pix2pix [22] was used. Pix2pix is a GAN applied with the concept of a conditional GAN (CGAN) mode. The generator of Pix2pix is similar to that of U-Net [21]. Unlike U-Net, skip-connection is applied between the encoder and decoder because a blur problem occurs due to the loss of image details when the size of the image is enlarged and then reduced. Furthermore, DeblurGAN [23] uses the input image and target image of a CGAN as an input. However, it exhibits a very different architecture. The architecture of the generator in DeblurGAN consists of two convolutional blocks, 9 residual blocks, and two transposed convolution blocks. Each convolution block contains instance normalization layer [24] and rectified linear units (ReLU) layer, as shown in Table 2. Instance normalization [24] is also referred as contrast normalization. ReLU layer serves as an activation function in residual blocks. The loss function of DeblurGAN uses adversarial loss and content loss. The total loss of the two loss functions can be calculated using Equation (1) as follows:

$$L\_{\text{total}} = L\_{\text{Adv}} + \lambda L\_{\text{Cont}}.\tag{1}$$

First, adversarial loss (*LAdv*) can be explained as follows. The adversarial loss discerns the blurred image restored via a generator by using a discriminator. In this case, the loss is considered as optimal when the difference between the loss discerned by the discriminator and the threshold value 1 is close to 0. Thus, *LAdv* used in DeblurGAN is represented in Equation (2) as follows:

$$L\_{Adv} = \sum\_{k=1}^{N} -D\_{\theta}(G\_{\theta}(I\_{\mathcal{B}})). \tag{2}$$

In Equation (2), *N* denotes the number of images, *D*<sup>θ</sup> denotes the discriminator network, *G*<sup>θ</sup> denotes the generator network, and *I<sup>B</sup>* denotes a blurred image. As specified in DeblurGAN [23], Wasserstein GAN-gradient penalty (WGAN-GP) [25] was used for the adversarial loss. Next, *LCont* is explained in Equation (3).

$$L\_{\rm Cont} = \frac{1}{X\_{\rm n,m} \chi\_{n,m}} \sum\_{k=1}^{X\_{\rm n,m}} \sum\_{j=1}^{Y\_{\rm n,m}} \left( \mathcal{Z}\_{\rm n,m} (\mathbf{I}\_{\rm S})\_{\rm n,m} - \mathcal{Z}\_{\rm n,m} (\mathbf{G}\_{\Theta} (\mathbf{I}\_{\rm B}))\_{\rm n,m} \right)^2. \tag{3}$$

With respect to content loss, either L1 or mean absolute error (MAE) loss or L2 or mean squared error (MSE) loss can be selected. However, perceptual loss was selected for the content loss of DeblurGAN. The perceptual loss of DeblurGAN can be distinguished by the difference between the restored image and target image obtained through conv3.3 features maps of VGG-19 pretrained with ImageNet. In Equation (3), *Xn*,*<sup>m</sup> and Yn*,*<sup>m</sup>* are the size of a feature map, and ∅*n*,*<sup>m</sup>* is the feature map obtained from the *m*th convolutional layer. Furthermore, *I<sup>S</sup>* is the target image for restoring the blurred image [23]. Tables 2 and 3 summarize the architecture of the generator and discriminator in DeblurGAN. Figure 2a,b denote the architecture of a generator and discriminator in DeblurGAN, respectively.


**Table 2.** Generator of DeblurGAN. GAN = generative adversarial network.

**Table 3.** Discriminator of DeblurGAN (All convolution layers 1–5 \* indicate that they have two paddings.).


**Figure 2.** Architecture of DeblurGAN (**a**) Generator in DeblurGAN and (**b**) Discriminator in DeblurGAN.

#### *4.3. Structure of Deep Learning (VGG Face Net-16 and ResNet-50)*

The face and body images restored with DeblurGAN used VGG face net-16 and ResNet-50. In our previous research [9], we compared the recognition accuracies by VGG face net-16 and ResNet-50 with those by other CNN architectures on the custom-made Dongguk face and body database (DFB-DB1) whose acquisition environments including scenario and cameras were same to those of DFB-DB2 used in our research. According to the experimental results, VGG face net-16 and ResNet-50 outperform other CNN architectures, and we adopt these CNN models in our research. A pretrained model was used for two types of CNN models, which were fine-tuned based on the characteristics of the dataset used in this study.

The VGG face net-16, which was used for face images, consists of convolution filters and neural network. Specifically, it consists of 13 convolutional layers, five pooling layers, and three fully connected layers. The CNN pretrained model used in this study was trained with Labeled faces in the wild [26] and YouTube faces [27]. The size of the image restored with GAN corresponded to 256 × 256, and it was resized to 224 × 224 for using VGG face net-16 for fine-tuning. The resized image undergoes convolution calculation through the convolutional layer. The calculation is as follows: output = (W − K + 2P)/S + 1. Here, W denotes the width and height of an input, K denotes the size of a convolutional layer filter, P denotes padding, and S denotes stride. For example, if a 224 × 224 image has convolution filter with K = 3, P = 0, and S = 1, then the output is (224 − 3 + 0)/1 + 1, i.e., 222.

There are many types of ResNet based on the number of convolutional layers. As the number of layers increase, the feature map of body images becomes smaller, and thereby causing a vanishing or exploding gradient problem. Thus, a shortcut is used for the ResNet architecture to avoid such a problem. In the shortcut, the input X goes through three convolutional layers and performs convolution calculation three times. If input X that has completed the convolution calculation is termed as F(x), then the shortcut is the sum of the features, or F(x) + X, which is then used as an input for the next convolutional layer. To reduce the convolution calculation time, 1 × 1, 3 × 3, and 1 × 1 convolutional layers were used as opposed to two 3 × 3 convolutional layers. This is termed as the bottleneck architecture wherein 1 × 1 in the front reduces the dimension of the input image, while the 1 × 1 in the back enlarges the dimensions.

#### **5. Experimental Results and Analysis**

#### *5.1. Experiments for Database and Environment*

Two types of cameras were used in this study to acquire the DFB-DB2. The cameras were Logitech BCC950 [28] and Logitech C920 [29]. The cameras were also used for Dongguk face and body dataset version 1 (DFB-DB1). There was no difference in the scenario used for DFB-DB2 and DFB-DB1 in the study [9]. Furthermore, the DFB-DB1 only consists of images above the threshold based on the method of an extant study [10]. However, the DFB-DB2 used in this study included images below the threshold that were restored with DeblurGAN. Figure 3 shows the scenario of the images with respect to DFB-DB2. In the figure, (a) shows the images acquired via the Logitech BCC950 camera, whereas (b) shows those acquired via the Logitech C920 camera.

Table 4 summarizes the details of face and body images of two databases, namely DFB-DB2 and ChokePoint dataset [30], used in this study. Two-fold cross validation was applied to both databases and each dataset was divided into sub-dataset 1 and 2. For example, if sub-dataset 1 is used for training, then sub-dataset 2 is used for testing. Furthermore, if sub-dataset 2 is used for training, then sub-dataset 1 is used for testing to evaluate the performance.


**Table 4.** Total images of DFB-DB2 and ChokePoint dataset.

(**a**)

(**b**)

**Figure 3.** Representative Dongguk face and body dataset version 2 (DFB-DB2) images captured by (**a**) Logitech BCC950 camera and (**b**) Logitech C920 camera.

The ChokePoint dataset is provided at no cost by National ICT Australia Ltd. (NICTA) and consists of Portal 1 and 2. Portal 1 contains 25 individuals (19 males and 6 females), and Portal 2 contains 29 individuals (23 males and 6 females). A total of three cameras were used from six locations to constitute the dataset. The dataset of the study [9] was maintained. Furthermore, the images considered exhibit a blur, based on the threshold value in an extant study [10], were restored with DeblurGAN and included for evaluating the recognition performance. Figure 4 shows the examples of the ChokePoint dataset.

**Figure 4.** Example images for ChokePoint dataset.

#### *5.2. Training DeblurGAN and CNN Models*

#### 5.2.1. DeblurGAN Model Training Process and Results

Blur image and clear image were distinguished for training DeblurGAN based on the focus score threshold value [9]. The values below the threshold were set as test images for DeblurGAN; the focused image exhibiting a value greater than or equal to the threshold was used as a reference image. Pytorch version of DeblurGAN [31] was used for the program. All the images for training and testing DeblurGAN were resized to 256 × 256. The learning rate was 0.0001, and the batch size was 1 for training DeblurGAN.

#### 5.2.2. CNN Model Training Process and Results

After performing image deblurring with DeblurGAN, face images were trained with VGG face net-16 [32] and body images were trained with ResNet-50 [33]. The number of data points for training each deep CNN model was insufficient, thus the number of data points was increased via data augmentation for training.

As shown in Table 4, data augmentation was performed only in the training data, whereas the original non-augmented data were used as test data. The number of test data points for the DFB-DB2 is less than that of the ChokePoint dataset, which is an open dataset, and therefore center image crop was performed during augmentation. The cropped image was applied with image translation and cropping for five pixels in top, bottom, left, and right directions. Furthermore, the image was horizontally flipped (mirroring). The training data that was processed accordingly included 440,000 augmented images from sub-datasets 1 and 2. For the ChokePoint dataset, after performing center image crop, image translation and cropping were applied for two pixels in top, bottom, left, and right directions. Furthermore, horizontal flipping was applied to obtain images that were magnified by 50 times. The sub-datasets 1 and 2 in Table 4 include a total of 1.03 million augmented images. Figure 5 shows the data augmentation method used in this study.

**Figure 5.** Data augmentation method involving (**a**) image translation and cropping and (**b**) horizontal flipping.

Given that VGG face net-16 is pretrained with Oxford face database, it was appropriately fine-tuned for the characteristics of the images in DFB-DB2. Furthermore, ResNet-50 also uses the pretrained model, and thus was appropriately fine-tuned for the characteristics of the image database used in this study. The learning rate was 0.0001, and the batch size was 20 for the training of VGG face net-16 and 15 for the training of ResNet-50.

Figure 6 illustrates the plots of the loss-accuracy of the training CNN model for trained face and body images. The specifications of the computer used for the experiment are as follows: CPU Intel(R) Core(TM) i7-6700 CPU @ 3.40 GHz, 16 GB RAM, NVIDIA GeForce GTX 1070 graphic card, and CUDA version 8.0.

**Figure 6.** Plots depicting training loss and accuracy of DFB-DB2 ((**a**)–(**d**)) and ChokePoint dataset ((**e**)–(**h**)). Visual geometry group (VGG) face net-16 was used in the case of (**a**,**e**), the 1st fold was used for (**b**,**f**), the 2nd fold ResNet-50 was used in the case of (**c**), 1st fold in the case of (**g**), and the 2nd fold in the case of (**d**,**h**).

#### *5.3. Testing Results from DeblurGAN and CNN Model*

For comparing the original image and deblurred image during the deblurring process, signal-to-noise ratio (SNR) [34], peak signal-to-noise ratio (PSNR) [35], and structural similarity (SSIM) [36] can be used. However, the aforementioned methods, such as SNR, PSNR, and SSIM, cannot be compared with the proposed method because the blur or noise in the blurring images used in this study was naturally generated during the acquisition of the data as opposed to artificial generation of blur or noise in the original image.

#### 5.3.1. Testing with CNN Model for DFB-DB2

Two-fold cross validation was performed to test the training CNN model. For a face image, 4096 features were obtained from the 7th fully connected layer of VGG face net-16. For a body image, 2048 features were obtained from the average pooling layer of ResNet-50. Given the features obtained from the CNN model, the image feature geometric center was calculated by using the Euclidean distance to determine the gallery image. The authentic and imposter distance was calculated by finding the normalized Euclidean distance between the gallery image and other probe images. The distance was used to calculate the equal error rate (EER).

#### Ablation Study

The performance of DFB-DB2 was compared with or without DeblurGAN. Here, "without DeblurGAN" means that both the procedures of focus score checking and DeblurGAN were not operated, whereas "with DeblurGAN" represents that both the procedures of focus score checking and DeblurGAN were adopted. The same DFB-DB2 and ChokePoint dataset were used for the experiment, while VGG face net-16 and ResNet-50 were used for the CNN model. The values in Tables 5 and 6 show that the recognition performance was improved after using DeblurGAN because there was a reduction in the number of changes in pixels between the original image and image generated after using DeblurGAN.


**Table 5.** Comparison of equal error rate (EER) for face recognition and body recognition on DFB-DB2 without or with DeblurGAN (unit: %).

**Table 6.** Comparison of EER for score-level fusion on DFB-DB2 without or with DeblurGAN (unit: %).


As shown in Figure 7, the performance of 'with DeblurGAN (Face)' and 'with DeblurGAN (Body)' was improved. Face and body refer to face images and body images, respectively. Based on the score-level fusion approach, the weighted sum method exhibited a better performance than the weighted product method.

**Figure 7.** Receiver operating characteristic (ROC) curves with or without DeblurGAN performance of DFB-DB2 (**a**) face and body recognition result and (**b**) score-level fusion result.

Comparison between Previous Method and Proposed Methods

First, blur restoration is performed using other GAN methods besides DeblurGAN, which was proposed in this study for comparison. Specifically, CycleGAN [20], Pix2pix [22], attention-guided GAN (AGGAN) [37,38], and DeblurGAN version 2 (DeblurGANv2) [39] were used for GAN models. Table 7 and Figure 8 show the comparison results of GAN for DFB-DB2, and our method outperforms the state-of-the-art methods. As shown in Table 7, the recognition performance of CycleGAN, which restored the body image in DFB-DB2, was outstanding because DeblurGAN is a CGAN type method wherein the input image and target image are paired. However, when the target image is composed in this study, only the image that is similar to the input image is used for restoration. Therefore, the background, texture of clothes, and the individual's gait can be different, and this makes the restoration more difficult.


**Table 7.** Comparisons of EER for recognition by proposed method with those by other GAN-based methods in DFB-DB2 (unit: %).

**Figure 8.** *Cont*.

**Figure 8.** ROC curves via proposed and other GAN methods in DFB-DB2 (**a**,**b**) face and body image recognition results and (**c**) score-level fusion result.

Second, the experiment was conducted to compare face and face and body recognition. The experiment to compare face recognition was conducted with VGG face net-16 [40] and ResNet-50 [41,42]. Multi-level local binary pattern (MLBP) + PCA [43,44], histogram of gradient (HOG) [45], local maximal occurrence (LOMO) [46] and ensemble of localized features (ELF) [47] were used for the experiment to compare face and face and body recognition. Table 8 summarizes the comparison results of face recognition, and Table 9 summarizes the comparison results of face and face and body recognition. Figure 9 shows the receiver operating characteristic (ROC) curve of the results in Tables 8 and 9.

**Table 8.** Comparison of EER for the results of the proposed method and previous face recognition methods (unit: %).



**Table 9.** Comparison of EER for the results of the proposed and previous face and body recognition methods (unit: %).

**Figure 9.** ROC curves obtained via comparing proposed and the state-of-art-methods. (**a**) Face image recognition results and (**b**) face and body image recognition results.

Third, the accuracy of recognition was evaluated via the cumulative match characteristic (CMC) curve. Figure 10 shows the comparison results of the proposed method and methods in Tables 8 and 9. The horizontal axis corresponds to the rank, and the vertical axis corresponds to the genetic acceptance rate (GAR) accuracy for each rank. Table 4 shows that the DFB-DB2 consists of 11 individuals, as shown in Figure 10.

**Figure 10.** Cumulative match characteristic (CMC) curves of the proposed and previous methods on DFB-DB2. (**a**) Face image recognition results by the proposed and previous methods and (**b**) face and body image recognition results via the proposed and previous methods.

Figure 11 shows the difference in the performance by measuring the Cohen's d-value and t-test results of face recognition and face and body recognition and comparisons with the proposed method. With respect to face recognition, the difference in the Cohen's d-value between the proposed method and ResNet-50 [41,42] was 2.95. This significantly exceeds the effect size of 0.8 and is thus high. The p-value of the t-test is approximately 0.098, which differs from the proposed method by 99.902%. With respect to face and body recognition, the Cohen's d-value and t-test results were measured for the ELF [47] that exhibited the second-best performance when compared to that of the proposed method with a Cohen's d-value of 5.65. This exhibited a large effect size, and the t-test exhibited a difference of 99.97%.

**Figure 11.** T-test performance of our proposed method and the second-best model in terms of average accuracy. (**a**) Comparison of the proposed method and ResNet-50 and (**b**) comparison of the proposed method and ensemble of localized features (ELF).

The false acceptance ratio (FAR), false rejection ratio (FRR), and correct case of the previous experimental results are analyzed in the plots. Figure 12 illustrates different cases, in which the image on the left corresponds to the enrolled image, and the image on the right corresponds to the probe image. The portion in the red box of the image on the right is restored via DeblurGAN.

(**c**)

**Figure 12.** Cases of false acceptance (FA), false rejection (FR), and correction recognition (**a**)–(**c**) in DFB-DB2. (**a**) FA cases, (**b**) FR cases, and (**c**) cases of correct recognition.

#### 5.3.2. Class Activation Map

Subsequently, we analyzed the class activation feature map of VGG face net-16 and ResNet-50 that were used for the DFB-DB2 to evaluate the recognition performance for face and body images. Figure 13 shows the class activation feature map from a specific layer using Grad-CAM method [48]. Furthermore, the important features shown through the distribution. Figure 13a,d,g,j correspond to the input face and body images of the CNN model, and Figure 13b,c,e,f,h,i,k,l show the class activation feature map results of face and body images.

**Figure 13.** Results on class activation feature map on DFB-DB2. (**a**,**d**) Input face images, (**b**,**c**,**e**,**f**) results via VGG face net-16 in rectified linear units (ReLU) layer, (**b**,**e**) images from 7th ReLU layer, (**c**,**f**) images from 13th ReLU layer, (**g**,**j**) input body images, (**h**,**i**,**k**,**l**) results via ResNet-50 in batch normalized layer, (**h**,**i**) images from last batch normalized layer on conv5 2nd block, (**k**,**l**) images from last batch normalized layer on conv5 3rd block.

Specifically, when the input (a) is processed through VGG face net-16, (b) corresponds to the class activation feature map of the 7th ReLU layer, and (c) corresponds to the class activation feature map of the 13th ReLU layer. The image in (c) shows the distribution focused around the face area where the red color represents the main feature, while the blue color represents less important features. The black color indicates that no features were detected. When the process goes from (b) to (c), the features are more focused around the face region. Additionally, body images were extracted from the batch normalized layer. In contrast to the face image results, the main features were observed around the body region because the trained part of the ResNet-50 model considers information with respect to the individual's body and clothes as important features.

#### 5.3.3. Testing with CNN Model for ChokePoint Dataset

#### Ablation Study

The images restored with DeblurGAN and images with a score exceeding the threshold value were combined in the experiment, as proposed in the study. Based on the results in Tables 10 and 11, the weighted sum method, among the score-level fusion methods, exhibited better results. Figure 14 shows the results in Tables 10 and 11 in the form of plots. As shown in the plots in Figure 14, the recognition performance improves when DeblurGAN is applied. Furthermore, the weighted product method exhibited better results among score-level fusion methods.

**Table 10.** Comparison of EER for face recognition and body recognition on ChokePoint dataset without or with DeblurGAN (unit: %).




**Figure 14.** ROC curves with and without DeblurGAN performance on ChokePoint dataset. (**a**) Face and body image recognition results and (**b**) score-level fusion result.

Comparison between Previous Methods and Proposed Method

With respect to the GAN models for blur image restoration, the performance of CycleGAN and DeblurGAN was compared. Table 12 and Figure 15 show the results and plots, respectively. The results indicated that DeblurGAN exhibited better recognition performance than CycleGAN.

**Figure 15.** ROC curves for the proposed and CycleGAN in ChokePoint dataset. (**a**) Face and body recognition result and (**b**) score-level fusion result.


**Table 12.** Comparisons of EER for recognition by proposed method with that by CycleGAN (unit: %).

Second, the existing face recognition and face and face and body recognition methods were compared with the proposed method. Tables 13 and 14 show the experimental results, and Figure 16 illustrates the results in the plots.

**Table 13.** Comparison of EER for recognition results via the proposed method and previous face recognition methods (unit: %).


**Table 14.** Comparison of EER for recognition results via the proposed and previous face and body recognition methods (unit: %).


**Figure 16.** *Cont*.

**Figure 16.** ROC curves for proposed and the state-of-art-method on ChokePoint dataset. (**a**) Face image results and (**b**) face and body image results.

Figure 17 shows the comparison of the CMC curve of the proposed method and previous methods for face and face and body recognition. As shown in Figure 17a,b, the performance of the proposed method exceeded that of other methods.

**Figure 17.** *Cont*.

**Figure 17.** CMC curves for the proposed and previous method on ChokePoint dataset. (**a**) Face image recognition results via the proposed and previous methods; (**b**) face and body image recognition results for the proposed and previous methods.

The results of the proposed method using the ChokePoint dataset are shown for the cases of FAR, FRR, and correct recognition in Figure 18.

**Figure 18.** *Cont*.

(**c**)

**Figure 18.** Cases corresponding to false acceptance (FA), false rejection (FR), and correction recognition (**a**)–(**c**) Cases from ChokePoint dataset (**a**) FA cases, (**b**) FR cases, and (**c**) cases of correct recognition.

Figure 19 shows the difference in the performance by measuring the Cohen's d-value and t-test results of face recognition and face and body recognition and comparison with the proposed method. With respect to face recognition, the Cohen's d-value between the proposed method and ResNet-50 [41,42] is 4.89, and this significantly exceeds the effect size of 0.8, thus its being high. The *p*-value of the t-test is approximately 0.03941, which differs from the proposed method by 99.961%. With respect to face and body recognition, Cohen's d-value and t-test results were measured for the ELF [47] that exhibited the second-best performance when compared to that of the proposed method. The Cohen's d-value is 5.06, thereby exhibiting a large effect size, and the t-test exhibited a difference of 99.963%.

**Figure 19.** T-test performance of the proposed method and second-best model in terms of average accuracy. (**a**) Comparison of the proposed method and ResNet-50 and (**b**) comparison of the proposed method and local maximal occurrence (LOMO).

#### 5.3.4. Class Activation Map

In the subsequent experiment, the class activation feature map of the ChokePoint dataset was examined. Figure 20 shows the class activation feature map results. The face image signifies the class activation feature map obtained from the ReLU layer of VGG face net-16. Figure 20h,i,k,l of Figure 20b,c,e,f body image represent the class activation feature map of the image that passed through the batch normalized layer. In the result of the images, the red color represents the main feature, and the blue color represents less important features. Similar results to the experiment using DFB-DB2 are obtained in Figure 20.

(**a**) (**b**) (**c**)

(**d**) (**e**) (**f**)

**Figure 20.** Result on class activation feature map on ChokePoint dataset. (**a**,**d**) Input face images, (**b**,**c**,**e**,**f**) results for the VGG face net-16 in ReLU layer, (**b**,**e**) images from 7th ReLU layer, (**c**,**f**) images from 13th ReLU layer, (**g**,**j**) input body images, (**h**,**i**,**k**,**l**) results for the ResNet-50 in batch normalized layer, (**h**,**i**) images from the last batch normalized layer on conv5 2nd block, and (**k**,**l**) images from the last batch normalized layer on conv5 3rd block.

#### 5.3.5. Comparisons of Processing Time on Jetson TX2 and Desktop Computer

In the next experiment, the computing speed of the proposed method was compared using Jetson TX2 board [49] as shown in Figure 21 and a desktop computer including NVIDIA GeForce GTX 1070 graphic processing unit (GPU) card. Jetson TX2 board is an embedded system equipped with NVIDIA Pascal™ GPU architecture with 256 NVIDIA CUDA cores, 8 GB 128-bit LPDDR4 memory, and dual-core NVIDIA Denver 2 64-Bit CPU. The power consumption is less than 7.5 watts. The proposed method was ported with Keras [50] and TensorFlow [51] in Ubuntu 16.04 OS. The versions of the installed framework and library include Python 3.5 and TensorFlow 1.12; NVIDIA CUDA® toolkit [52] and NVIDIA CUDA® deep neural network library (CUDNN) [53] versions are 9.0 and 7.3, respectively.

**Figure 21.** Jetson TX2 embedded system.

As shown in Tables 15 and 16, our method requires the time cost of a total of 75.72 ms and 481.7 ms on desktop computer and Jetson TX2 embedded system, respectively, which means that our method can be operates at the speed of 13.2 frames/s (1000/75.72) and 2.08 frames/s (1000/481.7) on desktop computer and Jetson TX2 embedded system, respectively. The Jetson TX2 embedded system has less computing resource and GPU of lower speed compared to those in the desktop computer. Therefore, the processing speed on Jetson TX2 is slower than that on desktop computer. However, more advanced and cheaper GPU card and embedded GPU system have been fast commercialized, and our method can be operated at faster speed on those systems.


**Table 15.** Comparison of processing time on Jetson TX2 and desktop computer by DeblurGAN (unit: ms).

**Table 16.** Comparison of processing time on Jetson TX2 and desktop computer by VGG face net-16 and ResNet-50 (unit: ms).


#### **6. Conclusions**

There were lots of works that use GAN for deblur [38,39,54–56]. However, most previous works aimed at the visibility enhancement of general scene images, whereas the main purpose of our research is to enhance the recognition accuracy of face and body images. In the previous works, GAN tried to generate the image of high visibility and distinctiveness although limited amount of noise is additionally included in the generated image. However, GAN in our research tries to generate the face and body images with which the higher recognition accuracies can be obtained. It means that the maximization of intra-class consistency (from matching between same people) and inter-class variation (from matching between different people) in the generated image is more important than the visibility enhancement in our GAN. Therefore, we compared the recognition accuracies of face and body images by our GAN with those by other GANs, as shown in Tables 7 and 12 and Figures 8 and 15, instead of the metrics showing the image visibility, such as peak signal-to-noise ratio (PSNR) and structural similarity (SSIM), like previous works [38,39,54–56]. Consequently, it is not appropriate to use our method to handle the natural images.

The study proposed a deep CNN-based recognition method involving a score-level fusion approach for face and body images in which a GAN is applied to restore the blur problem that is generated when body recognition data is obtained in indoor environments from a long distance. Previous studies focused on minimizing a blur if discovered in face images although deblurring is typically omitted for body images because detailed information is considered as absent in body images when compared to the face images. However, the blur problem in body images affects recognition performance. To solve the problem, face images and body images were separated, and a blur was then restored using a GAN model in the study. Higher processing time is obtained if restoration is performed independently for face and body images using a GAN model. However, better restoration of distinctive features of face and body is observed. For impartial comparison experiments, the GAN model was used for restoration, VGG face net-16 and ResNet-50 were used for training in the study, and the DFB-DB2 built by the researchers was disclosed.

In future work, we would research about the advanced GAN model which can process the face and body images simultaneously. For that, we also consider the scheme of pre-classification of input image into face and body image, as well as adopting different loss functions according to input image. In addition, we would study the combined structure of GAN and recognition CNN models for the reduction of training time, and the measures to increase the processing speed of an embedded system would be explored via examining a lighter GAN for deblurring. Furthermore, our deblur-based recognition method would be applied to various biometric systems, including iris and finger-vein to evaluate recognition performance.

**Author Contributions:** J.H.K. and K.R.P. designed the face and body-based recognition system based on two CNNs and GAN model. S.W.C. and N.R.B. helped to experiments and analyzed results, and collecting databases. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was supported by the Ministry of Education and Ministry of Science and ICT, Korea.

**Acknowledgments:** This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2018R1D1A1B07041921), in part by the Ministry of Science and ICT (MSIT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2020-2020-0-01789) supervised by the IITP (Institute for Information & communications Technology Promotion), and in part by the Bio and Medical Technology Development Program of the NRF funded by the Korean government, MSIT (NRF-2016M3A9E1915855).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## *Article* **LdsConv: Learned Depthwise Separable Convolutions by Group Pruning**

#### **Wenxiang Lin <sup>1</sup> , Yan Ding 1,\* , Hua-Liang Wei <sup>2</sup> , Xinglin Pan <sup>3</sup> and Yutong Zhang <sup>1</sup>**


Received: 20 July 2020; Accepted: 30 July 2020; Published: 4 August 2020

**Abstract:** Standard convolutional filters usually capture unnecessary overlap of features resulting in a waste of computational cost. In this paper, we aim to solve this problem by proposing a novel Learned Depthwise Separable Convolution (LdsConv) operation that is smart but has a strong capacity for learning. It integrates the pruning technique into the design of convolutional filters, formulated as a generic convolutional unit that can be used as a direct replacement of convolutions without any adjustments of the architecture. To show the effectiveness of the proposed method, experiments are carried out using the state-of-the-art convolutional neural networks (CNNs), including ResNet, DenseNet, SE-ResNet and MobileNet, respectively. The results show that by simply replacing the original convolution with LdsConv in these CNNs, it can achieve a significantly improved accuracy while reducing computational cost. For the case of ResNet50, the FLOPs can be reduced by 40.9%, meanwhile the accuracy on the associated ImageNet increases.

**Keywords:** convolutional neural network; convolutional filter; classification

#### **1. Introduction**

Convolutional neural networks (CNNs) have shown remarkable achievements in various vision tasks [1–8]. Most of the achievements benefit from the innovative design of network architectures [9–14], with applications in a variety of areas including phishing detection (see, e.g., [15]). Recent designs usually use the convolutional filter as the basic unit and achieve good training results through special network architectures. However, the manual design of the network architecture has been gradually replaced by architecture searching [16–22] with the rapid development of the computation ability of the hardware. Compared with architecture searching, which often requires strong computing power and expensive time cost, the model compression method and other new convolutional filter design techniques [23–25] provide an economic choice to improve the efficiency of CNNs.

At present, the commonly used convolutions are Groupwise Convolution [2], Depthwise Convolution [26] and Pointwise Convolution [27]. Pointwise Convolution is able to adjust the dimension of the channels or feature maps. It is widely used in the design of architectures. Groupwise Convolution can reduce the connection density and computation cost of convolutional filters, while Depthwise Convolution is the extreme version of Groupwise Convolution which sets the number of groups to be the same as the number of input channels. However, if we simply replace the standard convolution with Depthwise or Groupwise Convolution without special adjustment of the architecture, the resulting model may not work well. Therefore, some new convolutional filters have been proposed recently. HetConv [23] proposes the heterogeneous kernel-based convolution. OctConv [24] designs a convolutional filter that can extract multi-scale information from features. These convolutional filters have the ability to improve the performance of model by simply replacing the standard convolutions without any adjustment of the baseline. The present study proposes a similar but different plug and play convolutional unit. Our proposed LdsConv pays more attention on the learning ability of the model and aims to transform a standard convolutional filter into a learned depthwise separable convolutional filter.

Model compression is considered as another reliable and economic method to improve the efficiency of the convolutional neural network, which can be roughly divided into three categories: (a) Connection pruning [28,29]; (b) Filter pruning [30–36]; and (c) Quantization [28,37–39]. These methods can effectively reduce the computation of the convolutional neural network, but this is always achieved at the price of sacrificing the accuracy. Sometimes, special hardware support is also required for compression methods.

Instead of directly pruning the whole model, we choose to integrate the pruning technique into the design of convolutional filters. In this way, the model can automatically learn to know which input features are most valuable for each single output, so that it enables to extract better features with fewer filters. To achieve this objective, we design a new type of convolutional filter—Learned Depthwise Separable Convolution (LdsConv), which can be directly plugged into existing standard architecture to reduce floating point of operations (FLOPs) and meanwhile improve the accuracy.

To integrate the pruning methods, we develop the two-stage training framework to divide the training task into picking and combining. In the first stage, the LdsConv picks out the most valuable input features and applies more filters to them by pruning technique. In the second stage, the additional pointwise convolution combines the output of the first stage and produces the output features. The idea of division of labour and progressive working has been reflected in computer vision. For example, the two-stages detection framework [40] divides the task into region proposed stage and classification as well as location stage. Cascade RCNN [41] further refines the second stage into three parts and each part is based on the front one. Similarly, we adopt this idea in the convolutional operation and thus divide the training task into picking up useful filters and mixing up the results of picking up. The relationship between two stages is progressive and inseparable. The two-stage training process simplifies the training task for each stage and finally improves the efficiency of the model.

Our experiments show that by replacing the standard/depthwise convolution with the LdsConv in CNNs, it can improve the accuracy and reduce computational costs in the following models: ResNet [1], DenseNet [42], MobileNet [9], and SE-ResNet [43].

Our main contributions are three-fold:


#### **2. Related Work**

#### *2.1. High Efficiency Convolutional Filter*

Ever since the pioneering work on Alexnet [2] and VGG [3], researchers have studied how to improve the efficiency of CNNs from various perspectives. However, much less work has been devoted to developing innovative convolutional filters. Among those proposed convolutional filters, the most popular ones are Groupwise Convolution [2], Depthwise Convolution [26] and Pointwise Convolution [27]. They are widely used in the design of efficient CNNs. ResNet [1,44] uses Pointwise Convolution to build bottleneck layers that allow the network to go deeper without increasing too many parameters. For example, ResNeXt [45] and ShuffleNet [12] use Groupwise Convolution to reduce redundancy in internal connections. Xception [10] and Mobilenet [9] use Depthwise Convolution to further reduce the connection density. SENet [43] and CBAM [46] design a module that can automatically weigh the output of convolutional filters at the cost of a small number of parameters. Hetconv [23] uses convolutional filters with heterogeneous kernels to replace the standard convolutional filters. OctConv [24] reduces the spatial redundancy in CNNs by designing special convolutional filters with multi-scale input features. The Multi-Kernel Depthwise Convolution proposed in [47] can better extract information with multiple kernel sizes and effectively utilize the computational efficiency of Depthwise Convolution. The fully learnable group convolution (FLGC) proposed in [48] can be integrated into a deep neural network and automatically learn the group structure in the training stage in a fully end-to-end manner; its can achieve high computational efficiency. In [49], a new dynamic grouping convolution (DGConv) was proposed, which is able to learn the number of groups in an end-to-end manner; it has been proven to have several advantages. The training-free method, called network decoupling (ND), proposed in [50] is interesting; it achieves high computational efficiency and accuracy performance via pre-trained CNN models which are transferred to the MobileNet-like depthwise separable convolution structure. Compared to these methods, the proposed LdsConv chooses to incorporate weight pruning technique into the design of convolutional filters and further develops the two-stage training framework to simplify the training task for each stage.

#### *2.2. Model Compression*

Model compression is another popular method to improve the efficiency of the convolutional neural network. Refs. [28,29] remove redundancy in the model by pruning connection. Refs. [28,37–39] compress the calculation amount of the model via quantization. Refs. [30–36] prune filters that have a minimal contribution in the model. After removing these filters, the model is usually fine-tuned to maintain its performance. Among these methods, filter pruning methods generally do not require special hardware and software, but they need a pre-trained model which may use a computationally expensive training to obtain.

The proposed LdsConv inserts the weight pruning process into the training. Therefore, the LdsConv embedded model is able to be trained from scratch without a pre-trained model. Different from [51] which only integrates the pruning and fine-tuning process with training, LdsConv further develops the two-stage training framework dividing the training task into picking and combing. Moreover, LdsConv conducts the group pruning by replacing the original convolution with the groupwise convolution before training and use an additional balanced loss function to make the pruning procedure more smooth. Additionally, LdsConv adds an additional pointwise convolution at the end of the pruning, to integrate the pruning results and build a regular depthwise separable convolution, allowing for efficient computation in practice at test time.

#### **3. Method**

In this section, we first introduce Depthwise Separable Convolution and LdsConv. Then we describe the details about the utilization of LdsConv. We also discuss implementation details and show how to replace Depthwise Separable Convolution with LdsConv.

#### *3.1. Depthwise Separable Convolution*

Consider a standard convolution that takes an *R* × *D<sup>h</sup>* × *D<sup>w</sup>* feature as an input and produces an *O* × *D<sup>h</sup>* × *D<sup>w</sup>* feature as an output, where *R*, *O*, *D<sup>h</sup>* and *D<sup>w</sup>* denote the numbers of input channels, output channels, and the height and the width of the feature. Usually a standard convolution applies *R* filters to every input channel for each output. Thus, a standard convolution has the weight matrix with

the size of *R* × *O* × *H* × *W* where *H* and *W* denote the height and the width of the filter. To reduce the computational cost, the depthwise separable convolution splits the standard convolution into two: a depthwise convolution for filtering, that only applies a single filter to the corresponding input channel for the output one, and a pointwise convolution for combing the outputs of the depthwise convolution and producing final output channels. The depthwise convolution is parameterized by the kernel of the size *R* × 1 × *H* × *W* and the pointwise convolution is of the size *R* × *O* × 1 × 1.

#### *3.2. Learned Depthwise Separable Convolution*

Considering the strength of the depthwise separable convolution, it is highly desirable to design a more complex architecture to enhance the capability of the convolution so that the neural network can decide on which feature should be applied. In doing so, we need a novel convolution architecture, named Learned Depthwise Convolution (LdsConv). As shown in Figure 1, the training process is divided into picking stages and the combining stage. Moreover, the training task is also divided into picking and combining. In picking stages, we focus on removing little influence filters repeatedly to pick out valuable input features. In the combining stage, similarly to Depthwise Separable Convolution, an additional 1 × 1 convolution is applied to combine features.

**Figure 1.** The illustration of the LdsConv with a input channel of *R* = 3, an output channel of *O* = 5, a group cardinality of *N<sup>O</sup>* = 5, a group number of *G* = 1, a pruning factor of *k* = 2 and a stage factor of *s* = 2. At the end of picking stages, we remove filters with the number of (*N<sup>O</sup>* − *k*)*R*. After the picking stages, an additional 1 × 1 standard convolution is added into the convolutional module to form a standard depthwise separable convolution.

#### 3.2.1. Group Pruning

Initially, we adopt a group convolution which divides a standard convolution of size *R* × *O* × *H* × *W* into G groups of 4D tensors *F <sup>g</sup>* with the size of *N<sup>R</sup>* × *N<sup>O</sup>* × *H* × *W* to initialize our architecture. For convenience of description, define *N<sup>R</sup>* = *<sup>R</sup> G* and *N<sup>O</sup>* = *<sup>O</sup> G* . Given the fact that the size of convolution layers is widely different which needs different G for the division operation, in the experiment we set a unify hyper-parameter *NO*, named group cardinality, to represent our model and analyze its influence on the accuracy. Group pruning aims to relieve the effect of the pruning to the accuracy by making pruning results more uniform.

#### 3.2.2. Pruning Criterion

During the training process, we gradually screen out less important filters for each group. The importance of the filters is evaluated by the *L*1-norm of its weight *F <sup>g</sup>ij* that corresponds to the weight of the i-th input for the j-th output within group g. In other words, we remove filters with the *L*1-norm.

#### 3.2.3. Pruning Factor

It is important to consider and determine how many filters should be removed before the combining stage. Formally, we set a hyper-parameter *k* with a range from 1 to 4 to represent that the number of remaining filters is *k* × *R*. In Section 4, discussions and analysis on how to choose *k* is presented, which both has a good balance of parameter and accuracy and fits all around dataset and network scale.

#### 3.2.4. Stage Factor

In contrast to methods that prune weights in pre-trained models, our weight pruning process is plugged into the training procedure. Thus, we define the stage factor to determine the times of pruning. For a group filter weight *F <sup>g</sup>* with size of *N<sup>R</sup>* × *N<sup>O</sup>* × *H* × *W*, the number of filters that need to be pruned can be calculated by the equation *N<sup>d</sup>* = *NRN<sup>O</sup>* − *kNR*. Thus, the total number of pruned filters is *GN<sup>d</sup>* = *RN<sup>O</sup>* − *kR*. Then, at the end of each picking stage, we prune *GNd*/*s* filters.

#### 3.2.5. Balance Loss Function

To reduce the negative impact on the accuracy induced by pruning, we deliberately set the number of remaining filters of each input feature to be even avoid the case that most of remained filters extract information from only a small number of input features. As we know, it is hard to optimize the number of filters as they are non-differentiable. We thus define the coefficient of *M* to ensure that filters belong to input features with a bigger number of possible remained filters would be penalized more strongly.

In each training iteration in picking stages, we first find the filters that have the highest probability to remain. Then, we check their input features to get the number of probably remaining filters of each input feature. Finally, we restrain these filters belong to input features having a big number of probably remaining filters. To this end, we use the following regularizer for a group filter weight *F g* during training:

$$L\_{bal} = \sum\_{j=1}^{N\_O} \sum\_{i=1}^{N\_R} \mathcal{M}\_i (\sum\_{l=1}^{HW} \left| w\_{l,i,j} \right|)^2 \tag{1}$$

where *M<sup>i</sup>* denotes the coefficient for filters belong to the i-th input feature and *wl*,*i*,*<sup>j</sup>* denotes every parameter in *F <sup>g</sup>ij* . By adjusting the coefficient of *M<sup>i</sup>* , the input feature having higher number of probably remaining filters will force its filters to be penalized more strongly. The equation for *M<sup>i</sup>* is defined as:

$$M\_{\rm i} = \max \left( e^{\left(N\_{\rm i}^R - \lambda k\right)/\gamma} - 1, 0 \right) \tag{2}$$

where *N<sup>R</sup> i* denotes the number of probably remaining filters belonging to the i-th input feature. We introduce a parameter *λ* to define the threshold over which the filter belonging to the i-th input feature will receive the penalty since the average value of *N<sup>R</sup> i* is *k*. Furthermore, *γ* is set to adjust the penalty level. In this paper, we set *λ* = 1.5 and *γ* = 10 in all experiments empirically.

#### 3.2.6. Additional Pointwise Convolution

At the end of picking stages, we convert the sparsified model into a network with regular modules that can be efficiently deployed on devices without special hardware and software support. For this reason, we add additional pointwise convolutions to each LdsConv to build Depthwise Separable Convolution (see Figure 1). This operation also highly broadens the expression ability of LdsConv filters and lead the training task to combining the output of picking stages and producing the final output features. The weight of the additional pointwise convolution has the size of *kR* × *O* × 1 × 1 related to the number of input channel *R* and output channel *O* of the original convolution and the pruning factor *k*. The initial value of the weight is set by the index information of the remaining filters. Figure 2 shows the initial value of the example in Figure 1. We set the value of the position in the weight matrix to 1 only when the middle feature extract by the remaining filter matches the output feature. The color in Figure 1 represents this matching relationship. This kind of initial value can narrow the negative effect of the newly additional pointwise convolution added in the training process.


**Figure 2.** Initial value assignment of the example shown in Figure 1. The left set of parallelograms represents the middle features. The numbers in these parallelograms mean the index in the input features. The upper set of parallelograms represents the output features. The same color between the left parallelogram and the top one means that they are matched in picking stages. The value in the matrix means the initial value of the additional convolution.

#### 3.2.7. Learning Rate

We adopt the cosine shape learning rate schedule during training, which smoothly changes the learning rate, and usually improves the accuracy [18,52,53]. Figure 3 demonstrates the learning rate as a function of training epoch, and the corresponding training loss of a ResNet50 using LdsConv filters on the ImageNet dataset [54]. Before we enter the combining stage, we add additional pointwise convolution and reset the learning rate to reduce the negative effect of the learning rate to the newly added weights. Thus, the abrupt increase occurs in the loss at epoch 45. However, the plot shows that the loss gradually recovers from this accident.

#### *3.3. The Implementation of LdsConv*

In addition to the use of LdsConv, we briefly describe how to replace standard convolutional filters and depthwise separable convolutional filters with LdsConv filters.

**Figure 3.** The cosine shape learning rate and a typical training loss curve trained on ImageNet. The vertical gray bar in the figure marks the end of picking stage and the begin of combing stage.

#### 3.3.1. Standard Convolution

When we try to replace a standard convolution with our proposed LdsConv, the most important hyper-parameter is the group cardinality *NO*. In general, we suggest setting *N<sup>O</sup>* to the value from 8 to 32. But if the number of the channels of the original convolution is too small to divide, we need to set *N<sup>O</sup>* to the same value as the number of output channels ensuring the group to be 1. For other hyper-parameters, we can simply use the recommended value given by Section 4. In addition to the fact that we replace the standard convolution with the group one first, a 1 × 1 convolution should exist to mix all channels information after the group convolution. In Figure 4, we demonstrate the replacement in the ResNet.

**Figure 4.** The replacement of original convolutional filters with LdsConv. Left: The replacement in ResNet. We directly replace the 3 × 3 convolution with our proposed LdsConv in the bottleneck block. Right: The replacement in MobileNet. We replace the original 3 × 3 convolution and reduce the number of output channel of the LdsConv and input channel of the sequent 1 × 1 convolution.

#### 3.3.2. Depthwise Separable Convolution

**Original bottleneck block bottleneck block w LdsConv**

In general, a pointwise convolution exists in each depthwise separable convolution. So, we do not need to worry about the problem mentioned above. In other words, we can simply replace the depthwise convolution with our proposed LdsConv. However, parameters and FLOPs may increase if we do not make any adjustments. Therefore, we suggest adding an additional convolution before or after the LdsConv to reduce the number of input or output channels of the LdsConv. The right part of Figure 4 shows our implementation of LdsConv filters in MobileNet.

#### **4. Experiment**

In this section, we validate the effectiveness and efficiency of the proposed LdsConv. We first present ablation studies for image classification on Cifar [55]. Then, we perform a set of experiments on ImageNet [54] to check the performance of the proposed LdsConv.

#### *4.1. Ablation Study on Cifar*

We conduct a series of ablation studies to find the best situation to implement LdsConv filters and then check its robustness in different models.

#### 4.1.1. Training Details

We use stochastic gradient descent (SGD) algorithm to train all the models. Specifically, we adopt Nesterov momentum with a momentum weight of 0.9 without dampening, and use a weight decay of 1*e* −4 . Unless otherwise specified, the size of the training batch is set to be 64 and the number of total training epochs is 300, in which the picking stages take 150 epochs and the combining stage has 150 epochs. For the convenience of network accuracy comparison, we all use the standard cosine learning rate change strategy without reset which starts from 0.1 and gradually reduces to 0. It is worth mentioning that special modification on learning rate dose not affect too much. Therefore, we remove the reset described in Section 3.2.7 for the convenience.

#### 4.1.2. Implement on DenseNet-BC-100

We do experiment with DenseNet-BC-100 architecture having a growth rate of 12 [42] on the CIFAR-100 dataset.When we implement our proposed LdsConv, we simply replace the 3 × 3 convolutional filters in dense blocks with the LdsConv filters. Specifically, we set the group cardinality *N<sup>O</sup>* to the same as the number of output channels since the number is too small to divide. Then we start experiments on the effect of pruning factor *k* and stage factor *s* for the LdsConv.

#### 4.1.3. Effect of Stage Factor

The first part of Table 1 compares DenseNet-BC-100 models having LdsConv filters with different stage factors. In particular, we set the pruning factor *k* to 2. The result shows that *s* = 4 seems to be the best value. While reaching the peak at 4, the accuracy drops down for higher stage factors. We attribute this change to the decreasing of gap epochs between pruning which is calculated by the equation *E<sup>G</sup>* = *EP*/*s* where *E<sup>P</sup>* denotes training epochs of picking stages. To expel its effect, we conduct two more experiments with *s* = 6 and *s* = 8 and set *E<sup>G</sup>* to be the same value as the one when *s* = 4 in the second part of Table 1. In other words, the picking stages of these two experiments take 225 and 300 epochs, respectively. The result shows that the accuracy can increase a lot without the decreasing of gap epochs *EG*. By taking into account the training time, we suggest to set the stage factor to 4 in the ordinary course of events.

#### 4.1.4. Effect of Pruning Factor

We do experiment with several pruning factors *k*, which vary from 1 to 4. In addition, we set the stage factor *s* to 4 which means all models have the same times of pruning. The results presented in the third part of Table 1 demonstrate that parameters of the model raise while the accuracy rise ups and downs with the increasing of the pruning factor. The risk of overfitting and the decreasing of pruning proportion battle with each other resulting in this change. In particular, it suggests that setting the pruning factor *k* to 2 is a good choice which balances both the accuracy and the number of parameters. We can also reduce the pruning factor *k* to 1 or even integrate the additional pointwise convolution with the sequent convolution to reach a higher reduction to weights.


**Table 1.** The table shows the ablation study results in different setups on CIFAR-100. '<sup>∗</sup> ' refers to the LdsConv using the balance loss. '# ' refers to the model trained with gap epochs *E<sup>G</sup>* = *EP*/4.

#### 4.1.5. Effect of Balance Loss Function

To check the effectiveness of our balance loss function, we apply it to the models with varied pruning factors. The fourth part of Table 1 shows that the accuracy is improved by adding the balance loss regularization.

#### 4.1.6. Effect of Group Cardinality

To evaluate the effect of the group cardinality *NO*, we experiment with ResNet50 [1] which is designed to train on ImageNet and thus has large number of channels. We remove the first three downsampling operations and retain only the last two ones since images in cifar have smaller resolution. The fifth part of Table 1 compares ResNet50 models using LdsConv filters with varied group cardinality. Specifically, we set the group cardinality *N<sup>O</sup>* to 4,8,16 and 32. The stage factor *s* is set to 4 and the pruning factor *k* is set to 2 for all models. The result shows that the accuracy first rises up and then goes down. When *N<sup>O</sup>* = 8, the model reaches its best accuracy. While reaching the accuracy peak at 8, the accuracy drops down for lower *N<sup>O</sup>* indicating over-group can also have negative effects. We own the negative effects to the shrink in expression ability when the convolution is grouped.

#### 4.1.7. Effect of Two-Stage Training Framework

To verify the function of each stage, we first explore the norm value of the picking results and then evaluate the effect of the additional convolution. The three panels of Figure 5a illustrates the weights of the last 3 × 3 convolution for orignal DenseNet-BC-100, Dw-DenseNet-BC-100 and Lds-DenseNet-BC-100. We replace the 3 × 3 standard convolutions in dense blocks with depthwise separable convolutions in Dw-DenseNet-BC-100 which can be regarded as the typical one-stage training form of LdsConv. Each block in the figure represents the L1 norm (normalized by the maximum value among all filters) of a 3 × 3 filter. In the top two panels of Figure 5a, the vertical and horizon axis represent the height and width of the weight matrix, respectively. For the third panel, we arrange the weight matrix of Lds-DenseNet-BC-100 in this way for alignment. Figure 5b shows the curve between 48 3 × 3 convolutional layers in dense blocks and the average norm of weights for three models. The results suggest that the picking stage indeed reduces the redundancy in the weight matrix and picks up more valuable filters. We additionally experiment with Dw-DenseNet-BC-100 and Lds-DenseNet-BC-100 (*k* = 2) without additional convolutions (AC) in the final part of Table 1. Without additional convolutions, the combing stage becomes the common optimization one. The accuracy dramatically drops down indicating that the combing stage is indispensable. Furthermore, additional convolutions arrange the sparsified convolutions into standard depthwise separable convolutions improving the computation cost at test time. Besides, Dw-DenseNet-BC-100 shows lower accuracy and non negligible gap in the convergence speed compared with the baseline in Figure 5c. On the contrary, Lds-DenseNet-BC-100 trained with the two-stage training framework owns a better curve of convergence speed which is near to the baseline.

#### 4.1.8. Results on Other Models

To evaluate the effectiveness of the proposed LdsConv with the situation discussed in the above in different networks, we choose currently popular models as the baselines including ResNet [1], DenseNet [42], MobileNet [9], and SE-ResNet [43]. For all experiments, we set the pruning factor *k* to 2, the stage factor *s* to 4 and the balance loss function active. In DenseNet, we set its pruning cardinality *N<sup>O</sup>* to the same value as the number of output channels. In other networks, we set the pruning cardinality *N<sup>O</sup>* to 8. The experimental results are shown in Table 2. After using our modules to replace the convolutions in the original models, these networks generally achieve the effect of reducing the FLOPs and the number of parameters, meanwhile maintaining or even improving the accuracy. It shows that our method can effectively reduce the redundancy in convolutional filters. It also suggests that the LdsConv can perform well without too many adjustments on hyper-parameters.


**Table 2.** The table shows the results for different models on CIFAR-100. '<sup>∗</sup> ' refers to the LdsConv using the balance loss. With the setting obtained from the ablation study, we can simply improve the performance of the model by replacing the standard 3 × 3 convolution with our proposed LdsConv.

**Figure 5.** (**a**): Norm of weights of three models in Cifar 100. The block with darker color has less value of norm. The vertical and horizon axis represent the height and width of the weight matrix expect the Lds-DenseNet-BC-100 in which we arrange it in this way for alignment. (**b**) The curve between 48 3 × 3 convolutional layers in dense blocks and the average norm of weights for three models. (**c**) The curve of convergence speed for three models.

#### *4.2. Results on ImageNet*

In a set of experiments, we test LdsConv filters on the ImageNet dataset.

#### 4.2.1. Training Details

We use the SGD method to train all the models and adopt Nesterov momentum with a momentum weight of 0.9 without dampening using a weight decay of 1*e* −4 . We use 135 as the total training epochs, in which the picking stage takes 45 epochs, the combining stage involves 90 epochs. The learning rate change strategy is shown in Figure 3. For MobileNet, we choose to simply increase the training epochs rather than adjusting hyper-parameters to the best. Thus, we use 300 as the total training epochs, in which the epoch size of the picking stage and combining stage is set as 100 and 200, respectively. The initial learning rate is 0.045, and its weight decay is 4*e* −5 .

#### 4.2.2. Model Configurations

In the experiments on ImageNet, we set the balance loss function active, the pruning factor *k* to 2 and the stage factor *s* to 4. Except for DenseNet, we set the group cardinality *N<sup>O</sup>* to 8. In DenseNet, we still set its group cardinality to the same value as the number of output channels ensuring the group to be 1.

#### 4.2.3. Comparison on ImageNet

We continue to use ResNet [1], DenseNet [42], MobileNet [9], and SE-ResNet [43] as the baseline for comparison, and the results are shown in Table 3. All results of baselines come from their original papers. In MobileNet, we can slightly reduce parameters and FLOPs, and highly increase the accuracy by 2.3%. For other networks using standard convolution originally, we not only improve the accuracy but also obviously reduce the number of parameters and FLOPs. What's more, our modules can coexist with SE-modules to further improve the efficiency of the model.

**Table 3.** The table shows the results for different models on ImageNet. '<sup>∗</sup> ' refers to the LdsConv using the balance loss. By simply replacing the standard 3 × 3 convolutional filters with our proposed LdsConv filters, we can not only improve the accuracy but also reduce the FLOPs and the number of parameters a lot. For the case of MobileNet, we highly increase the accuracy by 2.3% which is a pretty considerable improvement.


#### 4.2.4. Comparison with Model Compression Methods

To investigate the compressing ability of our proposed LdsConv, we adjust the bottleneck block with LdsConv in the ResNet to a extreme state as shown in Figure 6. To this end, we remove the Bn

and Relu layers after the 3 × 3 group convolutional layer before training. When the combination stage begins, we integrate the additional pointwise convolution (AC) with the sequent 1 × 1 convolution by the matrix multiply operation since no non-linear operation exists between them. When the model formally enters the combing stage, we only train one 1 × 1 convolution after every LdsConv. In Table 4, we compare the LdsConv with the existing compression methods including ThiNet [30], NISP [56] and FPGM [57]. We use ResNet50 as the baseline, replace the standard convolution with the LdsConv, and reduce the number of parameters further by setting the pruning factor to 1 and combing the additional pointwise convolution with the sequent 1 × 1 convolution. We also set *s* = 6 and *E<sup>G</sup>* = *EP*/4, which lengthens the training epochs, in order to relieve the negative effect of extremely compressing. Compared with these pruning methods, our method, denoted as Lds-ResNet50-extreme, not only improves the accuracy outperforming all other compared methods but also reduces the FLOPs by 40.9%. Furthermore, the real inference speed of Lds-ResNet50-extreme is 42 batches (16 images per batch) per second with the practical evaluation on GPU Nvidia RTX 2080 compared with the 28.9 batches per second on the baseline of ResNet50. We can obtain nearly 1.5× speed up without special hardware support.

**Figure 6.** The extreme state of LdsConv in the ResNet. We remove the Bn and Relu layer after the 3 × 3 convolution and combine the additional convolution with the sequent 1 × 1 convolution by the matrix multiply operation. Finally the standard convolution is replaced with only depthwise convolution. The 3 × 3 LdsConv w/o AC means the depthwise part in LdsConv. The sequent 1 × 1 Conv w AC means the combing result of the additional convolution and original sequent 1 × 1 convolution.

**Table 4.** The table shows the comparison with existing compression methods for ResNet50 on ImageNet. Our Lds-ResNet50-extreme outperforms all other methods in terms of accuracy and still has a comparable reduction on FLOPs.


#### *4.3. Comparison with Similar Works*

To further verify the effectiveness of our approach, we do several experiments using three different networks, namely, ND [50], FLGC [48] and GDConv [49] as well as the proposed model. A comparison of the four models is shown in Table 5. These methods perform similarly when they transform a regular convolution into a depthwise/groupwise convolution. To fairly evaluate the performance of each method, we reimplement these methods in ResNet50 since they have different baselines in their original papers. FLGC mainly transforms the 1 × 1 convolution into groupwise one and thus can reduce the FLOPs a great deal. However, FLGC also sacrifices the accuracy a lot in order to reach such a reduction on computational cost. On the contrary, our proposed LdsConv mainly transforms the 3 × 3 convolution into the depthwise separable one and make a sweet balance between the FLOPs and the accuracy. ND decomposes the regular convolution into the accumulation of several depthwise separable convolutions. While our approach aims to replace the standard convolution with a single depthwise separable convolution. Further more, our Lds-ResNet50-extreme replaces with only one depthwise convolution (w/o separable one) resulting a extreme reduction on computation cost which can be never transcended by ND. The goal of DGConv is to construct a groupwise convolution with dynamic groups. While our approach is to construct a depthwise (Lds-ResNet50-extreme) or depthwise separable convolution with most valuable filters. Our Lds-ResNet50-extreme plays a role as the upper bound of reduction on FLOPs for DGConv-ResNet50 and our Lds-ResNet50<sup>∗</sup> simply surpasses the accuracy with fewer extra FLOPs. As shown in Table 5, our Lds-ResNet50<sup>∗</sup> outperforms other methods in terms of accuracy and still has a considerable reduction on FLOPs and number of parameters. Our Lds-ResNet50-extreme remains a comparable accuracy with strong compression on the model.

**Table 5.** The table shows the comparison with similar methods for ResNet50 on ImageNet. '<sup>∗</sup> ' refers to the LdsConv using the balance loss.


#### *4.4. Network Visualization with Grad-CAM*

We further apply the Grad-CAM [58] to models using images from the ImageNet validation set. Grad-CAM uses gradients to calculate the importance of the spatial locations in convolutional layers. As the gradients are calculate with respect to a specific class, Grad-CAM results show attended regions clearly. By visualizing the importance map for the network, we are able to understand which part the network is interested in and how the network is making use of the features for predicting a class. We compare the visualization results between our proposed Lds-ResNet50 and baseline (ResNet50) in Figure 7.

From Figure 7 it can be clearly seen that the Grad-CAM results of Lds-ResNet50 cover the target regions better than those of the original ResNet50. It suggests that LdsConv-integrated network learns well to exploit information in target regions and aggregate features from them.

**Figure 7.** Grad-CAM [58] visualization results. We compare the visualization results between our Lds-ResNet50 and ResNet50. The Grad-CAM visualization is calculated for the last convolutional outputs. The ground-truth label is shown on the top of each input image.

#### **5. Conclusions**

In this work, we propose a new type of convolution called LdsConv. We have compared our proposed convolutional filters with the original convolutional filters on various existing architectures. Experimental results show that our LdsConv is more efficient than existing convolutions in these models. We also have compared the LdsConv method with the FLOPs compression methods and similar motivated works. Results from our experiments show that the proposed method produces the overall best accuracy while still having competitive FLOPs.

**Author Contributions:** Conceptualization, W.L. and Y.D.; investigation, Y.D., H.-L.W. and X.P.; methodology, W.L., Y.D. and X.P.; project administration, Y.D.; software, W.L.; supervision, Y.D. and H.-L.W.; validation, Y.Z.; visualization, W.L.; writing–original draft, W.L.; writing–review and editing, Y.D., H.-L.W., X.P. and Y.Z. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

*Article*
