*Article* **Convolutional Neural Networks: A Roundup and Benchmark of Their Pooling Layer Variants**

**Nikolaos-Ioannis Galanis, Panagiotis Vafiadis, Kostas-Gkouram Mirzaev and George A. Papakostas \***

MLV Research Group, Department of Computer Science, International Hellenic University, 65404 Kavala, Greece **\*** Correspondence: gpapak@cs.ihu.gr

**Abstract:** One of the essential layers in most Convolutional Neural Networks (CNNs) is the pooling layer, which is placed right after the convolution layer, effectively downsampling the input and reducing the computational power required. Different pooling methods have been proposed over the years, each with its own advantages and disadvantages, rendering them a better fit for different applications. We introduce a benchmark between many of these methods that highlights an optimal choice for different scenarios depending on each project's individual needs, whether it is detail retention, performance, or overall computational speed requirements.

**Keywords:** Convolutional Neural Network (CNN); pooling; deep learning; computer vision; image analysis; benchmark

#### **1. Introduction**

Computer vision can be described as the way machines interpret images and is a field of AI that trains computers to comprehend the visual world [1]. During the last 20 years, computer vision has evolved rapidly, with deep learning and especially Deep Convolutional Neural Networks (D-CNNs) standing out among other methodologies. The accuracy rates for object classification and identification have increased to the point of being comparable to that of humans, enabling quick automated image detection and reactions to optical inputs.

CNNs are considered unquestionably the most significant artificial neural network architecture for any computer vision and image analysis project at the moment. Making an appearance in the 1950s with simple and complex cell biological experiments [2,3] and officially introduced in the 1980s [4] as a neural network model for a mechanism of visual pattern recognition, they have progressed greatly over the last years until today's complex pre-trained computer vision models. One of the main applications of deep learning and CNN's is that of image classification where the system tries to identify a scene or an object inside it. CNNs can also be taken a step further, by using one or more bounding boxes to recognize and locate multiple objects inside an image.

Many traditional machine learning models such as Support Vector Machine (SVM) [5] or K-Nearest Neighbor (KNN) [6] were used for image classification before CNNs, where each individual pixel was considered a feature. With CNNs, the convolution layer was introduced, breaking down the image into multiple features, which are used for predicting the output values. However, since convolution itself is a demanding computation, pooling was introduced to make the overall process less resource intensive along the network. This method reduces the overall amount of computations required, essentially downsampling the input every time it is applied while trying to maintain the most important information.

In this review, we attempt to summarize many of the pooling variants along with the advantages and disadvantages of each individual method, while also comparing their performance in a classification scenario with three different datasets.

Initially, the pooling methods are presented one by one, providing an overview of each approach. In the end, we summarize the models and datasets that each method uses

**Citation:** Galanis, N.-I.; Vafiadis, P.; Mirzaev, K.-G.; Papakostas, G.A. Convolutional Neural Networks: A Roundup and Benchmark of Their Pooling Layer Variants. *Algorithms* **2022**, *15*, 391. https://doi.org/ 10.3390/a15110391

Academic Editor: Frank Werner

Received: 9 September 2022 Accepted: 18 October 2022 Published: 23 October 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

in a table, as a preamble to the testing methodology, which is explained right after. Finally, we present and analyze our benchmark results, focusing on the performance and ability to retain the details of the original input.

#### **2. Materials and Methods**

#### *2.1. Related Work*

The following content is separated into two sections: a roundup of pooling methods summarizing each approach and a benchmark of their performance taking into account multiple factors, focusing on 2D image applications. There have been some review papers on this subject in the past, mostly summarizing the theory behind individual proposals.

Some of them are quite extensive [7,8] and may reference the test results from various external sources [8], though this type of compilation is not ideal for a direct comparison since each experiment is performed under different conditions (model, hardware, etc.). Others focus on deep architectures or neural networks in general, including only some of the pooling methods along with their main research subject [9,10]. In some cases, there are even small-scale tests, but they are targeted at very specific use cases, such as medical data [11].

To our best knowledge, though the subject is similar—which may cause some overlapping content—there has not been an extended benchmark implementation using the same environment so that there can be a direct comparison between the methods' performances.

#### *2.2. Pooling the Literature*

The publications that this review was based on were located by searching for a combination of the terms "Pooling" and "CNN" or "Convolution" (and their derivatives, such as "convolutional") in the title, keywords, and abstract. After shortlisting some of the results, further literature was added by extensively searching through references and related publications of the initially selected papers, focusing on the applications of CNNs and not the generic subject. While there are some references in 1990 when Yamaguchi introduced the concept of Max pooling [12], most pooling proposals and ideas appear to be chronologically placed in the last decade. Figure 1 shows a steady interest in the general subject of pooling for the last decade, perhaps with small increases or decreases per year.

**Figure 1.** Publications for total results about pooling for CNNs in Scopus.

#### *2.3. Let the Pooling Begin*

Three of the most common pooling methods are Max pooling, Min pooling, and Average pooling. As their names suggest, for every area of the input where the sliding window focuses, the maximum, minimum, or average value is calculated accordingly.

Average pooling (also referred to as Mean pooling) has the drawback that it takes into consideration all values regardless of their magnitude, and even worse, in some cases (depending on the activation function that is used), strong positive and negative activations can cancel each other out completely.

On the other hand, Max pooling captures the strongest activations while ignoring other weaker activations that might be equally important, thus erasing input data, while also tending to overfit frequently and not generalizing very well. While most of the other methods try to either improve, combine, or even completely replace these "basics", they still tend to be widely used due to their efficiency, ease of use, and low computational power required. Let us explore each of the available methods in detail.

#### 2.3.1. Max and Min Pooling

Max pooling is one of the most-common pooling methods, which selects the maximum element from the area of the feature map covered by the kernel applied, as seen in Figure 2. Depending on the filter and stride, the outcome is a feature map having the most distinguished features of the input [13]. On the other hand, Min pooling does the exact opposite, selecting the minimum element from the selected area. As expected, Max pooling tends to retain the lighter parts of the input when it comes to images, while Min pooling does the same with the darker parts.

**Figure 2.** An example of Max pooling's functionality [14].

#### 2.3.2. Fractional Max Pooling

Fractional Max Pooling (FMP) [15] is, as its name suggests, a variant of Max pooling, but the reduction ratio can be a fraction as well, instead of an integer. The most important parameter is the scaling factor *a* by which the input image will be downscaled, with 1 < *a* < 2. Considering an input of size *Nin* × *Nin*, we select two sequences of integers *ai*, *bi* that start at 1, and they are incremented by 1 or 2 and end at *Nin*. These sequences can be either completely random or pseudorandom when they follow the equation *ai* = *ceil*(*a* ∗ (*i* + *u*)), where *a* is the scaling factor and *u* is a number in the range (0, 1). Then, the input is split into pooling regions, either disjoint or overlapping using the respective variant of Formula (1), and the Max value for each region is retained.

$$P\_{i,j} = [a\_{i-1}, a\_i - 1] \times [b\_{j-1}, b\_j - 1] \text{ or } P\_{i,j} = [a\_{i-1}, a\_i] \times [b\_{j-1}, b\_j] \tag{1}$$

where

*P* : the pooling region

*ai*, *bi* : integer sequences according to the FMP algorithm

According to the writers' experiments, overlapping FMP seems to have better results than the disjoint alternative, while a random choice of the sequences *ai*, *bi* distorts the image, in contrast with the pseudorandom ones. Overall, FMP's performance appears to be better than that of Max pooling.

#### 2.3.3. Row-Wise Max-Pooling

Row-wise Max pooling is referred to alongside a deep panoramic representation for 3D shape classification and recognition called DeepPano [16]. A panoramic view is created from the projection of the 3D model as a cylinder to its principle axis. The pooling layer is placed after the last convolution layer and uses the highest value of each row in the input map. The suggested methodology appears to be rotation-invariant according to the experiments, since its output is not affected by the rotation of the 3D shape input.

#### 2.3.4. Average Pooling

Average pooling has a similar function as Max pooling, but it calculates the average value of the pooled area [17], as seen in Figure 3.

**Figure 3.** An example of Average pooling's functionality [14].

Average pooling, in contrast to Max pooling, which seeks the top features, extracts a patch of features, makes some calculations based on them, and returns a smoother result. This may lead to lower accuracy. In general, it depends on the density of the features (pixels) and the use of the output product.

#### 2.3.5. Rank-Based Pooling

The rank-based pooling methods [18] are an alternative to Max pooling, with three variants: rank-based Average pooling (RAP), rank-based weighted pooling (RWP), and rank-based stochastic pooling (RSP). The most-important characteristics of these methods are:


Before applying any of the three methods, the ranking process takes place, where an activation function is applied to the individual elements, and they are sorted in descending order according to that function's value.

RAP attempts to resolve the main issues of Max and Average pooling, which are the information loss of non-Max values in Max pooling and the information being downgraded due to near-zero negative activations in Average pooling. It does so by using an average of the top t important features, where t is a predefined downsizing threshold—if we want to downsize, for instance, by a factor of 2 and the kernel has a size of 2 × 2, *t* will have the value of 2 as well. Then, we set weights for all the elements within the kernel, with the top *t* having a weight of 1/*t*, whereas all other weights are set to 0, and the output is calculated from Equation (2).

$$s\_{\mathfrak{j}} = \frac{1}{t} \sum\_{i \in R\_{\mathfrak{j}}, r\_1 \le t} a\_{\mathfrak{i}} \tag{2}$$

where *a* is the activation function value and *t* is the rank threshold that determines which activation affects the averaging.

RWP takes into consideration that each region in an image might not be equally important, thus setting rank-based weights for each activation. Thus, the pooling output now changes to Equation (3).

$$s\_{\bar{\jmath}} = \frac{1}{t} \sum\_{i \in R\_{\bar{\jmath}}} p\_i a\_{\bar{\jmath}} \tag{3}$$

where *a* is the activation value and the probability *p* that is used for each weight is given by the ranking Equation (4) where *b* is a hyper-parameter, *r* is the rank of activations, and *n* is the size of the pooling area.

$$p\_r = b(1 - b)^{r - 1}, \tau = 1, \dots, n \tag{4}$$

Lastly, Equation (5) is used for RSP in a very similar way to RWP.

$$\mathbf{s}\_{\mathbf{j}} = \mathbf{a}\_{i\prime}\text{ where }\mathbf{i}\sim\text{Multiomial}(p\_1,\ldots,p\_n)\tag{5}$$

where *α* is the activation value for each element in the pooled region. Then, the final activation values are sampled based on probabilities *p* calculated by a multinomial distribution, based on Formula (4).

#### 2.3.6. Mixed, Gated, and Tree Pooling

Mixed pooling [19] combines Max and Average pooling, selecting one of these two methods, outperforming both of them when used separately. Lee et al. proposed two different variants along with the base one: mixed Max–Average pooling, and gated Max– Average pooling, along with an alternative method for tree pooling. An overview of the three methods can be seen in Figure 4.

**Figure 4.** A schematic comparison of the three proposed operations in [19]: (**a**) mixed Max–Average pooling, (**b**) gated Max–Average pooling and (**c**) tree pooling with 3-level binary tree.

In **mixed Max–Average pooling**, a parameter *a* is learned and can be different per the whole network, per layer, or per pooling region. Then, the output of the pooling layer is computed by Equation (6):

$$f\_{\rm mix}(\mathbf{x}) = af\_{\rm Max}(\mathbf{x}) + (1 - a)f\_{\rm avg}(\mathbf{x})\tag{6}$$

where:

*x* : the input to be pooled; *a* : a learned parameter; *<sup>σ</sup>*(*wTx*) : a sigmoid function, 1/(<sup>1</sup> <sup>+</sup> *exp*(−*wTx*)).

In **gated Max–Average pooling**, a mask of weights is learned and the inner product of that mask with the pooled region passed through a sigmoid function is used to decide whether to use Max or Average pooling. This mask can differ per network, layer, or region. The output is then calculated as described in Equation (7). According to the method's paper [19], in a comparison between this method and mixed Max–Average pooling, it appears that the gated variant performs consistently better.

$$f\_{\text{gate}}(\mathbf{x}) = \sigma(\boldsymbol{w}^T \mathbf{x}) f\_{\text{Max}}(\mathbf{x}) + (1 - \sigma(\boldsymbol{w}^T \mathbf{x})) f\_{\text{avg}}(\mathbf{x}) \tag{7}$$

where:


A third alternative was proposed in the same paper for **tree pooling**, where a binary tree is used and the pooling filters are learned. The tree level is a pre-defined parameter, and each node holds a learned pooling filter. Furthermore, gating masks are used in a similar way as described for gated pooling previously. Thus, the pooling result for each node is described by the function (8), and the output of the pooling method is the calculated output for the root node.

$$f\_m(\mathbf{x}) = \begin{cases} \nu\_m^T & \text{if leaf node} \\ \sigma(\boldsymbol{w}\_m^T \boldsymbol{x}) f\_{m,left}(\boldsymbol{x}) + (1 - \sigma(\boldsymbol{w}\_m^T \boldsymbol{x})) f\_{m,right}(\boldsymbol{x}) & \text{if internal node} \end{cases} \tag{8}$$

where:


#### 2.3.7. LP Pooling

Sermanet et al. [20] proposed LP pooling as part of an architecture to recognize house numbers. It is essentially another alternative to the Average and Max pooling methods, closer to the one or the other depending on the value of P, a predefined parameter chosen during the setup of the layer. This method is a sort of weighted function ending up with higher weights for more important features and lower for the lesser ones, which can be applied by using Formula (9).

$$O = (\sum \sum I(i,j)^{\mathbb{P}} \times G(i,j))^{1/\mathbb{P}} \tag{9}$$

where *O* is the output, *I* is the input, and *G* is a Gaussian kernel. We should also note that when *P* = 1, it is essentially Gaussian averaging, while when *P* = ∞, it is similar to Max pooling. Using this type of pooling, the authors managed to achieve an average of about 4% better accuracy than Average pooling for the Street View House Numbers (SVHN) dataset.

#### 2.3.8. Weighted Pooling

Weighted pooling [21] is a pooling strategy that aims to use the weighted average number of matches in a particular match. This is achieved by assigning different weights to different activation methods based on common information. Three main features of weighted pooling are, firstly, the amount of information of the pooling area is quantified by information theory for the first time. Second, each activation's benefaction is quantified for the first time, and these contributions reduce the uncertainty of the pooling area in which it is placed. Last, for selecting a senator in this pooling area, the weight of each activation clearly overtakes the value of activation.

#### 2.3.9. Stochastic Pooling

Stochastic pooling [22] attempts to improve the commonly used Max and Average pooling and their previously mentioned drawbacks, by selecting the pooled values of the input based on probabilities. According to this suggestion, a probability *pi* is calculated for each of the elements inside the pooling region using Formula (10), and then, one of the elements with a probability greater than zero is chosen randomly. This method though does appear to have a drawback similar to that of Max pooling, since important parts of the input might be ignored in favor of other parts with non-zero probabilities. The stochastic pooling strategy can be joined with any other forms of regulation such as dropout, data augmentation, weight decay, and others to avoid overfitting in deep convolutional network training.

$$p\_{i\prime}j = \frac{a\_i}{\sum\_{k \in R\_j} a\_k} \tag{10}$$

where:


#### 2.3.10. Spatial Pyramid Pooling

Spatial Pyramid Pooling (SPP) was inspired by the bag-of-words model [23], which is one of the best-known representation algorithms for object categorization. The fully connected layers at the end of the CNNs require a fixed length input. Spatial pyramid pooling [24] attempts to fix that by converting the input of any size into a predefined fixed length, essentially removing that fixed-size constraint, which might be problematic. Basically, a fixed-size window with a constant stride makes the output be relative to the input. On SPP layers the stride, and the pooling window are proportional to the input image, so the output can be a fixed size. The name came from the ability of the layers to apply more than one pooling operation and combining the outcome prior to moving on to the next layer, as described in Figure 5.

**Figure 5.** A network structure with a spatial pyramid pooling layer [25].

2.3.11. Per-Pixel Pyramid Pooling

The largest pooling window used in per-pixel pyramid pooling [26] differs from the original spatial pyramid pooling method, in order to manage obtaining the desired size of the receptive field. This may have as a result the loss of some of the finer details. For that reason, more than one pooling layer with different window sizes is applied, and the outputs are combined to create new feature maps. This pooling task is executed for every pixel without strides. The output is calculated by Equation (11).

$$P^{\mathbf{4P}}(\mathbf{F}, \mathbf{s}) = [P(\mathbf{F}, \mathbf{s}\_1), \dots, P(\mathbf{F}, \mathbf{s}\_{\mathbf{M}})] \tag{11}$$

where *s* is a vector with M elements, **F** is the pooling function applied, and *P*(**F**,*s*i) is the pooling operation with an si-sized kernel and stride 1.

#### 2.3.12. Fuzzy Pooling

The Type-1 fuzzy pooling [27] is achieved by combining the fuzzification, aggregation, and defuzzification of feature map neighborhoods. The method is applied using the following steps:


$$p^{\prime n} = \frac{\sum\_{i=1}^{k} \sum\_{j=1}^{k} \left( \boldsymbol{\pi}\_{i,j}^{\prime n} \cdot \boldsymbol{p}\_{i,j}^{n} \right)}{\sum\_{i=1}^{k} \sum\_{j=1}^{k} \boldsymbol{\pi}\_{i,j}^{\prime n}} \tag{12}$$

#### 2.3.13. Overlapping Pooling

Overlapping pooling was proposed as part of a paper with the suggestion of an architecture that classifies the ImageNet LSVRC-2010 dataset [28]. The idea behind it that can be applied to most—if not all—pooling methods is setting a smaller stride than the kernel size, so that there is overlap between neighboring pooled regions. The experiments with the proposed architecture showed that the top 1 and top 5 error rates were reduced by 0.4% and 0.3%, respectively for the case of Max pooling, while the model seemed to overfit slightly less when using overlapping—while that was rather an observation, and no specific evidence was presented.

#### 2.3.14. Superpixel Pooling

Superpixel is a term for 2D image segments. Essentially, superpixel pooling [29], just like overlapping pooling, is not a pooling method itself, but a method of applying a pooling function such as the Max or Average. The difference is that, instead of using a standard square sliding kernel as in other methods, the 2D image is already segmented—usually based on edges. Then, the selected pooling function is applied in each segment. This process reduces the computational cost significantly, while preserving a high accuracy in the models used.

#### 2.3.15. Spectral Pooling

While most other methods process the input in the spatial domain, spectral pooling [30] takes it to the frequency domain, pools the input, and then, returns the output back to the spatial domain. One of the main advantages is that information is preserved better compared to other common methods such as Max pooling—since lower frequencies tend to contain that information and higher frequencies usually contain noise.

The application of this type of pooling is rather straightforward, applying a Discrete Fourier Transform (DFT) to the input, cropping a predefined size window from the center, and returning it back to the spatial domain by using the inverse DFT.

Obviously, a significant issue is the computational cost, since the DFT is required both forward and inverse. That overhead though can be minimized when the FFT is used for the calculation of the convolution in the previous layer, thus limiting its use only to such scenarios. Zhang et al. [31] suggested an alternative implementation based on the Hartley transform, which might require less computational power while retaining the same amount of information.

#### 2.3.16. Wavelet Pooling

The wavelet pooling method [32] features a completely different approach compared to the previously mentioned ones that use neighboring inputs, attempting to minimize the artifacts produced during the process of pooling. It is based on the Fast Wavelet Transform (FWT), a transformation that is applied twice on the input, once on the rows, and once again on the columns. Then, the input features are reconstructed using only the second-order wavelet sub-bands by applying the Inverse FWT (IFWT), reducing by half the total image features.

Unfortunately, though on the MNIST dataset, the wavelet pooling managed to outperform other competitors, on other datasets (CFAR-10, SHVN, KDEF), simpler methods such as Average or Max pooling performed better. Furthermore, as one can see in Table 1, the computational power required appears to be 110 K mathematical operations for the simpler MNIST dataset, which goes up to a tremendous total of 6.2 M for the KDEF dataset, compared to 3.5 K and 29 K—200-times less—operations required by the much simpler-to-apply Average pooling.


**Table 1.** A comparison of the total mathematical operations required per method [32].

#### 2.3.17. Intermap Pooling

To achieve an increase in robustness for spectral variations of audio signals and acoustic features, Intermap Pooling (IMP) was introduced [33]. This was accomplished by the addition of a convolution maxout layer (IMP), which groups the feature maps, and then the Max activation function at each position is chosen.

#### 2.3.18. Strided Convolution Pooling

Ayachi et al. [34] proposed strided convolution as a drop-in replacement for Max pooling layers with the same stride and kernel size, attempting to make the CNNs more memory efficient. The convolution function that is applied is:

$$c\_{\dot{i},\dot{j},n}(f) = \sigma(\sum\_{h=0}^{k} \sum\_{w=0}^{k} \sum\_{u=0}^{m} \theta\_{h,w,u,n} f\_{\mathcal{S}}(h,w,\dot{i},j,u)) \tag{13}$$

where *σ* is the activation function, *n* ∈ [0, *m*] is the total number of output feature maps of the previous convolution layer, *k* is the kernel size, (*w*, *h*, *n*) are the width, height, and number of channels, and finally, *θ* is the kernel of the convolution weights, and it is *θ* = 1 if *n* = *u*, or *θ* = 0 otherwise.

In Table 2, one can easily see that the replacement of the pooling layer with the strided convolution does seem promising, since it actually reduces the total memory required by each model while also increasing the overall accuracy.

**Table 2.** Model size and top 5 error reduction before and after replacing the Max pooling layer with strided convolution for the ILSVRC2012 classification challenge [34].


#### 2.3.19. Center Pooling

Center pooling [35] is a pooling method used for object detection and intends to identify distinct and more recognizable visual patterns. In an output feature map, we obtain the maximum values for a pixel in it is vertical and horizontal axis and add them which will show us if that pixel is a center keypoint, which is the center of a detected object within an image.

#### 2.3.20. Corner Pooling

On the other hand, corners usually are located outside the objects, which do not have local relative features. Therefore, corner pooling [36] was introduced to solve this problem. Corner pooling finds the maximum values on the boundary directions and, in this way, identifies the corners. This has an effect on making the corners sensitiveto the edges. Addressing this issue, in order to let corners identify the visual patterns of the objects if needed, we use the cascade corner pooling method. Detecting the corners of an object can help define the edges of an object itself better.

#### 2.3.21. Cascade Corner Pooling

Cascade corner pooling [37] looks like a combination of center and corner pooling, by taking the maximum values in both the boundary directions and internal directions of the objects. Initially, from each boundary, it finds a boundary maximum value, then proceeds to look inside the location of the boundary maximum value to obtain an internal maximum value, and finally, it adds them together. As a result, the corners obtain both the boundary information and the visual patterns of objects.

#### 2.3.22. Adaptive Feature Pooling

Adaptive feature pooling [38] is used to gather features from all layers for each object detection proposition and merges them for the upcoming prediction. For each one, they are mapped at other feature levels. It is usually used to pool grids of features from each level. A fusion function (maximum or sum of elements) is then used to secure the grids of features from different levels.

#### 2.3.23. Local-Importance-Based Pooling

Local-Importance-based Pooling (LIP) [39] is a pooling layer that can increase discreet features during the downsampling process by learning adaptive weightings based on inputs. Using this kind of didactic network, the importance function now is not limited to manual forms and has the ability to recognize the criterion for the discriminativeness of features. Furthermore, the size of the LIP window is limited to a minimum dimension, so that it is not less than the step of making full use of the feature map and avoiding the issue of a defined sampling interval. More specifically, the importance function in LIP is implemented by a tiny fully convergent network, which learns to generate the importance map based on end-to-end inputs [40].

#### 2.3.24. Soft Pooling

Soft Pooling (SoftPool) [41] is a quick and effective kernel-based process that aggregates exponentially weighted activations, as described in Formula (14). In comparison with a number of other methods, SoftPool holds more information in the downsampled activation maps, so by having a more sophisticated downsampling process, the result returns better classification accuracy. It can be used to downsample 2D images and 3D video activation maps.

$$w\_i = \frac{e^{a\_i}}{\sum\_{j \in \mathcal{R}} e^{a\_j}} \tag{14}$$

where:

*a* : the activation value; *i*, *j* : the pooled region index.

#### **3. Putting the Methods to the Test**

#### *3.1. The Benchmark Setup*

In order to choose the optimal architecture and datasets to use for our benchmark, Table 3 was compiled. which summarizes what was used for each method in the corresponding paper.

**Table 3.** A cumulative table of models and datasets used in each method's publication.


It seems that less-potent architectures are preferred in most cases. This is probably because they usually achieve a lower overall performance, but that also means that the impact of changing the pooling layer will be better highlighted. Thus, a similar model was chosen, a LeNet5 architecture with 2 convolution layers, 2 respective interchangeable pooling layers, and 2 fully connected layers, as shown in Figure 6.

**Figure 6.** The CNN architecture used for the tests.

Regarding the datasets, the MNIST, CIFAR10, and CIFAR100 were used, since it seems from Table 3 that these are commonly used in the reviewed papers. They are also ideal since we had to make sure they were interchangeable for the exact same architecture without changes to the fully connected layer(s), just by modifying the total output class parameter.

Lastly, we focused on testing pooling methods that can be used as a direct drop-in replacement for the Max pooling layer, with a kernel size and stride of size 2, in order to reduce each dimension by half—applying parameters that would provide similar results wherever required (like a 0.5 scaling factor, for instance, for the spectral pooling layer). Stochastic gradient descent was used as an optimizer, with a learning rate of 0.01 and momentum of 0.9 over 300 epochs.

#### *3.2. Performance Evaluation*

For the performance comparison, we used the standard top 1 and top 5 testing accuracy (higher is better); for the computational complexity, we used the time required per epoch (lower is better), while also including three indicators, which can provide better insight into how well the details of the original image are maintained—for all three (higher values are better):

**Root-Mean-Squared Contrast (RMSC)** [44], as defined in Formula (15) for a *M* × *N* image:

$$RMSC = \sqrt{\frac{1}{M \times N} \sum\_{i=0}^{M-1} \sum\_{j=0}^{N-1} (\mathbf{x}\_{ij} - \overline{\mathbf{x}})^2} \tag{15}$$

where

*xij* : each pixel of the image;

*x* : (∑*M*−<sup>1</sup> *<sup>i</sup>*=<sup>0</sup> <sup>∑</sup>*N*−<sup>1</sup> *<sup>j</sup>*=<sup>0</sup> *xij*)/(*M* × *N*).

**Peak-Noise-to-Signal Ratio (PSNR)** [45], as defined in Formula (16) for a *M* × *N* image:

$$PSNR = 20\log\_{10}\left(\frac{MAX\_f}{\sqrt{MSE}}\right) \tag{16}$$

where:

*MSE* : (Mean-Squared Error) = (∑*M*−<sup>1</sup> *<sup>i</sup>*=<sup>0</sup> <sup>∑</sup>*N*−<sup>1</sup> *j*=0 *f*(*i*, *j*) − *g*(*i*, *j*)-<sup>2</sup>)/(*<sup>M</sup>* <sup>×</sup> *<sup>N</sup>*); *f* : the data of the original image; *g* : the data of the pooled image;

*MAXf* : the maximum signal value of the original image.

**Structural Similarity Index (SSIM)** [46], which is defined by three combined metrics for luminance, contrast, and structure and can be simplified for two signals *x*, *y* in the form seen in Formula (17):

$$SSIM(\mathbf{x}, \mathbf{y}) = \frac{(2\mu\_{\mathbf{x}}\mu\_{\mathbf{y}} + \mathbf{C}\_1)(2\sigma\_{\mathbf{x}\mathbf{y}} + \mathbf{C}\_2)}{(\mu\_{\mathbf{x}}^2 + \mu\_{\mathbf{y}}^2)(\sigma\_{\mathbf{x}}^2 + \sigma\_{\mathbf{y}}^2 + \mathbf{C}\_2)}\tag{17}$$

where:

*μx*, *μ<sup>y</sup>* : the pixel mean *μ<sup>x</sup>* = ∑*<sup>N</sup> <sup>i</sup>*=<sup>1</sup> *xi*/*N*; *σx*, *σ<sup>y</sup>* : standard deviation *σ<sup>x</sup>* = 5 ∑*<sup>N</sup> <sup>i</sup>*=<sup>1</sup> (*xi* − *μx*)2/(*N* − 1); *σxy* : ∑*<sup>N</sup> <sup>i</sup>*=<sup>1</sup> (*xi* − *μx*)(*yi* − *μy*)/(*N* − 1); *C*<sup>1</sup> : (*k*1*L*)2; *C*<sup>2</sup> : (*k*2*L*)2; *L* : the dynamic range of pixels, 255 for 8-bit grayscale images; *k*<sup>1</sup> : A small constant <1, 0.01 used in the paper experiments;

*k*<sup>2</sup> : A small constant <1, 0.03 used in the paper experiments.

All tests were performed using a PyTorch implementation of the methods, on an Nvidia GTX1080 GPU.

#### **4. Results**

#### *4.1. Details Retention*

As previously described, three metrics were used as a means of comparison for how well details are preserved after pooling the original input. The first one is the Root-Mean-Squared Contrast (RMSC) [44], which is the standard deviation of the pixel intensities, which indicates how well the contrast levels are maintained between the input and output. The second, the Peak-Noise-to-Signal Ratio (PSNR) [45], shows how strong the original image signal is compared to the introduced noise due to pooling. Lastly, the Structural Similarity Index (SSIM) [46] can range from −1 to 1 and shows the actual similarity between the input and output of the pooling layer.

In Table 4, Average pooling appears to be the best choice, since it shows the best SSIM values across all dataset tests. Furthermore, it achieved a top ranking PSNR as well for two out of the three datasets—which can be interpreted as a low level of introduced noise. When it comes to the RMSC, though other methods achieved better values, Average pooling kept up, and as we can see in the pooling layers' output examples, higher contrast is not always good, at least when it comes to comparing similarities with the original image.

**Table 4.** The details' retention indicators of our benchmark. The best value for each metric in each separate dataset is highlighted.


In Figures 7–9, a sample input of each dataset is presented, as well as the respective output for each pooling layer. Each method might have a tendency to favor higher or lower values of the input pixels, while some increase the contrast significantly.

Combined with the results of Table 4, it seems that Average pooling indeed achieved a result that was very close to the original image. On the other hand, tree, l2, fuzzy, and spectral pooling introduced a much higher contrast to the image, generating an output that was very different from the original input.

**Figure 7.** The MNIST "5" original image (**a**) and the respective results of the first pass of pooling for the methods Max (**b**), adaptive Max (**c**), fractional (**d**), Average (**e**), mixed (**f**), gated (**g**), tree (**h**), l2 (**i**), stochastic (**j**), fuzzy (**k**), overlapping Max (**l**), spectral (**m**), wavelet (**n**), LIP (**o**), and SoftPool (**p**).

**Figure 8.** The CIFAR10 frog original image (**a**) and the respective results of the first pass of pooling for the methods Max (**b**), adaptive Max (**c**), fractional (**d**), Average (**e**), mixed (**f**), gated (**g**), tree (**h**), l2 (**i**), stochastic (**j**), fuzzy (**k**), overlapping Max (**l**), spectral (**m**), wavelet (**n**), LIP (**o**), and SoftPool (**p**).

**Figure 9.** The CIFAR100 horse original image (**a**) and the respective results of the first pass of pooling for the methods Max (**b**), adaptive Max (**c**), fractional (**d**), Average (**e**), mixed (**f**), gated (**g**), tree (**h**), l2 (**i**), stochastic (**j**), fuzzy (**k**), overlapping Max (**l**), spectral (**m**), wavelet (**n**), LIP (**o**) and SoftPool (**p**).

#### *4.2. Model Performance*

In Table 5, the accuracy of the individual pooling methods is presented, along with the time required per epoch. It appears that for the MNIST, perhaps due to the ease of the dataset, the results were almost identical. Though, in the previous section, Average pooling appeared to "win the battle" of details' retention, here, it is obvious that Max pooling and its variants—especially overlapping Max pooling—seemed to perform much better.


**Table 5.** The top 1/top 5 validation accuracy and time required per epoch for each model.

Figures 10–12 show the top 1 accuracy of the model over the 300 training epochs of the benchmark. In Figure 12, it is clear that overlapping Max pooling is the overall better-performing method for CIFAR100, significantly outperforming the rest—though the difference is not that obvious for the other two datasets.

When it comes to complexity, most methods required about 8 s per epoch, with some requiring a much increased time—which might perhaps perform much better with a C++ implementation. Overlapping Max pooling had one of the lowest times required per epoch, giving it yet another advantage. On the other hand, some methods managed to converge much more quickly. For instance, tree, l2, spectral, and Average pooling seemed to require far less than 100 epochs to obtain the highest possible accuracy. Thus, l2 might be a better choice after all, since it achieved a high accuracy in fewer epochs and one of the lowest processing times per epoch.

**Figure 10.** The top 1 accuracy of the models for the MNIST dataset over the epochs.

**Figure 11.** The top 1 accuracy of the models for the CIFAR10 dataset over the epochs.

**Figure 12.** The top 1 accuracy of the models for the CIFAR100 dataset over the epochs.

On a closing note, the overall selected amount of 300 epochs might be a bit higher than required since most methods achieved their peak accuracy at less than 100–150 epochs. The high amount of epochs though did make sure that there were enough for each method to achieve the best performance possible.

#### **5. Discussion**

As expected, there is no "absolute best" for the pooling layer—one that may work great for one application might not even be viable for another. Though overlapping Max pooling seemed to be the "winner" of this benchmark, there may be different scenarios where other commonly used methods may be more suitable—such as, for instance, when detail retention is important, Average pooling is a better choice and easy to implement and has similar performance. Therefore, the choice of the proper pooling layer is not always that simple and straightforward.

One of the most important factors is probably the overall computational power required. Since the convolution layer itself is resource-heavy and the pooling layer's role is to "relieve" part of that load, it would be expected for the added overhead to be as minimal as possible.

Other factors that one should keep in mind are the level of invariance required usually when the input is a video or highly variable images of similar objects—and the overall detail retention that is required. Of course, a combination of two or even more pooling methods could be applied to further improve the overall accuracy of the output. Some might even prefer simpler methods due to their ease of implementation—in the case where a rapid prototype would be adequate as a proof of concept. Taking into consideration all the model's requirements and even the personal favorites of the development team is what usually drives the final selection of the pooling layer.

#### **6. Conclusions**

CNNs are an important part of computer vision, and pooling can significantly reduce their overall processing, allowing the implementation of models and architectures with far fewer resources than would normally be required. We created a roundup of many of the pooling methods that have been proposed so far—though it might not be exhaustive summarizing each approach and a benchmark for a practical comparison.

Overlapping Max pooling appeared to perform better than the rest, at least for the selected datasets. Even though it might be next to impossible to pinpoint and test every single variation for all existing pooling methods, hopefully, it will be more than enough to function as a starting point for every researcher and machine learning scientist in order to help choose the one that is more appropriate or even inspire new approaches or improvements for current implementations.

**Author Contributions:** Conceptualization, G.A.P.; methodology, N.-I.G., P.V. and K.-G.M; software N.-I.G., P.V. and K.-G.M.; validation N.-I.G.; formal analysis, N.-I.G., P.V. and K.-G.M; investigation, N.-I.G., P.V. and K.-G.M; resources, N.-I.G., P.V. and K.-G.M; data curation, N.-I.G.; writing—original draft preparation, N.-I.G., P.V. and K.-G.M; writing—review and editing, N.-I.G.; visualization, N.-I.G., P.V.; supervision, G.A.P.; project administration, N.-I.G.; funding acquisition, G.A.P. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research received no external funding.

**Data Availability Statement:** The source code of this study is available via https://github.com/ MLV-RG/cnn-pooling-layers-benchmark/, (accessed on 8 September 2022).

**Acknowledgments:** This work was supported by the MPhil program "Advanced Technologies in Informatics and Computers", hosted by the Department of Computer Science, International Hellenic University, Kavala, Greece.

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


**Alireza Saberironaghi 1, Jing Ren <sup>1</sup> and Moustafa El-Gindy 2,\***


**Abstract:** Over the last few decades, detecting surface defects has attracted significant attention as a challenging task. There are specific classes of problems that can be solved using traditional image processing techniques. However, these techniques struggle with complex textures in backgrounds, noise, and differences in lighting conditions. As a solution to this problem, deep learning has recently emerged, motivated by two main factors: accessibility to computing power and the rapid digitization of society, which enables the creation of large databases of labeled samples. This review paper aims to briefly summarize and analyze the current state of research on detecting defects using machine learning methods. First, deep learning-based detection of surface defects on industrial products is discussed from three perspectives: supervised, semi-supervised, and unsupervised. Secondly, the current research status of deep learning defect detection methods for X-ray images is discussed. Finally, we summarize the most common challenges and their potential solutions in surface defect detection, such as unbalanced sample identification, limited sample size, and real-time processing.

**Keywords:** defect detection; surface defect detection; defect detection for X-ray images; defect recognition; deep learning

**1. Terminology**


**Citation:** Saberironaghi, A.; Ren, J.; El-Gindy, M. Defect Detection Methods for Industrial Products Using Deep Learning Techniques: A Review. *Algorithms* **2023**, *16*, 95. https://doi.org/10.3390/a16020095

Academic Editors: Xiang Zhang and Xiaoxiao Li

Received: 24 December 2022 Revised: 25 January 2023 Accepted: 3 February 2023 Published: 8 February 2023

**Copyright:** © 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).


#### **2. Introduction**

Several factors affect the quality of manufactured products during the manufacturing process, including poor working conditions, inadequate technology, and various other factors. Among product defects, poor product quality is most visible in surface defects. Therefore, detecting product surface defects [1] ensures a high qualification ratio and reliable quality.

A defect is generally defined as an absence or area that differs from a normal sample. Figure 1 compares normal samples with defective samples of industrial products.

**Figure 1.** Normal samples of industrial products are compared to defective samples. The first row contains good samples, and the second, third, and fourth rows contain defective samples. The first, second, third, fourth, and fifth columns display wood, grid, capsule, leather, and bill, respectively, and there are three types of defects listed below the image.

In the past, identifying defects was carried out by experts, but this process was not efficient. One major reason for this was that human subjectivity greatly affected the detection results. Additionally, human inspection alone cannot meet the need for real-time detection, and thus, it is not able to fulfill all the necessary requirements.

A significant amount of time has been dedicated to using traditional methods to detect surface defects. When differentiation exists between the defect color and the background, traditional image processing methods can perform well. Traditional methods in terms of the product's features can be categorized into three types: texture-based features, color-based features, and shape-based features.

Several studies have used specialized techniques for detecting surface defects. In color-based feature, for instance, literature [2] proposed a technique involving the use of a percentage of the feature of color histogram and a vector texture feature to classify image blocks to detect surface defects on wood; this method has been proven effective by experiments, especially with defects involving junctions. In Figure 2, the method results are shown.

**Figure 2.** An example of the result of wood defect detection using the presented technique in [2].

Research conducted in literature [3] employed cosine similarity to verify the validity of the periodic law in magneto-optical images by utilizing the color moment feature. This method was successful in identifying the appropriate magneto-optical image for detecting and locating welding defects. Literature [4] describes a two-step technological process for SVM-based and color histogram-based defect detection in particle boards, followed by localization of defects using smoothing and thresholding. According to literature [5], color moment features and FSIFT features were merged based on their magnitude of influence for the purpose of resolving the tile surface defect problem not being adequately described by a single feature.

In terms of shape-based feature methods, literature [6] proposed a method of detecting cutting defects on magnetic surfaces. In this method, the image of the magnetic surface is reconstructed using the Fourier transform and Hough transform, and, in order to obtain defect information, the gray difference between the original image and the reconstructed image is compared. A method for identifying defects on bottle surfaces was presented in reference [7]. This method includes a step for extracting regions of interest, where the boundary line of the light source is determined using a fast Hough transform algorithm. In [8], global Fourier image reconstruction and template matching were proposed as a method for detecting and locating small defects in aperiodic images. Literature [9] described how to detect surface defects on small camera lenses using Hough transforms, polar coordinate transforms, weighted Sobel filters, and SVM algorithms. Different types of defects were detected in several test images. In Figure 3, red highlights are used to indicate defects such as stains, scratches, and dots.

**Figure 3.** A camera lens with several defects: (**a**) original image and (**b**) converted result based on inspection result and polar coordinate transformation [9].

In the texture-based feature method, for example in [10], a multi-block local binary pattern (LBP) algorithm has been improved. In addition to having the simplicity and efficiency of LBP, this algorithm ensures high recognition accuracy by varying the block size to describe defect features. According to the experiment, the method has the speed to meet online real-time detection requirements (63 milliseconds/image), outperform the widely used scale-invariant feature transform (SIFT), speed up robust features (SURF), and gray-level co-occurrence matrix (GLCM) algorithms for recognition accuracy (94.30%), demonstrating that MB-LBP can be used to detect images in real time online. Literature [11] used a fuzzy model that was based on extracting GLCM features and processed it using MATLAB. The model took in three variables as inputs: autocorrelation, square root of variance, and the number of objects. Using fuzzy logic on ceramic defects, the accuracy of the ceramic inspection process with a light intensity of 300 lx, camera distance of 50 cm, and a 1.3 MP or 640 × 480 pixel image size was determined using the training data of 96.87%, and the accuracy of the real-time system was 92.31%. According to literature [12], features such as Reduced Coordinated Cluster Representation (RCCR) are used to form a one-class classifier. An algorithm based on texture periodicity estimates the primitive unit size of defect-free fabrics during the training phase. After splitting the fabrics into samples of one unit, RCCR features are used in a one-class classifier to learn their local structure. In [13], morphological filters are used to detect defects on billet surfaces in order to distinguish them from scales. With the help of morphological erosion and dilation techniques with repetition, the image is converted into a binary image by using morphological top-hat filtering and thresholding. The detection efficiency of the proposed algorithm is evaluated using real billet images to evaluate its performance. The proposed algorithm is found to be effective and suitable for analyzing billet images with scales in experiments. According to literature [14], the GLCM is defined as the fabric image's characteristic matrix. To distinguish defect-free from defective images, Euclidean distance is used and, in order to determine the pattern period, the autocorrelation function is used. In this paper, the authors discussed two GLCM parameters in relation to Euclidean distances. Furthermore, in addition to being concise and objective, Euclidean distances have the advantage of being reliable and objective for defect detection. According to the algorithm's tests, it is not only accurate, but also more adaptable to yarn-dyed fabrics with short organization cycles. Table 1 summarizes recent applications of machine learning algorithms for surface defect detection in industrial products, categorized by texture, color, and shape features. Table 2 compares the strengths and weaknesses of feature-based methods for detecting surface defects, including accuracy, computational efficiency, and robustness. These tables provide an overview of the diversity of approaches and key factors affecting performance in the field of surface defect detection.


**Table 1.** Recent applications using machine-learning-based vision algorithms for detecting surface defects in industrial products, categorized into three categories based on texture, color, and shape features.

**Table 2.** An overview of the strengths and weaknesses of various feature-based methods for detecting surface defects in industrial products.



Using only one feature or one class of features on industrial products is rarely sufficient because their surfaces typically contain a variety of information. Consequently, many features are used in combination in practical applications, making it difficult to detect defects. Additionally, feature-based approaches are highly effective when they detect defects in images with little or no variation, and when defects appear on surfaces in a consistent pattern. Considering the wide range of uncertainties in industrial settings, it is important to develop methods that are adaptable to such wide ranges of variations in defect intensity, shape, and size.

Deep learning models based on convolutional neural networks (CNN) have had a lot of success in various computer vision fields, such as recognizing faces, identifying pedestrians, detecting text in images, and tracking targets. Additionally, these models are used in a wide range of industrial settings for defect detection. This includes both commercial and industrial applications, such as in the automotive industry for detecting defects in cars. The deep-learning-based surface defect detection software is employed in these settings to improve the efficiency and accuracy of the defect detection process.

Recently, several papers covering the latest techniques, applications, and other aspects have been published on deep learning in defect detection [19]. Literature [12] describes the different types of defects and compares mainstream and deep learning methods for defect detection. Various defect detection techniques are discussed in literature [20], including ultrasonic inspection, machine vision, and deep learning. Literature [21] focuses on the use of AI-enhanced metrology, computer vision, and quality assessment in the Zero Defect Manufacturing (ZDM) process. The study also highlights the use of IoT/IIoT technology as a means of supporting these tools and implementing AI algorithms for data processing and sharing. Literature [22] discusses deep learning methods for detecting surface defects, then discusses three critical issues related to small samples and real-time defects detection. In [23,24], the authors analyze and compare the benefits and drawbacks of the above methods. There are also defect detection surveys in several application domains, including fabric defects [25], corrosion detection [26], pavement defects [27], metal defect detection [28], and industrial applications [29]. The investigation shows that, in the field of surface defect detection of industrial products, there is currently a limited literature review on machine learning methods, and while some papers summarize the challenges and problems, the mentioned solutions are not systematic. The first section of this paper addresses the above issues by summarizing the research status on the detection of surface defects on industrial products using deep learning algorithms and then discusses the issues in the process of industrial surface defect detection, such as unbalanced sample identification problems, small sample problems, and real-time problems.

This paper is organized as follows. Section 3 provides an overview of deep learning methods for surface defect detection in industrial products from three perspectives, along with a common dataset for surface defect detection. In Section 4, we summarize the recent research status of deep learning methods for X-ray image defect detection. A discussion of the main problems and their solutions is provided in Section 5. In Section 6, a brief description of future research directions is provided and Section 7 concludes the paper with a conclusion.

#### **3. Deep Learning Surface Defect Detection Methods for Industrial Products**

Deep learning has become increasingly popular in the field of defect detection due to its rapid development. This section summarizes the state of research on inspection of industrial products for detecting surface defects. Learning-based approaches are classified as supervised, semi-supervised, and unsupervised. The performance of learning-based methods is best optimized when large datasets are provided. In particular, supervised techniques perform well when there are sufficient examples of each class in the dataset.

#### *3.1. Supervised*

Supervised detection requires large datasets of defect-free and defective samples labeled in a training set. Since all the training data is labeled, detection rates can be very high. It must be noted, however, that supervised detection may not always the most effective approach due to the imbalance of classes in the datasets. There are a number of datasets that supervised learning methods use, including the fabric dataset [30], rail defect dataset [31], and railroad dataset [32].

Deep neural networks and feature extraction and classification methods used in supervised methods differ in their structures. For example, detecting cross-category defects without retraining was proposed using a two-layer neural network in the literature [33]. Based on structural similarities between image pairs, the method learns differential features, which may result in some structural similarities among different classification objects. This method has been shown to be able to detect defects in different types of factories based on experiments in real factory datasets. Literature [34] suggests that the composition of kernels

is more important than the number of layers when it comes to detection results. To detect small defects and textures in surface images, it is necessary to use a sample image that is large enough for computational accuracy and reducing the cost of the network. ShuffleNet uses convolution of pointwise groups and channel shuffle as two new techniques to achieve this goal. Literature [35] proposes a novel in-line inspection system for plastic containers based on ShuffleNet V2. The system can be used to inspect images on complex backgrounds as well. In [36], they proposed ShuffleDefectNet, a deep-learning-based defect detection system that achieved 99.75% accuracy on the NEU dataset.

Reference [37] suggested that shallow CNN networks can be used to identify anomalies. To train the model, only negative images are used and the research employs full-size images. The argument is that it is not necessary to have full-size examples of both defective and defect-free samples, as the negative samples already have pixels that correspond to the defect-free regions. Based on the Fast R-CNN model, Faster R-CNN introduces a region proposal network (RPN), which enables an end-to-end learning algorithm. This leads to a near-costless regional recommendation algorithm that significantly improves the speed of target detection. Faster R-CNN was used in [38] to detect PCB surface defects, a new network was proposed combining ResNet50, GRAPN residual units, and ShuffleNetV2. Using a cascaded RCNN structure, as described in literature [39], the defect detection problem of power line insulators can be changed into a two-level target detection problem; the results are shown in Figure 4.

**Figure 4.** The results of insulator defect detection. The green box represents the non-defective insulator, and the red box represents the defective insulator [39].

In limited hardware configurations, MobileNet-SSD [9] improves real-time object detection performance. There is no need to sacrifice accuracy for the reduction of parameters in this network. An SSD network classifies regression and boundary box regression using various convolution layers. Translation invariance and variability are resolved in this model, resulting in good detection precision and speed. Object detection is effective when defects have regular or predictable shapes [40]. Additional preprocessing steps can be applied to more complex defect types. Fully Convolutional Networks (FCNs) use all convolutional layers as network layers; label maps can be directly derived using pixel-level prediction. To achieve accurate results, a deconvolution layer with larger data sizes is used. In literature [41], FCN and Faster R-CNN were combined to develop a deep learning model that could detect stains, leaks, and pipeline blockages in tunnels. A method for segmenting defects in solar cell electroluminescence pictures was presented in [42]. A defect segmentation map was obtained in one step by combining FCN with a specific U-net architecture.

#### *3.2. Unsupervised*

Research has begun to explore unsupervised methods to overcome the disadvantages of supervised methods. By learning the inherent characteristics of the input training data, the machine can learn some of its own characteristics and connections when there is no label information and automatically classifies the input training data based on the pattern of these unlabeled data [43]. It automatically classifies these unlabeled data based on inherent characteristics and connections between the data. Methods based on reconstruction and embedding similarity are the most commonly used to detect surface defects among unsupervised learning methods. Reconstruction-based methods such as autoencoders (AEs) and Generative Adversarial Networks (GANs) are most commonly used. Popular algorithms include PaDIM [44], SPADE [45] PatchCore [46], etc. In [47], an algorithm based on DBN was proposed for detecting defects in solar cells. Both training and reconstructed images were used as supervision data by the fine-tuning network of the BP algorithm. Literature [48] proposed a multi-scale convolutional denoising autoencoder with high accuracy and robustness that synthesizes the results of multiple pyramid levels.

A SOM-based detection method was proposed in [49] for determining the difference between normal and defective wood. The first stage involves detecting suspected defect areas, and the second stage involves separately inspecting each defect area. A detection method that uses GANs was proposed in reference [50]. The method is divided into two stages: first, a generative network and a learning mechanism based on statistical representation are used to detect new areas. In the second stage, defects and normal samples are directly distinguished using the Frechet distance. The solar panel dataset was used to test the method, and it achieved 93.75% accuracy.

A multiscale AE with fully convolutional neural networks has been proposed [51], in which each FCAE sub-network directly obtains the original feature image from the input image and performs feature clustering. Utilizing a fully convolutional neural network, the residual images were combined to create the defect image. PatchCore, introduced in literature [46], is a technique for identifying and isolating abnormal data in scenarios where only normal examples are available. It balances the need to retain normal context through memory banks of patch-level features extracted from pre-trained ImageNet networks and minimize computational time via coreset subsampling to create a leading system for coldstart image anomaly detection and localization that is efficient on industrial benchmarks. On MVTec, the algorithm demonstrated an AUROC of over 99%, while also being highly efficient in small training set scenarios. Literature [52] presented a GAN-based surface vision detection framework that uses OTSU to segment fusion feature response maps and fuses the responses of the three layers of the GAN discriminator. The framework has been proven effective on datasets of wood cracks and road cracks. As shown in Figure 5, ref. [53] proposed a GAN-based method for detecting strip steel surface defects, in which the generator G uses encoding and the hidden space features in the penultimate layer are fed into a SVM to detect defects. The test results on images provided by the Handan Iron and Steel Plant indicated good accuracy. It is more effective at detecting texture images; however, its accuracy still needs to be improved.

**Figure 5.** Presenting the results of experiments on six defect samples using four methods. The defect types are listed in the first column and include drops tar, shadow, floating, crush, pitted surface and scratch. The results from traditional manual feature extraction methods (CPICS-LBP, AEC-LBP, HWV and the proposed method in [53]) are shown in columns 2–5. The experiment compares the proposed method with current state-of-the-art methods in detecting strip steel surface defects.

#### *3.3. Semi-Supervised*

As a result of combining the properties of supervised and unsupervised methods, semisupervised methods are developed. Only normal samples are used as training data for semisupervised defect detection and a defect-free boundary is learned and set, and any samples outside the boundary are considered anomalous. Since there are few defective samples to be obtained, the method is extremely useful. Nevertheless, this method has lower accuracy in defect detection compared to supervised methods. Unlabeled sample data can be automatically generated by semi-supervised methods without manual intervention.

A framework for identifying defects in PCB solder joints was proposed in literature [54], which utilizes a combination of active learning and self-training through a sample query suggestion algorithm for classification. The framework has been demonstrated to improve classification accuracy while reducing the need for manual annotations. A semisupervised model of convolutional autoencoder (CAE) and generative adversarial network is proposed in [55]. After training with unlabeled data, the stacked CAE's encoder network is retained and input into the SoftMax layer as a GAN discriminator. Using GAN, false images of steel surface defects were generated to train the discriminator. For the detection of steel surface defects, literature [56] developed a WSL framework combining localization networks (LNets) and decision networks (DNets), with LNets trained by image level labels and outputs a heat map of potential defects as input to DNets. Through the use of the RSAM algorithm to weight the regions identified by LNet, the proposed framework has been demonstrated to be effective on real industrial datasets. The application prospects for weakly supervised methods are also wide because the methods simultaneously combine advantages of both supervised and unsupervised methods. There are few weakly supervised methods for detecting surface defects in industrial products. The literature [57] proposed a deep learning algorithm to learn defects from a variety of defect types with an unbalanced training sample pool for PCBA manufacturing products. In this method, an overall defect recognition accuracy of 98% is achieved in PBCA images using a novel batch sampling method and the sample weighted cost function.

A semi-supervised learning system that generates samples to detect surface defects was proposed according to the literature [58]. As part of the semi-supervised learning part, two CDCGAN and ResNet18 classifiers were used, and the NEU-CLS dataset was used to compare the two classifiers. In this way, supervised learning and transfer learning are both shown to be inferior to the method. A convolutional neural network structure based on residual network structures was proposed in [59] by stacking two layers of residual building modules together, resulting in a 43-layer convolutional neural network, while at the same time by appropriately increasing the network width; a more balanced network depth and network width can be obtained and accuracy can be improved. The network structure shows good performance on the DAGM, NEU steel, and copper clad plate datasets. Table 3 provides an overview of recent research in surface defect detection, including classifications of targets and Table 4 evaluates the strengths and weaknesses of deep learning techniques for detecting surface defects in industrial products, including accuracy, computational efficiency, and robustness. These tables give a comprehensive understanding of current research and the considerations for using deep learning in surface defect detection. Table 5 lists a selection of commonly used datasets for training and testing algorithms for detecting surface defects in industrial products. The datasets are classified based on the type of industrial products they are intended for. This information is useful for researchers and practitioners looking for suitable datasets for their work in the field of surface defect detection.


**Table 3.** An overview of recent research publications as well as classifications based on targets.





**Table 4.** Strengths and weaknesses of different techniques for detecting surface defects on industrial products using deep learning.


**Table 5.** A list of common surface defect datasets with classifications for industrial products.



#### **4. Deep Learning Defect Detection Methods for X-ray Images for Industrial Products**

Non-destructive testing (NDT) is a method that uses radiography or ultrasound technologies to discover faults without causing damage to the detected objects. It is widely used in engineering industries to detect and evaluate defects in materials of all types.

An important technique in non-destructive testing is radiographic testing, which uses X-rays to identify and evaluate flaws or defects, such as cracks or porosities. Defects can appear in X-ray images in many shapes and sizes, making detection difficult. The images are often low contrast and noisy, making identification of defects difficult.

The traditional approach for identifying defects in industrial products is for human operators or experts to visually inspect radiographs. However, this method can be subjective and prone to errors. Additionally, the process of examining a large number of images can be time-consuming and may lead to misinterpretations. However, there have been significant advancements in the field of defect detection in recent years, thanks to the emergence of deep learning techniques. As a result, a number of methods for detecting defects have been proposed, which are more efficient and reliable than the conventional approach. This section aims to provide a summary of current research on industrial product defect detection methods using X-ray images. Specifically, it covers the use of deep learning techniques such as convolutional neural networks and generative adversarial networks to analyze radiographic images and identify defects with a high degree of accuracy. These methods have the potential to reduce the subjectivity and human errors associated with the traditional approach, as well as the time required for inspection. Additionally, they can be trained to improve over time with more data, making them more robust and reliable.

A proposed system in literature [124] aimed to automate the process of inspecting and monitoring the condition of machines in the hard metal industry by analyzing defects in real production samples. Three models were created to analyze different types of data, a method called stacked generalization ensemble was applied and a random forest classifier was utilized to combine and analyze the results of the microprofilometer and ultrasound models. The fusion model was found to have improved performance and higher classification accuracy (88.24%) as compared to the individual models. Additionally, the shop floor model was able to effectively identify breakdowns during the manufacturing process and the ultrasound model was found to have better classification scores compared to the VGG-19 model. According to literature [125], a three-stage deep learning algorithm was proposed for detecting bubble patterns in engines. The algorithm consisted of training an autoencoder using normal images, fixing the coefficients of the encoder, and training a fullyconnected network using both normal and defective images. To improve the performance of the network, the entire system was fine-tuned. According to [126], a CNN model was designed with ten layers that belong to six grades for detecting defects in X-ray welding images. It was possible to achieve 98.8% classification accuracy using CNN if the ReLU activation function was used for X-ray welding image recognition. A real-time X-ray image analysis method using Support Vector Machines (SVMs) was presented in [127]. Using a background subtraction algorithm, all potential defects were segmented, and three features were extracted, including the defect area, the grayscale average difference, and the grayscale standard deviation. In order to distinguish non-defects from defects, the extracted features were input into an SVM classifier. A real-time X-ray image defect detection method based on the proposed method reduced undetected defects and false alarms. Another SVM-based method for detecting weld defects was described in [128]. The training SVM is trained by extracting three feature vectors from potential weld defects using grey-level profile analysis. In the last step, the SVM is trained to differentiate between defects that are real and those that are potential. A high percentage of correct detections could be achieved using the proposed method. For detecting insert molding in automotive electronics, ref. [129] proposed a Yolov5-based DR image defect detection algorithm. Width and a window level are adjusted in the preprocessing stage of the acquired data, and fast guided filtering is used for edge retention. Using the overlap, tiny anomalies are detected, and a multi-task dataset is constructed. Using Ghost, which replaces the standard convolutional network with the backbone network with enhanced features, the number of parameters can be further reduced. Moreover, CSP-modules are embedded in the neck and backbone of the network to enhance feature extraction. As a result of adding the transformer attention module after spatial pyramid pooling, over-fitting can be avoided while computational effort can be reduced. DR data-based Yolo series target detection algorithms are used as a final step to conduct consistent experiments. For detecting bead toe errors, ref. [130] proposed a lightweight semantic segmentation network. An encoder extracts the texture features of different regions of the tire in the network first. Then, to fuse the encoder's output feature, a decoder is introduced. A reduction in the dimension of the feature maps has allowed the positions of the bead toe to be recorded in the X-ray image. An index of local mIoU (L-mIoU) is proposed to evaluate the final segmentation effect. YOLOv3\_EfficientNet is used as the backbone of the methodology instead of YOLOv3\_darknet53. It results in a substantial improvement in YOLOv3 mean average precision, as well as a substantial reduction in inference time and storage space. DR image features are then used to enhance the data, thereby increasing the diversity of the clarity and shape of defects. With depth separable convolution, models can be deployed on embedded devices with acceptable accuracy loss ranges. A method was presented in [131] that utilizes deep learning with X-ray images to detect defects in aluminum casting parts used in automobiles, with the goal of improving the accuracy of both the algorithm and data augmentation. The study found that using Feature Pyramid Networks (FPNs) resulted in a 40.9% increase in Mean of Average Precision (mAP) value, making it the most effective modification. Additionally, using RoIAlign instead of RoI pooling in Faster R-CNN improved the accuracy of bounding

box location. The study also proposed various data augmentation methods to compensate for the limited availability of X-ray image datasets for defect detection. The results showed that the mAP values for each data augmentation method reached an optimal value and did not continue to increase as the number of datasets increased. Overall, the proposed improvements to the Faster R-CNN algorithm resulted in better performance for X-ray image defect detection of automobile aluminum casting parts. Using the Faster R-CNN detection model with X-ray preprocessing was applied to the detection of tire defects in [132] to improve curve fitting performance. Faster R-CNN precision and recall of defects were improved by adjusting its feature extractor, proposal generator, and box classifier. According to literature [133], triplet deep neural networks can be used to detect weld defects. X-ray images are first preprocessed into relief images to make defects easier to identify. Following that, a deep network is constructed based on triplets, and a feature vector is obtained by mapping the triplets. The distance between similar defect feature vectors and the distance between different types of defect feature vectors must be closer. The SVM is also used for automatic detection and classification of weld defects. Based on the results of two experiments, the proposed method is capable of effectively detecting multiple defects. Tables 6 and 7 together provide a comprehensive overview of the current state of research and practices in the field of deep learning for defect detection in X-ray images. Table 6 summarizes recent research publications, and Table 7 compares the strengths and weaknesses of different techniques. This information can be valuable for anyone interested in the advancement of this field.

**Table 6.** Recent publications on deep learning defects detection in X-ray images.



**Table 7.** Strengths and weaknesses of different deep-learning techniques for identifying defects in X-ray images.


#### **5. Problems and Solutions**

#### *5.1. Unbalanced Sample Identification Problem*

In industrial products, surface defects can also be detected with deep learning using unbalanced sample sets [137,138]. To train the deep learning model, it is usually necessary to have a balanced sample set of samples of different categories. This ideal situation, however, almost never occurs in the real world. More often than not, the majority of data in the dataset comes from "normal" samples, while "defective" or "abnormal" samples only make up a small portion. Supervised learning is one of the main tasks that suffers from unbalanced sample identification. The algorithm will therefore pay more attention to categories with larger data volumes and underestimate categories with smaller data volumes, affecting the model's generalization and prediction abilities. The data-level process methods aim to maintain a consistent number of samples for all types within the training set. Resolving the unbalanced sample identification issue at the data level can be broken down into five categories: data resampling, data augmentation, class equalization sampling, data source, and synthetic sampling. It is necessary to collect more samples in fewer categories from the data source. By horizontally or vertically flipping, rotating, zooming, cropping, and other operations, we can purposefully increase the number of sample data in each category.

Regarding data resampling [139,140], it is good to resample a sample set to change the proportion of samples in each category, including oversampling and undersampling. Class equalization sampling groups samples by categories and generates sample lists for each category. To ensure that each category has an equal chance of participating in training, a random category is selected during training, and samples are randomly selected from the corresponding sample list. Synthetic samples [141] are generated by combining various characteristics of an existing sample to create a new sample. Using this method, you can create a new sample by randomly selecting a value from the feature.

#### *5.2. Small Sample Problem*

As a result of continuous optimization of industrial processes, the number of defective samples has decreased. This makes it difficult to use deep learning methods to detect surface defects in industrial products, since there are fewer and fewer defect images available for deep learning. Overfitting problems in training can easily occur with small samples [142]. Transfer learning applies knowledge gained from one task to a different but related task when there is insufficient data to complete the target task. Consequently, transfer learning is also a critical method for solving the small sample problem. For surface defect detection, literature [143,144] used VGG networks and transfer learning to detect emulsion pump bodies, printed circuit boards, transmission line components, steel plates, and wood surfaces. Fabric surface defect detection using DenseNet and transfer learning was described in [145]. The combination of transfer learning and AlexNet was used to detect surface defects on solar panels and fabrics in [146,147]. Solving the small sample problem can also be achieved by optimizing the network structure. For the first time, GAN was used for image anomaly detection with the AnoGAN model [148] in 2017. A continuous iterative optimization process is used to find an image that matches the test image closest in the latent space, and then DCGAN is used to detect anomalies in that image. The f-AnoGAN model was introduced in [149]. This model proposes a method of encoding an image so that latent points can be quickly mapped, and then using WGAN to detect anomalies. As a result of the introduction of an encoder, the AnoGAN's iterative optimization process is much faster and less time-consuming. Additionally, the GANomaly model was proposed in [150] in 2018. It detects abnormal samples by comparing latent variables obtained by coding with latent variables obtained by reconstructing. There is no requirement for training with negative samples in any of the above models. It is also possible to obtain many sample images by enlarging the data. Using synthetic defects [151], the decorated plastic part dataset is expanded by adding synthetic defects to the defect-free

image. Literature [152] described a technique for generating defect representations that combine hand-made and unsupervised learning features.

#### *5.3. Real-Time Problem*

It is essential to consider real-time problems when performing surface defect detection in real industrial environments. Real-time detection problems involve reducing detection time and improving detection efficiency to maintain roughly the same accuracy. Research has been conducted on real-time problems by some scholars. To detect surface defects on printed circuit boards, literature [153] suggested combining SSIM and MobileNet. Comparing the proposed algorithm with Faster R-CNN, it maintained high accuracy while being at least 12 times quicker than the existing algorithm. Literature [154] developed a novel 11-layer CNN model for detecting welding defects in robotic welding manufacturing. The proposed method was capable of detecting metal additive manufacturing in real time, which meets specific requirements for online detection.

#### **6. Discussion**

Deep learning technology has revolutionized the field of defect detection in industrial products. However, finding a suitable deep learning model for solving the defect detection problem is very difficult due to the particularities of industrial scenarios. In the coming years, deep learning will encounter challenges and trends as it becomes more widely used in industrial fields. A brief description of recent trends and future research directions is provided in this section.

• Integrating deep learning with other methods:

By incorporating other techniques such as traditional image processing, the robustness and performance of the defect detection system in challenging conditions can be enhanced. For instance, using traditional image processing techniques to preprocess the images before inputting them into a deep learning model can improve the quality of the data and make it easier for the model to effectively detect defects. Additionally, integrating deep learning with other techniques, such as physics-based simulations, can provide better understanding of the underlying physical causes of defects and lead to the development of more efficient and effective defect detection methods.

• Adjustment to various lighting scenarios:

Examining industrial products frequently occurs under diverse lighting conditions, which can make it hard to identify defects. Research in this field could concentrate on developing techniques for adapting to various lighting conditions and using them to enhance the precision of defect detection. This could include methods such as image enhancement techniques, color constancy techniques, and multiple exposure fusion techniques, to improve the visibility of defects in different lighting conditions. Additionally, research could also focus on developing deep learning models that are robust to changes in lighting conditions, such as using adversarial training methodologies, to improve the robustness of the model. This may lead to a more accurate and reliable defect detection system that can function in a wide range of lighting scenarios.

• Transparent AI:

To be implemented in industrial environments, defect detection systems need to be transparent and explainable. Research in this field could focus on developing techniques to make deep-learning-based defect detection systems more understandable, so that users can comprehend why a defect was missed or incorrectly identified.

• All aspects need to be taken into account:

In order for a defect detection system to perform well, it must take into account various factors. There are many factors that can influence the accurate detection of defects, such as defect size, shape, the technique for image acquisition, alignment and distortion of images, resolution of images, and algorithmic speed, among others. It is important to consider all of these factors when creating a mature and successful method.

• Limited number of defect samples:

In many industrial applications, deep learning methods require a large training dataset and have high computational costs, and the number of defect samples is often insufficient to detect defects. Additionally, as the product line is frequently updated, new defect types are introduced and the detection process becomes more challenging. When training on normal samples, a simple defect detection method does not have any issues dealing with a small defect dataset, but, when it comes to defect localization and classification, the size of the dataset containing defects can be a challenge.

• Utilizing transfer learning:

Defect patterns may be shared between two different application domains. There may be similarities in the morphology of cracks in two different materials, but they may be different in their sizes and colors. It is currently necessary to train two different networks in order to use current approaches. A well-trained, tested network can transfer its knowledge to a new network to speed up the training process. Currently, transfer learning is not effectively utilized in most approaches.

• Multi-modal sensor integration:

Defect detection in industrial products often relies on visual inspections using cameras or other imaging devices. However, incorporating other types of sensors, such as thermal, acoustic, or vibration sensors, can provide additional information that can aid in the detection of defects. Research in this area could focus on developing methods for integrating data from multiple sensors and using it to improve the accuracy of defect detection. This could include techniques such as sensor fusion, where data from multiple sensors is combined to provide a more comprehensive view of the product, or methods for combining deep learning with other types of sensor data, such as sensor data from IoT devices.

• Continuous learning:

In industrial environments, the product line is frequently updated, and new defect types are introduced. Research in this area could focus on developing methods for continuous online learning, which can be used to adapt the defect detection system as new data is acquired and new defects are introduced. This could include online learning techniques, where the system can continuously update its knowledge as new data is acquired, or active learning methods, where the system can actively select the most informative images for annotation. This would allow the system to adapt to changes in the product line and improve its performance over time.

• Real-time detection:

There are only a few existing defect detection methods that are implemented in real time. In order to apply these methods to real-time inspection scenarios in the future, computationally efficient methods must be developed among these methods in order to achieve detection success rates in real time.

• Reducing the complexity:

Users of defect detection methods are interested in understanding why a defect has been missed or incorrectly identified in an acceptable part when such a method fails to find the defect. The majority of deep learning methods follow a complex architecture, so humans have difficulty understanding the decision-making process and providing a rationale for failure. When it comes to deploying and improving the performance of a system, this can be a challenge. Moreover, in industrial applications, lightweight deep learning networks will be easier to deploy. Often, the processing resources used to support artificial intelligence computations are valuable in quality inspections on production lines and industrial maintenance monitoring. By using lightweight networks, the prediction

system's workload can be effectively reduced, which is extremely beneficial for simple terminal deployments and can also reduce costs and performance.

• A common reference database:

Testing can be conducted on different databases, though several studies have failed to provide satisfactory results due to inconsistency in such databases and a lack of testing samples. Additionally, most of the studies presented in this review have their own databases with varying sizes and quality. To evaluate and compare performance in the future, a common reference database would be helpful.

#### **7. Conclusions**

Deep learning is rapidly gaining momentum as a powerful tool in the field of defect detection on industrial products. In this paper, we conducted a comprehensive review of the current state-of-the-art in the use of machine learning methods for detecting defects in industrial products. We specifically focused on deep learning methods for detecting surface defects and defects from X-ray images, and provided a detailed overview of the different techniques and algorithms that have been proposed in these areas. We also discussed some of the key challenges and limitations of these methods, and highlighted potential solutions to these problems. The goal of this review was to provide researchers with a clear understanding of the current state-of-the-art in the field of surface defect detection for industrial products, and to serve as a reference for future research in this area.

**Author Contributions:** Authors contributed as follows: Conceptualization, A.S. and J.R.; methodology, J.R. and M.E.-G.; funding acquisition, J.R. and M.E.-G.; investigation, A.S., J.R. and M.E.-G.; writing original draft preparation, A.S. and J.R.; writing—review and editing, A.S., J.R. and M.E.-G.; supervision, J.R. and M.E.-G. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by Natural Sciences and Engineering Research Council of Canada (NSERC), grant number 210471.

**Data Availability Statement:** In the manuscript, you will find a list of the corresponding websites.

**Conflicts of Interest:** There are no conflicts of interest between the authors.

#### **References**


**Disclaimer/Publisher's Note:** The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

MDPI St. Alban-Anlage 66 4052 Basel Switzerland www.mdpi.com

*Algorithms* Editorial Office E-mail: algorithms@mdpi.com www.mdpi.com/journal/algorithms

Disclaimer/Publisher's Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Academic Open Access Publishing

mdpi.com ISBN 978-3-0365-8831-5