*Article* **An Efficient YOLO Algorithm with an Attention Mechanism for Vision-Based Defect Inspection Deployed on FPGA**

**Longzhen Yu <sup>1</sup> , Jianhua Zhu 1,\* , Qian Zhao <sup>2</sup> and Zhixian Wang <sup>1</sup>**


**Abstract:** Industry 4.0 features intelligent manufacturing. Among them, the vision-based defect inspection algorithm is remarkable for quality control in parts manufacturing. With the help of AI and machine learning, auto-adaptive instead of manual operation is achievable in this field, and much progress has been made in recent years. In this study, considering the demand of inspection features in industrialization, we made further improvement in smart defect inspection. An efficient algorithm using Field Programmable Gate Array (FPGA)-accelerated You Only Look Once (YOLO) v3 based on an attention mechanism is proposed. First, because of the relatively fixed camera angle and defect features, an attention mechanism based on the concept of directing the focus of defect inspection is proposed. The attention mechanism consists of three improvements: (a) image preprocessing, which is to tailor images for selectively concentrating on the defect relevant things. Image preprocessing mainly includes cutting, zooming and splicing, named CZS operations. (b) Tailoring the YOLOv3 backbone network, which is to ignore invalid inspection regions in deep neural networks and optimize the network structure. (c) Data augmentation. First, two improvements can be made to efficiently reduce deep learning operations and accelerate the inspection speed, but the preprocessed images are similar and the lack of diversity will reduce network accuracy. So, (c) is added to mitigate the lack of considerable amounts of training data. Second, the algorithm is deployed on a PYNQ-Z2 FPGA board to meet the industrialization production requirements for accuracy, efficiency and extensibility. FPGA can provide a low-latency, low-cost, high-power-efficiency and flexible architecture that enables deep learning acceleration for industrial scenarios. A Xilinx Deep Neural Network Development Kit (DNNDK) converted the improved YOLOv3 to Programmable Logic (PL), which can be deployed on FPGA. The conversion process mainly consists of pruning, quantization and compilation. Experimental results showed that the algorithm had high efficiency, inspection accuracy reached 99.2%, processing speed reached 1.54 Frames per Second (FPS), and power consumption was only 10 W.

**Keywords:** vision; defect inspection; YOLO; FPGA; attention

**Copyright:** © 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/).

**1. Introduction**

Manufacturing involves a large number of parts. However, installation, welding, handling and many other sectors of manufacturing inevitably cause part defects, most of which can be identified by vision. The vision-based defect inspection algorithm is crucial to ensure the quality of parts and the entire manufacturing process. Industry 4.0 features intelligent manufacturing, which means doing jobs as efficiently as possible and adapting quickly to new conditions. With the help of AI and machine learning, autoadaptive operation is replacing manual operation in defect inspection. In particular with the emergence of cutting-edge deep learning technologies [1], the scope intelligence, accuracy, speed and efficiency of defect inspection algorithms are improved significantly [2].

Wang, Z. An Efficient YOLO Algorithm with an Attention Mechanism for Vision-Based Defect Inspection Deployed on FPGA. *Micromachines* **2022**, *13*, 1058. https://doi.org/10.3390/mi13071058

**Citation:** Yu, L.; Zhu, J.; Zhao, Q.;

Academic Editors: Xiuqing Hao, Duanzhi Duan and Youqiang Xing

Received: 3 June 2022 Accepted: 28 June 2022 Published: 30 June 2022

**Publisher's Note:** MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

*Micromachines* **2022**, *13*, x FOR PEER REVIEW 2 of 17

Deep learning algorithms for vision-based defect inspection can be mainly divided into two types: classification-based algorithms and regression-based algorithms [3]. Algorithms based on classification are represented by Region-based Convolutional Neural Network (R-CNN) series, including R-CNN, Spatial Pyramid Pooling Networks (SPP-Net), Fast R-CNN, Region-based Fully Convolutional Networks (R-FCN), and Mask R-CNN. Based on these algorithms, Fan et al. [4], Ji et al. [5], Zhang et al. [6], Guo et al. [7], Jin et al. [8] and Cai et al. [9] have inspected surface defects of wood, gear, metal, etc. Using two-stage processing (region extraction and object classification), R-CNN algorithms generally require high computing power to achieve high accuracy but with a relatively low inspection speed. Regression-based algorithms are characterized by only one round of processing, so the speed is faster. Redmon et al. [10] proposed the well-known You Only Look Once (YOLO) algorithm, which is a representative regression-based and end-to-end model. To date, the YOLO series has evolved to include YOLOv1, YOLOv2, YOLOv3 [11], YOLOv4 [12] and YOLOv5 [13]. Furthermore, the representative regression-based algorithms also include Single Shot MultiBox Detector (SSD) [14] and CornerNet [15]. YOLOv3 is among the most widely used YOLO algorithms. Based on YOLOv3, Jing et al. [16], Li et al. [17], Huang et al. [18] and Du et al. [19] performed surface defect inspection of fabric, PCB boards, pavements, etc. two-stage processing (region extraction and object classification), R-CNN algorithms generally require high computing power to achieve high accuracy but with a relatively low inspection speed. Regression-based algorithms are characterized by only one round of processing, so the speed is faster. Redmon et al. [10] proposed the well-known You Only Look Once (YOLO) algorithm, which is a representative regression-based and endto-end model. To date, the YOLO series has evolved to include YOLOv1, YOLOv2, YOLOv3 [11], YOLOv4 [12] and YOLOv5 [13]. Furthermore, the representative regression-based algorithms also include Single Shot MultiBox Detector (SSD) [14] and CornerNet [15]. YOLOv3 is among the most widely used YOLO algorithms. Based on YOLOv3, Jing et al. [16], Li et al. [17], Huang et al. [18] and Du et al. [19] performed surface defect inspection of fabric, PCB boards, pavements, etc. Compared with classical deep learning object detection algorithms, vision-based defect inspection can be optimized due to two characteristics: first, the recognition region on an image is predictable. As shown in Figure 1, the camera angle of the two parts is

Deep learning algorithms for vision-based defect inspection can be mainly divided

into two types: classification-based algorithms and regression-based algorithms [3]. Algorithms based on classification are represented by Region-based Convolutional Neural Network (R-CNN) series, including R-CNN, Spatial Pyramid Pooling Networks (SPP-Net), Fast R-CNN, Region-based Fully Convolutional Networks (R-FCN), and Mask

Jin et al. [8] and Cai et al. [9] have inspected surface defects of wood, gear, metal, etc. Using

Compared with classical deep learning object detection algorithms, vision-based defect inspection can be optimized due to two characteristics: first, the recognition region on an image is predictable. As shown in Figure 1, the camera angle of the two parts is fixed. In fact, we only care about the red box region of the two photos. By identifying only this region, we can identify whether the part is defective. Other regions of the original photo can be deleted accordingly. Second, the algorithm needs to meet the deployment requirements of industrial scenarios. The indicators of efficiency must be considered, such as stability, scalability, higher speed and lower power consumption. The target system requirements for this work are as follows. Inspection accuracy should be higher than 97%, image processing speed should be higher than 1 FPS, and the power consumption of each equipment should be less than 100 W. fixed. In fact, we only care about the red box region of the two photos. By identifying only this region, we can identify whether the part is defective. Other regions of the original photo can be deleted accordingly. Second, the algorithm needs to meet the deployment requirements of industrial scenarios. The indicators of efficiency must be considered, such as stability, scalability, higher speed and lower power consumption. The target system requirements for this work are as follows. Inspection accuracy should be higher than 97%, image processing speed should be higher than 1 FPS, and the power consumption of each equipment should be less than 100 W.

**Figure 1.** Normal welding and defect welding. (**a**) Normal welding. (**b**) Defect welding. **Figure 1.** Normal welding and defect welding. (**a**) Normal welding. (**b**) Defect welding.

attention mechanism deployed on FPGA is proposed. There are two main contributions.

In this work, an efficient YOLO algorithm for vision-based defect inspection with an

attention mechanism is proposed that is based on the concept of drawing global dependencies between the input and the output of a neural network [20],

In this work, an efficient YOLO algorithm for vision-based defect inspection with an attention mechanism deployed on FPGA is proposed. There are two main contributions. consequently directing focus on defect inspection. The improvement in attention focuses on three aspects: (1) we use image preprocessing named CZS operations for

*Micromachines* **2022**, *13*, x FOR PEER REVIEW 3 of 17


### **2. YOLOv3 Based on an Attention Mechanism** An attention mechanism focuses on modeling the relationship between the input and

### *2.1. Image CZS Preprocessing* the output of the algorithm, regardless of distance [22]. For defect inspection, the camera

An attention mechanism focuses on modeling the relationship between the input and the output of the algorithm, regardless of distance [22]. For defect inspection, the camera angles of industrial cameras are relatively fixed. So, we can predefine the possible defect regions and extract features only from the specific region. The self-attention mechanism calculates the sequence semantic representation by associating different positions in the sequence. We add the image preprocessing equivalent to adding a self-attention mechanism to YOLOv3 preprocessing. CZS operations are as shown in Figure 2. The blue box represents the cutting region, and the green box and the red box represent two kinds of defect markup regions. Color boxes numbered 1 to 8 in the original image on the left side correspond to the splicing regions of the image on the right side. All color boxes are the main areas of concern for defect detection. The entire process involved three steps: cut predefined regions from the original image, zoom regions to the same size and splice regions together to form a new image—named CZS operations for short. angles of industrial cameras are relatively fixed. So, we can predefine the possible defect regions and extract features only from the specific region. The self-attention mechanism calculates the sequence semantic representation by associating different positions in the sequence. We add the image preprocessing equivalent to adding a self-attention mechanism to YOLOv3 preprocessing. CZS operations are as shown in Figure 2. The blue box represents the cutting region, and the green box and the red box represent two kinds of defect markup regions. Color boxes numbered 1 to 8 in the original image on the left side correspond to the splicing regions of the image on the right side. All color boxes are the main areas of concern for defect detection. The entire process involved three steps: cut predefined regions from the original image, zoom regions to the same size and splice regions together to form a new image—named CZS operations for short.

Cutting operations take a small square box containing a defect region as a cutting region. The box is slightly larger than the smallest box containing the defect region, so as

point of the defect region to the original image is *wr* , and thus the width of the center point of the defect region is *w wr m sw* = × . Similarly, define the height, *x*-coordinate and

**Figure 2.** CZS operations. **Figure 2.** CZS operations.

Cutting operations take a small square box containing a defect region as a cutting region. The box is slightly larger than the smallest box containing the defect region, so as to ensure fault-tolerant positioning of the same type of images. Define the width of the original image *w<sup>s</sup>* and the height *h<sup>s</sup>* , respectively. The ratio of the width of the center point of the defect region to the original image is *rw*, and thus the width of the center point of the defect region is *w<sup>m</sup>* = *w<sup>s</sup>* × *rw*. Similarly, define the height, *x*-coordinate and *y*-coordinate as *rh* , *r<sup>x</sup>* and *ry*. Additionally, the height, *x*-coordinate and *y*-coordinate of the defect region's center point can be expressed as *h<sup>m</sup>* = *h<sup>s</sup>* × *r<sup>h</sup>* , *x<sup>m</sup>* = *w<sup>s</sup>* × *r<sup>x</sup>* and *y<sup>m</sup>* = *h<sup>s</sup>* × *ry*. Define the width, height, *x*-coordinate and *y*-coordinate of the cutting box's top left corner as *wc*, *hc*, *x<sup>c</sup>* and *yc*, respectively.

$$w\_{\mathcal{C}} = h\_{\mathcal{C}} = \max(w\_m, h\_m) \times \alpha \tag{1}$$

$$\mathbf{x}\_{\mathcal{L}} = \begin{cases} \mathbf{x}\_{m} - \frac{w\_{\mathcal{L}}}{2}, \mathbf{x}\_{m} + \frac{w\_{\mathcal{L}}}{2} \le w\_{\mathcal{S}} \\\ w\_{\mathcal{S}} - w\_{\mathcal{L}}, \mathbf{x}\_{m} + \frac{w\_{\mathcal{L}}}{2} > w\_{\mathcal{S}} \end{cases} \tag{2}$$

$$y\_{\mathcal{L}} = \begin{cases} \ y\_m - \frac{l\_{\mathcal{L}}}{2}, y\_m + \frac{l\_{\mathcal{L}}}{2} \le h\_{\mathcal{S}} \\\ h\_{\mathcal{S}} - h\_{\mathcal{L}}, y\_m + \frac{l\_{\mathcal{L}}}{2} > h\_{\mathcal{S}} \end{cases} \tag{3}$$

The *α* is the expansion coefficient, which takes a value between 1 and 2. This means that cutting box is 1- to 2-fold larger than the smallest box containing the defect region. Formulas (2) and (3) mean that when the defect region is close to the boundary of the original image, the top left corner coordinates of the cutting box should be consistent with the original image.

The zooming operation is to scale all cutting boxes on an image to the same size, so that can be fully held by a new 416 × 416 image (the standard image size processed by YOLOv3 is 416 × 416 pixel). According to YOLOv3, images will be scaled to 416 pixels in width (*W*) and height (*H*) before being processed. Suppose the number of cutting boxes on an image is *Nc*. So, the number of boxes that can be held in a row (*N<sup>h</sup>* ) or a column (*Nv*) of a new image is calculated as Formula (4). The target size that a cutting box should be scaled to is calculated as Formula (5). The scaling factor *β* is calculated as Formula (6). The *sqrt* function is used to obtain the square root of the passed argument. The *ceiling* function is used to obtain the smallest integer larger than the passed argument. The *f loor* function is used to obtain the largest integer smaller than the passed argument.

$$N\_h = N\_v = 
ceiling(sqrt(N\_c))\tag{4}$$

$$w\_z = h\_z = f 
lor (\frac{W}{N\_h}) = f 
lor (\frac{H}{N\_v}) \tag{5}$$

$$
\beta = \frac{w\_z}{w\_c} \tag{6}
$$

The splicing operation is to combine cutting boxes from the original image into a new image after being scaled. Splicing mainly consists of two processing works; one is to map a cutting box to the splicing region; another is to map the actual defect markup region to the splicing region. For the first work, we first sort the cutting boxes from small to large according to the *x<sup>c</sup>* value. If the *x<sup>c</sup>* values of two cutting boxes are the same, then we sort them from small to large according to their *y<sup>c</sup>* value. Suppose that a cutting box is ranked as *N<sup>i</sup>* , *i* = 0, . . . , *N<sup>h</sup>* , then the width, height, *x*-coordinate and *y*-coordinate of the box's top left corner in the new image are as defined as *w i χ* , *h i χ* , *x i <sup>χ</sup>* and *y i χ* . Operator // represents the round function and % represents the remainder function.

$$n\_h^i = \begin{cases} N^i / / N\_v + 1 \text{,} N^i \% \text{N}\_v \neq 0\\ N^i / / N\_v \text{,} N^i \% \text{N}\_v = 0 \end{cases} \tag{7}$$

$$n\_v^i = \begin{cases} N^{i\diamond} \lhd N\_{\upsilon} \, N^{i\diamond} \lhd N\_{\upsilon} \neq 0 \\\ N\_{\upsilon} \, N^{i\diamond} \lhd N\_{\upsilon} = 0 \end{cases} \tag{8}$$

$$w^i\_\chi = h^i\_\chi = w\_z \tag{9}$$

$$\mathbf{x}\_{\chi}^{i} = (\mathfrak{n}\_{\upsilon}^{i} - \mathbf{1}) \times w\_{z} \tag{10}$$

$$y\_\chi^i = (n\_h^i - 1) \times h\_z \tag{11}$$

For the second work, the size ratio of the width, height, *x*-coordinate and *y*-coordinate of the center point of an actual defect markup region to the new image is *R i <sup>w</sup>*, *R i h* , *R i <sup>x</sup>* and *R i y* .

$$R\_w^i = \frac{w\_m \times \beta}{\mathcal{W}} \tag{12}$$

$$R^i\_h = \frac{h\_m \times \beta}{\mathcal{W}} \tag{13}$$

$$R\_\chi^i = \frac{(n\_v^i - 1) \times w\_z + (\mathbf{x}\_m - \mathbf{x}\_c) \times \beta}{W} \tag{14}$$

$$R\_y^i = \frac{(n\_h^i - 1) \times h\_z + (y\_m - y\_c) \times \beta}{H} \tag{15}$$

In addition to the above algorithms, the new image generated may have some blank regions. We filled them with 0 or 255 values, shown as the square labelled 9 on the right-side image of Figure 2. Then, CZS preprocessing of the image is finished.

### *2.2. Tailoring the Backbone Network*

According to the size of the inspection target, the YOLOv3 backbone network can be tailored to detect the defect regions more efficiently.

As shown in Figure 3, the backbone network of classical YOLOv3 includes 53 layers, so called Darknet-53. Among them, *Convolutional* is the convolution layer, *Residual* is the hop connection layer of residual network, *Avgpool* is pooling layer by average, and *Connected* is the full connection layer. The labels ×1, ×2, ×8, ×8 and ×4 represent repeated execution 1, 2, 8, 8 and 4 times, respectively. Note that the five repeated steps correspond to five down-sampling. Additionally, the outputs of the ×8, ×8 and ×4 down-sampling of the last three steps correspond to the classification prediction feature map (YOLO layer) at three scale resolution levels—52 × 52, 26 × 26 and 13 × 13. The final feature maps of classical YOLOv3 have three sizes, the 52 × 52 resolution has better support for detecting tiny objects, and the 13 × 13 resolution is more suitable for identifying larger objects.

Classical YOLOv3 is used for general object detection, including both large and small objects. The distance of the camera from the object of which photos are taken will also affect the size of the object to be recognized. However, the vision-based defect inspection algorithm is different from classical YOLOv3. The camera angle of industry cameras is relatively fixed, and the shape and size of the defect to be inspected are also relatively fixed. Therefore, based on the fixed shape and size of the defect, only the corresponding resolution networks need to be retained, instead of retaining all three scales (52 × 52, 26 × 26 and 13 × 13) of networks. For example, in the production site of automobile rubber and plastic parts, the visible defect commonly has a moderate size and is obviously distinguished, so the inspection network of such defects does not require a very high resolution. However, on the other hand, in the field of silicon chip solder joint quality inspection, defect inspection of welding points needs high precision. The solder joint layout on the chip is very fine and tiny, so it needs a very high-resolution network for identification. In short, the network structure can be optimized according to targeted inspection tasks. Tailoring the YOLOv3 backbone network can be based on the following formulas.

$$if(every(w\_i > \frac{W}{26} \cap h\_i > \frac{H}{26})), tailor(yolo\_{52\times52})\tag{16}$$



**Figure 3.** YOLOv3 backbone network and tailorable parts. **Figure 3.** YOLOv3 backbone network and tailorable parts.

Classical YOLOv3 is used for general object detection, including both large and small objects. The distance of the camera from the object of which photos are taken will also affect the size of the object to be recognized. However, the vision-based defect inspection algorithm is different from classical YOLOv3. The camera angle of industry cameras is relatively fixed, and the shape and size of the defect to be inspected are also relatively fixed. Therefore, based on the fixed shape and size of the defect, only the corresponding resolution networks need to be retained, instead of retaining all three scales (52 × 52, 26 × 26 and 13 × 13) of networks. For example, in the production site of automobile rubber and The *if (condition)* statement is the basic conditional control structure. This allows the *tailor* function to happen, depending on whether a given condition is true. The *tailor* function is used to delete part of the input argument network used to identify a certain scale. The *yolo*52×<sup>52</sup> is the smallest resolution inspection network part of YOLOv3, while the *yolo*13×<sup>13</sup> represents the largest resolution network part. The *every* function means every inspected target. The *N* represents the full amount of the inspected targets on an image. The *w<sup>i</sup>* and *h<sup>i</sup>* represent the width and height of a target. The *W* and *H* represent an image's width and height. *Micromachines* **2022**, *13*, x FOR PEER REVIEW 7 of 17 image. The and ℎ represent the width and height of a target. The and represent an image's width and height. In the case study of chapter 4, the 52 × 52 tiny resolution-scale network is shrunk.

plastic parts, the visible defect commonly has a moderate size and is obviously distinguished, so the inspection network of such defects does not require a very high resolution. However, on the other hand, in the field of silicon chip solder joint quality inspection, defect inspection of welding points needs high precision. The solder joint layout on the chip is very fine and tiny, so it needs a very high-resolution network for identification. In short, the network structure can be optimized according to targeted In the case study of chapter 4, the 52 × 52 tiny resolution-scale network is shrunk. That is to partially delete the third down-sampling layers of the backbone network. For classical YOLOv3, the third down-sampling layers consist of eight rounds of repetition. In our algorithm, seven rounds of repetition are tailored off, that is to delete 14 convolutional layers for all. Thus, the backbone network is condensed from 53 layers to 39 layers. Our algorithm's backbone network turns into Darknet-39. That is to partially delete the third down-sampling layers of the backbone network. For classical YOLOv3, the third down-sampling layers consist of eight rounds of repetition. In our algorithm, seven rounds of repetition are tailored off, that is to delete 14 convolutional layers for all. Thus, the backbone network is condensed from 53 layers to 39 layers. Our algorithm's backbone network turns into Darknet-39.

#### inspection tasks. Tailoring the YOLOv3 backbone network can be based on the following formulas. *2.3. Data Augmentation 2.3. Data Augmentation*

52 52 ( ( )), ( ) 26 26 *i i i N W H if every w h tailor yolo* <sup>×</sup> <sup>∈</sup> > ∩> (16) In order to enhance attention and improve the recognition accuracy of deep learning networks, it is also necessary to implement data augmentation to expand the dataset. The whole process of data augmentation is shown in Figure 4. There are mainly two strategies. In order to enhance attention and improve the recognition accuracy of deep learning networks, it is also necessary to implement data augmentation to expand the dataset. The whole process of data augmentation is shown in Figure 4. There are mainly two strategies.

The *if (condition)* statement is the basic conditional control structure. This allows the function to happen, depending on whether a given condition is true. The **Figure 4.** The process of data augmentation. **Figure 4.** The process of data augmentation.

every inspected target. The

a grey rectangle.

function is used to delete part of the input argument network used to identify a certain scale. The 52 × 52 is the smallest resolution inspection network part of YOLOv3, while the 13 × 13 represents the largest resolution network part. The *every* function means Strategy 1: It is to add random noise to the defect markup region of the original image, which changes from normal to noisy or faulty. As shown in Figure 5, the Strategy 1: It is to add random noise to the defect markup region of the original image, which changes from normal to noisy or faulty. As shown in Figure 5, the rectangular

rectangular cover is used to simulate the noise or missing faults on the surface of the markup region. The position, size and color can be set randomly. From left to right, Figure

equivalent to adding some noise, so it should be ensured that the network training can recognize such markup region; Figure 5c,d completely cover one or two markup regions

**Figure 5.** Simulation of noisy or missing faults. (**a**) The normal and original picture marked up with 5 inspection regions of 2 types, green box represents circular solder joint and red box represents strip solder joint. (**b**) One inspection regions was covered by a blue rectangle for 1/3. (**c**) One regions was completely covered by a blue rectangle. (**d**) Two neighbor regions was completely covered by

Strategy 2: It is to rotate the image by 90°, 180° and 270° and flip it horizontally. That is equaled to expand into 8 images by rotation and flipping, as shown in Figure 6. In the original picture, there are a total of 8 regions to be inspected, which are divided into two

represents the full amount of the inspected targets on an

**(a) (b) (c) (d)** 

types, represented by red boxes and green boxes respectively.

with rectangles to simulate the missing faults.

cover is used to simulate the noise or missing faults on the surface of the markup region. The position, size and color can be set randomly. From left to right, Figure 5a is normal; Figure 5b uses a rectangle to cover 1/3 region of a markup region, which is equivalent to adding some noise, so it should be ensured that the network training can recognize such markup region; Figure 5c,d completely cover one or two markup regions with rectangles to simulate the missing faults. rectangular cover is used to simulate the noise or missing faults on the surface of the markup region. The position, size and color can be set randomly. From left to right, Figure 5a is normal; Figure 5b uses a rectangle to cover 1/3 region of a markup region, which is equivalent to adding some noise, so it should be ensured that the network training can recognize such markup region; Figure 5c,d completely cover one or two markup regions with rectangles to simulate the missing faults.

Strategy 1: It is to add random noise to the defect markup region of the original image, which changes from normal to noisy or faulty. As shown in Figure 5, the

image. The and ℎ represent the width and height of a target. The and represent an

In the case study of chapter 4, the 52 × 52 tiny resolution-scale network is shrunk. That is to partially delete the third down-sampling layers of the backbone network. For classical YOLOv3, the third down-sampling layers consist of eight rounds of repetition. In our algorithm, seven rounds of repetition are tailored off, that is to delete 14 convolutional layers for all. Thus, the backbone network is condensed from 53 layers to

In order to enhance attention and improve the recognition accuracy of deep learning networks, it is also necessary to implement data augmentation to expand the dataset. The whole process of data augmentation is shown in Figure 4. There are mainly two strategies.

*Micromachines* **2022**, *13*, x FOR PEER REVIEW 7 of 17

39 layers. Our algorithm's backbone network turns into Darknet-39.

image's width and height.

*2.3. Data Augmentation* 

**Figure 4.** The process of data augmentation.

**Figure 5.** Simulation of noisy or missing faults. (**a**) The normal and original picture marked up with 5 inspection regions of 2 types, green box represents circular solder joint and red box represents strip solder joint. (**b**) One inspection regions was covered by a blue rectangle for 1/3. (**c**) One regions was completely covered by a blue rectangle. (**d**) Two neighbor regions was completely covered by a grey rectangle. **Figure 5.** Simulation of noisy or missing faults. (**a**) The normal and original picture marked up with 5 inspection regions of 2 types, green box represents circular solder joint and red box represents strip solder joint. (**b**) One inspection regions was covered by a blue rectangle for 1/3. (**c**) One regions was completely covered by a blue rectangle. (**d**) Two neighbor regions was completely covered by a grey rectangle.

Strategy 2: It is to rotate the image by 90°, 180° and 270° and flip it horizontally. That is equaled to expand into 8 images by rotation and flipping, as shown in Figure 6. In the original picture, there are a total of 8 regions to be inspected, which are divided into two types, represented by red boxes and green boxes respectively. Strategy 2: It is to rotate the image by 90◦ , 180◦ and 270◦ and flip it horizontally. Thatis equaled to expand into 8 images by rotation and flipping, as shown in Figure 6. In theoriginal picture, there are a total of 8 regions to be inspected, which are divided into twotypes, represented by red boxes and green boxes respectively. *Micromachines* **2022**, *13*, x FOR PEER REVIEW 8 of 17

**Figure 6.** Rotate and flip horizontal the original image. **Figure 6.** Rotate and flip horizontal the original image.

photos.

explanation.

**3. FPGA Deployment**  *3.1. Overall Framework* 

The use of a dataset enhancement strategy is not only to improve the quality of training, but also to effectively reduce manpower consumption. In this study, there are only 630 original photos, which can only be manually added to the defect markup region. The use of a dataset enhancement strategy is not only to improve the quality of training, but also to effectively reduce manpower consumption. In this study, there are only 630 original photos, which can only be manually added to the defect markup region. Then,

Then, a python script can be used to automatically complete image preprocessing and dataset enhancement, so that the final dataset size for training and testing reaches 40,320

The vision-based defect inspection algorithm is deployed on an All Programmable System on Chip (APSoC), the Xilinx PYNQ-Z2. With its help, we can use low-powerconsumption, cutting-edge, customized hardware to replace high-energy-consumption, large-footprint, non-specific-purpose and high-cost deep learning workstations. As shown in Figure 7, the whole algorithm deployed on FPGA includes two parts: image CZS preprocessing, and hardware acceleration for deep learning. Through CZS operation, only the regions to be inspected in the pictures will be retained. Refer to Figure 2 for specific a python script can be used to automatically complete image preprocessing and dataset enhancement, so that the final dataset size for training and testing reaches 40,320 photos.

### **3. FPGA Deployment**

### *3.1. Overall Framework*

The vision-based defect inspection algorithm is deployed on an All Programmable System on Chip (APSoC), the Xilinx PYNQ-Z2. With its help, we can use low-powerconsumption, cutting-edge, customized hardware to replace high-energy-consumption, large-footprint, non-specific-purpose and high-cost deep learning workstations. As shown in Figure 7, the whole algorithm deployed on FPGA includes two parts: image CZS preprocessing, and hardware acceleration for deep learning. Through CZS operation, only the regions to be inspected in the pictures will be retained. Refer to Figure 2 for specific explanation. *Micromachines* **2022**, *13*, x FOR PEER REVIEW 9 of 17

**Figure 7.** Overall framework of the proposed defect inspection algorithm. **Figure 7.** Overall framework of the proposed defect inspection algorithm.

At the beginning, an industry camera captures photos in FPGA. Then, programs running on the operation system of FGPA (PS, Processing System) automatically perform CZS preprocessing on those photos according to the metadata stored in the database. Preprocessed images are then transmitted to the DPU implemented by PL, which is a special customized hardware for mapping and running the darknet-39 YOLOv3 model. With it, the process of defect inspection can be accelerated and finished. At the beginning, an industry camera captures photos in FPGA. Then, programs running on the operation system of FGPA (PS, Processing System) automatically perform CZS preprocessing on those photos according to the metadata stored in the database. Preprocessed images are then transmitted to the DPU implemented by PL, which is a special customized hardware for mapping and running the darknet-39 YOLOv3 model. With it, the process of defect inspection can be accelerated and finished.

A Xilinx PYNQ-Z2 FPGA board is equipped with a ZYNQ-7020 APSoC chipset (Xilinx AMD Inc., San Jose, USA), which has both a "hard core" and a" soft core". As shown in Figure 8, the hard core and its functions are grey and green boxes, and the soft core and its functions are orange and yellow boxes. The hard core is a 650 MHz ARM Cortex-A9 dual-core processor (Arm Inc., Cambridge, UK), running an embedded Ubuntu system (Canonical Ltd., Landon, UK). This processor supports python programming for simple processing (preprocessing image, running database, etc.) and C++ programming for calling the DPU. The soft core is the PL that can be employed by the B1152 DPU architecture in accordance with the Xilinx DNNDK 3.0 (Xilinx AMD Inc., San Jose, CA, USA) framework. Deep learning algorithms can be transformed to a format A Xilinx PYNQ-Z2 FPGA board is equipped with a ZYNQ-7020 APSoC chipset (Xilinx AMD Inc., San Jose, USA), which has both a "hard core" and a" soft core". As shown in Figure 8, the hard core and its functions are grey and green boxes, and the soft core and its functions are orange and yellow boxes. The hard core is a 650 MHz ARM Cortex-A9 dual-core processor (Arm Inc., Cambridge, UK), running an embedded Ubuntu system(Canonical Ltd., Landon, UK). This processor supports python programming for simple processing (preprocessing image, running database, etc.) and C++ programming for calling the DPU. The soft core is the PL that can be employed by the B1152 DPU architecture in accordance with the Xilinx DNNDK 3.0 (Xilinx AMD Inc., San Jose, CA, USA) framework. Deep learning algorithms can be transformed to a format that the DPU can read and execute

that the DPU can read and execute without fully utilizing hardware resources. The efficiency of processing 416 × 416 images of YOLOv3 is approximately 3.5 FPS, and the rated power is approximately 10 W (the power information comes from the technical

common Central Processing Unit (CPU) or a Graphics Processing Unit (GPU) chip. This

can fully meet our predesigned efficiency target.

without fully utilizing hardware resources. The efficiency of processing 416 × 416 images of YOLOv3 is approximately 3.5 FPS, and the rated power is approximately 10 W (the power information comes from the technical documents of Xilinx PYNQ-Z2), having much better power-efficient performance than a common Central Processing Unit (CPU) or a Graphics Processing Unit (GPU) chip. This can fully meet our predesigned efficiency target. *Micromachines* **2022**, *13*, x FOR PEER REVIEW 10 of 17 *Micromachines* **2022**, *13*, x FOR PEER REVIEW 10 of 17

**Figure 8.** Xilinx PYNQ-Z2 architecture. **Figure 8.** Xilinx PYNQ-Z2 architecture.

#### *3.2. Deployment 3.2. Deployment* Deployment involves two parts, host-side deployment and FPGA-side deployment.

*3.2. Deployment* 

Deployment involves two parts, host-side deployment and FPGA-side deployment. On the FPGA side, we adopt the system version of the intelligent car hydramini, the essence of which is a customized DPU platform for PYNQ-Z2. The host side is the computer side. The host side needs to install the deep learning development platform, as well as the DNNDK 3.0. DNNDK is necessary for converting a standard deep learning model into a deployable model on PYNQ-Z2. DNNDK's core functions include pruning, quantization and compiling. As shown in Figure 9, the blue flowchart represents the pruning operation, the orange flow chart represents quantization operation, and the green flowchart represents the compiler. Deployment involves two parts, host-side deployment and FPGA-side deployment. On the FPGA side, we adopt the system version of the intelligent car hydramini, the essence of which is a customized DPU platform for PYNQ-Z2. The host side is the computer side. The host side needs to install the deep learning development platform, as well as the DNNDK 3.0. DNNDK is necessary for converting a standard deep learning model into a deployable model on PYNQ-Z2. DNNDK's core functions include pruning, quantization and compiling. As shown in Figure 9, the blue flowchart represents the pruning operation, the orange flow chart represents quantization operation, and the green flowchart represents the compiler. On the FPGA side, we adopt the system version of the intelligent car hydramini, the essence of which is a customized DPU platform for PYNQ-Z2. The host side is the computer side. The host side needs to install the deep learning development platform, as well as the DNNDK 3.0. DNNDK is necessary for converting a standard deep learning model into a deployable model on PYNQ-Z2. DNNDK's core functions include pruning, quantization and compiling. As shown in Figure 9, the blue flowchart represents the pruning operation, the orange flow chart represents quantization operation, and the green flowchart represents the compiler.

**Figure 9.** Pruning, quantifying, and compiling processes. **Figure 9.** Pruning, quantifying, and compiling processes. **Figure 9.** Pruning, quantifying, and compiling processes.

Pruning is to obtain a new condensed network from the network pretrained by PyTorch (Meta Inc., Silicon Valley, CA, USA) or Tensorflow (Google Inc., Silicon Valley, CA, USA). Pruning consists of automatically deleting some redundant branches that do not affect the network output, and replacing variable values on network nodes with constant values of the current session. Quantization is to convert float-point values into fixed-point values. The first benefit Pruning is to obtain a new condensed network from the network pretrained by PyTorch (Meta Inc., Silicon Valley, CA, USA) or Tensorflow (Google Inc., Silicon Valley, CA, USA). Pruning consists of automatically deleting some redundant branches that do not affect the network output, and replacing variable values on network nodes with Pruning is to obtain a new condensed network from the network pretrained by PyTorch (Meta Inc., Silicon Valley, CA, USA) or Tensorflow (Google Inc., Silicon Valley, CA, USA). Pruning consists of automatically deleting some redundant branches that do not affect the network output, and replacing variable values on network nodes with constant values of the current session.

from quantization is improving processing performance, using short bytes of data instead of long bytes. In general, 32-bit float-point values are replaced with 8-bit integers. As a result, the entire network is compressed. The second benefit lies in that the PYNQ-Z2′s Digital Signal Processing (DSP) units mainly support processing fixed-point values, the fixed-point values operations of which are specially optimized. Compiling is the deep learning algorithm being transformed into binary instruction constant values of the current session. Quantization is to convert float-point values into fixed-point values. The first benefit from quantization is improving processing performance, using short bytes of data instead of long bytes. In general, 32-bit float-point values are replaced with 8-bit integers. As a result, the entire network is compressed. The second benefit lies in that the PYNQ-Z2′s Digital Signal Processing (DSP) units mainly support processing fixed-point values, the Quantization is to convert float-point values into fixed-point values. The first benefit from quantization is improving processing performance, using short bytes of data instead of long bytes. In general, 32-bit float-point values are replaced with 8-bit integers. As a result, the entire network is compressed. The second benefit lies in that the PYNQ-Z20 s Digital Signal Processing (DSP) units mainly support processing fixed-point values, the fixed-point values operations of which are specially optimized.

> files that can be recognized by the DPU. The functional module consists of three parts: interpreter, optimizer and code generator. The interpreter parses the quantization model

> optimizing IR. The code generator makes optimized IR into DPU-recognized instructions. After the above three steps, a deep learning model file that is recognized by the PYNQ-Z2 DPU is generated. The file is deployed to PYNQ-Z2 and loaded on the core

and converts it into Intermediate Representation (IR). The optimizer is responsible for optimizing IR. The code generator makes optimized IR into DPU-recognized instructions. After the above three steps, a deep learning model file that is recognized by the PYNQ-Z2 DPU is generated. The file is deployed to PYNQ-Z2 and loaded on the core

Compiling is the deep learning algorithm being transformed into binary instruction

fixed-point values operations of which are specially optimized.

Compiling is the deep learning algorithm being transformed into binary instruction files that can be recognized by the DPU. The functional module consists of three parts: interpreter, optimizer and code generator. The interpreter parses the quantization model and converts it into Intermediate Representation (IR). The optimizer is responsible for optimizing IR. The code generator makes optimized IR into DPU-recognized instructions. *Micromachines* **2022**, *13*, x FOR PEER REVIEW 11 of 17

> After the above three steps, a deep learning model file that is recognized by the PYNQ-Z2 DPU is generated. The file is deployed to PYNQ-Z2 and loaded on the core kernel process of the DPU. We can let the PYNQ-Z2 reload DPU Intellectual Property (IP) then use an executable program written by the DPU C++ library to invoke the deep learning model, including using API to read problem initialization parameters and analyze the output results. Deployment is finished. All above steps is shown in Figure 10. CZS and after-processing operations are undertaken by ARM (Arm Inc., Cambridge, UK), refer to Figure 2 for specific explanation. And defect inspection operations are undertaken by Xilinx ZYNQ-7020 (Xilinx AMD Inc., San Jose, CA, USA). kernel process of the DPU. We can let the PYNQ-Z2 reload DPU Intellectual Property (IP) then use an executable program written by the DPU C++ library to invoke the deep learning model, including using API to read problem initialization parameters and analyze the output results. Deployment is finished. All above steps is shown in Figure 10. CZS and after-processing operations are undertaken by ARM (Arm Inc., Cambridge, UK), refer to Figure 2 for specific explanation. And defect inspection operations are undertaken by Xilinx ZYNQ-7020 (Xilinx AMD Inc., San Jose, USA).

**Figure 10.** Process flow of PYNQ-Z2. **Figure 10.** Process flow of PYNQ-Z2.

Access to the database and image preprocessing require a low amount of data processing resources, which are handed over to the ARM CPU for processing. In contrast, deep learning needs a lot of computing power, and FPGA is responsible for this processing work to achieve hardware acceleration. As a SoC board, PYNQ-Z2 fully realizes the seamless connection between the operating system and programmable hardware. Python and database software such as SQLite run on ARM CPU, while the DPU C++ code directly calls ZYNQ-7020 FPGA hardware resources, which optimizes the load balance of the whole process. Access to the database and image preprocessing require a low amount of data processing resources, which are handed over to the ARM CPU for processing. In contrast, deep learning needs a lot of computing power, and FPGA is responsible for this processing work to achieve hardware acceleration. As a SoC board, PYNQ-Z2 fully realizes the seamless connection between the operating system and programmable hardware. Python and database software such as SQLite run on ARM CPU, while the DPU C++ code directly calls ZYNQ-7020 FPGA hardware resources, which optimizes the load balance of the whole process.

### **4. Experimental Results 4. Experimental Results**

### *4.1. Experiment Design 4.1. Experiment Design*

The case study gets experimental data from an automobile rubber and plastic parts manufacturer, which is a super class parts supplier for several well-known automobile brands in China, such as SAIC GM Wuling and FAW Volkswagen. Therefore, success in implementing the application system will have significance in the whole auto parts industry. With the help of industrial cameras with 10 fixed camera angles, we collected 630 photos compared with the original 63 sample photos. The photos have the same 1600 × 1200 pixels resolution and 8-bit depth. In addition, there are 16 types of defect markups. The ratio of normal and defective samples in the 630 pieces of original pictures is approximately 9:1. Several sample photos are shown in Figure 11, the red box on the pictures identify the area to be detected. The hardware of the host side mainly consists of a NVIDIA RTX3090 24 GB GPU (Nvidia Co., Silicon Valley, CA, USA), an AMD R9 3900X CPU (AMD Inc., Silicon Vallsey, USA), 64 GB of DDR4 (Hynix Inc., Seoul, Korea) memory The case study gets experimental data from an automobile rubber and plastic parts manufacturer, which is a super class parts supplier for several well-known automobile brands in China, such as SAIC GM Wuling and FAW Volkswagen. Therefore, success in implementing the application system will have significance in the whole auto parts industry. With the help of industrial cameras with 10 fixed camera angles, we collected 630 photos compared with the original 63 sample photos. The photos have the same 1600 × 1200 pixels resolution and 8-bit depth. In addition, there are 16 types of defect markups. The ratio of normal and defective samples in the 630 pieces of original pictures is approximately 9:1. Several sample photos are shown in Figure 11, the red box on the pictures identify the area to be detected. The hardware of the host side mainly consists of a NVIDIA RTX3090 24 GB GPU (Nvidia Co., Silicon Valley, CA, USA), an AMD R9 3900X CPU (AMD Inc., Silicon Vallsey, CA, USA), 64 GB of DDR4 (Hynix Inc., Seoul, Korea)

and a 4 TB mechanical hard disk (Seagate Technology, Scotts Valley, USA). The software installed on the host side include Ubuntu, Compute Unified Device Architecture (CUDA)

(Xilinx AMD Inc., San Jose, USA).

memory and a 4 TB mechanical hard disk (Seagate Technology, Scotts Valley, CA, USA). The software installed on the host side include Ubuntu, Compute Unified Device Architecture (CUDA) (Nvidia Co., Silicon Valley, CA, USA), Python (The Python Software Foundation, Wilmington, DE, USA), PyTorch, Tensorflow, Docker (Docker Inc., San Francisco, CA, USA) and DNNDK 3.0 (Xilinx AMD Inc., San Jose, CA, USA). *Micromachines* **2022**, *13*, x FOR PEER REVIEW 12 of 17

**Figure 11.** Several sample photos. **Figure 11.** Several sample photos.

### *4.2. Host Side 4.2. Host Side*

*4.3. FPGA Side* 

On the host workstation computer side, our algorithm, the attention-based YOLOv3, is trained. The maximum number of training epochs was initially set to 300. In the training process, the training set was composed of 36139 photos after data augmentation, at a total of 11 GB. The training process took a long time, as one round of epoch took nearly 17 min. On the other hand, precision convergence was very fast. When we finished the work after 300 epochs of training, accuracy exceeded 99%. On the host workstation computer side, our algorithm, the attention-based YOLOv3, is trained. The maximum number of training epochs was initially set to 300. In the training process, the training set was composed of 36139 photos after data augmentation, at a total of 11 GB. The training process took a long time, as one round of epoch took nearly 17 min. On the other hand, precision convergence was very fast. When we finished the work after 300 epochs of training, accuracy exceeded 99%.

With the pretrained model, we tested the time efficiency on the host side. The time spent is mainly divided into two parts: the loading time of the Python library is approximately 1 s, and the inspection time is approximately 0.01 s, as shown in Table 1. The loading process of the Python library is very time consuming, which can be made into a daemon, so that it is always in the loaded library state, scanning to detect changes in images, and real-time inspection. With the pretrained model, we tested the time efficiency on the host side. The time spent is mainly divided into two parts: the loading time of the Python library is approximately 1 s, and the inspection time is approximately 0.01 s, as shown in Table 1. The loading process of the Python library is very time consuming, which can be made into a daemon, so that it is always in the loaded library state, scanning to detect changes in images, and real-time inspection.

**Table 1.** Comparison between our algorithm and YOLOv3. **Table 1.** Comparison between our algorithm and YOLOv3.


As shown in Table 1, our algorithm's inspection time decreased by 0.004s, and accuracy improved by 0.2%. In sum, compared with the majority of indicators, our algorithm makes an improvement to classical YOLOv3, indicating that our algorithm achieves better performance by tailoring the backbone network. As shown in Table 1, our algorithm's inspection time decreased by 0.004 s, and accuracy improved by 0.2%. In sum, compared with the majority of indicators, our algorithm makes an improvement to classical YOLOv3, indicating that our algorithm achieves better performance by tailoring the backbone network.

The efficiency and accuracy of the algorithm on the host side fully meet the industrial requirements, so the trained neural network is moved to the FPGA side. The DNNDK environment is deployed with Docker to prune, quantify and compile the trained

on PYNQ-Z2, reload the DPU kernel and compile the executable program, which completes the migration and deployment of PYNQ-Z2. Because it is necessary to use

### *4.3. FPGA Side*

The efficiency and accuracy of the algorithm on the host side fully meet the industrial requirements, so the trained neural network is moved to the FPGA side. The DNNDK environment is deployed with Docker to prune, quantify and compile the trained network; and then deploy the processed neural network files to the appropriate location on PYNQ-Z2, reload the DPU kernel and compile the executable program, which completes the migration and deployment of PYNQ-Z2. Because it is necessary to use onboard Python to call the low-performance ARM CPU for database query and image preprocessing, although the performance of programmable hardware circuit is very good, the overall timeliness of the system is lower than that of the host side. As shown in Table 2, the performance of PYNQ-Z2 will be reduced to some extent, but it is also competent for missing fault inspection. The experimental results on PYNQ-Z2 also showed satisfactory performance, similar to that of the host side. A comparison of processing times of the host workstation computer and PYNQ-Z2 is shown in Table 2.

**Table 2.** A comparison of processing times of workstation computer and PYNQ-Z2.


Although the process speed on PYNQ-Z2 is slower than that on the host side. The SoC cutting-edge equipment shows a reasonable efficiency with much lower power consumption than a workstation computer. Additionally, the comprehensive performance of our algorithm can meet the predesigned target of Section 1, as shown in Table 3.

**Table 3.** A comparison of processing times of different algorithms on PYNQ-Z2.


### *4.4. Inference*

In this study, the mean of average precision (mAP) and the intersection over union (IoU) were used as the main accuracy evaluation indices. In this case, the average recognition accuracy of 16 detection categories is calculated, and then the average value of these 16 average accuracies is the mAP. The closer the mAP is to 1, the better. Before training, the manually marked area to be detected is called the ground-truth bounding box. Later, the marked area detected by the model is called the predicted bounding box. The IoU is the intersection of these two regions divided by the union. The closer the IoU is to 1, the better, indicating that the model detection area is consistent with the manually marked area.
