2.2.1. CBAM-CSPDarknet53

Due to the problem of small objection with low pixels in the picture, it is easy for there to be missing information in construction waste sorting. The CBAM attention mechanism was added in the YOLOv5 backbone, named CBAM-CSPDarknet53. The CBAM attention mechanism module is mainly divided into a channel attention module and a spatial attention module [37]. The channel attention module pays more attention to the core information in the image of the dataset and can squeeze the spatial size, while the channel size remains uniform. The spatial attention module focuses on the position information of the object and can squeeze the channel size without modifying the spatial dimension. The structure of the CBAM attention mechanism module is shown in Figure 3.

**Figure 3.** Structure of CBAM attention mechanism.

The structure of the channel attention module is described in Figure 4. The feature image *<sup>F</sup>* (*<sup>F</sup>* <sup>∈</sup> *<sup>R</sup> <sup>C</sup>*×*H*×*W*) is processed by average pooling and maximum pooling, and then, the feature image size changes from *C* × *H* × *W* to *C* × 1 × 1. Next, the new feature image

is sent to the Multi-Layer Perception (MLP) and the number of neurons in the first layer of the MLP is *C*/*r*, where *r* is the decline rate, and *C* is the number of neurons in the second layer of the MLP. Then, after processing by the Sigmoid function, the weight coefficient *Mc* will be obtained and calculated. The equation is shown in Equation (1).

$$M\_{\mathfrak{c}}(F) = \sigma(\mathcal{W}\_1(\mathcal{W}\_0(F\_{\text{avg}}^{\mathbb{C}})) + \mathcal{W}\_1(\mathcal{W}\_0(F\_{\text{max}}^{\mathbb{C}}))) \tag{1}$$

where: *σ* is the Sigmoid function; *avg* is global average pooling; max is the maximum pooling; *<sup>W</sup>*<sup>0</sup> <sup>∈</sup> *<sup>R</sup> <sup>C</sup>*<sup>×</sup> *<sup>C</sup> <sup>r</sup>* ; *<sup>W</sup>*<sup>1</sup> <sup>∈</sup> *<sup>R</sup> <sup>C</sup>*<sup>×</sup> *<sup>C</sup> <sup>r</sup>* ; *F<sup>C</sup> avg* is the global average pooled feature image of size <sup>1</sup> × <sup>1</sup> × *<sup>C</sup>*; *<sup>F</sup><sup>C</sup>* max is the largest pooled feature image with size 1 × 1 × *C*.

Finally, the weight coefficient *Mc* is multiplied by the feature image *<sup>F</sup>* (*<sup>F</sup>* <sup>∈</sup> *<sup>R</sup> <sup>C</sup>*×*H*×*W*). Therefore, the new feature image can be obtained.

**Figure 4.** Structure of Channel Attention Module.

The structure of the spatial attention module is shown in Figure 5. The feature images of the new dataset obtained in the previous step are again processed by maximum pooling and average pooling, and then divided into two channels of size is 1 × *H* × *W*. Then, the obtained tensors are stacked together by joining operations, and the weight coefficient *Ms* is obtained after convolution and Sigmoid function operations. The equation is shown in Equation (2).

$$M\_{\mathfrak{s}}(F) = \sigma(f^{7 \times 7}([F\_{avg}^{\mathbb{S}}; F\_{\text{max}}^{\mathbb{S}}])) \tag{2}$$

where: *σ* is the Sigmoid function; *avg* is global average pooling; max is the maximum pooling; *<sup>f</sup>* <sup>7</sup>×<sup>7</sup> is the convolution of 7 × 7; *<sup>F</sup><sup>S</sup> avg* is the feature after the average pooling operation, and the size is 1 × *<sup>H</sup>* × *<sup>W</sup>*; *<sup>F</sup><sup>S</sup>* max is the feature after the maximum pooling operation, and the size is 1 × *H* × *W*.

Finally, the calculated weight coefficient *Ms* is multiplied by the feature image *F* which can obtain a new feature image *F* .

**Figure 5.** Structure of Spatial Attention Module.

#### 2.2.2. SimSPPF (Simplified SPPF)

The SimSPPF module based on the maximum pooling layer at the same size is proposed to replace the SPP module in the original YOLOv5s model. The structure of the SimSPPF module is shown in Figure 6.

**Figure 6.** Structure of SimSPPF Module.

The SimSPPF module has a scale of 5 × 5 for the input feature images in the construction waste dataset. Because of the different maximum poolings in the three stages, channels should be connected for the output of the pooling layer. The equations are shown in Equations (3)–(7).

$$F1 = \mathbb{C}BR(F) \tag{3}$$

$$F2 = Maxpooling(F1)\tag{4}$$

$$F3 = Maxpooling(F2)\tag{5}$$

$$F4 = \operatorname{Maxpooling}(F3) \tag{6}$$

$$F\mathfrak{F} = \mathbb{C}BR([F; F2; F3; F4])\tag{7}$$

The SimSPPF module can avoid the local feature loss of the construction waste dataset images, effectively reduce the residual parameter information, and retain the core texture features of construction waste dataset images. The calculation speed of the forward propagation of the SimSPPF module is faster than that of the SSP module. Meanwhile, after the SimSPPF module is embedded, the ability of the YOLOv5 model to extract image features can be improved greatly.
