2.4.2. Trans-E Block

The Transformer was first used in the field of natural language machine translation, and its most significant feature is the self-Attention mechanism. The main working modules in the Transformer structure are the encoder and decoder. During machine translation, the encoder part models the input sequence. It extracts the output value of the last time step at the structural output as a representation of the input sequence. The decoder then takes the input sequence representation as its input value and generates the translation with maximum probability. This paper simulates the encoder function in the Transformer structure, and proposes a Transformer-Encoder (Trans-E) block and tries to apply it to the image impurity detection. The structure of the Trans-E block is shown in the Figure 5. The Trans-E block consists of two sub-layers, the multi-head attention layer and the fullyconnected layer. Among them, the multi-head attention layer is to perform multiple linear mappings of different sub-region representation spaces through multiple heads under the consideration of parallel computing; thus, it can obtain more comprehensive information under different sub-spaces at different locations. The main function of theconnected layer

is to map the feature space calculated by the previous layer to the sample label space. A residual structure connects the two sub-layers. This article replaces the bottleneck blocks and some Cnov blocks of CSPDarknet53 in the original YOLOv5 with Trans-E blocks. Compared with CSP bottleneck blocks, Trans-E blocks have more advantages in capturing global information.

**Figure 5.** The architecture of Tran-E block.

2.4.3. CBAM Attention Mechanism

Since there is much useless information in the walnut kernel image, such as the walnut kernel itself, in order to suppress other useless image information, we increase the effective image feature weight, reduce the invalid weight, and make the training network model produce the best results. This paper introduces the based YOLOv5 Convolutional Block Attention Module (CBAM). The working principle of this module is as follows: take the global max pooling and global average pooling operations based on width and height, respectively for the input feature map *F* (*H* × *W* × *C*), and the output result is two 1 × 1 × *C* feature maps. Then, the obtained feature maps are sent to the neural network (*MLP*), respectively. The number of layers in the neural network is two layers. The number of neurons in the first layer is *C*/r (r is the reduction rate), the activation function is Relu, and the second layer is the number of neurons. The number of neurons in the layer is *C*. Then, an element-wise-based sum operation is performed on the output features, and the final channel attention feature, namely *Mc*, is generated after the sigmoid activation operation. Finally, the element-wise multiplication operation is performed on *Mc* and the input feature map *F* to generate the input features required by the Spatial attention module. The specific calculation is as follows:

$$M\_{\mathbb{C}}(F) = \sigma(MLP(AvgPool(F)) + MLP(MaxPool(F)))\tag{1}$$

$$=\sigma\left(\mathcal{W}\_1\left(\mathcal{W}\_0\left(F\_{\text{avg}}^c\right)\right) + \mathcal{W}\_1\left(\mathcal{W}\_0(F\_{\text{max}}^c)\right)\right) \tag{2}$$

The output of the channel attention module is taken as an input into the spatial attention module, which is also subjected to maximum pooling and average pooling. Then the two are stacked through the Concat operation, which only compresses the channel dimension but not the spatial dimension to focus on the target's location information. The mechanism is shown in Figure 6.

**Figure 6.** CBAM attention mechanism.

In this paper, the CBAM module is added after the C3 module and the Trans-E module in the neck part so that the image features of walnut shells and foreign objects are weighted and combined, which increases the network at the cost of a small amount of computation, so that the network pays more attention to the key information of foreign objects such as walnut shells, which helps to train a better network.
