This section is divided into three parts, which introduce the YOLOv7 algorithm, the fusion of deformable convolution of backbone networks and the introduction of attention mechanism, and the optimization of loss function by WIoU based on dynamic non-monotonic focusing mechanism.
2.5.2. Model Improvements
- (1)
Introducing deformable convolution
YOLOv7 uses deep convolutional networks for target feature extraction; in order to learn richer features, it needs to increase the depth of the network for feature learning, which increases not only the computational complexity but also the risk of overfitting. Meanwhile, convolutional networks have inherent drawbacks for modeling polymorphic targets due to the fact that convolutional networks only sample fixed positions of the input feature map [
24]. In long-staple cotton boll detection, the receptive fields of all feature points in the same layer of the feature map in the convolutional network are the same, while different locations may correspond to different scales of the boll, so adaptive learning of the scale or receptive field size to localize the target is not possible.
Deformable convolution can improve a model’s ability to model deformed targets. It uses parallel convolution layers to learn offsets that allow the convolution kernel to shift the sampling points on the input feature map to focus on regions or targets of interest [
25].
Figure 6 shows the comparison between regular convolutional sample points and deformable convolutional sample points.
Figure 7 shows the computational flow of deformable convolution, where the offset is calculated for the input feature map using the convolution layer, and the offset is used as the location of the sampling points. The number of channels of the output feature map is 3N (where N is the number of sample points of the convolution kernel), where 2N are the predicted offsets in the x- and y-directions, and the weights of the other N sample points must also be predicted since different sample points contribute differently to the features. In this way, the deformable convolution can flexibly adapt to the morphological changes of the target at different locations and scales, thus improving the detection accuracy of the model.
The conventional convolution operation is divided into two main steps: (1) sampling on the input feature map using a regular grid R; (2) weighting the sampled points using a convolution kernel w. R defines the size and dilation of the perceptual field, as shown in Equation (12), which defines a convolution kernel of size 3 × 3 and a dilation rate of 1.
For each position
on the output feature map, the output value
is calculated as follows:
In the deformable convolution operation, the expansion is performed by adding an offset
to the regular grid R while predicting a weight for each sample point. Then, the value of
for the same position
becomes as shown as follows:
Since the offset
is usually fractional, the value of
x must be calculated by bilinear interpolation as shown as follows:
Figure 8 shows the comparison between the feature point perceptual fields of the regular and deformable convolution. The two feature points represent targets of different scales and shapes. With two layers of convolution operations, it can be seen that the feature points of the conventional convolution have a fixed-size perceptual field, while the perceptual sampling points of the deformable convolution can learn adaptively, which is more in line with the shape and size of the object itself and more conducive to feature extraction. Therefore, the deformable convolution can better adapt to the morphological changes of long-staple cotton and improve the modeling ability of the model for long-staple cotton with variable morphology, thus improving the accuracy and stability of long-staple cotton boll detection.
In the YOLOv7 backbone network, the feature extraction of the target is mainly performed by the ELAN multi-branch stacking module. As shown in
Figure 9, the 3 × 3 convolution in the ELAN structure is used to perform feature extraction, and the 1 × 1 convolution is used to perform feature compression. For the deformable convolution, using the 1 × 1 deformable convolution to compute the offset for the sampling points may lead to sampling instability. Therefore, in this paper, part of the 3 × 3 convolution in the ELAN module is replaced by deformable convolution to improve the extraction of cotton boll features of different morphologies and scales by adding fewer computation resources. The introduction of deformable convolution can make the ELAN module more adaptable to the morphological changes of the target and improve the modeling ability of the model for deformed targets. Meanwhile, since deformable convolution can adaptively learn the position of sensory sampling points, it can better adapt to targets of different scales and shapes, thus improving the detection accuracy and robustness of the model.
- (2)
YOLOv7 introduces SENet attention mechanism
The attention mechanism is a common data processing method that is widely used in machine learning tasks in various fields [
26]. The core idea of the computer vision attention mechanism is to find the correlation between the original data and then highlight the important features, such as channel attention, pixel attention, multi-order attention, and so on. SENet is a typical implementation of the channel attention mechanism, and the module structure is shown in
Figure 10.
SENet can flexibly capture the connection between global and local information, allowing the model to obtain the object regions that require attention and assign greater weights to them, highlighting significantly useful features and suppressing the ignoring of irrelevant features, thus improving accuracy [
27]. Convolutional block attention module (CBAM) adaptively adjusts the features by inferring the attention weights sequentially along both spatial and channel dimensions and then multiplying them with the original feature map to adjust, and its model is less complex and computationally intensive [
28]. Coordinate attention (CA) performs global average pooling in two directions, height, and width, respectively, and then stitches the feature maps of these two directions together, which can effectively handle the relationship between channels [
29]. In this paper, the SENet module, CBAM module, and CA module are added after the backbone, the network structure is shown in
Figure 11, and the models after adding each module are compared with various evaluation metrics.
- (3)
WIoU loss
Intersection over union (IoU) is a common evaluation metric in target detection [
30]; it calculates the loss by evaluating the distance between the model’s prediction box B and the true box B
gt. Compared with the traditional cross-entropy loss function, it can better handle the problems of category imbalance and localization accuracy in target detection, thus improving the detection performance of the model, and the IoU loss is not affected by the bounding box scale. However, the value is zero if the two boxes
and
do not overlap, which can lead to no return in the gradient and inaccurate detectors.
To solve the problem of gradient disappearance, Rezatofighi et al. proposed to solve this problem by introducing a closed-loop penalty term consisting of real and prediction frames and constructing a generalized intersection over union (GIoU) loss [
31]. Most of the existing studies consider many geometric factors of the prediction frame and real frame off and construct a penalty term
to solve the problem of gradient disappearance, and most of the bounding box losses are based on additive losses and follow the following paradigm:
All of these methods assume that the examples in the training data are of high quality and work to strengthen the fitting ability of the bounding box loss. However, long-staple cotton data in real environments have large morphological and size differences due to their background occlusion, which can cause greater interference with the accuracy and consistency of data labeling, resulting in a dataset containing more low-quality examples, which may jeopardize the improvement of model detection performance if the regression of low-quality examples is persistently reinforced with bounding box. Zhang et al. proposed an efficient intersection over union (EIoU) loss and proposed a regression version of focal loss [
32] to focus the regression process on high-quality anchor frames [
33], but it does not fully exploit the potential of the non-monotonic focusing mechanism because its focusing mechanism is static. To address this problem, this study establishes a WIoU [
34] loss function based on a dynamic nonmonotonic focusing mechanism by defining an outlier parameter to describe the quality of the anchor frame and introducing a nonmonotonic focusing coefficient to dynamically assign the optimal gradient gain to the anchor frame. Its equation is as follows:
where
is mainly used to amplify the normal quality anchor frame by
,
mainly reduces the high-quality anchor frame by
, and significantly reduces its focus on the centroid distance in case of good overlap between the anchor frame and the target frame.
denotes the outlier, which describes the quality of the anchor frame, and
r denotes the non-monotonic focus factor, which is used to compute a gradient gain allocation strategy that fits the current situation. To prevent
from generating gradients that hinder convergence,
,
is separated from the computational graph. This effectively eliminates the factors that hinder convergence, so no new metrics, such as aspect ratio, are introduced.