The original YOLOv5 model has the following limitations: (i) It can only generate a target detection box with a horizontal angle and not a rotation angle. (ii) The stack of bottleneck modules in BottleneckCSP is serial, which causes the middle-layer features to be lost. (iii) The feature aggregation network lacks end-to-end connection between the input and output feature maps.
To solve these problems, we optimized three aspects of YOLOv5: (i) We added an angular prediction network and loss function, as well as a dynamic angle smoothing algorithm for angular classification, to improve the angular prediction ability. (ii) We optimized the BottleneckCSP module of the backbone network to enhance the model’s ability to extract the features of oriented waste. (iii) We optimized the feature aggregation network to improve the effect of multi-scale feature fusion.
2.2.1. Improvement of the Detection Head Network
The original YOLOv5 detector lacks a network structure for angular prediction and cannot provide the grasping angle information for waste objects. Therefore, the robotic arm cannot set the optimal grasp mode according to the placement angle of the waste, which easily leads to the object falling or to grabbing failure. Thus, we optimized the structure of the detection head.
Angular prediction can be realized as regression or classification. The regression mode produces a continuous prediction value for the angle but there is a periodic boundary problem, which leads to a sudden increase in the value of the loss function at the boundary of periodic changes, increasing the difficulty of learning [
18]. For example, in the 180° long-side definition method, the defined label range is (−90°, 90°). When the true angle of waste is 89° and the prediction is −90°, the error learned by the model is 179°, but the actual error should be 1°, which affects the learning of the model.
Therefore, we added convolution network branches in the detection head and defined the angle label with 180 categories obtained by rotating the long side of the target box clockwise around the center. The angle convolution network generates the angle prediction using information extracted from the multi-scale features obtained by the feature aggregation network.
In the detection head, the angle convolution network and the original network share the output of the feature aggregation network as the input feature graph. The output of the angle prediction network and the original network are merged as follows:
where
is the predicted category of the waste,
and
are the predicted central coordinates of the object box,
and
are the predicted lengths of the longer side and shorter side of the object box and
is the predicted angle of oriented waste.
2.2.2. Angle Smoothing and Loss Function
The realization of angular prediction as classification can avoid the periodic boundary problem caused by regression, but there are still some limitations. The loss function of traditional category tasks is calculated as cross-entropy loss, and the form of the labels is “one-hot label encoding”, as shown in Equations (2) and (3):
where
is the “one-hot label encoding” for the angle of sample
,
is the angle of the oriented waste and
is the prediction of the detection model.
Equation (8) shows that, for different incorrect predictions of the angle, the same loss value is obtained and the distance of the mistake cannot be quantified, which makes it difficult for model training to determine the angle of the oriented waste.
To solve this problem, we propose a dynamic smoothing label algorithm based on the circular smooth label (CSL) algorithm [
18] to optimize the “one-hot label encoding” label of the angle.
The circular smooth label algorithm is shown in Equation (4):
where
is the rotation angle value,
is the range of smoothness and
is the smoothing function. The angle label vector manifests as a “dense” distribution because
is within the range of smoothness.
The value of the smoothing function is shown in Equation (5):
where, when
, the function has a maximum value of 1, and when
, it is 0.
The CSL algorithm partially densifies the “one-hot label encoding”. When the angular prediction of the model is in the range of smoothness, different loss values for different predicted degrees are obtained; thus, it can quantify the mistake in the angle category prediction. However, the performance of CSL is sensitive to the range of smoothness. If the range of smoothness is too small, the smoothing label will degenerate into “one-hot label encoding” and lose its effect, and it will be difficult to learn the information from the angle. If the range is too large, the deviation in the angle prediction will be large, which will lead to it missing the object, especially for waste with a large aspect ratio.
Therefore, we propose a dynamic smoothing function for the angle label to adjust the smoothing amplitude and range.
The dynamic smoothing function uses the dynamic Gaussian function to smooth the angle labels. It can be seen from
Figure 3 that the smoothing amplitude and the range of the Gaussian function are controlled by the root mean square (RMS) value: the larger the RMS, the flatter the curve; the smaller the RMS, the steeper the curve and the smaller the smoothing range. Therefore, the RMS of the Gaussian function is gradually shrunk to achieve dynamic smoothing, as shown in Equation (6).
We provide two efficient functions—linear annealing and cosine annealing—to adjust the RMS, as follows:
where
is the value of the rotation angle for the waste, which corresponds to the peak position of the function;
is the encoding range of the waste angle;
is the value of the RMS; and
is the circular distance between the encoding position and the angle values. For example, if
is 179,
is 1 when
is 0;
and
represent the current number of training rounds and the maximum number of rounds of the model, respectively, and
c and
e are hyper-parameters.
It can be seen from Equation (6) that the DSM densifies the angle label according to the distance between the encoding position and the angle value dynamically. In the early stage of model training, obtained large values because of the small epoch. At this time, the range of smoothing was large, and the model’s learning of angles was reflected in the window area. When the smoothing range was more “loose”, the model came closer to the neighborhood area of the optimal point; thus, it reduced the difficulty of angle learning and improved the recall rate in image waste detection. The range of angle smoothing decreased with the increase in the value. The objective of the model was changed from the optimal region to the learning of the optimal point so that the deviation in the angular prediction would be smaller. The higher accuracy of the angle prediction improved the recall rate for the oriented waste, especially in cases with a large aspect ratio.
The angular loss of waste was calculated using the cross-entropy loss function based on the dynamic smoothing algorithm:
where
.
is the prediction of the angle and
is the quantification of the subdomain of the picture, and the model provides the prediction of the target for each subdomain.
is 0 or 1, which indicates whether there is a target. When the prediction is close to the true value, the cross-entropy has a smaller value.
In addition, the GIoU loss function [
19] was used to calculate the regression loss of the detection boundary box. In
Figure 4,
and
are the real box and the prediction box of the detection target, respectively. C is the smallest rectangle surrounding A and B. The green area is
.
The specific calculation is shown in Equations (8)–(10). GIoU not only pays attention to the overlap of the real box and the prediction box but also to the non-overlapping area, which allows it to solve the problem of the gradient not being calculated caused by A and B not intersecting.
In the equations, A and B are the real box and the prediction box of the detection target, respectively. C is the smallest rectangle surrounding A and B. The confidence loss function and category loss function are as shown by Equations (11) and (12):
where
and
indicate whether the prediction box
of the grid
is the target box, and
indicates the weight coefficients.
The overall loss function of the improved model is a weighted combination of the above loss functions, as shown in Equation (13):
2.2.3. Improvement of Feature Extraction Backbone Network
The feature extraction backbone network was used to extract the features of the waste in the image. Due to the addition of angular prediction in the detection of oriented waste, there is a higher demand on the feature extraction to realize effective recognition, especially in cases involving a large aspect ratio due to a narrow area.
BottleneckCSP is the main module in the backbone of YOLOv5. The BottleneckCSP module is stacked using a bottleneck architecture. As shown in
Figure 5a, the stacking of the bottleneck modules is serial. With the deepening of the network, the feature abstraction capability is gradually enhanced, but shallow features are generally lost [
20]. Shallow features have lower semantics and can be more detailed due to the fewer convolution operations. Utilizing multi-level features in CNNs through skip connections has been found to be effective for various vision tasks [
21,
22,
23]. The bypassing paths are presumed to be the key factor for easing the training of deep networks. Concatenating feature maps learned by different layers can increase the variation in the input of subsequent layers and improve efficiency [
24,
25]. In addition, attention mechanisms, which are methods used to assign different weights to different features according to their importance, have been found to be effective for the recognition of an image [
26,
27]. The coordinate attention mechanism (CA) [
28] is one such mechanism that shows good performance. Therefore, as shown in Equation (14), we concentrated and merged the middle features of BottleneckCSP and added the CA module to enhance the feature extraction capability. The attention mechanism is optional in the module at different levels.
where
is the feature map, is the input of the BottleneckCSP module, is the function mapping of the bottleneck module, and represents the CA attention operation.
Due to the “residual block” connection in the bottleneck architecture, excessive feature merging between bottlenecks leads to feature redundancy, which is not suitable for model training, and the increased number of parameters means that more resources are consumed. Therefore, the characteristic layers were connected using “interlayer merging”, as shown in
Figure 5b. The optimized module was named HDBottleneckCSP.
The CA module structure in HDBottleneckCSP is shown in
Figure 6. The input feature maps are coded along the horizontal and vertical coordinates to obtain the global field and to encode position information, respectively, which helps the network to detect the locations of targets more accurately.
As shown in Equation (15), the CA module generates vertical and horizontal feature maps for the input feature map and then transforms them through a 1 × 1 convolution. The generated
is the intermediate feature map for the spatial information in the horizontal and numerical directions,
is the down sampling scale and
represents the convolution operation.
where
is divided into
and
in the spatial dimension. As shown in Equations (16) and (17), it is transformed into the same number of channels as the input feature map through the convolution operation, while
and
are used as the attention weight and participate in the feature map operation. The output result of the CA module is shown in Equation (18).
The optimized feature extraction backbone network structure is shown in
Figure 7. It extracts features through the convolution module and the HDBottleneckCSP module and generates feature maps with three sizes by downsampling (1/8, 1/16 and 1/32).
2.2.4. Improvement of Feature Aggregation Network
The YOLOv5 feature aggregation network consists of feature pyramid networks [
29] (FPNs) and path aggregation networks [
30] (PANets). The structure of a PANet is shown in
Figure 8a. The PANet aggregates features along two paths: top-down and bottom-up. However, the aggregated features are deep features with high semantics, and the shallow features with high resolution are not fused. In order to make use of the input features more effectively, we used P2P-PANet to replace the PANet based on BiFPN [
31], as shown in
Figure 8b.
Compared to PANet, P2P-PANet adds end-to-end connection for the input-feature and output-feature maps, which establishes a “point-to-point” horizontal connection path from the low level to the high level, and it can realize the fusion of high-resolution and complex semantic features in an image without adding much cost. Through the extraction and induction of semantic information for the high-resolution and low-resolution feature maps, the angular feature information of rotating waste is further enhanced, and the detection ability of the model is improved.
The method for oriented waste detection after all the optimizations was named YOLOv5m-DSM and is shown in
Figure 9. When a picture is input into the model, YOLOv5m-DSM extracts features using the backbone and generates downsampling feature maps with three different sizes for the detection of waste. The feature aggregation network undertakes feature aggregation and fusion to enhance the model’s ability to learn features. The detection head generates the prediction information for waste targets based on the multi-scale features. In the model’s training stage, the label of the training set is smoothed using the dynamic smoothing module, and the loss in the prediction, including class, angle and position, is calculated using the loss calculation module for iterative learning.