2.2.3. Multi-Scale Detection

In the process of sorting construction waste, small objects need to be detected. However, the original YOLOv5 model can only detect an 8 × 8 receptive field for an input image with a size of 640 × 640. If the height or width of the detected object is less than 8 pixels in the dataset images, the image information features will be missing after convolution processing. In order to solve this problem, a special detection layer for small objects is added. This special detection layer is a 160 × 160 size output feature image, which can identify a receptive field size of more than 4 × 4 objects, basically meeting the detection requirements of sorting construction waste. Meanwhile, the three-scale feature fusion of the original YOLOv5 model was correspondingly increased to four-scale feature fusion, and a 160 × 160 feature detection layer was added. Therefore, the 80 × 80 feature detection layer was up-sampled twice. The twice up-sampled layer fused with the newly added 160 × 160 feature detection layer, which was used to detect small objects. The overall network structure of the improved YOLOv5 model is shown in Figure 7, where the dotted line in the Headdenotes the increased detection of the fourth scale, and the dotted line in the Neck describes the corresponding increased part of the four-scale feature fusion.

**Figure 7.** Structure of the Improved Multi-Scale Detection.

#### *2.3. Dataset Construction and Evaluation Index*

#### 2.3.1. Dataset

To evaluate the improved YOLOv5 model, it was tested in the two datasets. One is the public dataset, PASCAL VOC [38], and the other is the self-built construction waste dataset. The PASCAL VOC dataset is a common object-detection dataset, which includes 4 categories and 20 subcategories. Here, the train + val parts of VOC2007 and VOC2012 are applied for the training set, which consists of 5011 training samples of VOC 2007 and 11,540 training samples of VOC2012. Additionally, the test part of the VOC2007 dataset is given as the test set, which consists of 4952 test samples.

A rich dataset including many construction waste images of different types should be implemented in sorting waste. Only two open-source datasets—TrashNet [39] and Taco [40]—are available, but they are not suitable for a robotic sorting system because the detected objects transferred on a conveyor belt would be irregular, dirty and piled up on one another. Developing a new dataset is the first important step. Thus, sample images were collected in the construction site, as shown in Figure 8. The dataset consisted of 3046 construction waste images divided into 4 classes: bricks, wood, stones, and plastics. To create a more effective dataset, further data enhancement processing should be carried out, such as image flipping, translation, rotation, cropping, scaling, adding noise and random occlusion operations [41], so as to effectively avoid the overfitting problem in the training and improve the robustness of the model. A graphical image annotation tool, Labelimg, was used to label the image in the construction waste dataset. Finally, the dataset was divided into 3 subsets: the training set accounted for 80%, the verification set 10% and the test set 10%.

**Figure 8.** Construction waste image on site.

#### 2.3.2. Model Performance Evaluation Index

Average precision (*AP*), *F*1 score (*F*1-score) and mean average precision (*mAP*) were used as evaluation indicators to test the performance of the model. Average precision is a measure that combines recall and precision for ranked retrieval results. The recall rate reflects the ability to find positive samples, the precision expresses the ability to classify samples, and the *AP* shows the overall performance for object detection. The precision (*P*)–recall (*R*) curve can be plotted with the calculated *P* and *R* as the ordinate and abscissa, while the area under the curve is *AP*, and the mean value of *AP* is *mAP*. In addition, *F*1-score is commonly used for multiple classification problems, which is considered the harmonic average of precision and recall. The equations are shown in Equations (8)–(12).

$$P = \frac{TP}{TP + FP} \tag{8}$$

$$R = \frac{TP}{TP + FN} \tag{9}$$

$$AP = \int\_0^1 P \cdot R dR \tag{10}$$

$$mAP = \frac{1}{N} \sum\_{i=1}^{N} AP\_i \tag{11}$$

$$F1 = 2\frac{P \cdot R}{P + R} \tag{12}$$

where: *TP* (true positive) is the number of positive samples correctly predicted as positive samples; *FP* is the number of negative samples wrongly predicted as positive; *FN* is the number of positive samples wrongly predicted as negative samples; *N* is the number of sample classes of dataset. Judgment of positive and negative samples was based on the threshold of the Intersection over Union (IoU) which is the area of overlap between the predicted segmentation and the ground truth divided by the area of union between the predicted segmentation and the ground truth. If the IoU is greater than the threshold, it is classified as a positive sample, and if the IoU is less than the threshold, it is a negative sample. When the IoU is 0.5, the average accuracy of the YOLOv5 model is expressed as *AP*0.5, and the mean average accuracy is described as *mAP*0.5.
