1. Introduction
Aggregate classification is an important factor for determining the performance and quality of concrete. Concrete is composed of cement, sand, stones, and water. Aggregate generally accounts for 70% to 80% of concrete [
1]. Many factors affect the strength of concrete, mainly including the cement strength and water–binder ratio, the aggregate gradation and particle shape, the curing temperature and humidity, the curing age, etc. [
2]. The aggregate processing system is one of the most important auxiliary production systems used in the construction of large-scale water conservancy and hydropower projects [
3]. Aggregate quality control is of great significance to promote the sound development of the engineering construction industry [
4], it is also extremely important for improving the quality of a project and optimizing the cost of a project [
5]. Different types of aggregate have different effects on the performance of concrete [
6]. Regarding the particle size and shape of the aggregate, the current specifications for coarse aggregate needle-like particles are relatively broad [
7], and good-quality aggregate needs to have a standardized particle size and shape [
8]. Therefore, we must ensure the quality requirements of aggregate and select raw materials are reasonable to ensure the quality of concrete. It is particularly important to find a suitable aggregate classification and detection method.
In recent years, the level of aggregate classification and detection has greatly improved [
9], and there are now a variety of sand particle size measurement methods. These include, for example, mesoscale modeling of concrete static and dynamic tensile fractures for real shape aggregates [
10], the development of a particle size and shape measurement system for manufactured sand [
11], the use of extreme gradient boosting-based pavement aggregate shape classification [
12], the use of the wire mesh method to sort aggregate gardens [
13], a method for evaluating the strength of individual ballast aggregates by point load testing and establishing a classification method [
14], the determination of particle size, and core and shell size determination of core–shell particle distribution by analytical ultracentrifugation [
15], the use of the projected area of the particle to calculate the surface area, equivalent diameter, and sphericity [
16], the use of imaging methods to obtain reliable particle size distribution [
17], the use of a vibration dispersion system, feed blanking system, and backlight image acquisition system to construct a particle size and shape measurement device [
18]. Isa et al. proposed an automatic intelligent aggregate classification system combined with robot vision [
19]. Sun et al. proposed a coarse aggregate granularity classification method based on a deep residual network [
20]. Moaveni et al. developed a new segmentation technology that can capture images quickly and reliably to analyze their size and shape [
21]. Sinecen et al. established a laser-based aggregate shape representation system [
22], which classifies aggregate from the features extracted from the created 3D images.
However, these screening methods can only measure the size of sand particles offline. Although digital image processing methods use more mature technical means [
23,
24], the research of these methods mainly focuses on the evaluation index of the shape characteristics of the aggregate [
25], which cannot achieve the efficient real-time detection of images. In reality, with regards to the detection background of aggregate, the size of the detection target, the day and night light, and the difference in detection distances, the transmission of these detection targets to the processing side may cause different interferences. In this case, it is necessary to first detect the target position and locate and frame the target to reduce signal interference as much as possible, and, at the same time, detect target objects under different characteristics. Therefore, the real-time detection of aggregate features under complex backgrounds is of great significance.
In summary, this work brings the following main contributions:
The design of a new type of aggregate detection and classification model, which can accurately identify the types of aggregate under complex backgrounds, such as different light, different distances, and different states of dry and wet aggregate.
The improvement of YOLOv5, replacing the C3 module in the model backbone network, tailoring the Neck structure, and realizing the compression of the model, so that the model can be quickly detected on a computer that does not support GPU. The loss function is improved making the object frame selection more accurate.
In the original YOLOv5 model, the original three detection heads are simplified into two, which is more suitable for the detection of a single target (only one target is recognized in a picture), thus reducing the number of parameters and the number of calculations.
2. Related Work
There are many mature target detection algorithms, such as YOLOv4 [
26], SSD [
27], YOLOv4-tiny, and YOLOv5 [
28]. Compared with these algorithms, YOLOv5 is lighter and more portable. YOLOv5 uses a backbone feature extraction network, acquires the depth features of the input image, uses feature fusion to further improve the effectiveness of features, effectively frames the detection target, and improves the precision of target detection [
29]. At present, YOLO is also widely used as a popular target detection algorithm. Yan et al. proposed a real-time apple targets for picking detection method robot based on improved YOLOv5 [
30]. Yao et al. proposed a ship detection method in optical remote sensing images based on deep convolutional neural networks [
31]. Gu et al. proposed a YOLOv5-based method for the identification and analysis of emergency behavior of caged laying ducks [
32]. Zhu et al. proposed traffic sign recognition based on deep learning [
33]. Fan et al. proposed a strawberry ripeness recognition algorithm combining dark channel enhancement and YOLOv5 [
34]. A cost-performance evaluation of livestock activity recognition services using aerial imagery was proposed by Lema et al. [
35] Jhong et al. proposed a night object detection system based on a lightweight and deep network of the internet of vehicles [
36]. Wang et al. proposed a fast smoky vehicle detection method based on improved YOLOv5 [
37]. Wu et al. studied the application of YOLOv5 in the detection of small targets in remote sensing images [
38]. Song et al. proposed an improved YOLOv5-based object detection method for grabbing robots [
39].
Although YOLO v5 is much lighter than other object detection algorithms, the network structure is complex, with many layers, a large number of nodes, and limited experimental equipment. If it was running on one CPU, it would take longer during the actual training and inference.
In order to solve these problems, this experiment established an aggregate classification detection model, YOLOv5-ytiny, based on YOLOv5 in a complex background, compressing the YOLOv5 model, extracting complex detection background features in different environments, improving the detection speed, and providing real-time judgment of the classification of aggregate.
3. Materials and Methods
3.1. Data Collection and Processing
In this experiment, a high-definition camera is used to collect images, and the real-time images obtained by the camera are transmitted to the client. The model classifies and recognizes the acquired images, and then displays the results to the client.
Figure 1 is a schematic diagram of image acquisition.
When capturing images, the camera is fixed 2 m above the aggregate box. Considering that the distance between the car and the camera is within 1–2 m during actual transportation, the effective detection distance we expect is also within 1–2 m. The image collection is shot under natural light and night lighting. The shooting result is saved as a 1920 pixel × 1080 pixel RGB image. There are 4 types of aggregate, namely stones, small stones, machine-made sand, and surface sand. The particle size of the stones is in the range of 3–4 cm, the particle size of small stones is in the range of 2–3 cm, the particle size of machine-made sand is in the range of 1–2 cm, the particle size of surface sand is in the range of 0.1–0.5 cm. A total of 525 images in four types were taken, including different light conditions, dry and wet aggregate, and different shooting distances.
Taking the stones in
Figure 2a and the small stones in
Figure 2b as examples, the unit grayscale number of stones is distributed in the (130,180) pixel interval, and the unit grayscale number of small stones is distributed in the (120,180) pixel interval. The grayscale distributions of these two types of aggregate have highly overlapping areas. It may be difficult to segment an image based on gray threshold, and it can be seen in
Figure 3 that the stones and small stones are stacked. If an image is segmented using an image-based grayscale threshold segmentation method, it may not be possible to segment a single aggregate target because the grayscales connected to each other in the region are the same, which may easily cause the targets to stick together. On the other hand, the image of the aggregate with a short collection distance is clear, but the image is blurred when the distance is farther, and it is difficult to perform image processing. Therefore, this experiment uses the target detection algorithm YOLOv5 to extract the characteristics of aggregate in different backgrounds to realize the type recognition of aggregate.
In this experiment, a total of 525 images were obtained. They were of stones, small stones, machine-made sand, and surface sand. In this experiment, the labels they delineated are sz, xsz, jzs, ms. We use LabelImg to label images, and the smallest bounding rectangle of the target was used as the real frame. In the final data set, 80% (420 sheets) were randomly selected as the training set and 20% (105 sheets) as the test set.
The four types of aggregate show different shapes and colors due to different dry and wet states and different brightness of light. The image collection environment was under cloudy, sunny,, and night conditions. The aggregate states were dry, normal, and wet. The collection distance was 1.5 m.
The computer used in the research institute is Intel(R) Core(TM) i5-8250U, 1.80 GHz processor, running memory is 8 GB, storage memory is 512 GB, and the development environment is python 3.6.
3.2. Model Establishment and Experimental Process
Aggregate Classification Model Based on Improved YOLOv5
The technical route of the aggregate classification model is shown in
Figure 4. The artificially-labeled aggregate data are input into the YOLOv5 model for training and fine-tuning, to realize the real-time recognition of the target. The improved YOLOv5-ytiny model is used for the classification of aggregate under complex backgrounds based on the YOLOv5 model. YOLOv5-ytiny replaces the C3 module of the backbone structure of the network structure, cuts the Neck structure to achieve compression, reduces the network prediction header, reduces the image size, and adjusts the network width. It simplifies the structure and parameters of the model while ensuring precision.
The YOLOv5 algorithm has four network structures, namely YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. These four network structures are different in width and depth, but they are the same in principle. These four network structures can be flexibly selected according to need. The greater the depth of the structure selection, the higher the precision, but the training speed and inference speed also decrease. Aggregate is not a complex target. We hope to increase the speed of reasoning. Therefore, the selected structure is YOLOv5s, and improvements are made on this basis.
The YOLOv5 model mainly consists of the following five modules: (1) The Focus module slices the input image, which can achieve the effect of down-sampling the image without losing information. (2) The Conv module, three functions are encapsulated in this basic convolution module, including convolution (Conv2d) layer, BN (Batch Normalization) batch normalization layer, and SiLU (Swish) activation function, which realizes that the input features are passed through the convolution layer and the activation function. Through the normalization layer, the output layer is obtained. (3) The Bottleneck module, is mainly used to reduce the number of parameters, thereby reducing the amount of calculation, and after dimensionality reduction, data training and feature extraction can be performed more effectively and intuitively. (4) The C3 module, in the new version of YOLOv5, the author converts the Bottleneck CSP (bottleneck layer) module to the C3 module. Its structure and function are the same as the CSP architecture, but the selection of the correction unit is different. It contains three standard convolutional layers and multiple bottleneck modules. (5) SPP module, spatial pyramid pooling. The main purpose of this module is to fuse more features of different resolutions to obtain more information.
3.3. YOLOv5-Ytiny
Although the established YOLOv5 model can realize the detection and classification of aggregate, the structure and parameters of the model are still relatively large, and the calculation takes a long time. On the other hand, the detection and classification of aggregate are generally in the process of vehicle transportation, and the detection results need to be displayed in real-time. Therefore, to improve the detection speed of the model and reduce the amount of calculation, the model is optimized and compressed to form the YOLOv5-ytiny model.
Replace the C3 module of YOLOv5 with the CI module. As shown in
Figure 5, the C3 module of YOLOv5 has a shortcut structure, which connects two adjacent layers of networks, and there are n residual blocks. For aggregate targets, the data set belongs to relatively simple target recognition. Multiple residual modules in the C3 module may be a waste of resources. Thus, replace it with a CI module. The structure of CI is shown in
Figure 5.
After the C3 module is replaced with the CI module, the corresponding network structure in the original Backbone module and Neck module will also be changed. At the same time, all the layers of the C3 module in the original model are changed to one layer to reduce the overall depth of the network. In order to achieve model compression, after replacing the C3 module, cut the Neck module to remove the relatively redundant part of the network, and then delete part of the structure to reduce the depth of the network, reduce the amount of model calculations. The modified Neck module is shown in
Figure 6.
There are three modules in the original YOLOv5 model, namely the Backbone module, the Neck module, and the detection head. After replacing the C3 module with the CI module, in order to reduce the amount of calculation parameters, the detection head was changed from three to two. The original Neck module is composed of multiple convolutional layers (Conv), up-sampling, and tensor stitching (Concat). For a single and simple target, only a part of the combination in the Neck layer may be good. The original repeated multi-layer Neck structure may cause data redundancy, increase the amount of calculation, so we tailor the Neck structure to compress the network. We use fine-tuning-iterative training, a small number of times to cut the network structure, the number of times of training, adjusting it by judging the convergence effect, and the final network structure is shown in
Figure 6. It can be seen that YOLOv5-ytiny eliminates part of the repetitive hierarchy, retains the main network structure, and finally compresses it into two detection heads.
3.4. Improvement of Loss Function
YOLOv5s uses
GIoU Loss as the bounding box regression loss function to judge the distance between the predicted box and the ground truth box. The formula is as follows.
In the above formula, A is the predicted box, B is the ground truth box, IoU represents the intersection ratio of predicted box and ground truth box, Ac represents the intersection of predicted box and ground truth box, u represents the smallest circumscribed rectangle of predicted box and ground truth box, and LGIoU is the GIoU Loss.
The original YOLOv5 model uses
GIoU Loss as a position loss function to evaluate the distance between the predicted box and the ground truth box, but
GIoU Loss cannot solve the situation where the prediction frame is inside the target frame and the size of the prediction frame is the same. In addition, the bounding box regression is not accurate enough, the convergence speed is slow, and only the overlap area relationship is considered.
CIoU Loss takes into account the scale information of the aspect ratio of the bounding box, and measures it from the three perspectives of overlapping area, center point distance, and aspect ratio, which makes the prediction box regression more effective.
In the above formula, w and h are the width and height of the prediction box, respectively, and wgt and hgt are the width and height of the ground truth box.
Compared with the GIoU Loss used in YOLOv5s, CIoU Loss takes into account the overlapping area, center point distance, and aspect ratio for measurement, so that the network can ensure faster convergence of the prediction frame during training and obtain higher regression positioning accuracy; this paper uses CIoU Loss as the loss function of the aggregate classification detection model.
4. Experimental Results and Analysis
4.1. Experimental Results
The YOLOv5-ytiny model detects aggregate as shown in
Figure 7. Taking small stones as an example, they are tested at different distances, different illuminations, and different dry humidities of aggregate. It can be seen that
Figure 7a–c are, respectively, the detection results on cloudy days, sunny days, and at night at a distance of 1.5 m.
Figure 7d–f are the test results at distances of 1 m, 1.5 m, and 2 m, respectively.
Figure 7g–i show the identification of small stones under dry, normal, and wet conditions. The accurate classification of aggregate under different backgrounds is realized. After verification, the effective recognition range of the YOLOv5-ytiny model is between 1 m and 2 m.
Table 1 shows the confidence levels of the four types of aggregate under different conditions. The inspection was carried out under different light conditions on sunny and cloudy days, and at night, and at different distances, namely 1 m, 1.5 m, and 2 m. When inspecting the aggregate, the dry and wet state of the aggregate was also different. Tests were carried out in dry, normal, and wet conditions.
In total, there were 300 iterations of the YOLOv5 model and YOLOv5-ytiny respectively.
Figure 8 shows the change trend of the classification loss function. We can see that the original YOLOv5 loss function drops rapidly in the early stages of the iteration, indicating that the model is quickly fitting. The convergence speed of YOLOv5-ytiny in the early stage is slower than that of the original model. By 200 iterations, the loss values of the two models are basically the same, indicating that the convergence effect of YOLOv5-ytiny is good, and the learning efficiency of the model is high. As the iterations continue, about 140 times, the model loss value decreases slowly. When the iteration reaches 220 times, the loss value fluctuates at 0.001 and the model reaches a stable state.
4.2. Evaluation and Analysis of Model Performance
This paper selects the commonly used evaluation indicators of the target detection model: precision, recall, balanced F score (F1-score), mean average precision (mAP) and FPS (frames per second) for three evaluations of the model with three indicators.
The formula is as follows
In the above formula, TP is the number of positive examples that are correctly classified. FP is the number of positive examples that are incorrectly classified. FN is the number of negative examples that are incorrectly classified, TN is the number of negative examples that are correctly classified. AP is the average precision, mAP is the mean of each type of AP. Num is the target number of each category. N is the total category.
4.3. Comparison with Original Model
The comparison of the evaluation indicators between the original YOLOv5 model and the YOLOv5-ytiny model is shown in
Figure 9. This shows the precision, recall, and F1-score of the YOLOv5-ytiny and the original model YOLOv5. The precision of the YOLOv5-ytiny is 96.5%, which is 0.2% lower than the original model, the recall rate is 98.5%, 0.4% higher than the original YOLOv5 model, and the F1-score is 97.5%, which is 0.1% higher than the original model.
Below, the YOLOv5-ytiny model is compared with some parameters of the original YOLOv5 model. As shown in
Table 2, the total parameters of the YOLOv5-ytiny model are 19,755,583 fewer than the original YOLOv5 model. The mAP is 99.6%, which is consistent with the original YOLOv5 model. Precision is reduced by 0.2% compared to the original model, so there is no significant drop. YOLOv5-ytiny’s storage space is 3.04 MB, which is 10.66 MB smaller than the original YOLOv5, and the calculation time is 0.04 s, which is 60% faster than YOLOv5. The data in
Table 2 show that the precision of the YOLOv5-ytiny model is consistent with the original YOLOv5 mAP, with a slight decrease in precision, and the calculation speed is greatly improved compared with the original YOLOv5 model.
4.4. Comparison with Other Target Detection Models
In the field of target detection, SSD, YOLOv4, and YOLOv4-tiny have high detection precision. In order to verify the effectiveness of this method, the training set of this paper was used to train these three models and the YOLOv5 model, respectively, and the test data were set to evaluate the performance of these four algorithms and obtain the precision, recall, and F1-score of the four algorithms. The comparison results are shown in
Figure 10.
We can see from
Figure 10 that among the five algorithms, the SSD algorithm has the highest precision rate. The recall rate and F1-score of the algorithm in this paper are the highest. The F1-score is defined as the harmonic average of the precision and recall rates. It is a measure of classification problems. In some machine learning competitions for multi-classification problems, the F1-score is often used as the final evaluation method. YOLOv5-ytiny has superiority.
Table 3 compares the comprehensive evaluation indicators of the five algorithms. The precision, mAP, model storage space, and FPS of these five algorithms are compared, respectively. The comparison results are shown in
Table 3.
Compared with the other four models, the detection speed of YOLOv5-ytiny is faster. The mAP of the improved YOLOv5-ytiny model is the same as that of the original YOLOv5 model, and the model storage space is reduced by 78%. In terms of the detection speed, the method in this paper has the fastest detection speed. The precision of the SSD and YOLOv5 models is slightly higher than that of YOLOv5-ytiny. The precision of the YOLOv5-ytiny model is 8.5% and 14.5% higher than that of YOLOv4 and YOLOv4-tiny, respectively. In terms of the models’ storage space and detection speed, YOLOv5-ytiny has an absolute advantage, and while improving the detection speed, the mAP of the YOLOv5-ytiny model is consistent with the original YOLOv5 model, and the precision of the model has not dropped significantly.
In summary, compared with the other four models, YOLOv5-ytiny has a smaller model storage space and a faster detection speed. The detection speed is higher than that of YOLOv4, YOLOv4-tiny, SSD, and YOLOv5, 22.33 f/s, 22.33 f/s, 22.08 f/s, and 13.56 f/s, respectively. The detection precision of YOLOv5-ytiny is high, the detection speed is fast, and the improved space is small, which proves that the aggregate detection classification model YOLOv5-ytiny, based on the improved YOLOv5, has good practicability.
4.5. Practical Application
The experiment was conducted in cooperation with Zhengzhou Sanhe Hydraulic Machinery Co., Ltd. The experimental method was applied to the concrete batching plant for the preparation of concrete raw materials. Consistent with the model used in this experiment, the labels were inconsistent; we divided the results into four states, namely null (representing the unloaded state), complete unloading state, melon stones, and stones12. The experimental results are shown in
Figure 11a–d. The number on the image label is the probability of recognition.
5. Discussion
Although tremendous progress has been made in the field of object detection recently, it remains a difficult task to detect and identify objects accurately and quickly. Yan et al. [
30] named the YOLOv5 as the most powerful object detection algorithm in present times. In the current study, the overall performance of YOLOv5 was better than YOLOv4 and YOLOv3. This finding is in line with some previous researches, as we found several studies comparing YOLOv5 to previous versions of YOLO, such as YOLOv4 or YOLOv3. According to a study by Nepal [
40], YOLOv5 is more accurate and faster than YOLOv4. YOLOv5 was compared to YOLOv3 and YOLOv4 for picking apples by robots, and the mAP was increased by 14.95% and 4.74%, respectively [
30]. Similar results and comparisons with other YOLO models were demonstrated by [
32] while using YOLOv5 to detect the behavior of cage-reared laying ducks. The recall (73%) and precision (62%) of YOLOv5 was better compared to YOLOv3-tiny (57% and 45%, respectively) for ship detection in satellite remote sensing images [
31]. In experiments on grape variety detection, YOLOv5 had higher F1 scores than YOLOv4-tiny [
41]. In our experiment, YOLOv5 also showed better results than YOLOv4 and YOLOv4-tiny. On the other hand, we encountered various studies that showed that YOLO outperforms SSD in object detection deep learning methods. In traffic sign recognition, Zhu et al. [
33] used the same data set, and the results showed that the mAP of YOLOv5 was 7.56% higher than that of SSD, and YOLOv5 was also better than SSD in terms of recognition speed. In addition, YOLOv5 was found to have better recognition accuracy than SSD [
34] when detecting strawberry ripeness. In this experiment, YOLOv5 also shows better results in inference speed and mAP compared to SSD. In many studies, YOLOv5 outperforms SSD in terms of speed and accuracy [
35]. In this experiment, the YOLOv5-ytiny model based on an improved YOLOv5 has advantages in both speed and mAP. Furthermore, given the previous discussion, we believe that choosing to improve YOLOv5 for aggregate identification is a wise move.
6. Conclusions
The aggregate detection classification model YOLOv5-ytiny is based on the improvement of YOLOv5. In order to adapt to the complex environmental factors in the detection process, we trained four aggregates under different types of light, different wet and dry conditions, and different detection distances to achieve the real-time classification of aggregate. YOLOv5-ytiny used CIoU as the loss function of the frame regression to improve the precision of the frame regression. We modified the network structure of the Backbone C3 of YOLOv5. Under the premise of ensuring the mean average precision and precision, reducing the number of YOLOv5 detection heads to simplify the model reduces the amount of calculation and improves the detection speed of the model. The experiment shows that the model storage space is reduced by 10.66 MB compared with YOLOv5, and the detection speed is 60% higher than the original YOLOv5 model.
By comparing the experimental results of the proposed YOLOv5-ytiny model with the object detection networks of SSD, YOLOv4, and YOLOv4-tiny, it can be proved that the strategy proposed in this study can effectively improve the detection precision. Meanwhile, the detection speed of 22.73 FPS enables the YOLOv5-ytiny model to be applied to the industrial production of real-time aggregate classification.
In the experiment, on a computer device that supports a CPU, if the YOLOv5 model is used, the model occupies a large space and the inference speed is slow. After the proposed YOLOv5-ytiny is compressed, the model space is 3.04 MB, and the inference speed can reach 22.73 FPS, which can meet the actual requirements.