4.1. Experimental Setup
4.1.1. Experimental Platform
The experiments’ platforms are AMDR7-5800H3 and NVIDIA GeForce RTX2060 (6 GB). The experimental development environment is Python, and the PyTorch framework builds the detection model. The experiments use the stochastic gradient descent method to optimize the model. Epochs are set to 200 to ensure that the model can fully learn the features in the dataset. The input image size for YOLOv5 model training: the image size should be specified for YOLOv5 model training and set to 640 × 640. Batch size refers to the number of samples used in each training and is set to 32. The learning rate is set to 0.01, the regularization factor is set to 0.0005, and the number of iterations is set to 500.
4.1.2. Dataset
Choosing a suitable dataset is crucial for image segmentation experiments. First, a good dataset should be representative and diverse, able to cover a variety of situations such as different scenes, different lighting conditions, and different object morphologies, thus ensuring the generalization performance of the algorithm. Second, the quality of the annotation of the dataset is also very important; the annotation should be accurate, detailed, and consistent, and different types of annotations may be required for different tasks. Finally, the dataset size should also be large enough to train deep learning models and perform adequate validation and comparison. Therefore, choosing the right dataset can improve the reliability and generalization of experimental results.
In this paper, we choose COCO and PASCAL-VOC datasets to conduct ablation experiments and comparison experiments on the models. Choosing two datasets for experiments can help evaluate image segmentation algorithms’ performance and generalization ability on different datasets. Comparing the two datasets can reveal their differences.
COCO: The COCO dataset contains over 80 common object classes, such as people, animals, vehicles, and furniture. The main features of the COCO dataset are diversity and complexity, containing a large number of images and multiple object instances covering a wide range of scales, poses, and occlusions.
PASCAL-VOC: The PASCAL-VOC dataset mainly contains 20 common object classes, such as people, cars, airplanes, and animals. Each image is accurately annotated with object bounding boxes; for some images, there is also a pixel-level semantic segmentation annotation of the objects.
4.1.3. Mean Average Precision
In this paper, Average Precision (
AP) and mean Average Precision (
mAP) are used for evaluation and comparison [
34]. The mean average precision (
mAP) is used to measure the accuracy of target detection and represents the performance of the algorithm in terms of detection accuracy over the entire dataset, as follows:
where
m is the total number of object categories in the dataset.
4.2. Ablation Experiments
4.2.1. Ablation Experiment of Feature Fusion
For the feature extraction part of YOLOv5s, the feature pyramid structure with multi-level feature fusion is established to enhance the interaction between the shallow localization information and the deep semantic information, making the multi-scale feature fusion more adequate. To investigate the effect of the feature pyramid structure of multi-level feature fusion on multi-scale feature fusion, the effectiveness of the improved network structure is illustrated in terms of function loss and mean-average accuracy [
35].
- (1)
Loss comparison
As shown in
Figure 5, YOLOv5s represents the loss of the YOLOv5s algorithm using the original feature pyramid, and M-YOLOv5s represents the loss of the YOLOv5s algorithm using the multi-level feature fusion feature pyramid structure.
In the same rounds, the loss of the M-YOLOv5s algorithm using the multi-level feature fusion feature pyramid structure is 0.19% lower than that of the YOLOv5s algorithm, and the overall curve shows that the loss of the M-YOLOv5s algorithm is not more obviously perturbed than that of the YOLOv5s algorithm at around 50 rounds and the loss of the network decreases more rapidly in the early training period, and the overall trend is flatter. The lower loss function curve during training means that the algorithm can predict the segmentation results of the images more accurately, thus demonstrating the generalization performance and effectiveness of the M-YOLOv5s algorithm on the test set.
- (2)
Accuracy comparison
As shown in
Table 2, the M-YOLOv5s algorithm fully combines shallow and deep features to improve the quality of multi-scale feature fusion. In terms of detection accuracy, the M-YOLOv5s algorithm has improved by 1.58%, which illustrates that feature fusion between multi-level features promotes the information flow between shallow and deep feature layers, and also illustrates the effectiveness of multi-level feature fusion feature pyramid structure for improving network performance. M-YOLOv5s algorithm can accurately divide and segment different objects or regions in the image.
4.2.2. Ablation Experiment of Dilated Convolutional
After using the feature pyramid structure with multi-level feature fusion, the network accuracy is improved to some extent, but the feature pyramid structure with multi-level feature fusion improves the accuracy at the expense of feature map size.
In general, the loss of feature map size can lead to the imbalance between the semantic information of deep features and the location information of shallow features, and the interaction of their feature information is limited, resulting in the missing detection of some objects. Therefore, the dilated convolution module is added to the shallow feature layer of the multi-level feature fusion feature pyramid structure to prevent the loss of image detail information due to the deepening of the network so that the deep feature layer can expand the perceptual field without losing the image size and verify the effect on loss and accuracy under the balanced feature information interaction by adding the dilated convolution module.
- (1)
Loss comparison
As shown in
Figure 6, M-YOLOv5s in the figure indicates the loss of the algorithm without adding the dilated convolution module, and DM-YOLOv5s indicates the loss of the algorithm with the dilated convolution module added in the shallow features.
The DM-YOLOv5s algorithm loss is reduced by 0.69% under the same rounds, and the two curves have matched the overall curve, but the perturbation of DM-YOLOv5s algorithm loss is flatter around round 50, and the flat loss curve means that the loss function decreases more slowly when training the model, but the performance of the model is more stable on both training and test data without overfitting or underfitting problems. Such a situation usually indicates that the model is close to the optimal state and can perform well on new data. The perturbation at 50 rounds is due to the freeze training method chosen in this paper, which opens the original weights after the first 50 rounds and trains them together, so there is a certain loss of perturbation, but the experimental results show that the loss of perturbation after adding the original weights can be mitigated by optimizing the structure to achieve a better training effect.
- (2)
Accuracy comparison
As shown in
Table 3, the accuracy is improved by 2.21% when dilated convolution is used, and the improvement effect of the network illustrates that its network performance has reached the optimum for M-YOLOv5s. After adding dilated convolution, the information interaction between shallow and deep features has tended to be balanced and stable, the feature flow between scales is sufficient, and the overall network performance is close to saturation.
4.2.3. Ablation Experiment of Lightweight Model
To evaluate the improved part of this paper more comprehensively, ablation experiments were conducted on the MDM-YOLOv5s network, and the results are shown in
Table 4.
In this case, the comprehensive performance depends not only on the accuracy of the model but is also influenced by the model size. Both metrics need to be considered considering the trade-off between computational resources and storage capacity required in practical application scenarios. Their accuracies are relatively close to each other for the two models, MDM-YOLOv5s and DM-YOLOv5s. If these models need to be used in resource-constrained environments, model size may need to be considered as an important metric, in which case MDM-YOLOv5s will be more advantageous.
Beyond model size, processing speed is a pivotal performance metric. We conducted comparative analyses through statistical data to assess the time required by the DM-YOLOv5s model and the MDM-YOLOv5s model for processing a single image. The DM-YOLOv5s model typically requires 0.04 to 0.06 s per image. Nevertheless, the average processing time of the MDM-YOLOv5s model has been reduced to 0.035 s. This signifies that while sustaining high performance, our enhancement approach further amplifies image processing speed, rendering it more suitable for real-time application scenarios.
4.3. Comparison Experiment
This section compares the MDM-YOLOv5s network with other image segmentation algorithms, and from
Table 5, we can learn that the improved algorithms in this paper have higher accuracy than other algorithms.
From the experimental results in the above table, it can be seen that the MDM-YOLOv5s algorithm is more accurate than U-Net, SegNet, and Mask R-CNN for image segmentation, respectively, and the accuracy curves of each algorithm model are shown in
Figure 7.
The experimental results in the above table show that the MDM-YOLOv5s model is the smallest, reducing 116 M, 52 M, and 76 M compared to U-Net, SegNet, and Mask R-CNN, respectively. It means that the MDM-YOLOv5s model requires less storage space and computational resources to accomplish the same task. This is valuable for practical applications because it allows the model to run more efficiently on embedded devices or in constrained environments. MDM-YOLOv5s employs techniques such as multi-scale fusion, dilation convolution, and model lightweight to effectively compress the model structure and the number of parameters while ensuring accuracy. In contrast, U-Net, SegNet, and Mask R-CNN usually require more parameters and computational resources to achieve higher accuracy and robustness. Therefore, the small model of the MDM-YOLOv5s model is a great advantage and is especially suitable for scenarios with limitations on computational resources and storage space.
The delay of the target segmentation model refers to the time required to go from the input image to the output target bounding box. Comparing the latencies of target segmentation models can help us evaluate the usability and applicability of different models in real-world scenarios, especially in real-time applications. By comparing the latencies between models, we can understand which parts of the model are more time-consuming and thus optimize for those parts to improve the efficiency and speed of the model. Comparing the latencies of models can also help us evaluate the required hardware device resources, including CPU, GPU, memory, etc., to ensure that the model can run well on the selected hardware. Therefore, this paper compares the latency of MDM-YOLOv5s, U-Net, SegNet, and Mask R-CNN as shown in
Figure 8.
The MDM-YOLOv5s model is compared with U-Net, SegNet, and Mask R-CNN, and it is found that the latency of MDM-YOLOv5s is lower than the other three algorithms in image segmentation. This means that in practical applications, the MDM-YOLOv5s model has higher real-time performance and responsiveness and can detect and segment the target of the input image faster.
As a result, the MDM-YOLOv5s model can complete the inference task in a shorter time, speed up the image segmentation and improve the processing efficiency. The MDM-YOLOv5s model occupies less storage space, reducing the storage cost and making the operation and deployment more convenient. At the same time, the MDM-YOLOv5s model can be applied to some resource-constrained scenarios, such as mobile devices and embedded systems, extending the application scope of image segmentation technology. In addition, the MDM-YOLOv5s model has a simpler structure and fewer parameters, so it requires low conditional requirements, and the model training may be more stable and reliable. The model segmentation effect graph is shown in
Figure 9.
Comparing the MDM-YOLOv5s model with U-Net, SegNet, and Mask R-CNN, it is found that the accuracy of MDM-YOLOv5s is higher than the other three algorithms in image segmentation. The high accuracy of the MDM-YOLOv5s model can be attributed to the following aspects:
Feature fusion technique: The MDM-YOLOv5s model uses the feature fusion technique to fuse features at different levels, thus improving the model’s understanding of images and segmentation accuracy;
Dilated convolution technique: The MDM-YOLOv5s model adopts the dilated convolution technique, which can effectively expand the perceptual field and improve the model’s ability to capture image details, thus improving the segmentation accuracy;
MobileViT technology: The MDM-YOLOv5s model also adopts MobileViT technology, which can effectively reduce the model parameters and computation volume, thus improving the model operation speed and efficiency;
YOLOv5s structure: The MDM-YOLOv5s model is improved based on the YOLOv5s structure, and YOLOv5s itself is an efficient target detection algorithm with a simple structure, small computation, and fast speed, and these advantages also provide the basis for the high accuracy of the MDM-YOLOv5s model.
Compared with the excellent image segmentation model TransUNet introduced in recent years [
36], the comparison mainly includes the mean average precision (mAP), model size, and processing time of the single image. Experimental results show that the MDM-YOLOv5 algorithm outperforms the TransUNet algorithm in accuracy by 1.34 percentage points. Compared to TransUNet, the model size of MDM-YOLOv5 has significantly decreased. The processing time for MDM-YOLOv5 is reduced by 0.135 s compared with TransUNet, indicating that MDM-YOLOv5 is more efficient in terms of processing speed.
In summary, the MDM-YOLOv5s model uses a variety of advanced techniques and combines the advantages of YOLOv5s, thus achieving high accuracy in image segmentation tasks. Compared with U-Net, SegNet, Mask R-CNN, and TransUNet, the MDM-YOLOv5s model is optimized in feature fusion, dilation convolution, MobileViT technique, and structure, and thus has higher accuracy in image segmentation.