4.4.1. Precision–Recall Rate Experiment
We conducted a precision–recall experiment to verify the performance of the proposed method in two aspects: first, the impact of data set size and number of object categories on precision–recall and, second, the impact of the new algorithm, formed by combining the improved modules with YOLOv7, on the precision–recall rate. In this experiment, first of all, we combined three types of improved modules—namely, AF-FPN, T-encoder, and SIoU—with YOLOv7 to obtain the comparison models YOLOv7 + AF-FPN, YOLOv7 + T-encoder, and YOLOv7 + SIoU. Second, we trained and tested the above three models, the overall improved model ATS-YOLOv7, and the baseline model YOLOv7 on six different types of aerial photography data sets, then assessed the experimental data to draw
P–
R curves, as shown in
Figure 12.
From the P–R curves, it can be seen that the performance of the five types of detection models differed on the different data sets. Overall, the detection ability increased with increasing data set size and object types. Among them, the five detection models had the best precision–recall rate on the DIOR data set, and the worst performance on the UCAS-AOD data set. After analysis, it was found that the data size and number of object types in the DIOR data set were the largest of the six data sets involved in the experiment, with 23,463 images and 20 object types, while the UCAS-AOD data set had only 2420 images and 2 object types. Therefore, by observing the experimental results, it can be concluded that comprehensive aerial photography data are very necessary for the development of a detection model, as this can enable the model to fully learn object features, capture key information, and adjust the training parameters to improve the model generalization and detection capabilities.
From the perspective of the performance of the improved module combined with the baseline model, ATS-YOLOv7 performed best in terms of P–R ratio on the six data sets, followed by YOLOv7 + AF-FPN and YOLOv7 + T-encoder, while YOLOv7 + SIoU performed slightly worse and YOLOv7—the original model—performed the worst. It can be seen that YOLOv7 was overall improved, in terms of feature information extraction, object location, resistance to redundant interference, and drastic scale changes, after integrating the three types of models, and the comprehensive performance of the model was greatly improved. Although YOLOv7 + AF-FPN, YOLOv7 + T-encoder, and YOLOv7 + SIoU improved over the baseline, to different degrees, they cannot guarantee comprehensiveness of the detection process and high accuracy of the detection results for complex aerial images captured by UAVs. Therefore, in general, the ATS-YOLOv7 method proposed in this paper performed well in multi-category aerial image data sets, presenting good detection performance when processing multi-type aerial objects.
4.4.2. Ablation Experiment
In order to further verify the effectiveness and rationality of each improved module when combined with YOLOv7, we conducted ablation research on the DIOR dataset. The baseline model was YOLOv7 and each progressive module was trained based under the same experimental settings. The respective contributions of the models are intuitively shown in
Table 4, in which “√” indicates that the improved module was added, while a blank space indicates that the module was not selected.
During the experiment, we used a variety of evaluation methods according to the and , including , , , , , and . In particular, , , , and represent the average precision for tiny, small, medium, and large targets, respectively, while and represent the average value of for all scale objects when the SIoU threshold was set at 0.5 and the average value of for all scale objects when the step size was 0.05 from 0.5 to 0.95, respectively. These two types of are increasingly difficult for each detection model, and they can be used as an important indicator to measure the comprehensive detection capability of a model. For convenience of expression in the following, we denote YOLOv7 + AF-FPN by ①, YOLOv7 + T-encoder by ②, YOLOv7 + SIoU by ③, and the other models by ④, ⑤, ⑥, and ⑦ in turn.
From
Table 4, the following observations can be made. First, compared to YOLOv7, the models with single improvement modules presented improvements in all six indicators, with ② having the highest increases in
,
,
, and
(13%, 11.8%, 9.7%, and 7.1% higher, respectively). This indicates that YOLOv7 is more sensitive to the features of tiny and small objects when facing drastic variations in object scale, and its detection ability is improved after adding the T-encoder prediction head. Although the parameters were increased by 3 M, this did not affect the real-time requirements of the network. Second, in the models with two improved modules, the six indicators of ④ were the most improved over those of YOLOv7, which were 7.2%, 14.9%, 9.2%, 10.2%, 7.4%, and 8.1% higher, respectively. Due to the new AF-FPN architecture of the neck of the model and the additional detection heads, the parameters increased by 4.3 M, but the network reduced the information loss in the convolution process and strengthened the feature expression, which was conducive to the detection of complex object types. Finally, it can be seen that the baseline model YOLOv7 had the largest difference when compared with the comprehensive improvement ⑦ (ATS-YOLOv7), being 13.7%, 15.7%, 11.2%, 10.3%, 8.2%, and 11.6% lower, respectively. From the above analysis, it can be seen that YOLOv7 could detect and recognize multi-scale objects and densely occluded objects well after adding the AF-FPN module, T-encoder module, and optimizing the SIoU loss in training. Although ⑦ presented the largest increase in parameter quantity, it also managed to meet the real-time requirements in accordance with the FPS standard. The various kinds of modules can play a coordinated role in improving YOLOv7, verifying that the comprehensively improved YOLOv7 model proposed in this article is scientific and efficient.
Based on the above analysis, in order to highlight the advantages of ATS-YOLOv7 over YOLOv7 and, more specifically, to analyze the differences between the two models in terms of accuracy for various objects, we used the confusion matrix to assess their detection results. The confusion matrix is shown in
Figure 13. The horizontal axis in the figure represents the ground category labels, the vertical axis represents the predicted category labels, and the cross term is the result of their classification accuracy. In order to observe the visual detection results, we abbreviated the name of each object in the data set, as detailed in
Table 5. In order to save the detection time occupied by multiple similar objects, we selected 10 representative objects from the 20 object categories of DIOR for testing.
Overall, among the detected object categories, ATS-YOLOv7 had a higher classification accuracy for most objects when compared to YOLOv7, and the maximum difference in accuracy between the two was 12%. The overall false detection rate of ATS-YOLOv7 was lower than that of YOLOv7. Specifically, the performance difference between the two models in terms of large object detection (e.g., ETS, GTF, and SM) was not obvious, and both had high classification ability. In terms of medium objects (BE and OS), ATS-YOLOv7 had 5% and 3% higher accuracy, respectively. In terms of small objects (VE, CY, and AE), ATS-YOLOv7 had 5%, 3%, and 5% higher accuracy, respectively. The two presented the greatest difference for tiny objects (ST and SP), where ATS-YOLOv7 had 12% and 6% higher accuracy than YOLOv7, respectively. These results indicate that YOLOv7 with three detection heads is not sensitive to tiny objects, while the ATS-YOLOv7 model with AF-FPN and T-encoder has improved attention due to the feature layer and can fully mine the feature information of objects, allowing it to perform better on more difficult objects.
4.4.3. Comparison with the State-of-the-Art
In this section, we compare the performance of the proposed method with a variety of SOTA visual detection models, including one- and two-stage models, on the DIOR data set, in order to illustrate the advantages of the ATS-YOLOv7 algorithm in terms of the current aerial image detection task. The chosen experimental models included RetinaNet [
64], Scaled-YOLOv4 [
65], YOLOv5 [
22], TPH-YOLOv5 [
66], HR-Cascade++ [
67], O
2DETR [
68], DBAI-Net [
69], YOLOv7 [
23], and GLENet [
70]. The specific test results for each model are given in
Table 6. Each object is marked according to the scale size to display the differences in detection results, such as large (l), medium (m), small (s), and tiny (t).
From the experimental results, it can be seen that the detection results of the various models for large- and medium-sized objects were generally higher than those for small and tiny objects, indicating that small and tiny objects are extremely challenging in the field of object detection. Our proposed model, ATS-YOLOv7, presented the best overall performance among various visual models (mAP of 87%, F1 score overall higher than other models), being especially superior to other models for small and tiny objects. Specifically, the 10 models had high detection results for large- and medium-sized objects, such as stadiums, bridges, sports fields, airports, and so on. Among them, GLENet obtained the best results (mAP of 90%), 26% higher than the worst YOLOv5 model. The difference between ATS-YOLOv7 and the best-performing model was only 0.4%, indicating that the method proposed in this article also has efficient performance for large- and medium-sized objects. However, for small and tiny objects, such as cars, ships, and chimneys, ATS-YOLOv7 presented the highest score compared to other models, being 4.9% higher in accuracy than the second scoring model YOLOv7. These results demonstrate that our proposed AF-FPN module, T-encoder module, and SIoU can efficiently enhance the detection of medium and large objects by improving upon YOLOv7, and that the proposed model demonstrates superior performance when considering tiny and challenging objects in UAV aerial images.
Based on the above experimental results, we visualized the detection results of four typical challenging objects when using RetinaNet, Scaled-YOLOv4, YOLOv5, TPH-YOLOv5, HR-Cascade++, O
2DETR, DBAI-Net, YOLOv7, GLENet, and ATS-YOLOv7 models, as shown in
Figure 14.
Figure 14a shows the results of each model for multi-scale targets in a large scene. The objects in the picture are a ground track field, a basketball court, a tennis court, a golf course, and a car. Their dimensions vary from large to small, and their positions are scattered. It can be seen from the figure that ATS-YOLOv7 had the fewest misdetected objects, while HR-Cascade++ had the most misdetected objects, and the rest of the models had varying degrees of misdetection. This can be explained by the adaptive feature enhancement network, AF-FPN, of ATS-YOLOv7 helping the network reduce the negative impact on the network when multiple objects at varying scales change violently in large scenes, helping the detection performance of the network to remain stable.
Figure 14b shows the results of each model for tiny-scale ship targets. Observing the detection results, as ATS-YOLOv7 adds a detection route based on the T-encoder module to form a four-head detection structure, the network had a strengthened ability to capture the context information of the object area, making it more sensitive to the characteristics of tiny objects. Therefore, its detection results were the best, while the other models seriously missed some objects.
Figure 14c shows the results of the various models for densely arranged aircraft targets. From the perspective of prediction, the detection rate of YOLOv7 was the lowest, while the detection rate of ATS-YOLOv7 was the highest. Statistically, it was found that the detection result of ATS-YOLOv7 was about 10% better than that of YOLOv7. Therefore, after adding the AF-FPN, T-encoder, and SIoU functions to ATS-YOLOv7, the regression positioning and feature extraction abilities of the model were greatly improved. In the face of densely arranged objects with a similar background, it can maintain a high robustness and detection level.
Figure 14d shows the results of each model for storage tank targets with a fuzzy background. It can be seen that ATS-YOLOv7 detected and recognized storage tanks in fuzzy states better than the other models, as a result of the self-attention mechanism and adaptive feature strengthening mechanism in ATS-YOLOv7. However, in this state, if the storage tank is similar to the background, ATS-YOLOv7—like most other models—had trouble detecting the storage tank. In the future, we hope to further study TPH-YOLOv5 to improve our model architecture in this regard.
In general, for objects with a large scale, significant features, and a clear background, the networks could easily extract their rich feature information, then classify and locate them more accurately. However, for those complex objects with a tiny scale and fuzzy background, the network detection task becomes very difficult. Thus, the current aerial object detection network model not only relies on training on a large volume of data, but also requires comprehensive aerial data types and an advanced model architecture. On this basis, the model generalization and detection ability can be better improved.
The following shows some of the test results of our method on other aerial datasets. As shown in
Figure 15, in general, ATS-YOLOv7 still has a high detection performance for challenging objects on other datasets. A few cases of missed detection are caused by similar backgrounds and small differences in the number of pixels, which is also what we need to pay attention to in the future.
Above, the detection accuracy of each model for objects in different scenes and types was analyzed in detail. Next, real-time experiments and result statistics were carried out, according to the FPS value obtained by each model. The real-time performance of a model is an important criterion for evaluating its comprehensive performance. The experimental results are shown in
Figure 16.
As each network did not use the images in the DIOR validation set during training, we selected some images from the validation set for experiments, in order to make the experimental results more consistent with a real scenario. It can be seen, from the figure, that ATS-YOLOv7, YOLOv7, and TPH-YOLOv5 were the fastest in terms of reasoning speed, reaching 94 fps, 110 fps, and 105 fps, respectively. This result indicates that these three models had the highest speed advantage for the objects in the validation set. However, in terms of accuracy (see
Table 6), ATS-YOLOv7 reached 87%, 5% higher than YOLOv7. As for the reasoning speed of ATS-YOLOv7, although presented no advantage over YOLOv7 and ATS-YOLOv7, it had little difference in comparison with them. As the object detection task of an UAV aerial image is an application project, we should consider the reasoning speed and detection accuracy comprehensively when exploring the real-time nature of the model, seeking a balance between the two. Based on this, from a comprehensive point of view, ATS-YOLOv7 has both accuracy and speed advantages and, so, is the most applicable model among these models.