3.2. Model Evaluation
To evaluate the performances of improved models, precision, recall, and F1-score were calculated. Precision is concerned with assessing the prediction results, which indicates the ratio that detected instances are real positive samples. Recall focuses on the judgment of the real sample, which indicates the ratio that positive samples in the real samples are detected correctly.
Precision can be represented by the number of positive samples correctly predicted divided by the total number of instances detected:
where True Positive represents the number of detected samples are real positive samples. Real positive samples can be represented as Ground Truth (GT).
Recall is calculated by dividing the number of correctly detected positive samples by the total number of real positive samples as follows:
In general, the higher the precision, the lower the recall tends to be. If we want to improve precision, which means that as many detected instances as feasible are true positive samples, the probability threshold of the model that predicts the samples as positive examples should be increased, that is, the confidence threshold should be increased. If we want to improve recall, that is, to select positive samples as far as possible from the model, the confidence threshold should be lowered. Due to this contradiction, it is often necessary to choose between improving recall or precision when training a model based on a specific situation. The possibility of detected results being GT can be ranked high to low. With the confidence threshold of the model gradually reducing in this order, the precision and recall of the current threshold can be calculated each time. The Precision-Recall curve (P-R curve) can be obtained by taking recall as the x-axis and precision as the y-axis. The area beneath the P-R curve reflects the comprehensive performance of the model in both precision and recall.
The model adjusts the hyperparameters by evaluating the validation set once every epoch during training. The COCO API was used for evaluation [
39]. COCO API uses Intersection over Union (IoU) to measure overlap between candidate bound predicted by the model and GT in the object detection task. IoU is the ratio of their intersection to their union:
The higher the IoU is, the more overlap between candidate bound and GT is. In the ideal case, they are completely overlapped, and the ratio is 1.
An object detection task can be divided into two subtasks, bounding box detection of the instance and segmentation of the pixel. Two tasks can be evaluated separately.
In the object detection task, generally, the detection result can be considered good if the IoU ≥ 0.5. When IoU = 0.5, the P-R curves are shown in
Figure 5. The Blendmask-VOV model has the best segmentation and localization effect in both validation and test sets, followed by the CondInst-VOV models. Both two models show improvements in detection performance compared with the baseline.
The P-R curve can qualitatively assess an object detection model. To quantitatively evaluate the comprehensive performance of the model, we considered the F
beta-measure. F
beta-measure is a score indicator that evaluates binary classification models based on predictions about positive class. F
beta-measure decides whether to focus on the precision or recall metrics by using different
weights:
A balanced F-score (F1-score) assesses the combined precision and recall performance in object detection to avoid omissions and over-predictions. F1-score is the harmonic mean of precision and recall, a special case of
when
= 1.
A smaller value gives a higher weight to precision when calculating the score, while a larger value gives a higher weight to recall.
The precision used to calculate F1-score in the experiment is the average precision under all recalls corresponding to every IoU threshold, and recall is the average recall at all thresholds.
When evaluating the validation set, the Average Precision (AP) is calculated on the IoU ∋ [0.5, 0.95] for every 0.05. AP is obtained from the P-R curve. It is the average precision value of each recall value. AP is used to measure the detection ability of the model on the category of interest.
Table 2 and
Table 3 show the optimal results for each model. In the table, mAP
50 is the AP value at IoU = 0.50, mAP
75 is the AP value at IoU = 0.75, and mAP represents the average AP under the 10 IoU thresholds. mAP
S, mAP
M, and mAP
L are three different types of mAP that are calculated separately based on the size of the detected instances. The calculated instance area of mAP
S is not greater than 32 pixels × 32 pixels, the calculated instance area of mAP
M is between 32 pixels × 32 pixels and 96 pixels × 96 pixels, and the calculated instance area of mAP
L is greater than 96 pixels × 96 pixels.
For the evaluation results of the validation set in the bounding box detection task, the mAP and mAP50 of BlendMask-VoV are higher than those of BlendMask, respectively. Its mAP and mAP50 performed best in all models. Its mAP75 is only slightly lower than BlendMask but still higher than other models. The F1 score of 0.665 is also the highest among all models. BlendMask-VoV performs best in the bounding box detection task on all sizes of objects, especially on small objects, which it outperforms BlendMask significantly. Although the mAP75 of CondInst-VoV is not as high as CondInst, the mAP and mAP50 of CondInst-VoV in the bounding box detection task are higher, with an F1-score of 0.646, and the overall detection performance is also superior to CondInst. CondInst-VoV does not perform as well as CondInst for small and medium objects. However, it obtains better performance than the well-engineered Mask R-CNN in Detectron2. It outperforms CondInst and Mask R-CNN on large object detection.
In terms of the segmentation task of the validation set, BlendMask-VoV is also the best model for overall performance, with all evaluation indicators higher than other models. The performance of BlendMask-VoV in small object segmentation is greatly improved compared to the baseline. BlendMask and CondInst have similar overall segmentation precision, and both are better than CondInst-VoV and Mask R-CNN, while the segmentation precision of CondInst-VoV is only slightly better than that of Mask R-CNN.
Table 4 and
Table 5 show the evaluation indicators for the test set. Except that the positioning ability of small objects is not as good as that of CondInst, the overall positioning ability of BlendMask-VoV is still better than other models, with an mAP score of 63.066% and an F1 score of 0.666. CondInst-VoV has an mAP of 61.391% and an F1-score of 0.659, second only to BlendMask-VoV, and the mAP, mAP
50, and mAP
L for detecting bounding box are all improved compared with CondInst. The overall positioning ability of CondInst is better than that of BlendMask and Mask R-CNN. Its positioning of small objects is especially accurate compared to other models.
From high to low, the overall segmentation precision of the test set is BlendMask-VoV, CondInst, BlendMask, CondInst-VoV, and Mask R-CNN. BlendMask-VoV has an mAP of 59.402% and an F1 score of 0.626. The segmentation evaluation indicators of Blendmask-VoV are better than those of other models, except that the precision of small objects is inferior to that of CondInst and Mask R-CNN. Although CondInst-VoV has a higher mAP50 than CondInst, none of its other indicators have improved compared with CondInst.
It can be seen from the inference results of the test set by different models shown in
Figure 6,
Figure 7,
Figure 8 and
Figure 9 that BlendMask-VoV and CondInst-VoV have more accurate predictions on various images than baselines.
Figure 6 and
Figure 8 show the inference probabilities and boundaries of open-pit mines on fusion images of the Gaofen satellite. Both BlendMask-VoV and CondInst-VoV can accurately distinguish natural water bodies from mine pit water and are not easily confused by small objects with textures close to that of the mining area. CondInst-VoV has a close description of the segmentation boundary with BlendMask-VoV.
3.3. Case Study
Although the improved models performed well in the KOMMA dataset, the samples in the dataset were filtered. For open-pit mine detection, we pay more attention to application feasibility in actual remote sensing pre-survey scenarios.
In the case study, the remote sensing image of the entire Daye Town administrative region is clipped regularly before being input into models. Mine area on the edge of clipped images will inevitably be shown incompletely, leading to being ignored by models. To solve this problem, a sliding window with a size of 600 × 600 and a step size of 300 is used to cut the original remote sensing image so that the open-pit mine area can be displayed entirely in the image as far as possible.
The clipped images were re-mosaicked after detection. We take the union of instance masks for multiple overlapping instances in one image or among images and take the maximum prediction probability of the overlapping part. The detection results were transformed into vector format in the evaluation stage and analyzed statistically and spatially on ArcMap10.8.
On remote sensing images, the texture of the open-pit mine area is often very diverse, making it difficult to define the spatial topological relationship between surrounding vegetation, mining buildings, roads, and open-pit mines. Unlike the scattered samples in the KOMMA dataset, the practical open-pit mine scenes are often more complex. The detection result of Daye Town is also evaluated, as shown in
Table 6.
In two cases, the segmentation mask of an instance may be close to GT, but the positioning precision is low. One case is that different models have a different judgment on instance boundaries for a complex open-pit mine scene with multiple connected pits. Even if the scene is interpreted by a human, different interpreters have different judgments on the number of instances and their boundaries of instances. Another case is that the texture or tone characteristics inside one object are inconsistent, so the model infers that one object is two instances. Both the two instances have low IoU with the GT. Furthermore, the open-pit mine area segmented on the edge of an image may have a low probability of being detected.
Given the above analysis, when evaluating the detection results of Daye Town, the confidence threshold was set at 0.1, and all instances that coincided spatially with GT were counted. From
Figure 10,
Figure 11 and
Figure 12, it can be seen that a high confidence threshold and IoU are not necessary for the open-pit mine detection to obtain good results. Demanding the IoU and confidence threshold increases the rate of missed detection, which is not conducive to practical investigation.
Table 6 shows that, for the positioning task, among the three types of images, the recall of all models in the true-color image is generally relatively high, and the precision in the Tianditu image is relatively high. CondInst-VoV model has the best recall with 90.789%, 86.184%, and 88.816% in the true-color image, false-color image, and Tianditu image, respectively. In an open-pit mine, remote sensing pre-survey, the requirement of accuracy rate of manual interpreted object verification, namely, bounding box recall, is generally greater than 80%. The CondInst-VoV model can well meet that requirement. The recall of the BlendMask-VoV model and CondInst model are also high in all three types of images. The Mask R-CNN model has the highest precision, with 84.615% in the true-color image, 61.224% in the false-color image, and 97.917% in the Tianditu image followed by the BlendMask model and CondInst model. The precision of the BlendMask-VoV model and CondInst-VoV models is relatively low.
For the segmentation task, the accuracy is evaluated based on pixels of the whole study area, including open-pit mines and non-open-pit mines. The proportion of pixels in the non-open-pit mine is larger, accounting for 97.01% in the Gaofen image and 97.23% in the Tianditu image, respectively, which also greatly influences the evaluation results. Among the three types of images, Tianditu images often have the highest recall and precision.
CondInst-VoV model has the highest recall in the Tianditu image and true-color image, 92.351% and 74.628%, respectively. BlendMask-VoV model had the highest recall of 75.796% in false-color images. BlendMask model has the highest precision of 57.718% in the true-color image. CondInst-VoV model has the highest precision of 74.295% in the false-color image. The Mask R-CNN model has the highest precision of 75.922% in the Tianditu image.
Each detection model has its advantages and disadvantages depending on tasks and image types. Nevertheless, models generally perform the best in the Tianditu image. In the Tianditu image, the BlendMask model and Mask R-CNN model have the highest F1-score in the positioning and segmentation tasks, respectively, and both have good comprehensive performance. However, the high precision of the Mask R-CNN model comes from conservative prediction, while the CondInst-VoV model can find more objects.