Application of Improved Instance Segmentation Algorithm Based on VoVNet-v2 in Open-Pit Mines Remote Sensing Pre-Survey

Zhao, Lingran; Niu, Ruiqing; Li, Bingquan; Chen, Tao; Wang, Yueyue

doi:10.3390/rs14112626

Open AccessArticle

Application of Improved Instance Segmentation Algorithm Based on VoVNet-v2 in Open-Pit Mines Remote Sensing Pre-Survey

by

Lingran Zhao

¹,

Ruiqing Niu

^1,2,*,

Bingquan Li

¹

,

Tao Chen

²

and

Yueyue Wang

²

¹

School of Automation, China University of Geosciences, Wuhan 430074, China

²

Institute of Geophysics and Geomatics, China University of Geosciences, Wuhan 430074, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2022, 14(11), 2626; https://doi.org/10.3390/rs14112626

Submission received: 13 April 2022 / Revised: 25 May 2022 / Accepted: 30 May 2022 / Published: 31 May 2022

Download

Browse Figures

Versions Notes

Abstract

:

The traditional mine remote sensing information pre-survey is mainly based on manual interpretation, and interpreters delineate the mine boundary shape. This work is difficult and susceptible to subjective judgment due to the large differences in the characteristics of mining complex within individuals and small differences between individuals. CondInst-VoV and BlendMask-VoV, based on VoVNet-v2, are two improved instance segmentation models proposed to improve the efficiency of mine remote sensing pre-survey and minimize labor expenses. In Hubei Province, China, Gaofen satellite fusion images, true-color satellite images, false-color satellite images, and Tianditu images are gathered to create a Key Open-pit Mine Acquisition Areas (KOMMA) dataset to assess the efficacy of mine detection models. In addition, regional detection was carried out in Daye Town. The result shows that the performance of improved models on the KOMMA dataset exceeds the baseline as well as the verification accuracy of manual interpretation in regional mine detection tasks. In addition, CondInst-VoV has the best performance on Tianditu image, reaching 88.816% in positioning recall and 98.038% in segmentation accuracy.

Keywords:

open-pit mine detection; remote sensing pre-survey; instance segmentation; VoVNet-v2; high-resolution satellite image

Graphical Abstract

1. Introduction

Mineral resources are the key material foundation for the country’s development [1]. The mine investigation plays an essential role in geological environment protection and land resources planning. The exploitation of mineral resources has resulted in a slew of environmental and geological issues, including air pollution, soil pollution, water pollution, ecosystem disruption, landscape damage, and geological disasters [2,3,4]. As some of the damage is difficult to recover, mineral resources development and utilization must be limited. The impact of development on the geological and ecological environment around the mining area should be minimized in development. Mineral resources are the lifeblood of modern industrial development. Industrialization and urbanization are underway in China, and economic growth relies heavily on mineral resources [5]. The Chinese government has actively promoted industrial reform in response to the severe conflict between mineral resource exploitation and environmental protection. In recent years, local relevant departments have shut down extensive and excessive exploitation and conducted regular mine investigation projects to investigate cross-border mining, illegal mining, and landscape destruction under the strict ecological protection policies and strict environmental accountability. Remote sensing pre-survey of mines has been widely adopted by Chinese government departments to reduce the time and labor cost of field investigation. An efficient automatic way to extract mine information on remote sensing images is also urgently needed to be explored. The open-pit mine is a complex open environment that presents irregular shapes, textures, and color features in remote sensing images. Artificial interpretation and feature extraction can be easily disturbed by other engineering activities, and mine pits with water are easily confused with natural water bodies and frequently misjudged. All of these factors make my remote sensing surveys more difficult.

In the early stage of remote sensing technology, medium-and-low-resolution remote sensing satellite images cannot meet the needs of feature extraction of ground object images in small-scale geological environment problems. The identification of features on images was mainly based on pixel classification, and the cell size limited the large-scale implementation of accurate investigation and\geological objects monitoring [6]. The application of early remote sensing technology in mine remote sensing surveys focused on using spectral information to investigate mining information in mining areas, for example, fusing multi-band spectral information and high spatial resolution image information to distinguish between lithology and surface objects [7,8], assessing landscape destruction [9], land use mapping [10,11,12], and monitoring underground coal fires [13].

With the improvement of remote sensing image resolution, object-oriented image segmentation technology has become a research hotspot in the field of remote sensing image information extraction. This method extracts information such as color, texture, and shape from high-resolution images. The method is then used to tackle problems such as land cover change [14], ecological restoration assessment [15], specific ground object information extraction [16,17], and a variety of other mining-related questions. However, a model based on multiple feature segmentation must investigate each feature’s threshold. The characteristics and thresholds for different images are frequently extremely diverse, making it difficult to use the traditional segmentation method generally [18].

Deep learning models are capable of learning common underlying features, sharing some parameters, and having good generalization [19]. In recent years, deep learning models based on convolutional neural networks [20,21,22] have been widely used in object recognition and semantic segmentation remote sensing images due to their excellent performance in image processing.

In the field of mine remote sensing survey, thanks to the network structure that integrates multi-level features, semantic segmentation models show good performance in mine feature classification. Gallwey et al. [23] proposed an artisanal small-scale mining detection method based on the U-Net [24] model. They applied it to open-source Sentinel-2 multispectral satellite imagery to map mining and urban land-use change in southern Ghana. Chen et al. [25] proposed an object-oriented open-pit mineral mapping framework that combines multi-scale segmentation of high-resolution remote sensing images with the Convolutional Neural Network (CNN) to extract open-pit mining information. The results are far superior to traditional machine learning methods. They then proposed a hybrid open-pit mine mapping framework combining land use classification information, which is based on the U-Net+ model and shows good performance [26]. Li et al. [27] proposed Attention-aware Multi-scale Fusion Network (AMF-Net), an attention-based, multi-level feature fusion network that can more accurately extract the open-pit mine area objects from Unmanned Aerial Vehicle (UAV) images on Red-Green-Blue (RGB) images and digital surface models. Xie et al. [28] manually collected datasets on Gaofen-2 satellite remote sensing images and compared the segmentation effects of a series of segmentation models such as SegNet [29], U-Net, Deconvolution Network (DecovNet) [30], and DUSegNet on open-pit mining areas, and verified the advantages of DUSegNet.

In the semantic segmentation tasks mentioned above, the model only needs to decide which category each pixel belongs. However, open-pit mine areas spatially adjacent or connected are presented as a whole. The object detection tasks based on the instance segmentation algorithms require not only segmenting the object and inferring its category but also distinguishing each instance. Each pit can be split for scenes where multiple pits are connected.

Most mainstream instance segmentation methods with high accuracy follow a two-stage paradigm represented by the Region-based CNN (R-CNN) series [31,32,33]. These methods obtain the location box proposals through Region Proposal Network (RPN), then segment these proposals and generate instance masks by Region of Interest (RoI) Pool or -RoI Align. As a state-of-the-art object detection network, Mask R-CNN [32] is a powerful baseline model. Wang et al. [34] used Mask R-CNN for transfer learning to identify open-pit mines in Hubei province, and the results were better than traditional machine learning methods and Faster R-CNN [33].

However, the segmentation speed of the two-stage object detection model is slow, the instance masks are sloppy, and the boundaries detected for large objects in complex scenes are not accurate. Recent studies have proved that one-stage anchor-free object detection models such as Fully Convolutional One-Stage Object Detector (FCOS) [35] have higher segmentation accuracy. The use of location information by such algorithms is simpler and faster, so it has become a new research hotspot. Compared with some mainstream one-stage and two-stage detectors, FCOS is superior to the classic Faster R-CNN, You Only Look Once (YOLO), and Single-Shot MultiBox Detector (SSD) algorithms in terms of detection efficiency. Chen et al. [36] proposed a concise algorithm BlendMask based on FCOS, which draws on the cropping ideas of Fully Convolutional Instance-Aware Semantic Segmentation (FCIS) [37] and the weight addition ideas of You Only Look At CoefficienTs (YOLACT) [38]. Both the mask accuracy and inference speed of BlendMask on the Common Object in Context (COCO) dataset [39] exceed Mask R-CNN. Conditional convolutions for Instance segmentation (CondInst) [40] proposed by Tian et al. also use FCOS as a detector, innovatively employing dynamic instance-aware conditional convolutions. It is also superior to Mask R-CNN in detection accuracy and speed and has excellent potential.

Previous studies of open-pit mine extraction on remote sensing images are mainly based on divided image datasets to test the model and rarely consider the model’s prediction ability when applied in actual scenes. In this paper, two improved instance segmentation algorithms, BlendMask-VoV and CondInst-VoV, are proposed to segment open-pit mine boundaries and provide location prediction probabilities. We verified the model’s segmentation accuracy and positioning accuracy in the open-pit mine dataset of Key Open-pit Mine Acquisition Areas in Hubei Province and discussed the generalization performance of the model in a different geographical area. The main goal of our research was to develop an instance segmentation method and find an appropriate remote sensing image to produce objective high-precision remote sensing pre-survey information for mine investigation, improve the efficiency of secondary manual screening, and provide a reference for field investigation route planning and site selection.

2. Materials and Methods

2.1. Datasets and Case Study Area

To compare the detection performance of the model, we established a manual annotation Key Open-pit Mine Acquisition Areas (KOMAA) dataset by selecting some open-pit mine images from high-resolution remote sensing images in the Hubei Province of China. We conducted open-pit mine detection in an area of Henan Province of China as a case study to evaluate the model’s generalization performance.

Hubei province is in the middle reaches of the Yangtze River, north of the Dongting Lake. Its geographic range is approximately 108.35~116.13°E, 29.03~33.28°N, covering an area of 185,897 km². It features a typical subtropical monsoon climate with four distinct seasons, ample sun energy, and a long frost-free period. The province receives between 800 to 2500 mm of annual precipitation, and the natural water area is densely developed. The topography varies widely, and the geomorphic forms are diverse, mostly consisting of intermediate and low mountain areas, hills, and plains, with mountains accounting for 56%, hills for 24%, and plains and lakes for 20%. By the end of 2017, 150 mineral species had been discovered in the province. Among the non-oil and gas minerals, the reserves of titanium ore, phosphate ore, bromine, iodine, garnet, marl, and rectorite clay rank first in China. The concentration of major mineral resources is relatively high, with more than 80% of the resources and reserves of iron, copper, rock gold, silver, graphite, phosphorus, sulfur, gypsum, cement limestone, rock salt, natural brine, and other major mineral resources distributed in large and medium-sized mining deposits. The geographic location and open-pit mine distribution of the KOMMA dataset are shown in Figure 1b.

To further evaluate the generalization ability of the model, we conducted open-pit mine detection in Daye town. Daye town is located in the east of Dengfeng City in Henan Province. Its geographic range is approximately 113.15°E and 113.31°E, 34.37°N and 34.49°N. The administrative area covers 98.7 km², and the hilly land covers an area of about 38 km². It is a warm temperate monsoon climate zone. Two main rivers are seasonal.

The Daye town sits on the southwestern margin of the North China craton, bordered to the east by the Songshan anticline and the west by the Daye-Yangtai syncline. As shown in Figure 2, the area’s structures are complicated and marked by several folds and northeastern- and southwestern-striking faults. The terrain is hilly-escrow mainly landforms.

Stratigraphic units in the region mainly consist of middle Paleoproterozoic to Triassic Marine sedimentary rocks, with minor Tertiary and Quaternary rocks. The upper Cambrian Changshan and Gushan Formation (Fms.) is exposed in the south of the region, composed of 50–300 m thick dolostones and limestones. Conformable overlying the Ordovician Majiagou Fm. and Carboniferous Taiyuan, Benxi Fms. in parts of these areas. Majiagou Fm. is composed of gray thin-bedded argillaceous limestone interbedded with yellowish-green shale in the lower member and dark gray thick-bedded limestone in the upper member. Taiyuan Fm. is dominated by black thick-bedded limestone interbedded with gray sandy shale and sandstone, interbedding with coal seam. The lower Benxi Fm. is the varietal bauxite shale and nest magnetite layer. The upper Benxi Fm. is dark gray limestone. The lower Benxi Fm. is the varietal bauxite shale and nest magnetite layer, and the upper is dark gray limestone. The widespread Permian sequence of shales, cherts, and coal beds in the northwestern and eastern of the region. The lower Permian Shanxi Fm., locally intercalated with coal shale, conformably overlying the upper Carboniferous Taiyuan Fm. The magenta quartzite of Lower Proterozoic Songshan Group and Sinian Maanshan Fm. outcrops in the southeast corner. The region’s western, northern, and central parts are covered with Quaternary thin alluvial sand, gray-yellow sand, and maroon sub-clay. This region hosts numerous sedimentary minerals in the Ordovician and Carboniferous strata, showing the characteristics of concentrated contiguous. The main mining minerals are coal, clay, bauxite, and iron ore. However, the long-term, large-scale mining of mineral resources has also resulted in a large area of subsidence and destruction of cultivated land. The geographic location and open-pit mine distribution are shown in Figure 1d.

2.2. Data Description

The samples in the KOMMA dataset are interpreted based on Gaofen-1 and Gaofen-2 satellite images and Tianditu images from 2017 to 2019. Moreover, these samples were verified in fieldwork. All high-resolution satellite images are resampling to a resolution of 2 m after pre-processing. Tianditu (https://hubei.tianditu.gov.cn/, accessed on 18 January 2022) is a network geographic information sharing and service portal built by the National Geomatics Center of China, serving the government and professional departments. The images come from the basic mapping and remote sensing data pooling of provinces or cities. Image source and time are not unified. Image parameters are shown in Table 1.

This dataset consists of 2021 images, including 691 Gaofen satellite fusion images, 532 true-color images of Gaofen satellite composited by bands 3, 2, and 1, and 596 false-color images composited by bands 4, 3, and 2, and 202 Tianditu images. There are 1115 samples of fusion images, 1011 samples of true-color images, 1090 samples of false-color images, and 361 samples of Tianditu images. The dataset was randomly divided into the training set, validation set, and test set in an 8:1:1 ratio, and 2863 training samples, 357 validation samples, and 357 test samples were obtained. The training and validation dataset is organized in the format of the COCO dataset, which is an object detection and segmentation dataset. The annotation information is saved as a JavaScript Object Notation (JSON) file with an image size of 600 pixels × 600 pixels.

Due to the different acquisition times of Gaofen satellite images and Tianditu images, mine verification spots of Daye town were obtained by re-interpretation of Gaofen-2 images obtained in 2017 and the latest Tianditu Level 17 image, respectively, based on the results of remote sensing pre-survey and field verification of mine geological survey in Henan province in 2017.

2.3. Detection Models

Feature extraction, position detection, and segmentation are the three steps in extracting information from an open-pit mine using high-resolution remote sensing images. They can be met using an instance segmentation model. In the instance segmentation model, a skeleton network extracts features, which are subsequently fed into a subsequent network for classification and localization.

One-stage and two-stage structures are the two types of instance segmentation algorithms. The two-stage algorithm generates the object candidate bound boxes first and then performs segmentation. The most significant benefit is the high level of accuracy. The one-stage instance segmentation algorithm detecting object along with segmentation is fast. However, the accuracy of most one-stage algorithms is not as good as the two-stage instance segmentation algorithm.

Anchor box [33] is used to tackle multi-scale detection problems in most two-stage detection algorithms and certain one-stage detection algorithms. However, it will introduce many hyperparameters that need to be optimized, and the imbalance of positive and negative samples needs to be considered. The accuracy of anchor-free one-stage algorithms has substantially increased with the development of object detection algorithms, and FCOS is a representative anchor-free algorithm. CondInst and BlendMask use the FCOS structure to save complex operations and memory usage during training. It eliminates the demand for ROI clipping and feature alignment and can enhance the resolution of output instance masks greatly.

2.3.1. VoVNet-v2

At present, the object detection model based on deep learning relies on the CNN classification network as a feature extractor, such as the Visual Geometry Group (VGG) Network for SSD [41,42] and Residual Network (ResNet) for Mask R-CNN [32,43]. The first CNN with successful multi-layer training was LeNet-5, proposed by LeCun et al. in 1998 [22]. Alex Krizhevsky et al. developed AlexNet in 2012, which employs techniques including ReLU activation function, data augmentation, dropout, and maximum pooling to make CNN draw much interest in image identification and recognition [44]. The VGG network was proposed by the Visual Geometry Group of Oxford University in 2014, which selects smaller convolutions to reduce computational cost and improve model fitting ability [41]. Although the structure of VGGNet is simple, the performance can be improved by adding more layers, but a large number of parameters can result in slow network training. In order to reduce the amount of computation and avoid overfitting in deep learning, Google proposed the Inception model [45]. Although InceptionNet series algorithms increase the network depth and expand the width, the number of parameters is considerably fewer than VGG. However, Inception series networks are highly targeted in the configuration of hyperparameters, and their data expansion is not very good. The residual structure of ResNet proposed by He et al. can effectively reduce the number of parameters and has good extensibility [43]. ResNet is currently the most commonly used object detection model backbone. Dense Convolutional Network (DenseNet) proposed by Huang et al. has a stronger feature extraction capability and less computation than ResNet, but each layer will aggregate the features of the preceding layer, which will produce feature redundancy, and the high memory access cost will lead to a slower speed [46]. VoVNet [47] is a computational and energy-efficient backbone network with more outstanding performance and quicker speeds than Res-Net and DenseNet, thanks to the One-Shot Aggregation (OSA) module. By aggregating the subsequent feature map at one time, the OSA module can effectively extract different feature representations and solve the problem of DenseNet feature redundancy. As the OSA module can capture multi-scale accepted fields, it can effectively present a diverse representation of features, handling multi-scale objects and pixels well, especially on small objects.VoVNet-v2 [48] improves VoVNet-v1 by adding residual connections to the OSA module, enabling deeper VoVNets to perform better. A channel attention module, effective Squeeze-Excitation (eSE), has also been added. The eSE was improved by Squeeze-Excitation (SE) [49]. SE is a representative channel attention method employed in CNN architectures to model correlations between feature map channels to enhance their representation. The SE module compresses the spatial correlation by global averaging pooling. It then uses two fully connected (FC) layers and a Sigmoid function to rescale the input feature map to highlight only useful channels. The two FC layers of the original SE need to reduce the channel size, which can cause the channel information to be lost. The eSE uses only one FC layer, thus preserving the channel information. The structure of VoVNet-v2 is shown in Figure 3.

We applied VoVNet-v2 to the two latest classic models in the instance segmentation field, BlendMask, and CondInst, instead of ResNet [43] as the backbone. The Feature Pyramid Network (FPN) structure connects VoVNet2 to subsequent networks. Upper-level features are unsampled and connected to lower-level features from the top down.

2.3.2. BlendMask

BlendMask is a one-stage anchor-free instance segmentation method. The structure of BlendMask-VoV is shown in Figure 4a.

Blender modules are proposed to fuse the bottom module that processes the underlying detail information and the top layer that predicts the global information of the instance level. The inputs of Blender include bounding box proposals generated by the detector towers, top-level attention maps generated by the top layer connected to the box head of the detector, and bases generated by the bottom module, where bases are masks with size equal to the whole map. Bases are cropped out of the masks corresponding to the bounding box proposals area, resized, multiplied by the corresponding attention maps, and added up by channels to get the final mask.

Compared to the hard alignment used in FCN and FCIS, using the Blender module can cut computation by ten at the same image resolution. The inference time of the two-stage detection model increases with the number of bounding boxes, while that of BlendMask is negligible.

2.3.3. CondInst

CondInst is also built on the detector FCOS. The structure of CondInst-VoV is shown in Figure 4b. CondInst locates the mask in the original image by using dynamic convolution and relative coordinates attached to the feature image, rather than the RoI operation commonly used in previous instance segmentation algorithms. After the backbone, the network structure of CondInst mainly includes the shared head and mask FCN head. The structure of the shared head is basically the same as that of the FCOS. The output has three branches, classification head, center-ness head, and controller’s head. The classification head outputs the category of each point, the center-ness head suppresses the bounding box detected with low quality, and the controller’s head is responsible for generating parameters for the mask FCN head. As the controller’s head is added directly to the box head of FCOS, it encodes the shape and size of the instance. The structure of the mask FCN head is the FCN network, distinguishing foreground instances from other background information. FCN parameters are dynamic, and each instance has different parameters generated by the controller’s head.

CondInst offers reduced computational complexity and faster inference than previous models with fixed FCN network parameters, such as Mask R-CNN. The overall inference time of CondInst is almost regardless of the quantitative image of examples.

3. Results and Analysis

3.1. Experimental Configuration and Setting

We compared our improved models against previous state-of-the-art instance segmentation methods on our dataset. The experimental environment of the article is a personal computer equipped with an Intel Core i5-9400F processor, 12 GB memory, and an Nvidia GeForce RTX 2060 graphics card. The implements used are adapted from an open-source object detective toolbox on top of Detectron2 by the University of Adelaide [50,51]. Our version is modified to generate a segmentation mask layer individually, and the model backbone is replaced.

During model training, we used focal loss [52] as the loss function. Data categories are unbalanced; the focal loss function focuses on rare and hard-to-distinguish categories. The initial learning rate was set at 0.01. After the warm-up of 1000 steps, the multistep strategy was adopted to adjust the learning rate, and four images were input for the model each time. The BlendMask, CondInst, and Mask R-CNN models use ResNet101 as the backbone network. Blendmask-VoV and CondInst-VoV use VoVNetV2-57 as the backbone network. The two backbones have a similar number of parameters. After 100 epoch iterations, the best models from the training process were retained.

3.2. Model Evaluation

To evaluate the performances of improved models, precision, recall, and F1-score were calculated. Precision is concerned with assessing the prediction results, which indicates the ratio that detected instances are real positive samples. Recall focuses on the judgment of the real sample, which indicates the ratio that positive samples in the real samples are detected correctly.

Precision can be represented by the number of positive samples correctly predicted divided by the total number of instances detected:

Precision = \frac{True Positives}{Detected Instances},

(1)

where True Positive represents the number of detected samples are real positive samples. Real positive samples can be represented as Ground Truth (GT).

Recall is calculated by dividing the number of correctly detected positive samples by the total number of real positive samples as follows:

Recall = \frac{True Positives}{Groun Truth} .

(2)

In general, the higher the precision, the lower the recall tends to be. If we want to improve precision, which means that as many detected instances as feasible are true positive samples, the probability threshold of the model that predicts the samples as positive examples should be increased, that is, the confidence threshold should be increased. If we want to improve recall, that is, to select positive samples as far as possible from the model, the confidence threshold should be lowered. Due to this contradiction, it is often necessary to choose between improving recall or precision when training a model based on a specific situation. The possibility of detected results being GT can be ranked high to low. With the confidence threshold of the model gradually reducing in this order, the precision and recall of the current threshold can be calculated each time. The Precision-Recall curve (P-R curve) can be obtained by taking recall as the x-axis and precision as the y-axis. The area beneath the P-R curve reflects the comprehensive performance of the model in both precision and recall.

The model adjusts the hyperparameters by evaluating the validation set once every epoch during training. The COCO API was used for evaluation [39]. COCO API uses Intersection over Union (IoU) to measure overlap between candidate bound predicted by the model and GT in the object detection task. IoU is the ratio of their intersection to their union:

IoU = \frac{Candidate bound \cap Ground Truth}{Candidate bound \cup Ground Truth} .

(3)

The higher the IoU is, the more overlap between candidate bound and GT is. In the ideal case, they are completely overlapped, and the ratio is 1.

An object detection task can be divided into two subtasks, bounding box detection of the instance and segmentation of the pixel. Two tasks can be evaluated separately.

In the object detection task, generally, the detection result can be considered good if the IoU ≥ 0.5. When IoU = 0.5, the P-R curves are shown in Figure 5. The Blendmask-VOV model has the best segmentation and localization effect in both validation and test sets, followed by the CondInst-VOV models. Both two models show improvements in detection performance compared with the baseline.

The P-R curve can qualitatively assess an object detection model. To quantitatively evaluate the comprehensive performance of the model, we considered the F_beta-measure. F_beta-measure is a score indicator that evaluates binary classification models based on predictions about positive class. F_beta-measure decides whether to focus on the precision or recall metrics by using different

β

weights:

F_{β} = (1 + β^{2}) \cdot \frac{Precision \times Recall}{(β^{2} \cdot Precision) + Recall} .

(4)

A balanced F-score (F1-score) assesses the combined precision and recall performance in object detection to avoid omissions and over-predictions. F1-score is the harmonic mean of precision and recall, a special case of

F_{β}

when

β

= 1.

F_{1} score = 2 \times \frac{Precision \times Recall}{Precision + Recall} .

(5)

A smaller

β

value gives a higher weight to precision when calculating the score, while a larger

β

value gives a higher weight to recall.

The precision used to calculate F1-score in the experiment is the average precision under all recalls corresponding to every IoU threshold, and recall is the average recall at all thresholds.

When evaluating the validation set, the Average Precision (AP) is calculated on the IoU ∋ [0.5, 0.95] for every 0.05. AP is obtained from the P-R curve. It is the average precision value of each recall value. AP is used to measure the detection ability of the model on the category of interest.

Table 2 and Table 3 show the optimal results for each model. In the table, mAP₅₀ is the AP value at IoU = 0.50, mAP₇₅ is the AP value at IoU = 0.75, and mAP represents the average AP under the 10 IoU thresholds. mAP_S, mAP_M, and mAP_L are three different types of mAP that are calculated separately based on the size of the detected instances. The calculated instance area of mAP_S is not greater than 32 pixels × 32 pixels, the calculated instance area of mAP_M is between 32 pixels × 32 pixels and 96 pixels × 96 pixels, and the calculated instance area of mAP_L is greater than 96 pixels × 96 pixels.

For the evaluation results of the validation set in the bounding box detection task, the mAP and mAP₅₀ of BlendMask-VoV are higher than those of BlendMask, respectively. Its mAP and mAP₅₀ performed best in all models. Its mAP₇₅ is only slightly lower than BlendMask but still higher than other models. The F1 score of 0.665 is also the highest among all models. BlendMask-VoV performs best in the bounding box detection task on all sizes of objects, especially on small objects, which it outperforms BlendMask significantly. Although the mAP₇₅ of CondInst-VoV is not as high as CondInst, the mAP and mAP₅₀ of CondInst-VoV in the bounding box detection task are higher, with an F1-score of 0.646, and the overall detection performance is also superior to CondInst. CondInst-VoV does not perform as well as CondInst for small and medium objects. However, it obtains better performance than the well-engineered Mask R-CNN in Detectron2. It outperforms CondInst and Mask R-CNN on large object detection.

In terms of the segmentation task of the validation set, BlendMask-VoV is also the best model for overall performance, with all evaluation indicators higher than other models. The performance of BlendMask-VoV in small object segmentation is greatly improved compared to the baseline. BlendMask and CondInst have similar overall segmentation precision, and both are better than CondInst-VoV and Mask R-CNN, while the segmentation precision of CondInst-VoV is only slightly better than that of Mask R-CNN.

Table 4 and Table 5 show the evaluation indicators for the test set. Except that the positioning ability of small objects is not as good as that of CondInst, the overall positioning ability of BlendMask-VoV is still better than other models, with an mAP score of 63.066% and an F1 score of 0.666. CondInst-VoV has an mAP of 61.391% and an F1-score of 0.659, second only to BlendMask-VoV, and the mAP, mAP_50, and mAP_L for detecting bounding box are all improved compared with CondInst. The overall positioning ability of CondInst is better than that of BlendMask and Mask R-CNN. Its positioning of small objects is especially accurate compared to other models.

From high to low, the overall segmentation precision of the test set is BlendMask-VoV, CondInst, BlendMask, CondInst-VoV, and Mask R-CNN. BlendMask-VoV has an mAP of 59.402% and an F1 score of 0.626. The segmentation evaluation indicators of Blendmask-VoV are better than those of other models, except that the precision of small objects is inferior to that of CondInst and Mask R-CNN. Although CondInst-VoV has a higher mAP₅₀ than CondInst, none of its other indicators have improved compared with CondInst.

It can be seen from the inference results of the test set by different models shown in Figure 6, Figure 7, Figure 8 and Figure 9 that BlendMask-VoV and CondInst-VoV have more accurate predictions on various images than baselines. Figure 6 and Figure 8 show the inference probabilities and boundaries of open-pit mines on fusion images of the Gaofen satellite. Both BlendMask-VoV and CondInst-VoV can accurately distinguish natural water bodies from mine pit water and are not easily confused by small objects with textures close to that of the mining area. CondInst-VoV has a close description of the segmentation boundary with BlendMask-VoV.

3.3. Case Study

Although the improved models performed well in the KOMMA dataset, the samples in the dataset were filtered. For open-pit mine detection, we pay more attention to application feasibility in actual remote sensing pre-survey scenarios.

In the case study, the remote sensing image of the entire Daye Town administrative region is clipped regularly before being input into models. Mine area on the edge of clipped images will inevitably be shown incompletely, leading to being ignored by models. To solve this problem, a sliding window with a size of 600 × 600 and a step size of 300 is used to cut the original remote sensing image so that the open-pit mine area can be displayed entirely in the image as far as possible.

The clipped images were re-mosaicked after detection. We take the union of instance masks for multiple overlapping instances in one image or among images and take the maximum prediction probability of the overlapping part. The detection results were transformed into vector format in the evaluation stage and analyzed statistically and spatially on ArcMap10.8.

On remote sensing images, the texture of the open-pit mine area is often very diverse, making it difficult to define the spatial topological relationship between surrounding vegetation, mining buildings, roads, and open-pit mines. Unlike the scattered samples in the KOMMA dataset, the practical open-pit mine scenes are often more complex. The detection result of Daye Town is also evaluated, as shown in Table 6.

In two cases, the segmentation mask of an instance may be close to GT, but the positioning precision is low. One case is that different models have a different judgment on instance boundaries for a complex open-pit mine scene with multiple connected pits. Even if the scene is interpreted by a human, different interpreters have different judgments on the number of instances and their boundaries of instances. Another case is that the texture or tone characteristics inside one object are inconsistent, so the model infers that one object is two instances. Both the two instances have low IoU with the GT. Furthermore, the open-pit mine area segmented on the edge of an image may have a low probability of being detected.

Given the above analysis, when evaluating the detection results of Daye Town, the confidence threshold was set at 0.1, and all instances that coincided spatially with GT were counted. From Figure 10, Figure 11 and Figure 12, it can be seen that a high confidence threshold and IoU are not necessary for the open-pit mine detection to obtain good results. Demanding the IoU and confidence threshold increases the rate of missed detection, which is not conducive to practical investigation.

Table 6 shows that, for the positioning task, among the three types of images, the recall of all models in the true-color image is generally relatively high, and the precision in the Tianditu image is relatively high. CondInst-VoV model has the best recall with 90.789%, 86.184%, and 88.816% in the true-color image, false-color image, and Tianditu image, respectively. In an open-pit mine, remote sensing pre-survey, the requirement of accuracy rate of manual interpreted object verification, namely, bounding box recall, is generally greater than 80%. The CondInst-VoV model can well meet that requirement. The recall of the BlendMask-VoV model and CondInst model are also high in all three types of images. The Mask R-CNN model has the highest precision, with 84.615% in the true-color image, 61.224% in the false-color image, and 97.917% in the Tianditu image followed by the BlendMask model and CondInst model. The precision of the BlendMask-VoV model and CondInst-VoV models is relatively low.

For the segmentation task, the accuracy is evaluated based on pixels of the whole study area, including open-pit mines and non-open-pit mines. The proportion of pixels in the non-open-pit mine is larger, accounting for 97.01% in the Gaofen image and 97.23% in the Tianditu image, respectively, which also greatly influences the evaluation results. Among the three types of images, Tianditu images often have the highest recall and precision.

CondInst-VoV model has the highest recall in the Tianditu image and true-color image, 92.351% and 74.628%, respectively. BlendMask-VoV model had the highest recall of 75.796% in false-color images. BlendMask model has the highest precision of 57.718% in the true-color image. CondInst-VoV model has the highest precision of 74.295% in the false-color image. The Mask R-CNN model has the highest precision of 75.922% in the Tianditu image.

Each detection model has its advantages and disadvantages depending on tasks and image types. Nevertheless, models generally perform the best in the Tianditu image. In the Tianditu image, the BlendMask model and Mask R-CNN model have the highest F1-score in the positioning and segmentation tasks, respectively, and both have good comprehensive performance. However, the high precision of the Mask R-CNN model comes from conservative prediction, while the CondInst-VoV model can find more objects.

4. Discussion

4.1. Image Types of Open-Pit Mine Remote Rensing Survey

The case study shows that the detection results on different types of images are different. To explore whether this difference is affected by the distribution of training data, we calculated the average recall, average precision, and F1 score of different image types in the KOMMA dataset. As shown in Figure 13, all models have the highest indicator values on false-color images. Except for the bounding box recall and segmentation recall of the CondInst model, the indicator values of all models are generally higher on true-color images than on Tianditu images. The results on fusion images are most unsatisfactory, demonstrating that the image detection quality is not improved by increasing the original resolution.

In the case study, the detection results on the Tianditu image are overall better than that on the false-color image.

This distinction can be examined from two perspectives. First, the case study can be viewed as a transfer learning task [53] with differences in image features between its data and the KOMMA dataset:

The types of mineral resources are different. Coal, limestone, bauxite, and gangue are the main minerals in Daye town, but there are no bauxite or gangue mines in the KOMMA dataset.
The hues and textures of images are different. Daye Town is located in a different geographical region than Hubei, and the climate, topography, and hydrological environment result in significant differences in land cover. Open-pit mine areas in Hubei province are mostly distributed in mountainous areas far away from cities, surrounded by dense vegetation. In false-color images, vegetation is more straightforward to distinguish from mines due to its conspicuous hue. Henan Province has gentle terrain and large farmland. The color and texture of bare farmland may cause some confusion to the edge of objects.
The proportions of the three types of images are different. The KOMMA dataset does not have a balanced distribution of image types. The small number of Tianditu images makes it easier for lousy samples to reduce the overall evaluation indicators of Tianditu images [54]. In the case study, differences in image proportions have no effect.

Overall, the model used only a small number of samples to get good detection results on Tianditu images. Tianditu images are also easier to obtain for free than Gaofen satellite images and are less time-consuming in pre-processing stages. All of those make Tianditu images more feasible in practice.

4.2. Object Size Issue in Datasets

Object sizes are quite different in the KOMMA dataset and case study. Figure 14 shows that the KOMMA dataset has larger objects than the case study. The samples are concentrated in large objects, and the distribution range of the object size is wider, while the samples in the case study are concentrated in medium-sized objects.

The environmental impact of small and medium-sized mining is usually not as significant as that of large open-pit mining [55]. The mining of large open-pit ore deposits can cause severe damage to land area and surface ecology due to the stripping of topsoil, waste rock, and stacking of tailings. KOMMA dataset, therefore, focuses more on collecting large objects. However, models tend to infer the object with more than 150,000 pixels interpreted manually as multiple adjacent instances, so many instances of the large object are smaller than GT.

In the practice application of models, the image resolution can be adjusted to increase the pixel area of the object on the premise of ensuring clarity to obtain better detection results. In addition, the influence of image hue, brightness, and acquisition season on the detection model should be further discussed in the follow-up. The KOMMA dataset will also need to be expanded and adjusted for the image proportions and increase the number of small and medium-sized open-pit mine samples.

4.3. Applicability of Models

On the KOMMA dataset, BlendMask-VoV outperforms BlendMask with ResNet as the backbone in terms of overall performance, and CondInst-VoV also outperforms BlendMask and Mask R-CNN, suggesting that VoVNet-v2 as a backbone in one-stage instance segmentation algorithm can extract features effectively. BlendMask-VoV has the best performance in both band combination images, while the CondInst-VoV model has obvious advantages in fusion image detection compared with other models. Both BlendMask-VoV and CondInst have strong detection abilities in Tianditu images. However, BlendMask-VoV has a more balanced detection performance overall.

In the actual case of Daye Town, models that have relatively good evaluation results are CondInst-VoV and Mask R-CNN. Figure 14 shows that Mask R-CNN predicted broader boundaries and few instances. Although this can enhance prediction precision, it does not guarantee improved boundary shape accuracy. BlendMask-VoV and CondInst-VoV predicted more instances than other models. Complex mine complexes’ pits and surrounding regions are more clearly marked and are less likely to be overlooked. Compared with BlendMask-VoV, the predicted data distribution of CondInst-VoV is closer to GT.

5. Conclusions

We proposed two improved instance segmentation algorithms, BlendMask-VoV and CondInst-VoV, to overcome the problems of low efficiency and high subjectivity in manual interpretation in mine information extraction. We established a KOMMA dataset to train and evaluate mine detection models and discussed the applicability of the improved instance segmentation algorithms in mine remote sensing pre-survey tasks. The main conclusions are as follows:

Experimental result shows that BlendMask-VoV and CondInst-VoV exceed the baseline in segmentation and localization positioning tasks in the KOMMA dataset.
The CondInst-VoV model has good generalization and can be applied to geographical areas with different data distribution characteristics. It can meet the accuracy requirements of manual interpretation in mine remote sensing pre-survey tasks.
In practical case application, the models proposed in this paper can obtain better detection results on Tianditu images than on Gaofen satellite images.
Mine detection models in this experiment have a better recognition for medium and large objects, but it is easy to divide oversized objects into multiple instances.

In this paper, we have proved that incorporating the instance segmentation method into the mine information investigation can meet the need to reduce labor in digitization work. To further improve the detection precision, image features can be further analyzed more thoroughly, and multispectral information should be supplemented in the future.

Author Contributions

Conceptualization, L.Z. and R.N.; methodology, L.Z.; software, L.Z. and B.L.; validation, T.C. and L.Z.; formal analysis, L.Z.; investigation, L.Z. and R.N.; resources, R.N.; data curation, L.Z., Y.W. and B.L.; writing—original draft preparation, L.Z.; writing—review and editing, R.N.; visualization, L.Z.; supervision, R.N.; project administration, L.Z.; funding acquisition, T.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant numbers 42071429, 41871355, and 62071439.

Data Availability Statement

Data available on request due to restrictions eg privacy or ethical. The data presented in this study are available on request from the corresponding author. The data are not publicly available due to the source data includes commercial satellite data.

Conflicts of Interest

The authors declare no conflict of interest.

References

Zheng, Y.; Xiao, J.; Cheng, J. Industrial Structure Adjustment and Regional Green Development from the Perspective of Mineral Resource Security. Int. J. Environ. Res. Public Health 2020, 17, 6978. [Google Scholar] [CrossRef] [PubMed]
Xu, J.; Yin, P.; Hu, W.; Fu, L.; Zhao, H. Assessing the Ecological Regime and Spatial Spillover Effects of a Reclaimed Mining Subsided Lake: A Case Study of the Pan’an Lake Wetland in Xuzhou. PLoS ONE 2020, 15, e0238243. [Google Scholar] [CrossRef] [PubMed]
Firozjaei, M.K.; Sedighi, A.; Firozjaei, H.K.; Kiavarz, M.; Homaee, M.; Arsanjani, J.J.; Makki, M.; Naimi, B.; Alavipanah, S.K. A Historical and Future Impact Assessment of Mining Activities on Surface Biophysical Characteristics Change: A Remote Sensing-Based Approach. Ecol. Indic. 2021, 122, 107264. [Google Scholar] [CrossRef]
Zawadzki, J.; Przeździecki, K.; Miatkowski, Z. Determining the Area of Influence of Depression Cone in the Vicinity of Lignite Mine by Means of Triangle Method and LANDSAT TM/ETM+ Satellite Images. J. Environ 2016, 166, 605–614. [Google Scholar] [CrossRef] [PubMed]
He, H.; Xing, R.; Han, K.; Yang, J. Environmental Risk Evaluation of Overseas Mining Investment Based on Game Theory and an Extension Matter Element Model. Sci. Rep. 2021, 11, 16364. [Google Scholar] [CrossRef]
Chen, S.; Zhang, L.; Feng, R.; Zhang, C. High-Resolution Remote Sensing Image Classification with RmRMR-Enhanced Bag of Visual Words. Comput. Intel. Neurosc. 2021, 2021, 7589481. [Google Scholar] [CrossRef] [PubMed]
Harbi, H.; Madani, A. Utilization of SPOT 5 Data for Mapping Gold Mineralized Diorite–Tonalite Intrusion, Bulghah Gold Mine Area, Saudi Arabia. Arab. J. Geosci. 2014, 7, 3829–3838. [Google Scholar] [CrossRef]
Mezned, N.; Mechrgui, N.; Abdeljaouad, S. Mine Wastes Environmental Impact Mapping Using Landsat ETM+ and SPOT 5 Data Fusion in the North of Tunisia. J. Indian Soc. Remote Sens. 2016, 44, 451–455. [Google Scholar] [CrossRef]
Quanyuan, W.; Jiewu, P.; Shanzhong, Q.; Yiping, L.; Congcong, H.; Tingxiang, L.; Limei, H. Impacts of Coal Mining Subsidence on the Surface Landscape in Longkou City, Shandong Province of China. Environ. Earth Sci. 2009, 59, 783–791. [Google Scholar] [CrossRef]
Prakash, A.; Gupta, R.P. Land-Use Mapping and Change Detection in a Coal Mining Area—A Case Study in the Jharia Coalfield, India. Int. J. Remote Sens. 1998, 19, 391–410. [Google Scholar] [CrossRef]
Duncan, E.E.; Kuma, J.S.; Primpong, S. Open Pit Mining and Land Use Changes: An Example from Bogosu-Prestea Area, South West Ghana. Electr. J. Inf. Sys. Dev. 2009, 36, 1–10. [Google Scholar] [CrossRef]
Bangian, A.H.; Ataei, M.; Sayadi, A.; Gholinejad, A. Optimizing Post-Mining Land Use for Pit Area in Open-Pit Mining Using Fuzzy Decision Making Method. Int. J. Environ. Sci. Technol. 2012, 9, 613–628. [Google Scholar] [CrossRef] [Green Version]
Roy, P.; Guha, A.; Kumar, K.V. An Approach of Surface Coal Fire Detection from ASTER and Landsat-8 Thermal Data: Jharia Coal Field, India. Int. J. Appl. Earth Obs. Geoinf. 2015, 39, 120–127. [Google Scholar] [CrossRef]
Liu, S.; Du, P. Object-Oriented Change Detection from Multi-Temporal Remotely Sensed Images. In Proceedings of the GEOBIA 2010 Geographic Object-Based Image Analysis, Ghent, Belgium, 29 June–2 July 2010; ISPRS: Ghent, Belgium, 2010; Volume 38-4-C7. [Google Scholar]
Bao, N.; Lechner, A.; Johansen, K.; Ye, B. Object-Based Classification of Semi-Arid Vegetation to Support Mine Rehabilitation and Monitoring. J. Appl. Remote Sens. 2014, 8, 083564. [Google Scholar] [CrossRef]
Chen, L.; Li, W.; Zhang, X.; Chen, L.; Chen, C. Application of Object-Oriented Classification with Hierarchical Multi-Scale Segmentation for Information Extraction in Nonoc Nickel Mine, the Philippines. In Proceedings of the 2018 Fifth International Workshop on Earth Observation and Remote Sensing Applications (EORSA), Xi’an, China, 18–20 June 2018; pp. 1–3. [Google Scholar]
Song, X.; He, G.; Zhang, Z.; Long, T.; Peng, Y.; Wang, Z. Visual Attention Model Based Mining Area Recognition on Massive High-Resolution Remote Sensing Images. Clust. Comput. 2015, 18, 541–548. [Google Scholar] [CrossRef]
Wan, J.; Xie, Z.; Xu, Y.; Chen, S.; Qiu, Q. DA-RoadNet: A Dual-Attention Network for Road Extraction from High Resolution Satellite Imagery. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 6302–6315. [Google Scholar] [CrossRef]
Xu, Y.; Chen, Z.; Xie, Z.; Wu, L. Quality Assessment of Building Footprint Data Using a Deep Autoencoder Network. Int. J. Geogr. Inf. Sci. 2017, 31, 1929–1951. [Google Scholar] [CrossRef]
Fukushima, K. Neocognitron: A Self-Organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position. Biol. Cybern. 1980, 36, 193–202. [Google Scholar] [CrossRef]
LeCun, Y.; Boser, B.; Denker, J.S.; Henderson, D.; Howard, R.E.; Hubbard, W.; Jackel, L.D. Backpropagation Applied to Handwritten Zip Code Recognition. Neural Comput. 1989, 1, 541–551. [Google Scholar] [CrossRef]
Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-Based Learning Applied to Document Recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]
Gallwey, J.; Robiati, C.; Coggan, J.; Vogt, D.; Eyre, M. A Sentinel-2 Based Multispectral Convolutional Neural Network for Detecting Artisanal Small-Scale Mining in Ghana: Applying Deep Learning to Shallow Mining. Remote Sens. Environ. 2020, 248, 111970. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2015; Volume 9351, pp. 234–241. ISBN 978-3-319-24573-7. [Google Scholar]
Chen, T.; Hu, N.; Niu, R.; Zhen, N.; Plaza, A. Object-Oriented Open-Pit Mine Mapping Using Gaofen-2 Satellite Image and Convolutional Neural Network, for the Yuzhou City, China. Remote Sens. 2020, 12, 3895. [Google Scholar] [CrossRef]
Chen, T.; Zheng, X.; Niu, R.; Plaza, A. Open-Pit Mine Area Mapping With Gaofen-2 Satellite Images Using U-Net+. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 3589–3599. [Google Scholar] [CrossRef]
Li, J.; Cai, X.; Qi, J. AMFNet: An Attention-Based Multi-Level Feature Fusion Network for Ground Objects Extraction from Mining Area’s UAV-Based RGB Images and Digital Surface Model. J. Appl. Remote Sens. 2021, 15, 036506. [Google Scholar] [CrossRef]
Xie, H.; Pan, Y.; Luan, J.; Yang, X.; Xi, Y. Open-Pit Mining Area Segmentation of Remote Sensing Images Based on DUSegNet. J. Indian Soc. Remote Sens. 2021, 49, 1257–1270. [Google Scholar] [CrossRef]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
Noh, H.; Hong, S.; Han, B. Learning Deconvolution Network for Semantic Segmentation. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1520–1528. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
He, K.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [Green Version]
Wang, C.; Chang, L.; Zhao, L.; Niu, R. Automatic Identification and Dynamic Monitoring of Open-Pit Mines Based on Improved Mask R-CNN and Transfer Learning. Remote Sens. 2020, 12, 3474. [Google Scholar] [CrossRef]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 9626–9635. [Google Scholar]
Chen, H.; Sun, K.; Tian, Z.; Shen, C.; Huang, Y.; Yan, Y. BlendMask: Top-Down Meets Bottom-Up for Instance Segmentation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 8570–8578. [Google Scholar]
Li, Y.; Qi, H.; Dai, J.; Ji, X.; Wei, Y. Fully Convolutional Instance-Aware Semantic Segmentation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 4438–4446. [Google Scholar]
Bolya, D.; Zhou, C.; Xiao, F.; Lee, Y.J. YOLACT: Real-Time Instance Segmentation. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 9156–9165. [Google Scholar]
Lin, T.-Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Computer Vision—ECCV 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2014; Volume 8693, pp. 740–755. ISBN 978-3-319-10601-4. [Google Scholar]
Tian, Z.; Zhang, B.; Chen, H.; Shen, C. Instance and Panoptic Segmentation Using Conditional Convolutions. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 1. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the 3rd International Conference on Learning Representations (ICLR 2015), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision—ECCV 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2016; Volume 9905, pp. 21–37. ISBN 978-3-319-46447-3. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet Classification with Deep Convolutional Neural Networks. In Proceedings of the Communications of the ACM, Raleigh, NC, USA, 16–18 October 2012; Association for Computing Machinery: New York, NY, USA, 2012; Volume 60, pp. 1097–1105. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going Deeper with Convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar]
Lee, Y.; Hwang, J.; Lee, S.; Bae, Y.; Park, J. An Energy and GPU-Computation Efficient Backbone Network for Real-Time Object Detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Long Beach, CA, USA, 16–17 June 2019; pp. 752–760. [Google Scholar]
Lee, Y.; Park, J. CenterMask: Real-Time Anchor-Free Instance Segmentation. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 13903–13912. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
AdelaiDet: A Toolbox for Instance-Level Recognition Tasks. Available online: https://git.io/adelaidet (accessed on 14 July 2021).
Detectron2. Available online: https://github.com/facebookresearch/detectron2 (accessed on 12 July 2021).
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar]
Pan, S.J.; Yang, Q. A Survey on Transfer Learning. IEEE Trans. Knowl. Data Eng. 2010, 22, 1345–1359. [Google Scholar] [CrossRef]
Johnson, J.M.; Khoshgoftaar, T.M. Survey on Deep Learning with Class Imbalance. J. Big Data 2019, 6, 27. [Google Scholar] [CrossRef]
Wu, W.; Li, X. Exploitation of mineral resources and restoration of ecology. China Min. Mag. 2021, 30, 21–24. [Google Scholar] [CrossRef]

Figure 1. Geo-location and open-pit mine distribution of dataset and case study area. (a) Location of the KOMMA dataset collection area and study area in China; (b) key open-pit mine acquisition areas in Hubei Province; (c) location of Daye Town in Dengfeng County; (d) open-pit mines in Daye Town.

Figure 2. Geological map of Daye Town.

Figure 3. Structure of VoVNet-v2. (a) Structure of OSA module; (b) specific operation process of VoVNet-v2.

Figure 4. Structure of BlendMask-VoV and CondInst-VoV. (a) Structure of BlendMask-VoV; (b) structure of CondInst-VoV.

Figure 5. P-R curves of the validation and test sets from the KOMMA dataset. (a) The P-R curves of validation set in the bounding box detection task; (b) the P-R curves of validation set in the segmentation task; (c) the P-R curves of test set in the bounding box detection task; (d) the P-R curves of test set in the segmentation task.

Figure 6. Model inference probabilities and shapes results on Gaofen satellite fusion image. (a) Ground truth of a limestone mine. (b) Inference result by BlendMask model; (c) inference result by BlendMask-VoV model; (d) inference result by CondInst model; (e) inference result by CondInst-VoV model; (f) inference result by the Mask R-CNN model.

Figure 7. Model inference probabilities and shapes on Gaofen satellite true-color image. (a) Ground truth of limestone mines. (b) Inference result by BlendMask model; (c) inference result by BlendMask-VoV model; (d) inference result by CondInst model; (e) inference result by CondInst-VoV model; (f) inference result by the Mask R-CNN model.

Figure 8. Model inference probabilities and shapes on Gaofen satellite false-color image. (a) Ground truth of limestone mines. (b) Inference result by BlendMask model; (c) inference result by BlendMask-VoV model; (d) inference result by CondInst model; (e) inference result by CondInst-VoV model; (f) inference result by Mask R-CNN model.

Figure 9. Model inference probabilities and shapes on Tianditu image. (a) Ground truth of quartz mines. (b) Inference result by BlendMask model; (c) inference result by BlendMask-VoV model; (d) inference result by CondInst model; (e) inference result by CondInst-VoV model; (f) inference result by Mask R-CNN model.

Figure 10. Detection results on Gaofen satellite true-color image in Daye Town. (a) Detection result by BlendMask model; (b) detection result by BlendMask-VoV model; (c) detection result by CondInst model; (d) detection result by CondInst-VoV model; (e) detection result by Mask R-CNN model.

Figure 11. Detection results on Gaofen satellite false-color image in Daye Town. (a) Detection result by BlendMask model; (b) detection result by BlendMask-VoV model; (c) detection result by CondInst model; (d) detection result by CondInst-VoV model; (e) detection result by Mask R-CNN model.

Figure 12. Detection results on Tianditu image in Daye Town. (a) Detection result by BlendMask model; (b) detection result by BlendMask-VoV model; (c) detection result by CondInst model; (d) detection result by CondInst-VoV model; (e) detection result by Mask R-CNN model.

Figure 13. The evaluation indicators of inference on different types of images in the KOMMA dataset by five models.

Figure 14. Statistical graph of object area in KOMMA dataset. The size of the violin plot in the same coordinate system positively correlates with the number of pixels. (a) Object area in the KOMMA dataset; (b) object area in the true-color image; (c) object area in the false-color image; (d) object area in the Tianditu image.

Table 1. Image parameters of the KOMMA dataset.

Data Source	Payload	Band Order	Wavelength (μm)	Band Description	Spatial Resolution (m)	Pre-Processing
Gaofen-1 satellite PMS	Panchromatic	Pan 1	0.45~0.9	Panchromatic	2	Radiometric calibration, atmospheric correction, orthorectification
	Multispectral	Band 1	0.45~0.52	Blue	8
		Band 2	0.52~0.59	Green
		Band 3	0.63~0.69	Red
		Band 4	0.77~0.89	Near-infrared
Gaofen-2 satellite PMS	Panchromatic	Pan 1	0.45~0.9	Panchromatic	1
	Multispectral	Band 1	0.45~0.52	Blue	4
		Band 2	0.52~0.59	Green
		Band 3	0.63~0.69	Red
		Band 4	0.77~0.89	Near-infrared
Tianditu Level 17 production	RGB	Band 1	-	Red	2.39	-
		Band 2		Green
		Band 3		Blue

Table 2. Quantitative Evaluation of the validation set in the bounding box detection task.

	mAP (%)	mAP₅₀ (%)	mAP₇₅ (%)	mAP_S (%)	mAP_M (%)	mAP_L (%)	F1-Score
BlendMask	61.351	86.423	68.901	16.337	56.017	63.363	0.652
BlendMask-VoV	62.689	87.827	68.721	41.617	61.677	63.724	0.665
CondInst	60.126	83.535	66.486	38.218	57.159	61.599	0.642
CondInst-VoV	60.535	85.156	64.824	24.653	55.329	62.905	0.646
Mask R-CNN	59.631	83.484	66.210	27.475	51.586	62.268	0.626

Table 3. Quantitative Evaluation of the validation set in the segmentation task.

	mAP (%)	mAP₅₀ (%)	mAP₇₅ (%)	mAP_S (%)	mAP_M (%)	mAP_L (%)	F1-Score
BlendMask	57.127	86.856	65.081	14.810	48.751	59.381	0.604
BlendMask-VoV	59.316	88.102	67.981	38.416	52.037	61.183	0.627
CondInst	57.082	86.515	65.856	29.901	50.975	59.124	0.605
CondInst-VoV	55.861	87.166	64.103	25.495	44.655	58.839	0.594
Mask R-CNN	56.378	84.536	66.075	23.985	45.656	58.920	0.590

Table 4. Quantitative Evaluation of the test set in the bounding box detection task.

	mAP (%)	mAP₅₀ (%)	mAP₇₅ (%)	mAP_S (%)	mAP_M (%)	mAP_L (%)	F1-Score
BlendMask	60.813	82.575	66.578	27.475	53.878	62.933	0.643
BlendMask-VoV	63.066	85.961	70.306	24.554	57.085	65.261	0.666
CondInst	61.381	83.615	68.439	46.436	56.943	63.016	0.651
CondInst-VoV	61.391	84.644	67.096	23.234	54.688	64.165	0.659
Mask R-CNN	59.427	80.876	67.289	34.653	56.730	61.004	0.627

Table 5. Quantitative Evaluation of the test set in the segmentation detection task.

	mAP (%)	mAP₅₀ (%)	mAP₇₅ (%)	mAP_S (%)	mAP_M (%)	mAP_L (%)	F1-Score
BlendMask	57.452	84.342	64.327	20.776	48.088	59.973	0.604
BlendMask-VoV	59.402	86.625	66.968	24.554	50.841	61.735	0.626
CondInst	58.380	84.627	67.561	33.168	48.155	61.047	0.614
CondInst-VoV	57.250	85.971	64.287	19.835	46.261	60.382	0.609
Mask R-CNN	55.454	81.652	64.667	28.515	49.045	57.127	0.583

Table 6. Quantitative Evaluation of model inference results in Daye Town.

		Bounding Box			Segmentation
		Recall (%)	Precision (%)	F1-Score	Recall (%)	Precision (%)	Accuracy (%)	F1-Score
GaoFen True-color image	BlendMask	69.079	62.500	0.656	61.258	57.718	97.501	0.594
	BlendMask-VoV	83.553	36.919	0.512	74.021	44.412	96.454	0.555
	CondInst	85.526	52.846	0.653	69.367	46.604	96.709	0.558
	CondInst-VoV	90.789	50.000	0.645	74.628	40.615	95.980	0.526
	Mask R-CNN	50.327	84.615	0.631	58.974	47.133	96.797	0.524
GaoFen False-color image	BlendMask	60.526	41.071	0.489	68.023	50.152	97.023	0.577
	BlendMask-VoV	76.974	32.773	0.460	75.796	37.336	95.474	0.500
	CondInst	83.553	35.574	0.499	74.941	36.021	95.273	0.487
	CondInst-VoV	86.184	32.832	0.475	74.295	74.295	94.404	0.743
	Mask R-CNN	59.211	61.224	0.602	66.292	45.037	96.574	0.536
Tianditu image	BlendMask	70.199	87.603	0.779	81.120	74.957	98.727	0.779
	BlendMask-VoV	78.146	60.513	0.682	87.844	69.894	98.616	0.778
	CondInst	75.497	70.807	0.731	86.749	68.864	98.547	0.768
	CondInst-VoV	88.816	63.679	0.742	92.351	59.372	98.038	0.723
	Mask R-CNN	62.252	97.917	0.761	88.277	75.922	98.900	0.816

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, L.; Niu, R.; Li, B.; Chen, T.; Wang, Y. Application of Improved Instance Segmentation Algorithm Based on VoVNet-v2 in Open-Pit Mines Remote Sensing Pre-Survey. Remote Sens. 2022, 14, 2626. https://doi.org/10.3390/rs14112626

AMA Style

Zhao L, Niu R, Li B, Chen T, Wang Y. Application of Improved Instance Segmentation Algorithm Based on VoVNet-v2 in Open-Pit Mines Remote Sensing Pre-Survey. Remote Sensing. 2022; 14(11):2626. https://doi.org/10.3390/rs14112626

Chicago/Turabian Style

Zhao, Lingran, Ruiqing Niu, Bingquan Li, Tao Chen, and Yueyue Wang. 2022. "Application of Improved Instance Segmentation Algorithm Based on VoVNet-v2 in Open-Pit Mines Remote Sensing Pre-Survey" Remote Sensing 14, no. 11: 2626. https://doi.org/10.3390/rs14112626

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Application of Improved Instance Segmentation Algorithm Based on VoVNet-v2 in Open-Pit Mines Remote Sensing Pre-Survey

Abstract

1. Introduction

2. Materials and Methods

2.1. Datasets and Case Study Area

2.2. Data Description

2.3. Detection Models

2.3.1. VoVNet-v2

2.3.2. BlendMask

2.3.3. CondInst

3. Results and Analysis

3.1. Experimental Configuration and Setting

3.2. Model Evaluation

3.3. Case Study

4. Discussion

4.1. Image Types of Open-Pit Mine Remote Rensing Survey

4.2. Object Size Issue in Datasets

4.3. Applicability of Models

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI