1. Introduction
The intertidal zones of tropical and subtropical regions are home to mangrove trees, which are evergreen woody plants that can withstand high salt levels [
1,
2]. Mangroves play an important role in wind prevention, coastal stability, carbon sequestration, and other applications [
3]. Mangrove forests in China have shrunk from 420 km
2 in 1950 to 220 km
2 in 2000 due to land reclamation for agriculture, urbanization, industrialization, and aquaculture [
4,
5,
6,
7]. In view of various laws and regulations issued by the Chinese government on wetland protection, it is crucial to track the ecological changes of mangroves. However, mangrove distribution data for extensive field measurement and sampling is complex due to the density of mangroves in intertidal zones and their periodic submergence by periodic seawater [
8]. The widespread adoption of remote sensing technology has resulted in the use of satellite imagery for various environmental protection purposes, including the ecological change for mangroves [
9,
10,
11,
12].
In recent years, scholars have further explored deep learning research, which provides a highly positive effect for the semantic segmentation of remote sensing images and meets the accuracy requirements of computer vision applications [
13,
14,
15,
16,
17,
18,
19,
20,
21]. The fully convolutional neural network was proposed in 2015 [
22]. This method is modified based on a convolutional neural network to classify images at the pixel level. Bittner et al. [
23] proposed a Fused-FCN4s model consisting of three parallel FCN4s networks to improve the convolution method. Three-band (R, G, B), panchromatic (PAN), and Normalized Digital Surface Model (NDSM) images are used as inputs to parallel networks to extract features from high-resolution remote sensing images. Chen et al. [
24] proposed a symmetric FCN model, including symmetric normal fast FCN (SNFCN) and symmetric dense quick FCN (SDFCN) with fast connections. In terms of improving the effect of semantic segmentation, the asymmetric convolution net (ACNet) proposed by Hu et al. [
25] in 2019 uses ResNet for feature extraction and sets the attention mechanism through the amount of information in the features at different levels, to achieve the balance between characteristics and the segmentation of effective regions. In 2019, Jun Fu et al. [
26] proposed a new type of scene segmentation network, which creatively presented a dual attention module after the network consisted of the fully convolutional network and the hollow convolution. To cope with these factors, the proposed method by Guillaume et al. [
1] combines deep learning-based enhancement of ITCs with a marker-controlled watershed segmentation algorithm.
Deep convolutional neural networks have achieved great success in many fields and have proven their superior performance in many applications in recent years [
27]. McGlinchy explored the use of fully convolutional neural networks (FCNNs), specifically UNet [
28], to map these complex features at the pixel level of high-resolution satellite images. This trend has also attracted many researchers to apply deep convolutional neural networks to the field of semantic segmentation of remote sensing images [
29,
30]. Based on the popular UNet model, by modification and adjustment Ge et al. [
31] improved the prediction performance of forest parameters in the improved model.
Although the various FCN-based methods mentioned above have achieved remarkable performances in remote sensing image segmentation, their recognition relies heavily on large-scale datasets because millions of parameters in the network need to be trained [
32]. In addition, recent studies have shown that the more profound the network, the better the performance of deep convolutional networks [
33]. Unfortunately, as the number of neural network layers increases, gradient vanishing and gradient explosion problems may occur. The disappearance of gradients causes the model to fail to get updates from the training data. Gradient explosion can cause model instability, significant changes in loss during an update, or null model loss during training.
To solve the problem of insufficient labeled data, transfer learning, as a deep learning strategy [
34], provides an effective way to train large networks with limited data without overfitting. Transfer learning can reduce the pre-training time and resource overhead of new models and can solve the problem of insufficient samples in new prediction problems. By fusing CNNs with models such as conditional random fields (CRF) and support vector machines (SVM) to create a new fusion model [
35], Dong et al. tested an approach based on the fusion of RF classifier and CNN with ultra-high resolution remote sensing (VHRRS) [
36]. To solve the problem of vanishing gradients and gradient explosions, Szegedy et al. [
37] proposed ResNet with typical residual connections, which allows gradients to propagate flexibly by bypassing paths.
Many challenges remain when using a convolutional neural network for image semantic segmentation. For instance, pooling operation and long convolutional steps cause the detailed features of the image to be lost; the spatial position information in the image to not be used efficiently; and the high algorithm complexity and number of datasets large [
38,
39].
Various scholars are currently using the heuristic learning method of artificial intelligence to solve the uncertainty problem. Based on the expert system of evidence inference and generative rule base, confidence rule inference is a heuristic method based on artificial intelligence, which has been widely used and studied [
40,
41]. Loannou et al. [
42] introduced the confidence framework into the traditional IF-THEN rule expression, covering theories such as evidence theory, fuzzy set theory, and decision theory, and proposed a new inference method based on evidential inference (ER) of the belief rule base. Lin et al. [
43] proposed an inference method which introduced the attenuation factor to correct the incomplete rule weights for incomplete data sets for the extended confidence rule base expert system. Liu et al. put forward a classification method based on the inference of the confidence rule base by introducing the linear combination of the confidence rule base [
44]. Niu et al. proposed an algorithm based on importance and visual attention fusion confidence [
45]. Charels et al. [
46] proposed a target criterion for model confidence.
The above research focuses on acquiring feature sets and dividing similar classes in the inference process. However, it does not strictly consider the validity and credibility of feature sets.
For the segmentation of remote sensing images of mangrove forests, there are the following difficulties:
Mangroves have less distribution, and remote sensing images are challenging to obtain, resulting in insufficient data set for classification;
Mangrove reserves belong to the natural environment, and their distribution and growth area are irregular. Some parts are mixed with other shrubs, making distinguishing difficult;
The remote sensing images of the interested areas show that the houses in residential areas are relatively small, the tidal flats and bare soil appear irregular, and the interspecific characteristics are very similar, such as oceans and rivers, mangroves, and other shrubs.
Hence, it is difficult for neural networks and traditional image inference methods to deal with the above problems. Although the use of neural networks can usually achieve better segmentation effects, a large enough data set to support the network model for training and verification is needed. Using methods such as knowledge graphs also requires a large amount of data to build a knowledge base. To solve the above problems, a semantic understanding of remote sensing images is proposed to construct the feature space of different ground objects in remote sensing images, and to build the mapping relationship between these features and different ground objects. Therefore, this paper put forward an approach based on convolutional feature inference-based semantic understanding, combining convolution transformation with confidence inference. The feature inference is interpreted as adopting the rule base for inference and prediction, by constructing the appropriate convolution kernel for the feature extraction to build the rule base, for inference and prediction.
The main contributions of this study are:
This study proposes a new approach to solve the segmentation issue of mangroves in different regions using an improved convolution feature extraction model;
A spatial confidence inference method is proposed for the regional growth of convolutional features. This method introduces an improved similarity calculation method to divide similar classes, to obtain the first-time reasoning results;
To improve the semantic segmentation ability of various features, the proposed algorithm takes the first-time result as a new sample for boundary exclusion and noise reduction and builds the final feature space and rule base for inference by establishing a three-dimensional color space distribution map of each category and the similarity of the introduced domain.
2. Proposed Method
Convolution can extract image features with specific texture information and compresses images. Thus, convolution is introduced when processing the original sample images in this study. This paper proposes a semantic segmentation method for remote-sensing images based on convolutional feature confidence inferences to extract convolutional features with certain texture information. The research method flow is shown in
Figure 1.
Based on the small segmentation subgraph of high-resolution remote sensing images, this method extracts colors and textures in a certain way to construct an inference feature set. It then makes the feature set have different types of semantic comprehension and generalization abilities to obtain the segmentation result of the whole image through the semantic reasoning model. The detailed method of image semantic understanding, as described above, is shown in
Figure 2.
This section will introduce the process, construction method, and theoretical analysis of the semantic inference model based on convolution features.
2.1. Convolution Feature Extraction
The collection of inference samples has a crucial impact on the inference results, and a reasonably prosperous and correct sample set determines the predictive ability of the inference model. Image inference requires a certain amount of data samples as a reference, and the corresponding feature sets and rules are set according to the sample characteristics, so that the reasoning model can grasp the relevant feature rules of the sample such that the model has the reasoning ability. This paper randomly extracts local remote sensing images as samples for feature extraction. A rule base of samples is generated, and the rule base is used to divide the entire remote sensing image inferentially. The process of the feature set training module is shown in
Figure 3.
As is known, objects appear macroscopic in a remote sense image. Many entities are often concentrated in a particular area, and under most conditions present random and nonuniform distribution characteristics. Some large-scale, densely distributed objects occupy only a very small spot in remote sensing images. Therefore, the local image can contain abundant feature information. The color characteristics of the “ocean” class in the remote sensing image are taken as an example for analysis. The pixel color in the image is divided into four ranges from high to bottom by red, yellow, blue, and green, respectively. The distribution characteristics of its color can be observed, as shown in
Figure 4.
Notably, if the inference model masters the feature laws in the local feature image, it can master the critical part of the feature rules in a single category, and this model has preliminary feature inference ability. However, it is not easy to generate accurate segmentation results solely based on the color feature of the image itself to build an inference model. In object detection and image segmentation research in recent years, many scholars have used the idea of convolution in image processing. The convolution idea is introduced when processing the original sample image. When extracting convolutional features, it must be assumed that the input image is defined as
and thar the convolution kernel is defined as
. The convolution operation can be expressed according to the formula below:
In the above Formula (1),
is the result of a convolution transformation. It can be seen that in places with dense single-category pixels, the features tend to be relatively stable, while for areas between different categories, the features will show obvious changes. Therefore, using the convolution kernel should extract the image edge information better. The setting of the convolution kernel is variable and can be adjusted according to the different images. The segmentation results obtained by comparing different convolution kernels are shown in
Figure 5.
The 3 × 3 convolution kernel has fewer parameters and an insufficient sensing field, and the inference results are greatly affected by the category boundary information. Convolution kernels larger than 5 × 5 take much time in actual operations, affecting the inference speed. In the algorithm, a 5 × 5 convolution kernel is used, as shown in
Figure 6, in which the parameters of the middle 1 × 1 part of the convolution kernel are set as large as possible to expand the feature range. The edge convolution kernel is introduced in the middle 3 × 3 part of the convolution kernel, and eight parameters are added to the outermost layer for collecting texture information around the pixel.
Taking the original sample map as the input layer, the original sample map is convolved to obtain a set of feature values for the corresponding category. If conduct feature extraction for the image information of the channel
in the class α is conducted, the category feature value
is obtained. All categorical feature values can form a one-dimensional feature value table, including the size of each feature value and the weight of the feature value
, which is the proportion of the feature value to the percentage of the total pixel value. Taking the mangrove class as an example, the extraction method of feature values is shown in
Figure 7.
The original sample is the category sample image as it is manually collected. Its three channels (RGB) are multiplied by the convolution kernel for feature extraction to obtain feature information, respectively. Statistics on the feature value distribution interval for each channel are employed to obtain a set of categorical features for that category, such as , , and .
2.2. Semantic Rule Base Construction
The feature distribution interval of each category
and distribution weight of each feature in each feature interval
can be obtained through statistics. All feature intervals make up a subset of category features
. The feature set
corresponding to a category contains subsets of its three color channels (R, G, and B), namely
. The final feature set in the output is the collection of all feature sets of categories. The relationships among feature set, category feature set, category sub-feature set, and feature table are shown in
Figure 8.
The manually collected samples have important drawbacks wherein the lack of sample data results in the lack of features. If the sample is limited, then the part with a dense but discontinuous feature distribution is also very likely to be a feature of the actual model.
To solve this problem, under the condition that the feature information provided by the sample is certain, it is necessary to adjust the initially obtained feature set, that is, to expand the feature interval of each sub-feature set reasonably. The expansion limit M, which is a number between
, is introduced to limit the feature interval’s expansion range, indicating that the maximum number of the expanded interval can be transformed to a multiple of the original interval. Let the original interval length be
and the extended interval length is
, then:
The expansion limit is usually set within the range
. If the expansion limit is too large and results in too significant changes to the feature set, the segmentation result will be unstable. Supposing there are
n feature intervals for a subset of features, the weight of each feature interval
is
, and the extension limit is M. Then, the extension length L of the feature intervals is calculable by Equation (3):
If the minimum value of the feature table is 0 and the maximum value of the feature table is H, then the result of the original feature interval after interval expansion can be expressed as Formula (4):
After the initially extracted feature intervals are expanded in the above way, some feature intervals in the feature set will intersect. The intersection interval is prone to errors in the inference logic in the process of confidence inference, so it is necessary to process the feature interval that produces the coincident region.
After the feature interval is expanded, the information is extended to a certain extent. The feature interval that still accounts for a small proportion of the new sub-feature set is overwhelmingly likely to be an outlier interval and may not belong to its classified features. If this part of the feature value is still retained for subsequent inference, then it will also have an impact on the part of the inference information. Therefore, a smaller feature interval in each subset must be removed to attenuate this effect.
2.3. Semantic Feature Inference
Feature interval expansion enriches the feature information to a certain extent, which can improve the problem of the insufficient data volume of samples. Although the feature set obtained based on the sample image can grasp part of the critical feature information of each classification, it will also miss more classification features. During semantic inference, this part of the uncollected features can easily lead to misjudgment. To improve the segmentation ability of the inference model, this paper uses the method of confidence inference to solve this problem.
Confidence rule inferences are the process of inference through data statistics, processing, and analysis in a given data or data set to obtain confidence rules. The conclusion of inferences is generated based on probability. Three parts are needed to represent a particular inference method: one is the feature sets, the second is the rules of reasoning, and the third is the result of reason. The problem that confidence inference needs to solve is that when inferential segmentation is performed on images, pixels that are not within the scope of feature set segmentation ability are inferred into the corresponding classification by inference.
To complete the inference task on unknown classification information, it is necessary to calculate the similarity of each image pixel’s feature values to each category’s features. There are roughly three cases in this inference process. In the ideal case, the feature value can be matched to a particular class of feature information. Supposing a pixel
contains the feature values
,
, and
of the three color channels R, G, and B, the pixel can be perfectly matched to the feature information of the category
to complete the inference, as shown in Equation (5):
If this happens, the feature value will be directly assigned to the class . However, in addition to this situation, the feature values may not be ideally matched into any category of feature ranges, and other discrimination methods should be set for such pixels.
The feature value of a pixel should not be allowed to fall in any of the feature ranges of the feature set
(
) corresponding to channel
of the category
. Then, the similarity calculation method should be introduced to calculate the similarity of the feature value and the category to represent the degree of a match between the feature value and the category
. Assuming that the feature set has
standard feature intervals, and the feature value
is between the feature intervals
and
, its class similarity
is calculable as follows:
where
is an adjustable parameter. The values of
can be used to magnify the difference between feature values and feature sets. The larger the value of
, the lower the similarity if the distance between the feature value and the feature interval is fixed. When processing different segmented images, the number of categories of image segmentation and the feature relationship between different types can be adjusted according to the image scale. The range of similarity is
; the closer the similarity is to 1, the closer the feature information is to the category.
In addition to the above two cases, when segmenting an image, if the segmentation categories are too much or the features between the segmentation categories are too similar, the feature values of a pixel may closely match the characteristics of multiple categories. At this point, further reasoning and division of the input information of this class are required. The feature value
should fall in the feature interval
of the feature set
corresponding to the channel
in the category
. According to the previous definition of similarity, the similarity match degree of the feature value and the feature of the class is 1. Thus, according to the previous definition of similarity, the similarity match degree of the feature value and the quality of the type is 1. To determine which of the categories with similarity 1 is more closely related to this information, the similarity must be calculated according to the method of Formula (7), as follows:
The range of similarity in this equation is . The closer the feature values are to the center of the feature, the greater the calculated similarity result. is the weight of the feature interval in the feature set of its class. With the effect of , the similarity calculation formula will infer a more accurate inference result for the pixel corresponding to the feature value and each similar category based on the input feature to determine the classification of the pixel.
The similarity calculation result of each pixel’s feature value in each channel in the category can be obtained by calculating the above method, and the similarity of the input information in the category can be obtained by adding the calculation results, namely:
where
is the total number of channels. According to the above method, the similarity of the input information to each category can be obtained. The corresponding pixel can be classified into the category with the most remarkable similarity by comparing the similarity with each type to complete the inference.
To summarize the semantic feature reasoning in the above different cases, the semantic feature reasoning can be summarized in the following three points:
If all feature values xc of the pixel x belong to one category n only, then the reasoning result is category n;
A feature value of the pixel x belongs to multiple categories n1, n2, n3 ⋯, and the similarity of each category is calculated according to Formula (7);
A feature value of the pixel x does not belong to any category, and the similarity of each category is calculated according to Formula (6);
x should be inferred as the largest similar class .
The above reasoning process is articulated with the “if… Then” sentences in the pseudocode as follows:
Input any pixel x:
if
then
calculate
else if
then
calculate
else if
then
calculate
2.4. Correction Module Based on the Results of First-Time Inference
The preliminary division of the segmented image is obtained through the above operations, but problems still exist in the feature set. On the one hand, the feature set is not rich enough. On the other hand, the segmentation ability of the image is poor. Therefore, further processing of the feature set is required to improve the segmentation ability of the model. Since the image segmented by the first inference also has a considerable number of incorrectly segmented pixels which will interfere with the construction of the feature set, it is therefore necessary that a reasonable training method is set to adjust the feature set. It is found that roughly three types of misjudgment areas affect the construction of feature sets after experiments. The first type is the boundary area between categories, and the feature performance in this region is unstable. The second type is a small area of noise information, which cannot be well reasoned into the corresponding correct classification due to the limitations of the original feature set. The third type, where the data being wrongfully segmented is fused with the category being wrongfully segmented, is complicated and challenging to eliminate. The main task of the correction module is the rule base optimization based on the results of first-time inference, which needs to be completed for feature set expansion with minimal error information, and its flow is shown in
Figure 9.
2.4.1. Category Boundary Information Removal Based on Segmented Images
If the first segmented image is input as a new sample set and the first type of misjudged area, the boundary area between categories is processed first. The edge of each segmented class will intersect with other styles, affecting the feature performance. Hence, this part of the information needs to be removed when generating a feature set.
Therefore, this paper uses the mean convolution kernel for convolution, which compresses and then samples the image while ensuring that the features are not lost, meanwhile removing the edge information according to the convolution result. The mean convolution is equivalent to a filter for smoothing the image, if the convolution kernel size is
, the convolution kernel is calculated as follows:
2.4.2. Noise Information Removal
After removing the boundary information, the sample image cannot be used as an input sample for feature extraction because it still contains noise information interference. If the noise information is added to the feature construction process, it will impact the category features. Therefore, the image is first split according to the spatial position. Each independent region in the space is divided into a separate sample image
. The number of pixels in the sample image is
, and all the sample image sets of each category shape the sample set of the category
. The noise information in the sample set is then processed, and the dropout parameters are set as
. Supposing the sample set of class
is processed at this time, when a particular class in the sample image satisfies Formula (10), the sample is considered a noise sample and discarded.
2.4.3. Sample Set Processing Based on Convolutional Feature Confidence Inference
Following the above method, the noise information in the small and boundary areas between categories was processed to obtain a sample set. However, the sample still exists the problem of the third type. Therefore, the sample set still cannot be used for feature extraction, and further processing is required.
Assuming that class
contains
sample images, the RGB channel values of the pixels in each sample image are concentrated in a particular area
in the color space. If most of the pixels in the sample image are correctly divided, then the area
will also be distributed stably. For the pixels of each class sample, their distribution in the color feature space is drawn as shown in
Figure 10.
As shown in the figure above, each category of pixels are distributed in three-dimensional feature space and the front view (RB), side view (GB) and top view (RG) of this three-dimensional space are displayed. In the RG diagram, the abscissa is R, and the ordinate is G, and so forth. In the 3D diagram, the x-axis corresponds to feature channel R, the y-axis corresponds to channel G, and the z-axis corresponds to channel B. It can be seen that the pixel values of each category have a relatively concentrated distribution area and show more stable characteristics. Furthermore, a sample image in class will have a higher coincidence and similarity with the area composed of other sample maps when there are fewer wrongly segmented pixels.
Conversely, supposing the region contains a portion of pixels that are not properly segmented, the area will be less coincident with the area formed by other sample maps in the class , and this area will be discarded. Thus, a method is needed to determine the point distribution of each sample in the feature space.
Taking a pixel
of the
sample in the category
, its values corresponding to the three channels are
,
,
. Then, a set of characteristics for this pixel
can be calculated according to the Equation (11):
where
to k
12 is the position of input information in RG, GB, RB, which are the slope of the line where the four vertices
,
,
, and
are located (taking 256 instead of the maximum value of 255 prevents the denominator from being zero). In order to show the meaning of these features more clearly,
,
, and
can be assumed. The relationship between the feature information is shown in
Figure 11.
The domain
of the
sample in the class
eventually constitutes a space limited by the above conditions. The points in the sample are distributed within this spatial range. It is necessary to determine the supremum and infimum of each condition for space determination. The state of each threshold in
is as Equation (12) outlines:
where
is the lower limit of the channel
and
is the upper limit of the channel
. By analogy, the threshold with subscript
is the infimum of the condition, and the threshold with subscript
is the infimum of the condition. Then, after calculating the value of each condition for
, the upper and lower bounds of each condition for the domain
should be updated. After processing all of the pixels of the sample, the distribution domain
of the sample in the feature space is finally obtained.
Taking the sample image
and
of the category
as an example, the features of two sample images are extracted to obtain the feature values
and the feature value weights
of each channel for each sample. Supposing that the characteristic intervals of two samples
and
intersect, and each intersection interval is expressed with
, then the similarity of each two sample images is calculated according to Formula (13):
where
is the length of the feature table. A decision threshold v is set to determine the minimum similarity between two samples. When the similarity between a sample domain and all other sample domains falls below this threshold, the anomaly segmentation value of the sample area is considered too high and needs to be discarded. On the contrary, if a sample is very similar to any other sample, then the information of the two samples is similar. The domains corresponding to the two samples need to be fused. Taking the minimum value of the infimum of features and the maximum value of the infimum of features in the two domains, that is, taking the union of the two domains to get a new domain that can contain the features of the two original domains, the result of processing according to the above description is the final sample set for training.
The sample set processed according to the above method has the first type of outliers removed, the second type of outliers, and part of the third type of outliers. The sample data obtained is larger and the category rules contained are richer. The newly obtained sample set when input into the model of rule extraction to get a new feature set, obtain a feature set which can used for reasoning segmentation. The feature set nevertheless contains a considerable amount of error information, which will have a great impact on the reasoning results. At this time, the newly obtained feature set must be fused with the initial feature set obtained from the first segmentation by optimizing and adjusting this new feature set based on the feature set obtained from the first training.
The adjustment of the feature set must expand each feature interval moderately, and the extension length needs to be adjusted in a certain way according to the situation of the interval. The interval weight, interval width, and the factors that segment the image need to be comprehensively considered. Assuming that the characteristic interval of the channel for a category
is
, and the weight of the interval is
, then the interval can be expanded according to Equations (14) and (15). Equation (14) is the length to extend.
The result of the expansion of the interval can be expressed by Equation (15):
where L is the length of the feature value. Thus at this stage, the operation of the correction module is complete, and the feature value table has been expanded.