1. Introduction
Over the past decade, exponential growth in remote sensing data generated by ground-based sensors has been observed. These data have been widely utilized in geological remote sensing interpretation (GRSI) [
1]. In the context of regional lithological mapping at large scales, GRSI refers to the process of identifying various geological features based on their characteristics as remote sensing images [
2,
3,
4]. It is an integral component of regional lithological mapping, and it holds significant importance. Compared to traditional manual geological surveys, GRSI offers the advantage of conducting large-scale geological investigations in a cost-effective manner. As such, it has the potential to serve various related applications, such as geological mapping, urban and regional planning, environmental protection, and disaster assessment. Hence, the procurement of high-quality GRSI products is of utmost importance for conducting effective research in these fields.
To improve the accuracy of GRSI products, researchers have proposed various geological interpretation models. Early national and international geological remote sensing practitioners developed interpretation models based on the spectral characteristics of rocks, minerals, and geological tectonic features [
5,
6]. In Ref. [
7], a novel band ratio image was generated by utilizing the desired band from the ASTER data and by effectively combining it with geological field observations to accurately map lithological units in the Wadi Kid region. Another approach to the use of spectral information is the spectral angle mapper (SAM) [
8]. However, these geological interpretation models are heavily reliant on manual labor and have a low degree of automation. Moreover, the spectral response of the same type of rock varies considerably, thus resulting in spectral uncertainty. This phenomenon is referred to as spectral variability, which is widely observed in many scenarios where the spectral characteristics of pure material substances vary in the acquired hyperspectral remote sensing images. Spectral variability can be caused by illumination and environmental factors, as well as by the inherent properties of the materials [
9,
10]. For example, variations in the orientation, strike, and dip of a single rock can result in differences in how the rock absorbs and reflects solar radiation, thus leading to distinct pieces of spectral information, which are captured by remote sensing sensors. These issues constrain the development of GRSI products.
In the last few years, there has been a shift toward the use of automated machine learning (ML) methods to extract geological element features from remotely sensed imagery. These methods address the limitations of manual interpretation models. The first common type of ML method is dimensionality reduction, which includes independent component analysis (ICA) [
11], principal component analysis (PCA) [
12], and the minimum noise fraction (MNF) [
13]. Another type of method is classification, such as the support vector machines (SVMs) method [
14]. For example, Ref. [
15] utilized ASTER spectral data and SVMs to classify the lithology on ASTER images. The results showed that the SVM algorithm performed well. The classification maps obtained were consistent with both the field survey and officially published geological maps. There are also a number of related platforms, like the GoldenEye project. However, as the temporal, spatial, and spectral resolution of remote sensing data continues to rise and as the information of remote sensing data becomes more abundant [
16], ML-based feature-extraction methods face challenges in terms of feature representation.
Since 2012, a multitude of advanced deep learning (DL) methods have emerged in the field of image processing. DL techniques can automatically model large-scale and complex datasets [
17], and they have shown exceptional performance in various computer vision tasks. Among these, convolutional neural networks (CNNs) have gained widespread attention in remote sensing due to their effectiveness in image processing [
18]. For instance, Ref. [
19] applied CNNs to map geological target features and to classify Landsat images. Similarly, Ref. [
20] employed both CNNs and traditional ML methods, such as SVMs and multilayer perceptrons, to map lithological units in mineral-rich areas of southeastern Iran. Ref.  [
21] proposed a high-resolution mapping approach using unmanned aerial vehicles (UAVs) to obtain data and DL algorithms to extract target features. Ref. [
22] proposed a multistage self-guided separation network (MGSNet) that enhances the discriminability of targets and backgrounds in remote sensing scenes through the utilization of a target–background separation strategy, contrastive regularization, and self-guided networks. Ref. [
23] proposed a representation-enhanced status replay network (RSRNet). This approach addresses representation bias, classifier bias, and insufficient information interaction through the combined augmentation of feature representation, a status replay strategy, and cross-modal interactive fusion. Ref. [
24] proposed a structural optimization transmission framework, namely a structural optimization transmission network (SOT-Net). This method effectively utilizes the reflectance-specific information from HSIs, as well as the detailed edge representations from multiple sources, to enhance feature extraction and classification. Ref. [
25] proposed a spatial–logical aggregation network (SLA-NET) which leverages morphological transformations and trainable structuring elements to extract fine-grained morphological structures from hyperspectral images. The method aims to enhance the classification of tree species and has shown superior performance compared to other state-of-the-art classifiers. Ref. [
26] proposed an innovative method called Large kernel Sparse ConvNet weighted by Multi-frequency Attention (LSCNet) to overcome the limitations of traditional CNNs in remote sensing scene understanding. Overall, DL-based methods have shown remarkable potential in extracting semantic features of geological elements, such as lithology [
27], minerals [
28], glaciers [
29], soils [
30], and geological formations [
31]. However, the accuracy of these interpretations is often inadequate due to various factors. These factors can be summarized as follows:
- 1.
- Small inter-class differences: Certain geological elements have blurred boundaries and similar imaging features due to various physical and chemical effects, such as weathering, erosion, and biological activity. Distinguishing between these elements in satellite imagery is difficult, as the color, shape, and structural features can be challenging to classify accurately. 
- 2.
- Unevenly distributed geological elements: The interpretation model is faced with higher requirements due to the variety of geological elements and the different sizes of the areas covered by these elements. It is noteworthy that those geological elements with a wider area of coverage, such as soil and water bodies, are relatively easier to identify. 
- 3.
- Sensor limitations: Sensor aberrations, changes in the operating conditions, and the movement of the Earth can cause image distortions. Natural phenomena, such as clouds and fog, can also obscure the sensor’s view of the Earth’s surface, thus making it difficult to obtain clear remote sensing images. 
These factors pose significant challenges in improving the accuracy of GRSI data. The limitations in the acquisition and processing of remote sensing data result in insufficient information content in GRSI data. Therefore, relying solely on remote sensing data for interpretation often fails to meet the high accuracy requirements. However, experts obtain point data through geological surveys, which have higher precision. Therefore, it is better to combine the complementary information from both point and surface data for geological analysis. The fusion of point and surface data enables the acquisition of more-comprehensive and -accurate interpretation results. However, due to the heterogeneity of data from different sources, each with its own data domain, the fusion of point–surface data often faces the challenge of data domain misalignment.
To address the issues of the limited accuracy of GRSI products caused by small inter-class differences, the uneven distribution of geological elements, sensor limitations, and the misalignment of data from multiple sources, this paper proposes a point–surface data optimal fusion method (PSDOF), based on optimal transport (OT) theory, to improve the accuracy of GRSI products. PSDOF combines geological survey point data (which have spatial location and geological element information and are called point data) with a geological remote sensing DL interpretation product (which has limited accuracy and is called surface data) to improve the quality of the resulting output, as well as to introduce OT to facilitate heterogeneous data alignment. The method presented includes two primary stages: (1) the extraction of high-precision location information from the point data and (2) point–surface fusion, which is where the OT model incorporates point data information into the accuracy-constrained GRSI products to achieve an information gain. To demonstrate the effectiveness of the method, experiments were conducted with PSDOF using GRSI products over the Pamir Plateau in the southern part of the Tianshan Mountains in China. Our work made the following contributions:
- 1.
- A new fusion method, named PSDOF, in the field of GRSI that is designed to aid with the fusion of heterogeneous geological survey point data and GRSI data. 
- 2.
- By fusing the high-precision location information extracted from geological survey point data, the PSDOF achieves information gains and effectively enhances the accuracy of GRSI products. 
- 3.
- PSDOF employs the concept of OT to address the challenge of data misalignment between geological survey point data and GRSI data, thus achieving a significant improvement in accuracy. 
This paper is divided into six sections. 
Section 2 examines the related works. 
Section 3 defines the problem and objectives, and presents the data used in this study, while 
Section 4 explains the PSDOF method, including its principles and features. Next, 
Section 5 covers the experimental setup and results. Finally, 
Section 6 summarizes the findings and future research implications.
  3. Problem Definition
Consult 
Table 1 for the primary symbols utilized in this research. PSDOF is described as follows: input a set of Landsat 8 images of size 
 (
r and 
c denote the quantity of pixels along the rows and columns of the image, respectively, and 
W equals the quantity of channels in the image) with the surface data and the point data 
Q.
      
      where 
 represents the gray-scale correlation feature value of the Landsat 8 image pixel corresponding to the point data. We aimed to use 
 to reflect the information about the geological elements of each pixel of the image. After calculation and expert screening (
), it was determined that pixels with feature values in the range 
 have similar geological elements and that these pixels are called similar feature pixels.
      
      where the SD 
 is the finite set of all similar feature pixels 
 of the point data in the remote sensing image. The TD 
 is the set of pixels 
 within the geological element boundaries (i.e., the boundaries obtained by fitting similar pixel curves) in the vicinity of the observation point of the decoded product. The inter-domain transport scheme 
 can be written as follows:
In the OT model, the SD samples and TD samples are the input, and the OT cost of the sample domain is calculated to output the fused optimal bound. The OT plan can be written as follows:
Finally, the GRSI products are updated according to the optimal bounds needed to obtain a higher-quality interpretation product. In summary, our proposed method can effectively fuse two sample domains in the information space  at a minimal cost, thus making it suitable for multimodal remote sensing data fusion. Our approach extracts complementary information from the point data through the calculation of gray-scale correlation features, as well as leverages lithological and positional information from the geological survey point data to fuse them into boundary-ambiguous GRSI products, thus resulting in higher accuracy.
  Materials
The study area selected for this experiment is situated on the Pamir Plateau in the southern region of the Tianshan Mountains in China. The area has a highland mountain climate characterized by cold weather and numerous rock glaciers:
- 1.
- Remote sensing data: These are a critical component of this study. Landsat 8 is a multispectral satellite that covers a total of 11 bands, thus making it an ideal data source for this study. To better distinguish the spectral characteristics of the geological elements, this work utilized the three RGB bands of Landsat 8 to synthesize color images, as shown in  Figure 1- . 
- 2.
- GRSI data: These were derived from the DL interpretation products obtained from Landsat 8 satellite imagery in the study area when using a multi-species semantic segmentation model. Separately, they were FCN [ 45- ], DeepLabV3 [ 46- ], DANet [ 47- ], OCNet [ 48- ], PSPNet [ 49- ], and AdvSemi-OCGNet [ 50- ]. The geological elements extracted are characterized by nine categories: glacier, granitic rock, lakes, carbonate rocks, slates, sandstone, volcanic debris, schist, and soil bodies.  Table 2-  shows the accuracy of the AdvSemi-OCGNet interpretation products, and through that information, we can conclude the following: the classification accuracy of glaciers (Acc:  - ), lakes (Acc:  - ), and soil bodies (Acc:  - ) ranked in the top three of the nine geological elements for the best classification accuracy. Meanwhile, the three geological elements with the worst classification accuracy were as follows: granitic rocks, sandstone, and volcanic debris. To assess the efficacy of our model, three of these localized areas (i.e., where geological survey data existed) were selected for experimentation. 
- 3.
- Geological survey data: In this study, these mainly consisted of point data. These points were selected by geologists within the study area through manual effort, thus resulting in a sparse dataset with precise spatial location and lithological information. After careful screening and confirmation, only point data within the interpretation boundary were used. The distribution of the selected point data is illustrated in  Figure 2- . To improve the interpretation accuracy, these point data were fused with the GRSI products to provide complementary information. 
- 4.
- Ground truth data: These are essential in assessing the precision of the GRSI model. In this study, the GRSI products were obtained from Landsat 8 satellite imagery, and they were manually labeled by experts with geographic alignment and cropping. The ground truth data represent the distribution of the geological elements in the study area, and they were used as a reference to evaluate the performance of the GRSI model. The annotated map in  Figure 3-  shows the nine categories of the geological elements included in the ground truth data, which are spatially consistent with the Landsat 8 satellite imagery. 
  4. Methodology
This section introduces the PSDOF method for the fusion of the point data and surface data. 
Figure 4 shows a diagrammatical flowchart of PSDOF for the fusion of surface data and point data.
  4.1. Information Extraction
Geological survey point data are sparse, which raises an additional problem: a lack of samples and labels. To address this problem, PSDOF uses Landsat 8 images and a gray-level co-occurrence matrix (GLCM) to generate point samples and labels.
The GLCM is defined as follows [
51]. Suppose our image (e.g., Landsat 8 images and GRSI products) is square, with 
c representing the columns and 
r the rows. The gray values of each pixel are quantized into different levels 
, 
, thus creating a set of weighted gray values. The image can be depicted as a function that allocates a gray value 
G to couple 
 coordinates. 
 is the co-occurrence frequency matrix of the two gray values 
a and 
b, which are detached by the step distance 
d in the image. Moreover, it is a function of the distance and angle relationship between neighboring pixels, which, thus, reveals information about the texture of the image [
52].
For 
d and angle 
, the normalized frequency is defined by the following equation:
        where 
N, represents the size of the image. Define 
 as the value of the 
-th entry in the normalized GLCM. The averages of the rows and columns—as denoted by 
 and 
, respectively—as well as their corresponding standard deviations can be described as follows:
For different texture features, which represent different meanings, this study used only the gray-scale correlation features. These features are calculated as follows:
        where 
 corresponds to the number of gray values in the image pixels and 
 corresponds to the number of ranks of the pixels. 
 represents the linear correlation of the gray-scale value of each pixel of the image; the higher the value, the stronger the correlation is [
53].
The setting of the parameters 
d and 
 is particularly important according to Equation 
6. Fine textures require relatively small distance values, and ensuring this will produce more texture information than would be the case otherwise, as it is difficult to represent them if the distance values are too large. 
 is relatively less important in co-occurrence matrices. Many authors have used the average [
54]. After extensive experimental analysis, we chose the best parameters of 
d and 
 for the surface data, where the step distance was 
. Each texture feature was rotated by 
, 
, 
, and 
, and the average feature value of the four angles was taken.
Pixels with similar  values around the geological survey points were selected as a set of point samples  for , and the experiment showed that the distribution of the point sets matched well with the distribution of the geological elements in the ground truth data.
The GLCM method utilizes the correlation of neighboring pixels’ gray-scale values in an image to represent its texture features. It is useful in addressing the issue of inadequate point samples in the application of the PSDOF method.
  4.2. Sample Selection
How to map the complementary information of heterogeneous data into fusion space is a challenge. Specifically, we need to find an effective set of characteristic data and a mapping method that represents these data. This section introduces our data-processing method:
Characteristic point: This study selected pixel points in the neighboring area of the geological elements, which are represented by the geological survey data that have a consistent gray-scale correlation with the geological survey point data, as point samples. The point samples’ set  was determined by the texture features of the geological elements. Since the geological survey point data had high-precision location information, all of the point samples could be generated with labels (with both location and geological element information). Regardless of the heterogeneity and origin of the point data, after information extraction, the point samples can all be converted into labels.
Characteristic surface: In the experiment, the characteristic surface was obtained by combining the textural features of the geological elements. We denote the  pixel distribution surface (using curve fitting) of the geological elements as  and the classification boundary of the surface data as . The characteristic surface was  for .
In principle, PSDOF requires a mapping of 
n labels among 
L modalities. Suppose there exist 
L different modalities of heterogeneous data (e.g., geological survey point and GRSI data), which are denoted as the mapping 
 with their own domain 
 and with the ranges 
, respectively:
        where 
 is a measurable probability space of the 
l types of multimodal data (including point data and surface data), and 
 can have values taken in the field of real numbers 
, complex values 
, integers 
, etc. In our experiments, different geological survey point data represent different mappings 
, thus revealing the types and spatial location information of the geological elements.
The goal of PSDOF is to homogenize the domains of the data of different modalities into the common space 
 with 
 for 
:
        where the mapping 
 is a non-linear many-to-one relationship and is metrizable. In the study, the remote sensing data were associated with 15 × 15 m
2 grid cells, wherein one can consider those cells as elements in 
. Thus, 
 serves as a mapping from the positions of point samples to their nearest unit centers. This mapping relationship, when taken, allows for heterogeneous domains to be aligned and fused into a common domain:
The fusion of multi-source heterogeneous information requires solving the alignment problem of the data domain. The domain adaptation technique provides a feasible solution: the source samples (characteristic point set ) and the target samples (characteristic surface ) are correlated in a unified information space . Furthermore, the data deviations between the domains can be addressed by solving the correlation mapping .
  4.3. Data Fusion
To obtain the optimal mapping  of the data domain, the PSDOF method introduces OT theory.
In the domain adaptation research area, OT provides a scheme for transferring SD distributions to the TD. Specifically, OT aims to estimate transport plans that minimize transport costs [
55]. This fusion model considers only a finite domain 
 with 
. The source and target distributions can be expressed as 
 (e.g., geological survey point data) and 
 (e.g., GRSI products), where 
 and 
 are the count of units in 
 and 
, respectively. Then, the expected distribution of the two domains 
 and 
 can be written as follows:
        where 
 and 
 are the Dirac function at 
 and 
, respectively, and 
 and 
 are the factors of the unit simplex, with 
 and 
.
In the case of discrete OT, the empirical distributions 
 and 
 are estimates of 
 and 
 for the discrete data points in 
 and 
, respectively. As such, we have
        
        where 
 and 
 are the probability masses of the unit simplex 
 and 
, respectively, with 
 and 
. The best-matching 
 between 
 and 
 is calculated by Kantorovich’s formula [
56] as follows:
        where 
 is the matrix of the cost, 
 indicates the cost of transporting the probability mass from 
 to 
, and 
 is the Frobenius inner product. 
 is the set of associative discrete couplings between 
 and 
, such that
        
        where 
 represents a vector of ones with length n.
The probability density of each position in the target distribution should be determined based on the geological element distribution, which is unknown and more difficult to obtain. This work assumed that the probability density of each position of the target distribution follows a uniform distribution—in other words, the probability density of 
 and 
 is 
 and 
, respectively. The function 
 can then be stated as follows:
The corrected empirical distribution can be stated as
        
Brenier [
57] proved that the Kantorovich formula can be solved by a system of linear partial differential equations when the cost function is quadratic and the domain of definition 
 is a Euclidean space. This further optimizes 
, and 
, which can be approximated as
        
        where 
 is the Wasserstein distance. The approach can be viewed as a local second-order approximation, which can be used to minimize the Wasserstein distance 
 at 
.
In a given geometric space, 
 and 
 can be regarded as the SD and TD, respectively. The optimal coupling can be calculated from Equation (
14), where the cost matrix 
 is the Euclidean distance of the uniformly discrete feature pixels in the sample space 
 and 
:
The transport costs of the OT model can be seen in solving for the two-Wasserstein distance:
To address the issue of data domain misalignment during the fusion of heterogeneous data from multiple sources, the data fusion module introduces the OT method. The experiment demonstrated the superiority of PSDOF in the fusion process of geological survey point data and GRSI products.
  5. Experimentation
This section describes the experimental methodology and results of the study. This includes the experimental setting and parameters, the evaluation metrics employed, an analysis of the experimental results, and a comparison of the outcomes obtained with different GRSI models.
  5.1. Experimental Setting and Parameters
In this experiment, we aligned Landsat 8 image and ground truth data, as well as cropped them into 1200 images of a 224 × 224 px in size. The training and testing data were divided into an 80%:20% ratio. We used six semantic segmentation models for geological remote sensing interpretation: FCN, DeeplabV3, OCNet, DANet, PSPNet, and AdvSemi-OCGNet. To ensure fairness in the experiment, all of these models were trained using ResNet50 as the backbone network, as well as by utilizing the official pre-trained models provided by PyTorch. In the comparative experiments, the geological remote sensing interpretation models were initialized with a learning rate of  and were trained for a total of 10,000 iterations with a batch size of 24. The optimizer was set to SGD. All of the experiments were conducted on a workstation equipped with an Intel i7 11700k CPU, an NVIDIA RTX 3090 GPU, and code methods that were utilized in Python.
Then, we analyzed the experimental parameters of the PSDOF method based on the interpretation results of the AdvSemi-OCGNet method:
- 1.
- Table 3-  shows the grading of the characteristic points  -  and characteristic surface  -  for the three study areas. After our analysis of the data from the different study areas, the levels were divided into 11 classes, where  - . The number of feature pixel points corresponding to each level of the feature surface showed a linear distribution, thereby reflecting the distribution of the 11 geological elements. 
 
- 2.
-  and  -  are understood as discrete samples, where  - . The minimum cost of transport of the OT model after performing domain fusion is shown in  Table 4- . For the transport costs,  Table 4-  yields the following conclusions: When the costs corresponding to level 2 in Area 1 and level 7 in Areas 2 and 3 are the respective optimal transport costs, the minimum transport costs are  - ,  - , and  - , respectively. In addition, we could update the GRSI products according to the optimal transport rates for each of the three areas. 
  5.2. Evaluation Indicators
In this section, the evaluation metrics used to quantitatively assess the accuracy of our model are presented, followed by a discussion of the important parameter settings for the experimental area used in this research. Finally, a presentation and analysis of the experimental results is provided.
To evaluate the accuracy improvement of the GRSI product in our model, the evaluation metric used was the intersection over union [
58], which can be derived by calculating the confusion matrix.
Intersection over Union (IOU): The IOU of nine geological elements is computed by the model as the ratio of the intersection and union of the anticipated and actual pixels within a group:
        where 
k indicates the category of the geological elements, 
 denotes the total number of true pixels of class 
i that are correctly predicted as 
i, and 
 denotes the total number of true pixels of class 
i that are incorrectly predicted as 
j. The accuracy of the model is determined by the IOU. The larger the IOU, the greater the overlap between the predicted pixels and the ground truth labels is, thus indicating a more-accurate model.
  5.3. Effectiveness Assessment
We used the AdvSemi-OCGNet PSDOF model for the geological interpretation of the remote sensing data to evaluate the effect of PSDOF on the accuracy improvement of the geological interpretation product. As shown in 
Figure 5, the geological element of the observation point was sandstone, and the classification result of the interpretation model in Area 1 was soil. The IOU accuracy of the sandstone was only 
, and the interpretation accuracy was poor. Our PSDOF combines the geological element information around the observation point (red rectangular boxed area) into the interpretation of the product. It utilizes this information to update the optimal interpretation boundary based on the minimum transport cost, thus resulting in a more-accurate GRSI product. Based on the accuracy assessment data in 
Table 5, the IOU of the updated interpreted product metamorphic sandstone in Area 1 improved by 
. Moreover, as can be seen in 
Table 6, there was an overall improvement of 
 in the MIOU of its nine geological elements.
In Area 2, the geological element of the observation site (the red rectangular boxed area) was volcanic debris, and the GRSI product identified only a small amount of volcanic debris within this area (with the remainder being classified as extensive glaciation). According to 
Table 5, the IOU accuracy for volcanic debris in the area was 
 (which improved to 
 after our PSDOF model), and the classification accuracy of the volcanic debris improved by 
. According to 
Table 6, the overall MIOU of the nine geological elements improved by 
.
The information gathered from the observation points (within the red rectangular boxed area) in Area 3 suggests a significant presence of granite. Although the geological interpretation products yielded a relatively clear granite–soil boundary, we estimated that it can still be optimized. 
Table 5 displays the outcomes, where the granite IOU within the observation site improved from 
 to 
 (with an improvement in the classification accuracy by 
). According to 
Table 6, the MIOU of the nine geological elements in Area 3 improved by 
. The results demonstrated the correctness of our optimization scheme in Area 3.
  5.4. Comparative Experiments
To verify the accuracy improvement effect of PSDOF for different DL-based GRSI products, GRSI was carried out under the following classical and representative semantic segmentation models: FCN, DeepLabV3, DANet, OCNet, and PSPNet. Our PSDOF model was added for comparison experiments in three study areas, and the visual effects are shown in 
Figure 6, 
Figure 7 and 
Figure 8:
- 1.
- FCN and PSDOF: As shown in  Figure 6- , the processed Landsat 8 satellite images were fed into an FCN interpretation model to classify the nine geological elements at the pixel level in order to obtain the interpretation products. As can be seen from  Figure 6- , the interpretation accuracy for all three geological elements was unsatisfactory. The sandstone and volcanic debris in the red boxes in the figure were located, but with inaccurate boundaries (the specific IOU accuracy can be seen in  Table 7- ). The distribution of what was originally thought to be granite was determined as slate instead, which was completely inconsistent with the observations (and the annotated diagram also fully supports this view). The PSDOF method resulted in the correct repositioning of the three lithologies in the vicinity of the observation point, as evident from the visual effect. It fully combined the high-precision spatial information of the observation point data and achieved a higher positional accuracy. 
- 2.
- DeepLabv3 and PSDOF:  Figure 6-  shows the interpretation products obtained from the DeepLabv3 semantic segmentation model for the remote sensing images of the three study areas. Unlike the classical FCN model, the Deeplabv3 model uses an improved atrous spatial pyramid pooling (ASPP) approach. It demonstrated better performance in pixel-level classifications; thus, it outperformed the FCN model overall. However, as can be seen from  Figure 7- , the results were no better than the FCN model for the boundary localization of sandstone, granite, and volcanic debris, and the accuracy needs to be improved. Our experimental results demonstrated that PSDOF effectively improves the accuracy of the GRSI products generated by the Deeplabv3 model. Within the observation point range, correct boundary relocations were obtained for the three lithologies. 
- 3.
- DANet and PSDOF: The dual-attention network (DANet) has a location attention module to capture spatial information. DANet introduces a channel attention module that is used to integrate the relevant features between all channel mappings, and it overall outperformed the FCN network. The performance of DANet was poor in the extraction of the following three lithological features: sandstone, granite, and volcanic debris. As shown in  Figure 7- , where the metamorphic sandstone and volcanic debris in the red boxes of the interpretation results were located, granite was incorrectly located, and the segmentation accuracy of all of the three geological elements was not high. After applying PSDOF, the accuracy of the three geological elements in the product improved. 
- 4.
- OCNet and PSDOF: The neural network model OCNet is a target semantic network for scene parsing in the form of semantic aggregation, where instead of pixel-by-pixel prediction, similar pixel points are aggregated and, then, semantically segmented. The experimental results in the three study areas showed that the classification of sandstone, granite, and volcanic debris, as shown in  Figure 7- , was poor. The three geological elements of the scene showed a fragmented and scattered distribution after classification, as well as blurred geological element boundaries. In contrast, after incorporating the complementary observation point information, PSDOF was able to incorporate it into the interpretation product and relocate the boundaries of the geological elements at a local scale. The experimental results also showed that PSDOF can perform well for geological elements with poor accuracy. 
- 5.
- PSPNet and PSDOF: PSPNet is a scenario-analysis network built using the pyramid pooling module; furthermore, it outperforms traditional FCNs, and its overall performance was relatively good in this experiment. The PSPNet model in the study area had a good feature extraction capability for sandstone, as shown in  Figure 8- , and it ranked high in overall performance for the classification of granite and metamorphic sandstone. However, there was a large error in the boundary positioning of the three geological elements at the observation point range, but our PSDOF successfully handled and corrected this error. Moreover, the resulting interpretation products after processing also validated the good performance of PSDOF. 
To more-accurately represent the enhancement effect of the different geological interpretation models with the addition of PSDOF, the IOU accuracy of the individual geological elements were ascertained in each of the three study areas, as is presented in 
Table 7, 
Table 8, 
Table 9, 
Table 10 and 
Table 11. It can be observed that the five DL interpretation frameworks had different performances for different geological elements. However, it was found that it was more difficult to learn the features of granite, sandstone, and volcanic debris. The main reason for this is that granite and volcanic clasts are both magmatic rocks with similar morphological structures in the image. It is difficult to distinguish between sandstone, metamorphosed granite, and fine-grained granite, as they are similar. In a comparison of the three study area ranges, the PSPNet model was the most accurate in classifying sandstone with an IOU of 
, and the OCNet model performed the worst with an accuracy of 
. In terms of extracting volcanic debris features, the DANet model had the best IOU accuracy of 
, while the OCNet model was less effective in classification. In terms of extracting features from granite, all five models had an IOU accuracy of around 
, and none of them showed satisfactory performance.
Our experimental analysis of the different geological interpretation models showed that different models had different feature-extraction performances for the same geological elements, as well as that the same models had different feature-learning abilities for different geological elements. After adding our PSDOF to the interpretation products, the interpretation accuracy of all of the models for different geological elements improved. The accuracy of the metamorphic sandstone improved the most with a  improvement in the IOU accuracy in the local range. The local IOU accuracy of the other two lithologies was also improved. The experimental results showed that PSDOF had an excellent effect on the improvement of the accuracy of the selected geological interpretation products.
  6. Conclusions and Outlook
In this paper, we proposed the use of additional point data for the data enhancement of GRSI products so as to address the problem of their limited accuracy. A multimodal data-fusion framework base for an optimal transport model that is capable of fusing heterogeneous data with minimal cost was presented. The GRSI product for a local area of the Pamir Plateau is an example of a solution to the problem of fusing geological footprint data with interpretation data. The model uses gray-scale correlation features in the fusion task. The experiments showed that our fusion framework successfully fused point data geospatial location information and lithological information with GRSI products, thereby resulting in a higher quality.
The PSDOF method still showed certain limitations in the experiment. Due to the sparse distribution of geological survey data and the limited amount of data, the experimental study area was confined to a local region. Moreover, the fusion effect was constrained by the sample size of the point data. While the PSDOF method had a low computational cost in local regions, increasing the number of parameters will lead to a higher computational cost.
In future research, we will consider using a greater amount of multimodal remote sensing data, such as big social media and statistical data, to take full advantage of the information from different data sources in order to obtain a higher-quality GRSI product. In addition, our model was only examined experimentally on its capacity for geological remote sensing mapping; as such, in the future, we will consider more application scenarios, such as in agriculture, water bodies, wetlands, and cities, in order to combine the characteristics of different geographical elements and improve the generalization capability of the model.