2.3.2. STDFA Model

The STDFA algorithm is a class of spectral unmixing methods. The algorithm first classifies high-resolution low-temporal images of known periods based on K-means, which is set to 5 categories in this paper, and uses Equation (7) to calculate the richness of each category in each high-temporal and low-resolution pixel. The corresponding subregion is determined by taking a high-temporal low-resolution pixel as the center and calculating the average reflectance value of each category in the subregion by Formula (8). Then, we assign this value to the corresponding class of high-resolution low-temporal pixels within the center pixel [36,37,43].

$$f\left(X,c\right) = N(X,c)/m\tag{7}$$

In the formula, *f* (*X*, *c*) is the richness of category *c* in the high-temporal and lowresolution pixel *X* in the known period; *N*(*X*, *c*) is the number of high-resolution and low-temporal pixels belonging to category *c* in pixel *X*; *m* is the number of high-temporal and low-temporal pixels contained in the high-temporal and low-temporal pixels *X*. We select the D high-temporal and low-resolution pixels with the highest abundance in each category, find the difference between these high-temporal and low-resolution pixels in the known period and the predicted period, and then use the least squares method to fit the high-resolution pixels of each category. Thus, we obtain the change in reflectivity of low-temporal pixels.

$$X(t) = \sum\_{c=0}^{k} f(X, c) \times \mathbb{X}(c, t) \tag{8}$$

Limitation factor:

$$\sum\_{\mathcal{L}=0}^{k} f(X, \mathcal{L}) = 1, \; f(X, \mathcal{L}) \ge 0 \tag{9}$$

In the formula, *t* represents the prediction period and the known period *t*<sup>0</sup> and *tk*; *x*(*c*, *t*) represents the average reflectance of category *c* in the high-temporal and low-resolution pixel *X*; *k* is the total number of categories. We calculate the average reflectivity of category *c* in the known period and the predicted period, respectively, and through an SRCM (surface reflectance calculation model), based on Equation (10) [19], the high-resolution low-temporal data of the final forecast period can be obtained.

$$
\mathbb{X}(\mathbf{c}, t\_0) = \overline{\mathbf{x}}(\mathbf{c}, t\_0) - \overline{\mathbf{x}}(\mathbf{c}, t\_k) + \mathbf{x}(\mathbf{c}, t\_k) \tag{10}
$$

In the formula, *x*(*c*, *t*0) and *x*(*c*, *tk*) represent the reflectivity of high-resolution low-temporal pixels belonging to category c in the prediction period and the known period, respectively.

#### 2.3.3. Fit\_FC Model

The Fit\_FC algorithm is based on a linear model for data spatiotemporal fusion. It uses low-spatial-resolution and high-temporal-resolution data in the known period and the predicted period to fit linear coefficients, and then applies the coefficient to the known period of high-spatial-resolution and low-temporal-resolution data [36,44,45]. Taking the high-temporal-resolution pixel *X* as the center, we determine the neighborhood subregion, where the size of the subregion is 5 high-temporal and low-resolution pixels, and fit Formula (11) to the coefficients *a* and *b*.

$$X(t\_0) = a \times X(t\_k) + b \tag{11}$$

The predicted initial high-resolution low-temporal data can be obtained by applying the coefficients *a* and *b* to the high-resolution low-temporal pixels corresponding to the central pixel *X* in the known period. In addition, the residual value *R* can be obtained by Equation (12).

$$R = X(t\_0) - (a \times X(t\_k) + b) \tag{12}$$

In order to eliminate the "block effect" caused by the fusion of high- and low-resolution data, Formula (13) [13] is used to determine the similar neighborhood pixels centered on a high-resolution low-temporal pixel.

$$\sqrt{\sum\_{b=1}^{nb} (\mathbf{x}(t\_k) - \mathbf{x}\_{\text{neigh}}(t\_k))^2 / nb} \tag{13}$$

In the formula, nb represents the number of bands involved in the calculation; *x*(*tk*) and *xneigh*(*tk*) represent the high-resolution low-temporal center pixel and its neighbors in the known period. The smallest D high-resolution low-temporal pixels are selected as similar pixels, and the corresponding weights are given according to the normalized distance from the central pixel. For the initially predicted high-resolution low-temporal data, firstly, based on similar pixels and their weight values, the initial correction of the central high-resolution and low-temporal pixels is obtained by means of weighted summation. The residual *R* is then linearly interpolated to ensure that it has the same resolution as the high-resolution low-temporal data, and, based on the obtained similar pixels and weights, the reflectance value of the central high-resolution low-temporal pixel is corrected again to obtain the final result.

#### 2.3.4. Accuracy Evaluation

Using the real GF-2 band image acquired on 13 July 2021 as the verification image, visual interpretation and correlation analysis methods were used to evaluate the accuracy of the fusion image from both qualitative and quantitative aspects. The visual interpretation method can directly analyze the similarity between the fused image and the real image and yield a preliminary judgment on the fusion accuracy of each model. The correlation analysis method mainly uses four evaluation metrics: average absolute deviation (*AAD*), root mean square error (*RMSE*), correlation coefficient (*CC*), and structural similarity (*SSIM*) [26,46,47]. These indexes are used to quantitatively evaluate the similarity between the fused image and the real image.

*AAD* is used to measure deviation. The closer *AAD* is to 0, the smaller the deviation between the predicted value and the standard value.

$$AAD = \frac{1}{N} \sum\_{i=1}^{N} |P\_i - O\_i| \tag{14}$$

*RMSE* is used to measure the difference between images, and its value ranges from 0 to 1. The smaller the *RMSE*, the higher the accuracy.

$$RMSE = \sqrt{\frac{\sum\_{i=1}^{N} \left(P\_i - O\_i\right)^2}{N}} \tag{15}$$

*CC* can reflect the spectral similarity between images, and the closer *CC* is to 1, the higher the spectral similarity.

$$\text{CC} = \frac{\sum\_{i=1}^{N} \left( P\_i - \overline{P} \right) \left( O\_i - \overline{O} \right)}{\sqrt{\sum\_{i=1}^{N} \left( P\_i - \overline{P} \right)^2 \sum\_{i=1}^{N} \left( O\_i - \overline{O} \right)^2}} \tag{16}$$

*SSIM* can evaluate the structural similarity between images. The closer the *SSIM* is to 1, the greater the structural similarity between images.

$$SSIM = \frac{\left(2\overline{PO} + \mathbb{C}\_1\right)\left(2\mathbb{z}\_{\overline{\rho}\nu} + \mathbb{C}\_2\right)}{\left(\overline{P}^2 + \overline{O}^2 + \mathbb{C}\_1\right)\left(\sigma\_{\overline{\rho}}^2 + \sigma\_{\overline{O}}^2 + \mathbb{C}\_2\right)}\tag{17}$$

In the Formulae (14)–(17), *N* is the total number of image pixels; *Pi* and *Oi* represent the *i*-th pixel of the predicted image and the observed image, respectively. *P* and *O* represent the mean of the fusion result and the observed image, respectively. *<sup>P</sup>*<sup>2</sup> and *<sup>O</sup>*<sup>2</sup> represent the variance between the fusion result and the observed image, respectively. *σpo* represents the covariance between the fusion result and the observed image; *C*<sup>1</sup> and *C*<sup>2</sup> are two constants close to 0 used to stabilize the result, generally, *C*<sup>1</sup> = (K1L)2, *C*<sup>2</sup> = (K2L)2, generally K1 = 0.01, K2 = 0.03, L = 255 (dynamic range of pixel value, generally 255).

#### **3. Results**

In order to better evaluate the accuracy of the spatiotemporal fusion model under different landform types, this study mainly selected three experimental areas for algorithm comparison. The first test area belongs to the land and water boundary, and the main landform types are land and water. The second experimental region belongs to the mountainous area, and the main landform types are roads, buildings, and farmland. The third experimental region belongs to the urban area, and the landform types are mainly construction and roads. By studying the three types of terrain, we can provide more accurate support for the application of spatiotemporal fusion algorithms in different types of landforms.

## *3.1. The Accuracy of Land–Water Boundary*

The original PS low-resolution images (Figure 2a,b) are blurry and can only roughly identify water and land—they cannot provide more detailed information. However, the results of the spatiotemporal fusion of the FSDAF, STDFA, and Fit\_FC models show that FSDAF can clearly identify the water surface, land, shoal, etc., and the contours of ground objects are clear (Figure 2c). Compared with the FSDAF model, the fusion results of the STDFA model can also distinguish the water surface and the land, but there are large color spots in the results, which have a certain impact on the identification of the water–land boundary (Figure 2d). In addition, the fusion result of the Fit\_FC model is very poor. Compared with the original image, it loses a large amount of detail and cannot effectively identify the land and water boundary (Figure 2e). Therefore, for the land–water boundary area, the FSDAF model has the best fusion effect, followed by the STDFA model, and the Fit\_FC model has the worst effect.

**Figure 2.** (**a**) PS image on 15 April; (**b**) PS image on 10 July; (**c**) fusion image by FSDAF on 10 July; (**d**) fusion image by STDFA on 10 July; (**e**) fusion image by Fit\_FC on 10 July; (**f**) GF-2 verification image on 13 July.

As can be seen from Table 2, for all four bands, the images fused by the FSDAF model have a good correlation with the validation images, and the correlation coefficients are all higher than 0.6. Compared with the STDFA and Fit\_FC models, the mean value of CC increased by 0.0889 and 0.3055, respectively, indicating that the fused image of FSDAF has higher spectral similarity with the validation image. At the same time, the SSIM values of the FSDAF and STDFA models are both greater than 0.7, indicating that the fusion images of the two models have good structural similarity with the predicted images. Among them, except for the near-infrared band, the FSDAF model has the highest SSIM, and its average is 0.0077 and 0.0637 higher than those of the FSDAF and STDFA models, respectively, indicating that the model has the best structural similarity. For the Fit\_FC model, the RMSE of the four bands and the AAD values of the blue, green, and red bands are higher than those of the FSDAF and STDFA models, with an average of 0.1347 and 0.1028, respectively. Compared with the other two models, the average value of RMSE is increased by 0.037 and 0.036, and the average value of AAD is increased by 0.0155 and 0.0148, respectively, indicating that the fusion image of the Fit\_FC model has a large deviation from the predicted image. The statistical results show that the fusion image results of the FSDAF algorithm and the STDFA algorithm in study area 3 are much better than those of the Fit\_FC algorithm, which is consistent with the direct visual effect.


**Table 2.** The fusion accuracy evaluation of different bands for different models.

#### *3.2. The Accuracy of Mountains*

Study area 2 is mainly a mountainous area. The ground objects in the basic image obtained on 15 April 2021 are mainly cultivated land, buildings, and roads. In the image obtained on 10 July 2021, the original cultivated land has undergone the process of crop coverage changes (Figure 3a,b). The original PS image can roughly identify the ground object information, but its resolution is still somewhat insufficient for the identification of more detailed information. According to the effect of the spatiotemporal fusion algorithm, the three models have better fusion effects on ground objects, and the identification of ground object information is obviously more accurate. The FSDAF algorithm and STDFA algorithm have higher fusion image accuracy, but the fusion image displays a poor response to changes in crop coverage (Figure 3c,d). Generally speaking, the fusion image and the verification image should have the similar spectral, but the color of the two models in the crop coverage area is quite different from that of the verification image. Fortunately, we were still able to distinguish vegetation cover areas by color comparisons. In terms of resolution, both models can clearly display the spatial structure information of ground objects; in particular, the land structure in the vegetation-covered area can be better observed. Compared with the original PS image, the spatial resolution of the fusion image is also improved to a certain extent, the contours of different types of objects are also clearer, and the changes in the coverage areas of crops can be better displayed (Figure 3e). However, its resolution in specific spatial details is slightly lower than that of the other two algorithms.

**Figure 3.** (**a**) PS image on 15 April; (**b**) PS image on 10 July; (**c**) fusion image by FSDAF on 10 July; (**d**) fusion image by STDFA on 10 July; (**e**) fusion image by Fit\_FC on 10 July; (**f**) GF-2 verification image on 13 July.

From the statistical analysis results (Table 3), the fusion image of the Fit\_FC model has a good correlation with the verification image. With the exception of the red band, the CC value of the Fit\_FC model is higher than that of the FSDAF and STDFA models. The average value of CC is 0.7138, which is 0.0605 and 0.0166 higher than that of the FSDAF and STDFA models, respectively, indicating that the spectral similarity between the fusion image and the verification effect is higher. At the same time, the SSIM of the Fit\_FC model in the blue, green, and red bands is also higher than that of the FSDAF and STDFA models, with an average of 0.6641, which is 0.0434 and 0.0287 higher than that of the other two models, respectively, indicating that its structural similarity is also higher than that of the other two models. In all four bands, the RMSE value of the FSDAF model is the highest, that of the Fit\_FC model is the lowest, and the average values of the three models are 0.0791, 0.0052, and 0.0038 in descending order, indicating that the difference between the fusion image of the FSDAF model and the validation image is greater than others. Regarding the AAD value, the AAD of the FSDAF and STDFA models in the blue, green, and red bands is significantly higher than that of the Fit\_FC model, but slightly lower than that of the Fit\_FC model in the near-infrared band. The average AAD values of the FSDAF, STDFA, and Fit\_FC models were 0.0382, 0.0381, and 0.0291, respectively, indicating that the Fit\_FC model had less biased fusion images. The statistical results show that the Fit\_FC model has the best fusion effect in mountainous areas, and the FSDAF algorithm has the worst fusion effect.


**Table 3.** The fusion accuracy evaluation of different bands for different models.

#### *3.3. The Accuracy of Urban*

Study area 3 is mainly an urban area, and the types of ground objects in the area are mainly urban buildings, building land, roads, and vegetation greening. In the images from April and July in this area, with the exception of some areas where the land use changed (marked in yellow), the rest of the features changed little (Figure 4a,b). The original PS image can identify different types of objects, but the outlines between buildings are relatively blurred. Through direct observation of the fused image, compared to the PS image, all three models have improved spatial resolution to a certain extent, and can restore the area partially covered by shadows (red border) (Figure 4c–e). Among them, the fusion images of the STDFA and FSDAF models have a higher resolution. Fit\_FC is relatively blurry on the outline of the building, and there is a more obvious block effect. For the two land use changes in the image, the fusion results of the STDFA algorithm cannot clearly reflect these. FSDAF is also extremely blurry, mainly following the original base image, making it difficult to effectively identify changes. Relatively speaking, the fusion image of the Fit\_FC algorithm can better reflect the difference between the base image and the predicted image, and is more similar to the verification image. However, it does not achieve excellent result for the recognition of building outlines.

**Figure 4.** (**a**) PS image on 15 April; (**b**) PS image on 10 July; (**c**) fusion image by FSDAF on 10 July; (**d**) fusion image by STDFA on 10 July; (**e**) fusion image by Fit\_FC on 10 July; (**f**) GF-2 verification image on 13 July.

According to the statistical analysis results (Table 4), the CC values of the three models are generally distributed in the range of 0.5–0.6, and the difference between each band is small; the FSDAF model has the highest CC values in the blue, green, and red bands, and the Fit\_FC model has the highest in the near-infrared band. The mean values of CC for the FSDAF, STDFA, and Fit\_FC algorithms are 0.5434, 0.5067, and 0.5362, respectively, indicating that the spectral similarity between the fused image and the validation impact is the highest for FSDAF and the lowest for the STDFA model. The SSIM gaps of the FSDAF, STDFA, and Fit\_FC models are small, with average values of 0.7257, 0.7072, and 0.7323, respectively, and the Fit\_FC model has the highest SSIM values in the green, red, and near-infrared bands, indicating that the structural similarity between the fusion image and the verification image is better. The RMSE value of the STDFA model is higher than that of the FSDAF and Fit\_FC models in the green, red, and near-infrared bands, and the average values of the three models are 0.0040, 0.0036, and 0.0036, respectively, indicating that the STDFA model has a larger error in fused images. At the same time, the average AAD values of the three models of FSDAF, STDFA, and Fit\_FC are 0.0075, 0.0058, and 0.0038, respectively, indicating that the fusion image deviation of the FSDAF model is large. In general, although the FSDAF model has the highest AAD, its fusion image still has a good prediction effect.


**Table 4.** The fusion accuracy evaluation of different bands for different models.

#### **4. Discussion**

Different spatiotemporal fusion models have different fusion effects in karst areas. In order to select a more appropriate high-resolution data fusion model under different scenarios and needs, in this paper, the fusion results of the three models in different karst landforms are directly observed and statistically analyzed, and the application effects of the three models in karst areas are discussed.

#### *4.1. FSDAF in Different Regions*

The fusion image of the FSDAF model can improve the resolution of the original image and the classification accuracy of surface land use in three geomorphic types (water land border, mountain area, and urban area). Among them, the FSDAF model has a good fusion effect in the land water border area. It can not only clearly identify the water boundary, but it can also effectively identify information such as shoals. Therefore, the FSDAF model can be used for spatiotemporal fusion of target lakes, oceans, rivers, and other waters, and can accurately extract the water boundary. This advantage has important value in flood relief, remote danger monitoring of dammed lakes, and other practical applications. The FSDAF model has a large color difference between the fusion image and the verification image in mountainous areas and other areas with large seasonal changes in vegetation cover. The vegetation coverage cannot be restored well. The FSDAF model can effectively

improve the resolution of urban areas. It can more effectively identify the building outline, and more effectively display the part of the original image that is blocked by shadows. However, similar to the integration performance of mountain areas, the model cannot well show the land type change in the building area. This may be because the FSDAF model mainly uses spatial prediction to retrieve pixel changes. Theoretically, spatial prediction can truly describe the surface information of the predicted date. In addition, the signals of land cover type change and local variation are retained in the fusion results [26]. However, in the actual process, the error of FSDAF mainly depends on the residual distribution under the assumption of surface uniformity. Therefore, the FSDAF model can save more detailed information through this strategy, but it limits its ability to retrieve land cover changes.

#### *4.2. STDFA in Different Regions*

The STDFA model also has a good fusion effect on the images of different landforms in karst areas, which can effectively improve the image resolution. The STDFA model has a good recognition ability in the interface area between water and land, but the accuracy of fusion results is not high due to the appearance of "patches" in the predicted image. Different from this, the fusion image resolution of the STDFA model in mountainous areas is very high, which can better identify the structural information between ground objects. However, the fusion accuracy of the STDFA model is lower than that of the Fit\_FC model in areas with large changes in vegetation cover such as crops. However, based on the statistical data, the CC and SSIM of each band of the STDFA model are lower than that of the FSDAF model, while RMSE is higher than that of the FSDAF model. This indicates that in urban areas, the fusion accuracy of the STDFA model is lower than that of the FSDAF model. In addition, the STDFA model is a spatiotemporal fusion model based on the unmixing method, and its data fusion accuracy is related to two aspects: On the one hand, the STDFA model needs to classify high-resolution data in the basic period, but the classification accuracy of unsupervised classification methods (such as K-means method) will cause the fusion accuracy to decrease. On the other hand, when the resolution difference between high-resolution low-temporal data and high-temporal low-resolution data is large, the area represented by each high-resolution pixel will be more refined. For example, in the pixels of high-temporal and low-resolution data, when the richness of a certain category is very low, the fitting error will increase [48,49].

#### *4.3. Fit\_FC in Different Regions*

The fusion accuracy of the Fit\_FC model in different geomorphic types in karst area is quite different. The fusion effect of the Fit\_FC model is poor at the interface between land and water. Compared with the original low-spatial-resolution measurement data on 10 July 2021, the spatial resolution is not significantly improved, and it is difficult to identify the boundary between water surface and land. Meanwhile, the statistical results of the Fit\_FC model also show that the fusion accuracy is extremely low, and the correlation coefficient of the green band is as low as 0.12466. Therefore, the Fit\_FC model is not suitable for image fusion at the interface between land and water. The Fit\_FC model has a good fusion result in the mountainous area. The spatial resolution of the image is improved, and the changes of vegetation cover such as crops can be better presented. It is very suitable for spatiotemporal fusion in areas with large vegetation changes and increases the accuracy of ground class classification. Therefore, the Fit\_FC model can be given priority when studying the requirements of vegetation dynamic monitoring and land use change. The fusion results of the Fit\_FC model have lower resolution than the FSDAF and STDFA models in the contour of ground objects such as roads and buildings. Especially in densely built areas, the resolution gap is larger. In addition, the Fit\_FC model fits the high-temporal and low-resolution data of the known and predicted periods at pixel scale, and directly applies the fitting coefficients to the high-resolution and low-temporal data. When the difference between high-resolution data and low-resolution data is large, the results of the Fit\_FC model show obvious "block effect" [28,50].

#### *4.4. Statistical Precision Analysis*

For the fusion effect of the three models in different regions, it is found that the resolution is generally good in direct observation, but the statistical data of the related accuracy are obviously not high, being clearly lower than that of the related low-resolution data. This is due to the high resolution of the two sets of data that we use, which can accurately identify small changes in ground objects. Especially when the prediction data and the verification data are separated by three days, the information on the ground objects, such as vehicles, will be slightly different, and some land types will also change. At the same time, in the process of the spatiotemporal fusion of high-resolution variable images, the effect of sunlight will also have a great impact on the accuracy. Differences in shooting time, different satellite shooting angles, and changes in the incident angle of sunlight will cause the shadow areas of the basic image and the verification image to differ, which will have a certain impact on the analysis of statistical data. Under the combined effect of these factors, the fusion accuracy of the three algorithms for high-resolution data is lower than the fusion of the same model for medium- and low-resolution data. However, in general, in most cases, the resolution of the fused image becomes higher, the recognizability is greatly enhanced, and the practical application value is higher.
