1. Introduction
The widespread application of remote sensing (RS) technology has enabled the public to easily access RS image data from open sources and apply it across various fields, such as urban expansion [
1,
2,
3], scene classification [
4,
5], and land use detection [
6,
7,
8]. However, with the continuous development of image editing software and deep learning, it has become easy for individuals to modify natural images, including satellite images, leading to significant trust issues and security risks for the public [
9].
Current mainstream image manipulation localization (IML) methods focus primarily on natural images, performing pixel-level detection for common tampering techniques such as copy-move, splicing, and inpainting [
10]. These methods work by locating the edge artifacts of tampered regions and training models to learn the distribution of these edges. However, these methods are not entirely applicable to RS imagery. As shown in
Figure 1, examples of three tampering methods in RS imagery illustrate several challenges. Firstly, the acquisition methods for RS images differ significantly from those of natural images including sensor types, spectral channels, and post-processing techniques. For example, forests may appear lush green in summer but reveal more bare ground in winter. Similarly, the same area may exhibit different spectral characteristics due to variations in spectral reflectance. Secondly, RS images encompass diverse types of terrain and objects. Tampering artifacts at edges in mountainous or forested areas, for instance, can be easily masked by the natural textures of the terrain itself, increasing the difficulty in accurately locating edges.
Research on RSIML remains limited: the primary detection principle involves training networks to learn the distribution [
11] and mapping relationships of objects [
12,
13] in remote sensing imagery, identifying regions with inconsistent distributions within a single image. However, these methods still face unresolved challenges and opportunity within RSIML.
(1) Existing studies focus predominantly on detecting conspicuous tampering techniques such as splicing in
Figure 1, which has clearly boundaries and significant conflicts in object distributions, thus being relatively easier to detect. In contrast, the other two tampering methods exhibit greater concealment, seamlessly blending altered content with the scene and lacking obvious conflicts in object distributions, making them challenging to pinpoint using current detection strategies.
(2) Current research has focused on detecting tampered regions directly from single-source images, neglecting the significant presence of distributional consistency across multiple sources. Integrating heterogeneous images can thus be used to enhance mutual verification. However, it is worth noting that heterogeneous images cannot be directly aligned. For example, the Google Map, Tianditu, and Yandex Map services come from different countries, and their reference coordinate systems differ from each other (WGS84, CGCS2000, and Pseudo-Mercator, respectively) based on corresponding policies. The offsets and division standards of maps are also inconsistent. This leads to challenges of point offset and resolution difference while aligning heterogeneous images based on geographic coordinates, which limits the effective application of heterogeneous images. Furthermore, it is essential to effectively distinguish genuine tamper from inconsistencies caused by real changes and variations in lighting or seasonal effects, as shown in
Figure 2.
(3) Existing RSIML methods rely on proprietary datasets for training and testing, and the lack of open-source data and code poses significant challenges to the reproducibility and transferability of these methods.
Based on the discussion above, relying solely on neural networks to train models to learn object distributions or artifact edges is insufficient for addressing various manipulation methods in RSIML tasks. Therefore, we considered using comparative verification with multi-source remote sensing imagery to identify areas in the images that have changed due to tampering. However, it should be noted that directly applying change detection methods for IML still poses challenges. Existing change detection algorithms focus on improving the model’s ability to capture subtle changes and the recognition accuracy of change areas in complex scenes. There are still some limitations in processing heterologous images, excluding unrelated factors such as illumination and phenology, and object-level semantic understanding. These result in the detection of pseudo-changes and a decrease in accuracy, so current change detection methods have difficulty distinguishing the three types of inconsistencies more finely. Actually, in RS imagery, tampered areas usually exhibit significant semantic differences, such as buildings being tampered with by farmland. This characteristic should be leveraged to reduce algorithm complexity and improve accuracy without requiring overly detailed and refined feature learning of local information.
Here, we construct a high-precision heterogeneous satellite image manipulation localization (HSIML) framework to achieve our proposal. It aims to solve the potential image manipulation in remote sensing optical images. It conducts detection at the patch level, and can be divided into three parts:
(1) Heterogeneous Image Preprocessing Module: The projection coordinate systems and resolutions of heterogeneous images often exhibit differences, so these images need to be geographically aligned and subjected to some preprocessing operations, for example, water areas such as oceans and rivers will be treated as noise data and removed.
(2) Feature Point Constraint Module: Tampered images exhibit significant semantic differences, allowing the efficient determination of changes based on the quantity of feature points. Moreover, feature points offer high stability, enabling them to effectively mitigate variations in lighting, seasonal effects, and color channel disparities. This module extracts and matches feature points from the images, applying filtering rules to conduct an initial screening to identify candidate tampered patches.
(3) Semantic Similarity Measurement Module: The previous module differentiated based on the number of feature points and matched feature point groups. However, when the number of matched feature points is moderate and the quantity of feature points is also moderate, it is difficult to determine the confidence level when using the threshold method for judgment. At this point, we believe that determining whether the content of the two images is consistent still needs to return to the semantic information of the ground objects themselves. But at this stage, we do not need to identify what type of ground object it is. We only need to judge whether they are consistent by the similarity of the semantic features of the ground objects. Therefore, this module uses DINOv2 as the feature extraction backbone network and designs a classification network based on the saliency of features in remote sensing images. The module first performs feature classification and similarity measurements on the input images. Subsequently, based on predefined classification rules, it integrates the determined slice categories and image pair similarities to identify tampered regions. In addition, we also create a tampering detection dataset to alleviate the current scarcity of RS image datasets.
The remainder of this manuscript is organized as follows:
Section 2 introduces tampering detection methods in remote sensing imagery.
Section 3 describes the workflow and principles of the proposed method.
Section 4 presents the study area and the tampering detection dataset proposed in this study, accompanied by a comprehensive experimental analysis and discussion of the proposed method.
Section 5 gives our conclusion.
3. Methods
The proposed HSIML framework can be broadly divided into three parts: heterogeneous image preprocessing module, feature point constraint module, and feature classification and rule judgment module. The details of the model are illustrated in
Figure 3 and will be discussed further in the following subsections.
3.1. Heterogeneous Image Preprocessor
The proposed method requires two sources of images for the IML process: a tampered image to be examined and a reference comparison image. However, these two types may differ regarding reference coordinate systems and image resolutions, necessitating rigorous geographic registration before proceeding with detection. Additionally, in practical applications, images covering natural water bodies such as oceans and lakes exhibit significant pseudo-changes, as the features between calm and turbulent water surfaces can vary markedly, leading to potential interference for the model.
To address these challenges, we designed a heterogeneous image preprocessor, which provides the detector and discriminator with high-quality data, enhancing feature-matching accuracy and ensuring the model fully utilizes its capabilities. It comprises two steps: geographic registration and water shielding.
For geographic registration, we first apply projection coordinate transformation to unify the coordinate systems and establish accurate spatial correspondences. Subsequently, we perform bilinear interpolation resampling on the harmonized images to obtain strictly registered images with consistent resolution.
For water shielding, we collect wave-like texture features of oceans, beaches, rivers, and lakes from RS images. A K-nearest neighbor classifier [
38] is employed to train and shield these highly significant wave-like features during the image change detection process to reduce interference.
3.2. Feature Point Constraint Module
Due to the complexity of features in RS images, traditional feature point detection algorithms, such as SIFT, often yield a significant number of redundant or insignificant features. This study integrates the SuperPoint [
33] algorithm to design a feature point constraint module to address this issue. First, feature points are extracted from heterogeneous RS images. Based on the extraction results, feature point matching is conducted, and erroneous match pairs are filtered using threshold constraints. The quantity of matched feature point pairs assesses the presence of tampering in the slices. For cases that cannot be distinguished, they are handled by the next module. The overall workflow of the module is illustrated in
Figure 4.
The model takes heterogeneous images as input. The image encoder consists of three sets of convolutional layers, followed by max-pooling layers and ReLU activation functions. The descriptor for interest points draws from the UCN [
39] model, initially outputting a semi-dense grid of descriptors, which are then processed using bicubic interpolation and normalized to unit length through L2 normalization to obtain the interest point descriptors. Subsequently, we employ brute-force matching to pair feature points from the heterogeneous images. Due to the inherent complexity of the images, there remains a substantial number of irrelevant points that require further elimination.
Consider two RS images
A and
B; the matching point pairs are composed of the
x and
y image coordinates from images
A and
B, and the distance
:
is the successfully matched point pair
i,
are the coordinates of the matched point pair in image
A,
are the coordinates of the matched point pair in image
B,
and
are the height of the corresponding point,
and
are the width of the corresponding point, and
is the distance of the matched point pair. The distance threshold
r is introduced to filter the matching results further. Ideally, after registration, the pixel coordinates of matching point pairs in two registered images should be the same or fall within an acceptable buffer zone. Therefore, this study considers constructing a buffer zone with a
r radius. When the pixel coordinates of the matching points fall within this buffer zone, the match can be regarded as correct.
The matching results of feature point clusters can reflect the changes in prominent features in the image. As shown in the example diagram A in
Figure 4, when the features of the objects in the image are prominent, there are more extracted feature points, and correspondingly, there are more successfully matched point pairs, so it can be determined that no tampering has occurred. Conversely, if a change occurs, there may be a significant difference in the number of corresponding feature points, resulting in fewer successfully matched point pairs, as shown in the example diagram B in
Figure 4. Therefore, this study considers that the number of matched feature point pairs can be used to roughly determine whether changes have occurred in the image, thereby preliminarily screening areas where changes are more likely to have occurred.
For a given threshold
m for the number of feature point pairs, when the number of successfully matched point pairs exceeds this threshold, it is considered that the two images have a high degree of matching, indicating no change in the area. However, when the number of successfully matched feature point pairs is less than the given threshold, this study considers the influence of the number of feature points extracted from each image on the matching results. Specifically, when the number of feature points extracted from both images is less than the threshold
m, it indicates that the features of the objects in the patch are weak, resulting in fewer feature points, as shown in the example diagram C in
Figure 4. It requires the use of the next module to make a judgment.
3.3. Semantic Similarity Measurement
During large-scale image manipulation detection, there are often situations where the image features are not prominent, the feature points are sparse, or the matched feature point pairs do not satisfy threshold conditions from the previous module. As shown in
Figure 5, there are few matched feature point pairs, so we propose assessing the spatial consistency of the images based on the semantic features of ground objects. We first use the fine-tuned DINOv2 model to extract the feature type labels of the images, respectively, and then calculate the cosine similarity between the image pairs. When the similarity meets the preset threshold, we consider the image pair to be matched successfully. If not, we then turn to using the extracted feature type labels to perform rule judgment and judge their consistency. The architecture of the semantic similarity measurement module is shown in
Figure 5.
Specifically, this study initially gathers three typical land features—cities, forest and water bodies, and seasonal bare land—to construct a dataset of image feature categories. These categories correspond to three features: prominent, weak, and moderate as shown in
Figure 6. Subsequently, we employ the DINOv2 [
40] model as the backbone network for feature extraction, integrating a classic MLP for linear decoding. Training is conducted on the generated dataset to obtain a model suitable for feature classification.
The module utilizes heterogeneous RS images as input. It employs the trained DINO classification model to generate feature class labels for each image. For the two images, A and B, their features
and
; and classification results
and
are obtained based on the module. The cosine similarity
of features is used to determine the similarity between the two patches:
The similarity threshold is set for judgment. When the similarity is larger than , it is considered that the images are similar, thus inferring that the patch has not changed. When the similarity is less than , the following rules are designed for further detection:
(1) When is prominent and is weak, or vice versa, indicating a substantial difference in feature prominence between A and B, it is inferred that the image has changed.
(2) When both and are either prominent or weak, suggesting that A and B are possibly located in areas characterized by prominent features such as urban landscapes or regions with weak features like forest or water bodies, it is inferred that the image has not changed.
(3) In the remaining cases, where at least one of the features is at a moderate level, the current threshold filtering criteria are too high in this condition, so a lower similarity threshold is set. If the similarity is larger than , it is considered that there is no change in the image; otherwise, the range of the patch is expanded by half and the entire model process is re-executed for detection. If the similarity is still lower than , it is determined that there is a change in the image.
5. Discussion
The proposed HSIML has four hyperparameters that need to be determined, which would lead to complicated steps and additional workloads in actual applications or migration. This complexity not only increases the difficulty of debugging and optimization, but may also affect the performance and efficiency of the model. To simplify the computational and experimental workload arising from the numerous hyperparameters in practical applications, we have standardized the parameters by module and provided a method for quickly narrowing down the ranges. Specifically, the first two parameters pertain to the feature point constraint module. The buffer radius r ideally should correspond perfectly to matched point pairs, meaning r = 0. However, considering slight differences in feature point centers on the image, a default value of 3 is set, with a range of [1, 4] to be tested for the best precision. For the number of feature point pairs threshold m, the accuracy remains consistent when the number of matched pairs exceeds 5; thus, a default value of 5 is set, and adjustments are not recommended. The similarity threshold pertains to the semantic similarity measurement module. Based on parameter experimentation, it is advisable to maintain the default value of the similarity threshold at 0.85 and at 0.8. To achieve optimal accuracy, the parameter range of for testing can be set between [0.8, 0.9].
In the RSIML task, the model’s ability to resist interference is an important indicator of its robustness. Therefore, it is also crucial to assess whether the model’s accuracy can be maintained at a satisfactory level when the resolution of the tampered images is low or additional image compression is applied. To this end, we applied JPEG compression to the test samples at varying degrees to reduce their resolution and evaluated the model’s accuracy.
In
Table 6, as the compression rate increases, although the model’s precision decreases, the overall F1 score remains relatively stable until the compression rate reaches 50%, after which there is a noticeable drop in accuracy. When the compression rate reaches 70%, the model’s accuracy has declined by 13%, but at this point, the differences between image pairs are eliminated and lose their reference significance. To this end, it can be concluded that the model maintains a certain robustness under low-resolution conditions.