Vegetation Land Segmentation with Multi-Modal and Multi-Temporal Remote Sensing Images: A Temporal Learning Approach and a New Dataset

Qu, Fang; Sun, Youqiang; Zhou, Man; Liu, Liu; Yang, Huamin; Zhang, Junqing; Huang, He; Hong, Danfeng

doi:10.3390/rs16010003

Open AccessArticle

Vegetation Land Segmentation with Multi-Modal and Multi-Temporal Remote Sensing Images: A Temporal Learning Approach and a New Dataset

by

Fang Qu

^1,2,†

,

Youqiang Sun

^1,†

,

Man Zhou

³,

Liu Liu

⁴,

Huamin Yang

^1,2,

Junqing Zhang

¹,

He Huang

^1,*

and

Danfeng Hong

⁵

¹

Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei 230031, China

²

Scinece Island Branch, Graduate School of USTC, Hefei 230026, China

³

S-Lab, Nanyang Technological University, Singapore 639798, Singapore

⁴

Department of Computer Science, Hefei University of Technology, Hefei 230601, China

⁵

Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Remote Sens. 2024, 16(1), 3; https://doi.org/10.3390/rs16010003

Submission received: 29 November 2023 / Revised: 13 December 2023 / Accepted: 14 December 2023 / Published: 19 December 2023

(This article belongs to the Section AI Remote Sensing)

Download

Browse Figures

Versions Notes

Abstract

:

In recent years, remote sensing analysis has gained significant attention in visual analysis applications, particularly in segmenting and recognizing remote sensing images. However, the existing research has predominantly focused on single-period RGB image analysis, thus overlooking the complexities of remote sensing image capture, especially in highly vegetated land parcels. In this paper, we provide a large-scale vegetation remote sensing (VRS) dataset and introduce the VRS-Seg task for multi-modal and multi-temporal vegetation segmentation. The VRS dataset incorporates diverse modalities and temporal variations, and its annotations are organized using the Vegetation Knowledge Graph (VKG), thereby providing detailed object attribute information. To address the VRS-Seg task, we introduce VRSFormer, a critical pipeline that integrates multi-temporal and multi-modal data fusion, geometric contour refinement, and category-level classification inference. The experimental results demonstrate the effectiveness and generalization capability of our approach. The availability of VRS and the VRS-Seg task paves the way for further research in multi-modal and multi-temporal vegetation segmentation in remote sensing imagery.

Keywords:

multi-temporal; multi-modal; Vegetation Knowledge Graph (VKG); VRS-Sys; VRSFormer

1. Introduction

Recent years have witnessed studies on visual analysis applications on remote sensing images such as in [1,2,3,4]. Research on remote sensing analysis usually formulates the perception task as a single RGB image from a single period for visual segmentation or recognition [5,6]. Though much progress has been made on the algorithmic side for remote sensing segmentation, the community still lacks more detailed and in-depth consideration for remote sensing image capturing, especially for land parcels with highly vegetated plots. Under these circumstances, the construction of large-scale, highly vegetated, land remote sensing benchmarks has not gained widespread attention in computer vision communities. The major challenges lie in the following:

Multi-modality: The collection of multi-modal data involves high costs, complex processing and storage requirements, difficulties in acquisition, and challenges in annotation.
Multi-temporality: There is complexity in data collection, storage, and management; issues related to data consistency and calibration; and challenges in multi-temporal data analysis and algorithm development.
High-precision data collection: There are limitations in sensors and devices, complex data collection and processing programs, considerations for data privacy and security, and challenges in data annotation and verification.

Due to security, privacy, and commercial considerations, as well as satellite storage limitations, the resolution of public satellite images is usually low [7,8,9,10,11]. As drone technology develops, it will become easier to obtain high-resolution images. However, existing ultra-high-resolution remote sensing datasets [12] hardly consider multi-temporal and multi-modal aspects. This paper aims to enhance the performance of semantic segmentation tasks by considering multi-temporal and multi-modal information based on ultra-high-resolution imagery. We propose a novel computer vision task called VRS-Seg to segment the target vegetation areas from multi-modal and multi-temporal remote sensing images in order to address the challenge. In conjunction with this task, we propose a new graph-style annotation, namely the Vegetation Knowledge Graph (VKG) to organize the detailed annotations for the vegetation land that covers multi-modal images, multi-temporal images, and segmentation images (Section 4.1).

Existing datasets usually only contain RGB images of a specific region at a specific time. The data quality is greatly affected by the vegetation growth cycle, equipment, and changes in light and shadow, which significantly affects the model’s performance and generalization ability. These datasets [8,10,13] provide original multi-modal data and rely on the model to model the association of multi-modal data, thereby increasing the difficulty of model learning. These datasets [8,9,11] provide temporal datasets, but the data resolution is too low for detailed vegetation classification. The low resolution makes it possible for the model to input all of the data for inference, but this is a challenge for high-resolution models. To address the above issues, we have designed a data collection pipeline, VRS-Sys, to collect datasets from multi-modal and multi-temporal sources, as shown in Figure 1. We captured high-resolution remote sensing vegetation data using the DJI Phantom 4 multi-modal drone system, which was equipped with one visible light camera and five multi-modal cameras. The collected data were arranged based on latitude and longitude information, and annotations were added that considered the changes in image resolution. VRS-Sys helps us save costs and allocate a budget for collecting more multi-temporal and multi-modal remote sensing images. The Normalized Difference Vegetation Index (NDVI) is an index that evaluates vegetation coverage and health status in remote sensing images by comparing the reflectance of different spectral bands. We can use multi-modal data to calculate the NDVI prior to improving model performance.

Conventional segmentation models do not support multi-modal and multi-temporal data [9,14,15,16,17]. Therefore, we provide a baseline VRSFormer, a novel temporal pipeline for the VRS-Seg task. VRSFormer is an essential pipeline that addresses the fusion of multi-temporal and multi-modal data, geometric contour refinement, and classification inference. Based on the transformer architecture, VRSFormer comprises a multi-modal self-attention [18] module that maps RGB images to NDVI data, a temporal cross-attention module that utilizes prior knowledge from auxiliary temporal data for vegetation classification, and a multi-scale decoder module that aggregates multi-scale features. Additionally, VRSFormer incorporates a boundary module [19] to refine object boundaries. During the inference phase, the VRSFormer avoids inputting data from all periods. Instead, it enhances the segmentation of the current period by randomly utilizing data from a specific period, thus achieving optimal segmentation results with minimal resource utilization.

The main contributions of this paper are as follows:

We introduce VRS, which contains 108 plots of land across two categories. Moreover, we adopt a multi-modal knowledge graph, the VKG, to organize the rich annotations. It can help analyze the difficulties of vegetation detection in current remote sensing tasks. To our knowledge, it is the first vegetation dataset with ultra-high resolution and detailed annotations of multi-temporal and multi-modal data.
We design the VRSFormer model and innovatively introduce multi-modal self-attention and multi-temporal cross-attention modules to support the fusion of multi-temporal and multi-modal data. VRSFormer enhances current period segmentation by leveraging data from other periods, as well as utilizes multi-temporal image data for optimal mapping results. Additionally, we introduce a boundary module to enrich boundary information. The experimental results demonstrated that, compared to single-time and single-modal models, this model exhibits superior generalization performance.
We conducted comprehensive benchmark experiments on VRSFormer using the VRS dataset. The experimental conclusions indicate that the encoded information from the fusion of multi-modal and multi-temporal data contains rich boundary details. Additionally, the impact of fusing different periods and different amounts of data on model results was examined, and the rationality of the VRSFormer method was confirmed.

2. Related Work

2.1. Land Cover Semantic Segmentation Datasets

Researchers have been devoted to constructing more accurate, comprehensive, and high-quality remote sensing image datasets in the research of remote sensing image datasets. Datasets such as JSsampleP [20], LandCover.ai [21], EuroSAT [7], and DeepGlobe [22] have become essential foundations for related research. They provide rich annotations of land cover categories and include information about categories such as agriculture and vegetation. However, these datasets rely solely on image information for vegetation classification, which significantly limits the expressive power of models. Datasets such as LandCoverNet [8], GID [10], and Satlas [13] recognize the importance of multi-modal data for vegetation analysis, and they incorporate multi-modal data as important information for parcel classification.

Classifying parcels using single-time-point images in real-life scenarios is influenced by lighting conditions and the specific geographical states at capture time. For vegetation, the geographic landscape can vary significantly depending on the growth conditions at different times. Datasets such as LandCoverNet [8], PASTIS [9], Satlas [13], and DynamicEarthNet [11] introduce the concept of temporal information into parcel classification, which is achieved by using the fusion of multi-temporal data to address the limitations of single-time-point data.

However, most semantic segmentation datasets, including those mentioned above, are based on satellite remote-sensing imagery. Datasets published by satellites generally have low resolution, resulting in relatively blurred object boundary information. To obtain higher-resolution maps, unmanned aerial vehicles (UAVs) are necessary. For example, the FloodNet dataset [12] and the DroneDeploy dataset leverage UAV remote sensing imagery to achieve higher resolution images. However, the annotations in these datasets are relatively coarse and cannot distinguish the differences between vegetation types. Agriculture-Vision [23] provides more detailed vegetation images, but its tasks mainly focus on capturing abnormal areas within vegetation, thus making the dataset more suitable for classification tasks. Moreover, these high-resolution datasets do not simultaneously consider multi-temporal and multi-modal aspects. As surveyed in Table 1, there are currently no vegetation datasets available that combine high-resolution, multi-modal, multi-temporal data and detailed annotations.

2.2. Remote Sensing Segment

FCN [25] is a seminal work in semantic segmentation due to introducing the encoder–decoder architecture. Since then, FCN-based methods have dominated the remote sensing semantic segmentation field. To capture richer contextual features and reduce information loss, some approaches, such as in the studies of [5,6,26,27,28,29], have combined high-resolution remote sensing imagery techniques such as residual connections, atrous convolutions, pyramid scene understanding, and multi-task inference. These methods establish conditional relationships between different tasks, thus improving segmentation accuracy. Despite the satisfactory results achieved by CNN-based methods, the limitations of convolutional operations still restrict the upper bound of the entire mission.

The introduction of VIT [30] has provided a new direction for expanding the receptive field. A series of optimization methods [31,32] have been proposed to improve the original VIT for better adaptation to visual tasks. HBRNet [33] performs best in field segmentation tasks by introducing the Swin transformer to obtain a larger receptive field. To balance the local and global information of remote sensing images, some models [34,35,36,37] that have combined unet and transformer have been proposed. Although the results of these algorithms are close to a human-level performance, single-time-point image segmentation is susceptible to the quality of the images captured, thus leading to errors that some models cannot address.

Based on the PASTIS dataset, U-TAE [9] was proposed to fuse multi-temporal data to improve segmentation results. TSViT [17] introduced the transformer and achieved better results. These methods attempt to merge all the data for segmenting the current parcel. However, the data quality could be better, and relying solely on network autonomy can lead to an inevitable loss in accuracy. The collection of remote sensing data is very challenging, and how to leverage known high-quality data to obtain the best results is a question of great concern. Our model utilizes temporal data to obtain segmentation results for the current time point, and, inspired by [19], introduces additional boundary information to improve the segmentation results.

3. Problem Formulation

We propose a challenging problem: how does one perform segmentation in high-resolution agricultural rice fields when using multiple temporal and modal data? Similar to other types of vegetation, the morphology of rice varies across different periods. However, certain temporal instances present considerable segmentation challenges, such as water puddle formation due to the water accumulation during rice growth stages, which results in noticeable landscape variations within the same area. To tackle this problem, we leveraged data from different periods of the same land parcel. Nevertheless, the image quality was influenced by various factors, including shadows. To overcome these challenges, we employed multi-modal data to segment the images from multiple dimensions. By combining the information from multiple modalities, we can better capture the characteristics of a land parcel and improve segmentation accuracy. This approach effectively addresses the segmentation problem in high-resolution agricultural rice fields by leveraging multiple temporal and modal data.

First and foremost, our task involves utilizing RGB and multi-spectral NDVI images of a land parcel at distinct periods. For each period, we create an RGB-N image by incorporating the corresponding NDVI image as an additional channel to the RGB image. Our input can be represented as

I_{t} = {I_{t}^{i}}_{i = 1}^{M}

, where

I_{t}^{i}

denotes the RGB-N image at the i-th period, and M is the total number of temporal views. Subsequently, we employed the multi-modal self-attention (MSA) mechanism to establish connections between the different modalities of data, thereby facilitating the generation of multi-scale bird’s eye view (BEV) features as

B_{t} = {B_{t}^{i}}_{i = 1}^{L}

, where L is the total number of encoder layers. Next, we employed the temporal channel attention (TCA) mechanism to fuse the

B_{t}

and

B_{t - 1}

, thus enhancing the features of the current period. Finally, we output pixel-level predictions for the current land parcel, thereby classifying each pixel into background or vegetation categories.

4. Dataset Description

4.1. Vegetation Knowledge Graph (VKG)

Due to the diversity of the data types in this study, a multi-modal knowledge graph was essential for handling various object knowledge forms. By standardizing the annotation representation within such a graph, it becomes possible to accommodate and integrate various types of information. As such, to standardize the representation and annotation of diverse multi-modal and multi-temporal data for the same region, we introduced the VKG (Vegetation Knowledge Graph). The VKG comprises three main components: multi-modal, multi-temporal, and semantic. The following provides a detailed description of each element and visualization in Figure 2.

Multi-modal data: We utilized multi-modal data to capture field plot information by specifically calculating the NDVI. The NDVI was computed using red and near-infrared (NIR) band data with the formula NDVI = (NIR − Red)/(NIR + Red). NIR represents near-infrared band reflectance, and Red represents red band reflectance.
Temporal data: Based on the morphological characteristics of rice growth stages, we primarily collected data during the four distinct growth periods of rice: the tillering stage, jointing stage, heading stage, and grain filling stage. The intervals between each of these periods were approximately one month. Notably, the jointing stage represents the peak water demand period in the entire life cycle of rice. During this stage, there are significant differences in the distribution and extent of water in the paddy fields between high-irrigation and low-irrigation periods. Consequently, during the jointing stage, we conducted separate data collection for both high-irrigation and low-irrigation periods. Ultimately, we dedicated four months to collecting the data for the five stages, as illustrated in Figure 1.
Semantic data: Detailed semantic annotations were provided for the field plots to enhance paddy field segmentation. The data were categorized into vegetation and background classes, and they precisely delineated the paddy boundaries within the vegetation category. These fine-grained annotations ensured the accurate and precise identification and segmentation of vegetation field plots by the model.

We established a comprehensive multi-modal and multi-temporal Vegetation Knowledge Graph (VKG) by integrating the above data into a knowledge graph. The application of the VKG signified a strategic initiative for unifying and harnessing diverse data types, thereby facilitating an in-depth analysis of vegetation paddy field characteristics. Furthermore, the modular organization of annotation information within the VKG allowed for the convenient inclusion of new attributes. Although the VKG was originally tailored for paddy field objects, it can be expanded to encompass other crop types.

4.2. VRS System

4.2.1. Model Acquisition Equipment

We employed the DJI Phantom 4 multi-modal drone system to efficiently collect high-resolution remote sensing vegetation data, as shown in Figure 3. This system has an integrated multi-modal imaging device, including a visible light camera and five multi-modal cameras (blue, green, red, red edge, and near-infrared bands) for visual and multi-modal imaging. All of the cameras had a resolution of 2 megapixels and were equipped with a global shutter to ensure precise and stable imaging. To ensure good generalization performance of the model across different resolutions, we randomly set the flight altitude between 50 and 100. This setup yielded high-quality, high-resolution vegetation image data, thus forming a dependable foundation for further analysis and applications.

4.2.2. Dataset Production

We imported multi-temporal data into ArcGIS 10.8 software, and aligned it using latitude and longitude information. A segmentation grid was established based on coordinates to gather the data from various periods for the same land plots. Due to variations in the resolution of the remote sensing images, the sizes of images for the same land plots differed across different periods. We annotated the land plots in the same region in ArcGIS and exported the annotation results. However, discrepancies in the annotation results stemmed from drone measurement errors and image resolution differences. Consequently, we fine tuned the annotations across all periods to improve accuracy. We constructed a temporal dataset based on the land plots, where each land plot included three components: images corresponding to the five periods, the NDVI data, and annotation information.

4.3. Dataset Analysis

To create high-precision vegetation maps, we considered the following requirements: (1) A high resolution was necessary to delineate the vegetation boundaries accurately. (2) Temporal information was crucial as the vegetation exhibited distinct morphological variations during different growth stages, thus requiring data from multiple temporal instances to enhance the perceptual range. (3) Multi-modal information was considered as we believe it can positively assist vegetation segmentation.

In Table 1, unlike other datasets, VRS focuses on higher-resolution scales, ranging between 0.06 m and 0.12 m, to ensure the model’s generalizability. DeepGlobe, LoveDA, JSsampleP, and Agriculture-Vision also have centimeter-level resolution, but they only consider the single-period image data of plots, thus neglecting the differences in the shape characteristics of crops during their growth stages. PASTIS and LandCoverNet include data from different crop growth stages and use multi-modal data to assist model segmentation, but their image resolution is relatively lower, thus resulting in blurred boundary information. Only our VRS dataset was built on a high-resolution foundation while simultaneously considering multi-modal and multi-temporal data.

5. Method

The algorithms [9] in the past typically employed a fusion of data from all temporal phases and modalities for parcel segmentation. However, not all data from different periods contribute effectively to the classification of parcels. Additionally, this approach increases the computational load on the model. More importantly, when the land use of parcels changes over different periods, focusing on vegetation segmentation at the current moment becomes crucial, and images from other periods serve only as references.

To address the challenges above, in this section, we introduce VRSFormer. We designed VRSFormer based on Segformer. This model takes the multi-modal data from the current moment as the base and introduces another instance of multi-modal data from a different moment as an auxiliary to complete the segmentation of the current plot. In comparison to Segformer, VRSFormer integrates multi-temporal and multi-modal data without significantly increasing the computational overhead. Figure 4 illustrates VRSFormer, which consists of 8 encoder layers and is divided into multi-modal self-attention and temporal cross-attention. The multi-modal self-attention mechanism utilizes the features extracted from the backbone to establish mappings between multi-modal features, while temporal cross-attention retrieves and aggregates features from multiple temporal data at the same positions.

During inference, we feed multi-temporal data to the patch embedding network, thereby generating the features

F_{t} = {F_{t}^{i}}_{i = 1}^{M}

of different periods, where

F_{t}^{i}

is the feature of the i-th time, and M is the total number of temporal views. Multi-modal self-attention establishes the connection between multi-modal data using query

Q_{m}

in each encoder layer. Temporal cross-attention uses query

Q_{t}

to query the temporal information from multi-temporal features

F_{t}^{i}

and outputs refined BEV (bird’s eye view features of vegetation) features. After four encoder layers, a unified BEV feature

B_{t} = {B_{t}^{i}}_{i = 1}^{L}

is generated for the current timestamp t at multiple scales. Their sizes are

{1 / 4, 1 / 8, 1 / 16, 1 / 32}

of the original image size, where L is the total number of encoder layers. The decoder with a boundary module takes the BEV feature

B_{t}

as the input, thereby producing clear boundary information and fusing it with multi-scale feature maps to generate the final BEV perception result.

5.1. Multi-Modal Self-Attention

Due to the high computational cost of the regular multi-head attention mechanism, we innovated a multi-modal self-attention (MSA) block based on [31]. The architecture of the MSA module can be seen in Figure 5. This block replaces positional embeddings with the positional information learned via a 3 × 3 convolutional layer. However, as the original [31] is tailored for RGB images, adaptations are necessary for multi-modal inputs. We concatenated RGB and the NDVI as the input to the multi-modal self-attention block. RGBN was fused, for the first time, through linear layers to generate

Q_{m}

,

K_{m}

, and

V_{m}

, which were then used to establish the mapping relationship between the different modalities using multi-modal self-attention. Therefore, we modeled this modal connection between features through the multi-modal self-attention layer, which can be expressed as follows:

Q_{m} = L i n e a r ((C_{I} + C_{N}) \cdot H \cdot W, C_{o u t})

(1)

M S A (Q_{m}, K_{m}, V_{m}) = S o f t m a x (\frac{Q_{m} \cdot K_{m}^{T}}{\sqrt{d}}) \cdot V_{m}

(2)

where

C_{I}

is the dimension of RGB,

C_{N}

is the dimension of the NDVI, and d is the number of attention heads.

K_{m}

and

V_{m}

are generated through linear layer calculations like

Q_{m}

.

Q_{m}

,

K_{m}

, and

V_{m}

establish the feature maps for modalities through the MSA module.

5.2. Temporal Cross-Attention

In addition to multi-modal information, temporal information is crucial for the visual system to understand the surrounding environment. Detecting objects from blurry images without temporal cues poses a challenge. To address this issue, we designed the temporal cross-attention (TCA) module, as shown in Figure 6, which allows for incorporating random temporal BEV features to represent the current environment. Given the BEV query Q at the current timestamp t and the preserved random temporal features

F_{B_{t}}

, establishing precise associations between BEV features at different instances is challenging. Therefore, we employed a TCA layer to model the temporal relationships among the features, which can be expressed as follows:

T C A (Q, (F_{B_{t}}, F_{B_{t}})) = \sum_{i = 0}^{M} S o f t m a x (\frac{Q \cdot F_{B_{t}^{i}}}{\sqrt{d}}) \cdot F_{B_{t}^{i}}

(3)

where Q represents the global BEV query, M is the total number of temporal views, and

F_{B_{t}}

is the set of BEV features extracted from historical parcels. For the first sample of each sequence especially, the temporal cross-attention degrades to cross-attention without time information. Thus, we replaced the BEV features

{Q, (F_{B_{t}}, F_{B_{t}})}

with repeated BEV queries

{Q, (Q, Q)}

.

Compared to simply stacking features in [9,17], our temporal cross-attention mechanism can more effectively capture long-term dependencies. VRSFormer extracts temporal information from random sample BEV features instead of merely stacking multiple BEV features, thereby reducing computational cost and delivering less interference from irrelevant information.

5.3. Boundary Module

By integrating multi-modal and multi-temporal data, we enhanced the richness of the BEV features. However, the significant variations in data across different periods often lead to blurred boundary information. Moreover, as semantic information intensifies, the main content of objects becomes similar, while the blurriness of edges varies. In reference to [19], we explicitly introduced a boundary module to learn body and edge feature representation. We observed that low-frequency features are prone to excessive noise, while high-frequency features, despite their blurred edges, still contain semantic information about the boundaries. Hence, we leveraged the edge module to obtain boundary features (Edge E). Figure 4 shows that we placed the boundary module after

B^{4}

.

The boundary module uses a flow field

δ \in R^{H \times W \times 2}

to generate the flow probabilities that point toward the object’s center. Through bilinear interpolation, we performed feature mapping on

B^{4}

. The specific process is illustrated in Figure 7, where downsampling and upsampling operations were applied to

B^{4}

to eliminate some noise and to obtain a more precise feature map

B^{5}

of the main object. Subsequently, we concatenated

B^{4}

and

B^{5}

, as well as applied a 3 × 3 convolutional layer to predict the flow map

B^{6}

. Using the learned flow field, new feature values

B^{7}

can be obtained at corresponding positions in the input image through bilinear interpolation. Each point

p_{x}

in

B^{7}

can be mapped to

B I (p_{x})

in

B^{6}

through bilinear interpolation, where

B I

represents bilinear interpolation.

B^{7} (p_{x}) = \sum_{p \in B I (p_{x})} B^{6} (p) B^{4} (p)

(4)

Finally, we subtracted

B^{7}

from

B^{4}

to obtain more precise boundary information. This design of the boundary module effectively extracted and enhanced boundary information, thus contributing to the model’s precise segmentation of plot boundaries.

E d g e = B^{4} - B^{7}

(5)

5.4. Multi-Scale Decoder

Fusing features at different levels can improve the geometric and semantic information outputted by the model [31]. VRSFormer introduces additional boundary information in the decoder to achieve smoother image boundary segmentation. The encoder outputs multiple levels of features

B_{t}

, and VRSFormer uses the

B_{t}^{4}

feature for boundary information extraction. We obtained clear boundary information of the land parcel through the boundary module, which was represented as E. Then, multiple levels of features

B_{t}

and E were fused through convolution at different scales to enhance the contour information of

B_{t}

.

Finally, the multi-level and boundary features were fused to make pixel classification more accurate and segmenting details such as edges more precise. We used a multi-scale decoder (MSD) to fuse

B_{t}

, which was expressed as follows:

M S D (B_{t}) = \frac{1}{L} \sum_{i = 0}^{L} M L P_{i} (B_{t}^{i})

(6)

where L is the total number of

B_{t}

. Multiple scale features were uniformly upsampled to the resolution of 1/4 of the original image via a MLP.

6. Experiments

We primarily addressed the following research questions (RQs).

RQ1: How does the performance of VRSFormer compare to existing previous methods?

RQ2: Can the Boundary component in VRSFormer effectively extract the boundary information from the encoder’s output?

RQ3: Is adding multi-modal and multi-temporal (corresponding to the MST and TCA modules) more effective?

RQ4: What advantages does our selective fusion approach have over others in fusing all the data?

6.1. Experimental Settings

We utilized the mmsegmentation codebase and trained on two Tesla V100 servers. During training, we applied various data augmentations, including random resizing with a scale factor between 0.5 and 2.0, random horizontal flipping, and random cropping to sizes of 512 × 512, 1024 × 1024, and 512 × 512. The initial learning rate was set to 0.00006, and we used the default “poly” LR schedule with a coefficient of 1.0. We set class weights of 1.0 for the vegetation and 1.58 for the background to address the class imbalance. During the evaluation, we resized the short edge of the image to the training crop size. We reported the semantic segmentation performance using the mean intersection-over-union (mIoU) and mean pixel-wise accuracy (mAcc) metrics.

6.2. Results and Analysis for RQ1

RQ1: How does the performance of VRSFormer compare to existing previous methods?

To address the scarcity of algorithms utilizing multi-temporal and multi-modal data for image segmentation, we introduced VRSFormer as a benchmark model. As shown in Table 2, we conducted comparative experiments to evaluate the effectiveness of our multi-temporal image fusion approach. Our algorithm was compared with the baseline Segformer and other mainstream models for semantic segmentation. We compare VRSFormer with these models under the same settings. It is worth noting that the Segmenter, SwinTransformer, and Segformer algorithms are all implemented using the minimum configuration.

As shown in Figure 8, VRSFormer achieved smoother segmentation boundaries. We compared the performance of VRSFormer with mainstream models. As shown in Figure 9, the points closer to the left indicate higher speed or fewer parameters. VRSFormer was positioned in the upper-left corner of the coordinate system, thereby demonstrating its advanced level relative to baseline and mainstream algorithms, which is primarily due to VRSFormer reusing the encoder module and reducing the model’s parameter count and complexity. During the training process, the content of images from other time points was feature-extracted through the encoder module but did not participate in gradient backpropagation. The model only backpropagates gradients for images at the current time point. In the inference phase, the encoder sequentially extracts image features as the input to the decoder module to obtain the final result.

6.3. Results and Analysis for RQ2

RQ2: Can the Boundary component in VRSFormer effectively extract the boundary information from the encoder’s output?

The encoder of VRSFormer provides a unified and comprehensive BEV feature map for downstream tasks, and how one is to effectively utilize these feature maps becomes a crucial step. Although we attempted to process them similarly to Segformer, the performance improvement was minimal. Through experiments, we found that the feature maps provided by the encoder contained strong semantic information but had relatively blurry boundary information. Therefore, we designed a boundary module to enhance the boundary information of the features.

In Table 3, we observed minimal improvement in adding the Boundary module in Segformer. At the same time, the performance of the VRSFormer model increased by approximately three percentage points. The results suggest that VRSFormer obtains more robust semantic information by combining data from multiple temporal periods and modalities but lacks exploring boundary information.

To validate this hypothesis, we explored the effects of placing the Boundary module at different positions, as shown in the results in Table 4. Placing the Boundary module after

B^{4}

resulted in higher performance, thus confirming that richer semantic information contains more boundary details.

Figure 10 illustrates the feature map flow of the multi-scale decoder component. Feature map E represents the output of the boundary module. The fusion of BEV features in feature map E contributes to the model’s more effective optimization of classification boundaries, thus distinguishing the content across different categories.

When visualizing the results, as shown in Figure 11, it is evident that the Boundary module enhances the clarity and smoothness of the segmentation boundaries.

6.4. Results and Analysis for RQ3

RQ3: Is adding multi-modal and multi-temporal (corresponding to the MST and TCA modules) more effective?

To evaluate the multi-modal ablation experiments, we removed the NDVI data from the input data and the multi-modal components from the model architecture while keeping the other components unchanged. As shown in Table 5, we observed that the multi-modal data play a crucial role in improving performance. VRSFormer with multi-modal data exhibits more a stable performance and higher metrics across all periods. This suggests, on the one hand, that the NDVI can complement information beyond image data. On the other hand, it indicates that the MSA module we designed can handle multi-modal data.

For the study of multi-temporal phases, since VRSFormer is primarily based on improvements to Segformer, the experiments were primarily conducted using Segformer as the baseline for validation. As shown in Table 6, three main experiments were designed that corresponded with the three columns labeled “Segformer+”, “Segformer”, and “VRSFormer(without NDVI)”.

In the first experiment, “Segformer+”, Segformer was trained on single-period data (Period-1) and the tested on the data from other periods. It was observed that Segformer attained higher accuracy in Period-1 but experienced significantly lower accuracy in the other periods. This indicated a poor generalization of the model when trained on a single period, especially in growth cycles with significant differences in crop shape features.

In the second experiment, “Segformer”, Segformer was trained on the data from all periods (Periods 1–5) and was then tested on the data from other periods. Segformer’s performance was relatively consistent across each period. However, the test result for Period-1 at 87.86% was significantly lower than the result of ‘Segformer+’ at 90.69%, thereby suggesting that the data potential needed to be fully utilized.

In the third experiment, “VRSFormer (without NDVI)”, VRSFormer was trained on the data from all periods and tested on each period individually. To ensure a fair comparison with Segformer, VRSFormer excluded the NDVI data and multi-modal modules. The main difference from Segformer lied in the use of the TCA module by VRSFormer for the cross-temporal fusion of multi-temporal data. According to Table 6, VRSFormer, by utilizing temporal information, achieved higher results than Segformer across all periods.

6.5. Results and Analysis for RQ4

RQ4: What advantages does our selective fusion approach have over fusing all the data?

This section validates the impact of the quantity and quality of temporal data on the model’s accuracy. We used different periods and varying amounts of data to assist in segmenting the same field at a given period. As shown in Table 7, it can be observed that using two frames yields the best results, while fusing more temporal data led to a decline in model performance. Figure 12 and Figure 13 show the visualization results of the single-period images and multi-period auxiliary segmentation, respectively.

The main reasons for the advantages were as follows. The selective fusion approach enabled an improvement in segmentation performance with minimal resources at the current stage. It avoided the potential waste of resources and data contamination that may occur when merging all the data. In addition, the choice of fusion of the two frames was related to the model design. Unlike other models that concatenate all temporal data and input it into the model after compression, VRSFormer updates features frame-by-frame by reusing the encoder module. This approach reduces model complexity and computational complexity, but it is limited to capturing only the contextual information of the current moment, thus lacking the ability to capture long-term information. Therefore, using only two frames of data not only saves resources, but also aligns better with the characteristics of the model.

7. Conclusions

This paper presents a novel segmentation task for multi-modal, multi-temporal, and high-resolution contexts. To address this challenge, we collected vegetation remote sensing data using an aerial surveying drone that was equipped with a multi-modal imaging system, thereby resulting in a multi-temporal dataset with multi-modal and ultra-high resolution vegetation data. Moreover, we propose a baseline method named VRSformer, which innovatively introduces multi-modal self-attention and multi-temporal cross-attention modules to support the fusion of multi-temporal and multi-modal data. The experimental results indicate that the fusion of multi-temporal and multi-modal methods exhibits superior performance in vegetation segmentation tasks compared to single-temporal and single-modal approaches. VRSformer outperforms other models in accuracy and generalization.

In the future, our goal is to establish a professional vegetation database that meets the research community’s needs. This database will be characterized by its high-resolution imagery, multi-temporal coverage, diverse vegetation types, and multi-modal data sources. Such a comprehensive vegetation database will facilitate in-depth analysis, algorithm development, and comparative studies in remote sensing vegetation segmentation. By providing researchers with access to rich and diverse data, we aim to foster advancements in vegetation segmentation techniques, thus ultimately contributing to a better understanding and management of vegetation resources at a global scale.

Author Contributions

Conceptualization, F.Q., Y.S. and H.H.; methodology, F.Q. and Y.S.; software, F.Q.; validation, F.Q.; investigation, H.Y. and F.Q.; data curation, F.Q. and Y.S.; writing—original draft preparation, F.Q. and H.Y.; writing—review and editing, M.Z., L.L., Y.S., J.Z., D.H. and H.H.; visualization, F.Q.; supervision, H.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China (grant number 2021YFD200060102), the Strategic Priority Research Program of the Chinese Academy of Sciences (grant number XDA28120402), and the HFIPS Director’s Fund (grant number 2023YZGH04).

Data Availability Statement

In this study, the dataset comprises the map information of a specific region. Due to privacy constraints, the entire dataset cannot be fully disclosed. Nevertheless, to ensure transparency and reproducibility, we have shared a portion of our dataset, including images and the NDVI data from five growth cycles of the plots. Experts can use this subset data to replicate our methods. This selection is made to serve the research objectives while minimizing potential impacts on privacy. We release this data at https://github.com/qfwysw/VRSData (accessed on 8 December 2023).

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study.

References

Nevavuori, P.; Narra, N.; Linna, P.; Lipping, T. Crop Yield Prediction Using Multitemporal UAV Data and Spatio-Temporal Deep Learning Models. Remote Sens. 2020, 12, 4000. [Google Scholar] [CrossRef]
Lv, X.; Ming, D.; Chen, Y.; Wang, M. Very high resolution remote sensing image classification with SEEDS-CNN and scale effect analysis for superpixel CNN classification. Int. J. Remote Sens. 2019, 40, 506–531. [Google Scholar] [CrossRef]
Das, S.; Biswas, A.; Vimalkumar, C.; Sinha, P. Deep Learning Analysis of Rice Blast Disease Using Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2023, 20, 2500905. [Google Scholar] [CrossRef]
Matikainen, L.; Karila, K. Segment-Based Land Cover Mapping of a Suburban Area-Comparison of High-Resolution Remotely Sensed Datasets Using Classification Trees and Test Field Points. Remote Sens. 2011, 3, 1777–1804. [Google Scholar] [CrossRef]
Liu, R.; Tao, F.; Liu, X.; Na, J.; Leng, H.; Wu, J.; Zhou, T. RAANet: A Residual ASPP with Attention Framework for Semantic Segmentation of High-Resolution Remote Sensing Images. Remote Sens. 2022, 14, 3109. [Google Scholar] [CrossRef]
Wang, Z.; Wang, J.; Yang, K.; Wang, L.; Su, F.; Chen, X. Semantic segmentation of high-resolution remote sensing images based on a class feature attention mechanism fused with Deeplabv3+. Comput. Geosci. 2022, 158, 104969. [Google Scholar] [CrossRef]
Helber, P.; Bischke, B.; Dengel, A.; Borth, D. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2019, 12, 2217–2226. [Google Scholar] [CrossRef]
Alemohammad, H.; Booth, K. LandCoverNet: A global benchmark land cover classification training dataset. arXiv 2020, arXiv:2012.03111. [Google Scholar]
Garnot, V.S.F.; Landrieu, L. Panoptic segmentation of satellite image time series with convolutional temporal attention networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 4872–4881. [Google Scholar]
Tong, X.Y.; Xia, G.S.; Lu, Q.; Shen, H.; Li, S.; You, S.; Zhang, L. Land-cover classification with high-resolution remote sensing images using transferable deep models. Remote Sens. Environ. 2020, 237, 111322. [Google Scholar] [CrossRef]
Toker, A.; Kondmann, L.; Weber, M.; Eisenberger, M.; Camero, A.; Hu, J.; Hoderlein, A.P.; Şenaras, Ç.; Davis, T.; Cremers, D.; et al. Dynamicearthnet: Daily multi-spectral satellite dataset for semantic change segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 21158–21167. [Google Scholar]
Rahnemoonfar, M.; Chowdhury, T.; Sarkar, A.; Varshney, D.; Yari, M.; Murphy, R.R. FloodNet: A High Resolution Aerial Imagery Dataset for Post Flood Scene Understanding. IEEE Access 2021, 9, 89644–89654. [Google Scholar] [CrossRef]
Bastani, F.; Wolters, P.; Gupta, R.; Ferdinando, J.; Kembhavi, A. Satlas: A large-scale, multi-task dataset for remote sensing image understanding. arXiv 2022, arXiv:2211.15660. [Google Scholar]
Panboonyuen, T.; Jitkajornwanich, K.; Lawawirojwong, S.; Srestasathiern, P.; Vateekul, P. Transformer-based decoder designs for semantic segmentation on remotely sensed images. Remote Sens. 2021, 13, 5100. [Google Scholar] [CrossRef]
Zhang, C.; Jiang, W.; Zhang, Y.; Wang, W.; Zhao, Q.; Wang, C. Transformer and CNN hybrid deep neural network for semantic segmentation of very-high-resolution remote sensing imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4408820. [Google Scholar] [CrossRef]
Xu, Z.; Zhang, W.; Zhang, T.; Yang, Z.; Li, J. Efficient transformer for remote sensing image segmentation. Remote Sens. 2021, 13, 3585. [Google Scholar] [CrossRef]
Tarasiou, M.; Chavez, E.; Zafeiriou, S. ViTs for SITS: Vision Transformers for Satellite Image Time Series. arXiv 2023, arXiv:2301.04944. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA; pp. 5998–6008. [Google Scholar]
Li, X.; Li, X.; Zhang, L.; Cheng, G.; Shi, J.; Lin, Z.; Tan, S.; Tong, Y. Improving Semantic Segmentation via Decoupled Body and Edge Supervision. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020. [Google Scholar]
Xu, L.; Shi, S.; Liu, Y.; Zhang, H.; Wang, D.; Zhang, L.; Liang, W.; Chen, H. A large-scale remote sensing scene dataset construction for semantic segmentation. Int. J. Image Data Fusion 2023, 14, 1–25. [Google Scholar] [CrossRef]
Boguszewski, A.; Batorski, D.; Ziemba-Jankowska, N.; Dziedzic, T.; Zambrzycka, A. LandCover. ai: Dataset for automatic mapping of buildings, woodlands, water and roads from aerial imagery. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 1102–1110. [Google Scholar]
Demir, I.; Koperski, K.; Lindenbaum, D.; Pang, G.; Huang, J.; Basu, S.; Hughes, F.; Tuia, D.; Raskar, R. Deepglobe 2018: A challenge to parse the earth through satellite images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 172–181. [Google Scholar]
Chiu, M.T.; Xu, X.; Wei, Y.; Huang, Z.; Schwing, A.G.; Brunner, R.; Khachatrian, H.; Karapetyan, H.; Dozier, I.; Rose, G.; et al. Agriculture-vision: A large aerial image database for agricultural pattern analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2828–2838. [Google Scholar]
Wang, J.; Zheng, Z.; Ma, A.; Lu, X.; Zhong, Y. LoveDA: A remote sensing land-cover dataset for domain adaptive semantic segmentation. arXiv 2021, arXiv:2110.08733. [Google Scholar]
Chen, L.C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. Semantic image segmentation with deep convolutional nets and fully connected crfs. arXiv 2014, arXiv:1412.7062. [Google Scholar]
Diakogiannis, F.I.; Waldner, F.; Caccetta, P.; Wu, C. ResUNet-a: A deep learning framework for semantic segmentation of remotely sensed data. ISPRS J. Photogramm. Remote Sens. 2020, 162, 94–114. [Google Scholar] [CrossRef]
Li, R.; Zheng, S.; Duan, C.; Su, J.; Zhang, C. Multistage Attention ResU-Net for Semantic Segmentation of Fine-Resolution Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
Li, R.; Duan, C.; Zheng, S.; Zhang, C.; Atkinson, P.M. MACU-Net for Semantic Segmentation of Fine-Resolution Remotely Sensed Images. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
Li, R.; Zheng, S.; Zhang, C.; Duan, C.; Su, J.; Wang, L.; Atkinson, P.M. Multiattention Network for Semantic Segmentation of Fine-Resolution Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–13. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Sheng, J.; Sun, Y.; Huang, H.; Xu, W.; Pei, H.; Zhang, W.; Wu, X. HBRNet: Boundary Enhancement Segmentation Network for Cropland Extraction in High-Resolution Remote Sensing Images. Agriculture 2022, 12, 1284. [Google Scholar] [CrossRef]
Wang, L.; Li, R.; Zhang, C.; Fang, S.; Duan, C.; Meng, X.; Atkinson, P.M. UNetFormer: A UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J. Photogramm. Remote Sens. 2022, 190, 196–214. [Google Scholar] [CrossRef]
Song, P.; Li, J.; An, Z.; Fan, H.; Fan, L. CTMFNet: CNN and Transformer Multi-scale Fusion network of Remote Sensing Urban Scene Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 61, 1–14. [Google Scholar] [CrossRef]
Li, Q.; Zhong, R.; Du, X.; Du, Y. TransUNetCD: A hybrid transformer network for change detection in optical remote-sensing images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–19. [Google Scholar] [CrossRef]
He, X.; Zhou, Y.; Zhao, J.; Zhang, D.; Yao, R.; Xue, Y. Swin Transformer Embedding UNet for Remote Sensing Image Semantic Segmentation. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–15. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; Proceedings, Part III 18. Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
Strudel, R.; Garcia, R.; Laptev, I.; Schmid, C. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 7262–7272. [Google Scholar]
Cheng, B.; Schwing, A.; Kirillov, A. Per-pixel classification is not all you need for semantic segmentation. Adv. Neural Inf. Process. Syst. 2021, 34, 17864–17875. [Google Scholar]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 1290–1299. [Google Scholar]
Zhang, W.; Pang, J.; Chen, K.; Loy, C.C. K-net: Towards unified image segmentation. Adv. Neural Inf. Process. Syst. 2021, 34, 10326–10338. [Google Scholar]
Fan, M.; Lai, S.; Huang, J.; Wei, X.; Chai, Z.; Luo, J.; Wei, X. Rethinking bisenet for real-time semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9716–9725. [Google Scholar]
Ranftl, R.; Bochkovskiy, A.; Koltun, V. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 12179–12188. [Google Scholar]

Figure 1. Our dataset contains images and multi-modal data information from five periods of the same land parcel. The first row is image data, the second row is multi-modal data, and the third row is annotated images. Black represents others and white represents vegetation. (Period-1 represents the tillering stage, Period-2 and Period-3 indicate the jointing stage, Period-4 corresponds with the heading stage, and Period-5 represents the grain filling stage).

Figure 2. The Vegetation Knowledge Graph (VKG) defined in the VRS dataset. The VKG encompasses three types of knowledge: multi-modal, multi-temporal, and segmentation. Additionally, we have visualized how we utilized longitude and latitude information to obtain data from the same region across different periods.

Figure 3. The vegetation remote sensing data acquisition equipment consisted of six cameras capturing spectral bands: blue (450 ± 16 nm), green (560 ± 16 nm), red (650 ± 16 nm), red edge (730 ± 16 nm), and near infrared (840 ± 26 nm).

Figure 4. The overall pipeline of VRSFormer. The input of VRSFormer are multi-temporal RGB-N images, and it has four components as follows: (1) the multi-modal self-attention block (MSA), (2) the multi-temporal cross-attention module (TCA), (3) the multi-scale decoder, and (4) the boundary module (Boundary).

Figure 5. Architecture of the multi-modal self-attention module (MSA).

Figure 6. Architecture of the multi-temporal cross-attention module (TCA).

Figure 7. Architecture of the boundary module.

Figure 8. Qualitative comparisons between our temporal method and other models for validation on the VRS dataset.

Figure 9. Performance versus speed and trainable parameter numbers. Points closer to the left suggest higher speed or fewer parameters, while points closer to the right suggest better performance.

Figure 10. Quantitative analysis of different output feature maps of the decoder layer.

Figure 11. Qualitative comparisons were conducted on the baseline Segformer, VRSFormer, and VRSFormer without the Boundary module for validation on the VRS dataset.

Figure 12. Qualitative comparisons were conducted on the impact of different periods of data on the segmentation of the same land parcel at a certain period.

Figure 13. Qualitative comparisons were conducted on the impact of multiple periods of data on the segmentation of the same land parcel at a certain period.

Table 1. Comparison between VRS and other remote sensing datasets.

Image Level	Ground Sample Distance (m)	Dataset	Year	Vegetation	Multi-Temporal	Multi-Modal
Meter level	10+	EuroSAT [7]	2019	✓		✓
	10	LandCoverNet [8]	2020	✓	✓	✓
	10	PASTIS [9]	2021	✓	✓	✓
	4	GID [10]	2020	✓		✓
	3	DynamicEarthNet [11]	2022	✓	✓
Sub-meter level	0.25–0.5	LandCover.ai [21]	2020
	0.5	DeepGlobe [22]	2018	✓
	0.3	LoveDA [24]	2021	✓
	0.3 or 1	JSsampleP [20]	2023	✓
	0.1	Agriculture-Vision [23]	2020	✓		✓
	0.06–0.12	VRS(Ours)	2023	✓	✓	✓

Table 2. Comparison with the mainstream methods on the ATS dataset.

Method	IoU per Category (%)		Acc per Category (%)		mIoU	mAcc
Method	Vegetation	Background	Vegetation	Background	mIoU	mAcc
Unet [38]	87.55	80.51	94.46	87.55	84.03	91.00
DeeplabV3+ [39]	89.29	82.8	96.28	87.67	86.05	91.97
Segformer [31]	89.89	84.43	94.88	91.25	87.16	93.06
Segmenter [40]	90.1	84.46	95.72	90.16	87.28	92.94
SwinTransformer [32]	90.86	85.57	96.31	90.55	88.22	93.43
maskformer [41]	89.09	82.36	96.48	86.93	85.73	91.70
mask2former [42]	90.48	84.69	96.84	88.91	87.59	92.88
Knet [43]	90.27	84.17	97.23	87.85	87.22	92.54
Stdc [44]	90.25	84.67	95.86	90.20	87.01	92.73
Dpt [45]	89.92	84.1	95.84	89.62	87.46	93.66
VRSFormer (ours)	92.6	88.49	96.43	93.47	90.54	94.95

We bold the best result in each column.

Table 3. Results of the performance of the ablation experiments on the Boundary module of Segformer and VRSFormer on the VRS dataset.

Method	Boundary	IoU per Category (%)		mIoU
Method	Boundary	Vegetation	Background	mIoU
VRSFormer	No	89.97	84.44	87.47
VRSFormer	Yes	92.6	88.49	90.54
Segformer	No	89.89	84.43	87.16
Segformer	Yes	90.6	85.24	87.92

Table 4. Quantitative results of the boundary modules between different BEV features.

Boundary Place	VRSFormer
b1	88.72
b2	87.03
b3	88.27
b4	90.54

Table 5. Quantitatve result comparisons between VRSFormer and VRSFormer without the NDVI for the validation on the different periods of the VRS dataset.

Help Data	mIoU (%)
Help Data	VRSFormer (Without NDVI)	VRSFormer
Period-1	89.03	90.54
Period-2	87.86	89.80
Period-3	88.08	89.91
Period-4	87.59	89.03
Period-5	87.88	89.38

Table 6. Segformer+ was trained for the Period-1 data onto the VRS dataset, and it was tested for the other period data onto the VRS dataset. Segformer and VRSFormer without NDVI were trained on the VRS dataset and were tested on different periods of the VRS dataset.

Test Data	mIoU per Method (%)
Test Data	Segformer+	Segformer	RSFormer (Without NDVI)
Period-1	90.69	87.86	89.02
Period-2	82.35	86.78	89.32
Period-3	84.27	87.70	89.40
Period-4	59.28	85.46	88.16
Period-5	36.05	85.73	87.60

Table 7. We conducted a quantitative analysis of the segmentation of specific temporal parcels, aiming to explore the influence of multi-temporal data fusion on the segmentation results, as well as the impact of different temporal periods on the segmentation results.

Testing Data	Help Data					IoU per Category (%)		Acc per Category (%)		mIoU	mAcc
Testing Data	Period-1	Period-2	Period-3	Period-4	Period-5	Agriculture	Background	Agriculture	Background	mIoU	mAcc
Period-1	✓					92.98	89.3	96.59	94.01	91.14	95.3
Period-1		✓				92.86	89.06	96.68	93.63	90.96	95.16
Period-1			✓			92.66	88.82	96.41	93.75	90.74	95.08
Period-1				✓		92.39	88.35	96.48	93.15	90.37	94.82
Period-1					✓	92.84	89.08	96.53	93.86	90.96	95.19
Period-1		✓	✓			91.33	87.19	94.61	94.45	89.26	94.53
Period-1			✓		✓	88.71	84.03	91.65	94.88	86.37	93.26
Period-1				✓	✓	89.71	85.31	92.47	95.24	87.51	93.86
Period-1		✓	✓		✓	92.39	88.46	96.12	93.76	90.42	94.94
Period-1			✓	✓	✓	91.48	87.37	94.79	94.4	89.42	94.4
Period-1		✓	✓	✓	✓	86.49	81.57	89.15	95.26	84.03	92.2
Period-2				✓	✓	88.23	82.67	91.93	93.31	85.45	92.62
Period-2			✓	✓	✓	90.68	85.82	94.34	93.56	88.25	93.95
Period-2	✓		✓	✓	✓	86.21	80.68	89.02	94.8	83.45	91.91
Period-3	✓	✓				91.92	87.41	95.94	93.05	89.66	94.49
Period-3	✓	✓		✓	✓	91.08	86.25	95.14	92.92	88.66	94.03
Period-3	✓	✓		✓	✓	86.63	80.88	90.15	93.54	83.76	91.85
Period-4	✓				✓	90.71	85.65	95.11	92.3	88.18	93.7
Period-4	✓	✓	✓		✓	90.9	85.73	95.79	91.46	88.38	93.65
Period-4	✓	✓	✓		✓	86.29	80.94	90.4	92.02	83	91.89
Period-5	✓	✓				90.47	85.03	95.36	90.31	87.75	93.34
Period-5	✓	✓	✓			90.43	84.81	95.97	90.24	87.63	93.27
Period-5	✓	✓	✓	✓		86.08	80.63	90.2	90.82	83.29	91.74

✓ represents the help data was utilized during testing.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Qu, F.; Sun, Y.; Zhou, M.; Liu, L.; Yang, H.; Zhang, J.; Huang, H.; Hong, D. Vegetation Land Segmentation with Multi-Modal and Multi-Temporal Remote Sensing Images: A Temporal Learning Approach and a New Dataset. Remote Sens. 2024, 16, 3. https://doi.org/10.3390/rs16010003

AMA Style

Qu F, Sun Y, Zhou M, Liu L, Yang H, Zhang J, Huang H, Hong D. Vegetation Land Segmentation with Multi-Modal and Multi-Temporal Remote Sensing Images: A Temporal Learning Approach and a New Dataset. Remote Sensing. 2024; 16(1):3. https://doi.org/10.3390/rs16010003

Chicago/Turabian Style

Qu, Fang, Youqiang Sun, Man Zhou, Liu Liu, Huamin Yang, Junqing Zhang, He Huang, and Danfeng Hong. 2024. "Vegetation Land Segmentation with Multi-Modal and Multi-Temporal Remote Sensing Images: A Temporal Learning Approach and a New Dataset" Remote Sensing 16, no. 1: 3. https://doi.org/10.3390/rs16010003

APA Style

Qu, F., Sun, Y., Zhou, M., Liu, L., Yang, H., Zhang, J., Huang, H., & Hong, D. (2024). Vegetation Land Segmentation with Multi-Modal and Multi-Temporal Remote Sensing Images: A Temporal Learning Approach and a New Dataset. Remote Sensing, 16(1), 3. https://doi.org/10.3390/rs16010003

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Vegetation Land Segmentation with Multi-Modal and Multi-Temporal Remote Sensing Images: A Temporal Learning Approach and a New Dataset

Abstract

1. Introduction

2. Related Work

2.1. Land Cover Semantic Segmentation Datasets

2.2. Remote Sensing Segment

3. Problem Formulation

4. Dataset Description

4.1. Vegetation Knowledge Graph (VKG)

4.2. VRS System

4.2.1. Model Acquisition Equipment

4.2.2. Dataset Production

4.3. Dataset Analysis

5. Method

5.1. Multi-Modal Self-Attention

5.2. Temporal Cross-Attention

5.3. Boundary Module

5.4. Multi-Scale Decoder

6. Experiments

6.1. Experimental Settings

6.2. Results and Analysis for RQ1

6.3. Results and Analysis for RQ2

6.4. Results and Analysis for RQ3

6.5. Results and Analysis for RQ4

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI