A Novel Multimodal Species Distribution Model Fusing Remote Sensing Images and Environmental Features

Zhang, Xiaojuan; Zhou, Yongxiu; Peng, Peihao; Wang, Guoyan

doi:10.3390/su142114034

Open AccessTechnical Note

A Novel Multimodal Species Distribution Model Fusing Remote Sensing Images and Environmental Features

by

Xiaojuan Zhang

^1,2,

Yongxiu Zhou

³,

Peihao Peng

^1,2,4,* and

Guoyan Wang

^2,4

¹

College of Earth Sciences, Chengdu University of Technology, Chengdu 610059, China

²

Institute of Ecological Resources and Landscape Architecture, Chengdu University of Technology, Chengdu 610059, China

³

College of Geophysics, Chengdu University of Technology, Chengdu 610059, China

⁴

College of Tourism and Urban-Rural Planning, Chengdu University of Technology, Chengdu 610059, China

^*

Author to whom correspondence should be addressed.

Sustainability 2022, 14(21), 14034; https://doi.org/10.3390/su142114034

Submission received: 20 September 2022 / Revised: 19 October 2022 / Accepted: 25 October 2022 / Published: 28 October 2022

(This article belongs to the Special Issue Computational Sustainability: The Role of Earth Observation Science and Machine Learning in Securing a Sustainable Future)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Species distribution models (SDMs) are critical in conservation decision-making and ecological or biogeographical inference. Accurately predicting species distribution can facilitate resource monitoring and management for sustainable regional development. Currently, species distribution models usually use a single source of information as input for the model. To determine a solution to the lack of accuracy of the species distribution model with a single information source, we propose a multimodal species distribution model that can input multiple information sources simultaneously. We used ResNet50 and Transformer network structures as the backbone for multimodal data modeling. The model’s accuracy was tested using the GEOLIFE2020 dataset, and our model’s accuracy is state-of-the-art (SOTA). We found that the prediction accuracy of the multimodal species distribution model with multiple data sources of remote sensing images, environmental variables, and latitude and longitude information as inputs (29.56%) was higher than that of the model with only remote sensing images or environmental variables as inputs (25.72% and 21.68%, respectively). We also found that using a Transformer network structure to fuse data from multiple sources can significantly improve the accuracy of multimodal models. We present a novel multimodal model that fuses multiple sources of information as input for species distribution prediction to advance the research progress of multimodal models in the field of ecology.

Keywords:

multimodal; deep learning; feature fusion; Transformer network; species distribution models; high-resolution remote sensing images

1. Introduction

Species distribution models (SDMs) have become a fundamental tool in ecology, biodiversity conservation, biogeography, and natural resource management [1,2,3,4,5]. Traditional SDMs typically correlate the presence (or presence/absence) of species at multiple sites with relevant environmental covariates (temperature, precipitation, altitude, land cover, soil type, etc.) to estimate habitat preferences or predict distributions; these outputs are commonly used to inform ecological and biogeographical theory as well as conservation decisions [6,7,8,9,10]. Researchers in the field of ecology and geology use remote sensing images for automatic classification, providing a convenient means of classifying forests and classifying land use types. With the continuous development of remote sensing technology and the improvement of remote sensing image accuracy, many researchers started to try to incorporate remote sensing images as variables in the species distribution models [11,12,13,14,15]. Cerrejón et al. (2020) used remote sensing data combined with species distribution models to predict and map bryophyte communities and diversity patterns in a 28,436 km² boreal forest in Quebec (Canada), which provides evidence for the potential of using remote sensing data to assess and predict bryophyte diversity across landscapes [12]. Combining field and remote sensing data, Scholl et al. trained a random forest model to classify four major alpine conifer species at the Niwot Ridge Mountain Research Station in Colorado, achieving 69 and 60% accuracy on the validation set [16]. Bhattarai et al. (2020) used Sentinel-1 synthetic aperture radar (SAR) and Sentinel-2 multispectral images in combination with a total of 191 covariates from northern New Brunswick, Canada several site variables and mapped the distribution and abundance of spruce budworm (Choristoneura fumiferana; SBW) using random forest algorithm [17].

Just as the proposal of the imagenet dataset [18] has greatly advanced the development of deep learning techniques, many researchers in the field of remote sensing image-based SDM research have started to produce and open source their own datasets for comparing the performance and accuracy of different models, while advancing the technology in the field. Marconi et al. (2019) held a competition for tree species classification based on remote sensing images, which offered three tracks on canopy segmentation, tree alignment, and species classification, contributing to the development of remote sensing data for ecological and biological methods [19]. Lorieul et al. (2020) organized a competition called GeoLifeCLEF to study the relationship between the environment and the possible occurrence of species, a dataset that collected 1.9 million observations with their corresponding remote sensing data, and it is the largest open-sourced for studying species distribution [20].

As deep learning algorithms make breakthroughs in various unimodal tasks, researchers are beginning to focus on multimodal tasks that are closer to the real world. Currently, multimodal models are beginning to be widely used in areas such as image captioning, Text-to-image Generation, Visual Question Answering, and Visual Reasoning [21]. Yu et al. (2020) proposed a multimodal deep self-attentive network model (multimodal transformer) for image captioning that can input an image and output the textual description corresponding to this image, and they won first place in the Microsoft Common Objects in Context (MSCOCO) Image Captioning Challenge at the time [22]. Xu et al. (2018) proposed a deep attentional multimodal similarity model to train a graphical text generator, which can supplement missing details in images based on the input text descriptions and improved the best score by 14.14% on the Caltech-UCSD Birds (CUB) datasets while improving the best score by 170.25% on the Common Objects in Context (COCO) datasets [23]. Ben-Younes et al. (2017) proposed a multimodal model based on Tucker decomposition for the visual question and answer domain. Such models typically input a textual description of a question and a corresponding image, and output the answer to that question [24]. Nam et al. (2017) propose a model of Dual Attention Networks (DAN) that uses visual and textual attention mechanisms to capture the interactions between vision and language, which can be used for multimodal inference and matching [25].

Before the emergence of the Transformer model [26], the backbone of multimodal modeling was primarily convolutional neural networks, which were first proposed for natural language processing (NLP) with good results, and then the Transformer model and its variants were applied to the field of vision, also with good results, and now some researchers have begun to apply the Transformer model to multimodal domains. Lu J et al. (2019) proposed the Vision-and-Language BERT (Vilbert) model for visual question-and-answer tasks, which used the BERT model as the main architecture and achieved the best results on all four visual response datasets tested [27]. Chen et al. (2019) proposed a joint image text representation network called Universal image-text representation (Uniter), which used a multilayer Transformer mechanism and achieved the best results on six Vision and-Language (V + L) datasets [28]. Li et al. (2021) proposed a new pre-training method, Vision-Language Pre-training by Aligning Semantics (SemVLP), using the Transformer network as a pre-training model, which is effective in being able to align cross-modal representations with different semantic granularities [29]. We need to note that most of the current multimodal models based on Transformer are used for tasks that deal with the combination of vision and language, while models for species distribution prediction have not been proposed yet.

Although in the direction of species distribution prediction research, multimodal information, including environmental variables and remote sensing images, can be extracted for sample points, we have found few researchers using multimodal models to study species distribution. Therefore, we propose a multimodal model based on Transformer for species distribution prediction to explore whether the accuracy of a species distribution prediction model using remote sensing images and environmental variables as inputs is higher than that of a model using only remote sensing images or environmental variables as inputs and to determine how much the different structural fusion methods of the Transformer-based backbone species distribution model affect the model results.

2. Materials and Methods

2.1. Dataset

We used GeoLifeCLEF 2020 [30] as the training and testing dataset, which collected data from 1.9 million sample points in the United States and France, including information on 31,435 plants and animals. Each of these sample points has a corresponding high-resolution remote sensing image, infrared image, digital elevation map and land cover map, where the spatial resolution of the high-resolution remote sensing image is 1 m, and the high-resolution remote sensing image corresponding to each point is a 256 × 256-pixel image block centered on the sampling point (x, y). The dataset also provides information on 19 climate variables and 8 soil variables for each sample point (Table 1), which we extract through species points (x, y). We chose 2.5% of the data in the GeoLifeCLEF 2020 dataset at random as test data and 97.5% as training data to test the effectiveness of the various algorithms.

2.2. Multimodal Models

Species distribution models (SDMs) can be classified into three types based on the type of input data to the model: those based on environmental variable algorithms, those based on remote sensing image algorithms, and those based on multimodal data (Figure 1). Environmental variable algorithms-based models use temperature, precipitation, longitude, latitude, elevation, slope, aspect, and soil type as input to achieve the final classification via machine learning. Remote sensing image-based classification models take remote sensing images (optical, radar, vegetation indices, texture indices, etc.) as input and classify the data using machine learning. They are currently frequently employed in the vegetation classification and geology disciplines. On the other hand, multimodal data has received little attention, and most of the work that has been reviewed so far extracted environmental information from sample points or used remote sensing information for classification. The former is mainly implemented with ecological niche models (MaXent, GARP, Bioclim, Domain, random forests, etc.), while the latter uses a wide range of machine learning models (random forests, support vector machines, decision trees, maximum likelihood methods, artificial neural network methods, etc.) to achieve classification. Because the former models only learn environmental variables, the accuracy of trained models is the lowest of the three types of models, and we can find from the results of the GeoLifeCLEF 2021 competition [20] that deep learning models with remote sensing images as input outperformed machine learning models with environmental variable data as input. By analogy with the human information processing system, we find that the more comprehensive the information about the sample points, the more accurate we can judge the classification results of the samples. As a result, we propose a multimodal model that uses both remote sensing data and environmental variable data as input for SDM modeling.

2.2.1. Multimodal Model Based on ResNet50

Although deep learning species distribution prediction models based on remote sensing images outperformed machine learning approaches that only use environmental variables for prediction [20], completely eliminating the use of environmental variables can result in a significant loss of useful information when making predictions. To allow the model to extract useful information from remote sensing images and environmental variable information, we proposed a multimodal model capable of simultaneously inputting image information and environmental variable information and extracting and fusing remote sensing image information and environmental variable information from sample points.

Our multimodal model based on ResNet50 uses Deep Residual Network (ResNet) as the main network structure for extracting remote sensing image features. By introducing a network structure of residuals, the ResNet network is used to solve the problem of vanishing gradients caused by the too-deep layers of a convolutional neural network. Since its inception, the ResNet network has been widely used for classification, detection, segmentation, and other tasks. It is one of the most common backbone networks. The ResNet50 network structure consists of five stages, where stage 0 serves as the pre-processing layer of the input image, containing a 7 × 7 convolution and a 3 × 3 max pool layer, and the last four stages as the feature extraction layer of the image, all of which consist of Bottleneck. Their structures are relatively similar, and each stage begins with each layer using Bottleneck for feature downsampling. Stages 1, 2, 3 and 4 consist of 3, 4, 6, and 3 Bottlenecks, respectively. Finally, an average pool layer produces a 2048-dimensional feature vector.

We propose that the Species Diversity Multimodal Model Network (SDMM-Net) structure has two input branches (Figure 2). One of the branches takes an RGB remote sensing image as input to provide image information to the model for that sample point, and this input is passed through a ResNet network to produce a 2048-dimensional vector. The other branch feeds information on the 27 environmental variables at that sample site into the model. To facilitate network processing of the data, each environmental variable feature was normalized before being entered into the network. We start with a 27-dimensional vector, then pass it through a fully connected (FC) layer to get 2048-dimensional vector, and then fuse the two 2048-dimensional vectors from the two branches to obtain a 4096-dimensional feature vector. Finally, the 4096-dimensional a vector is followed by an FC layer and a softmax layer to produce a 31,435-dimensional result (a total of 31,435 species classes in the training dataset).

2.2.2. Multimodal Model Based on Swin Transformer

The Swin Transformer [33] model is an attention mechanism-based model structure proposed by Microsoft Research Asia. Once proposed, the network achieved SOTA on different datasets and is the most effective attention mechanism model at this stage. There are four versions of the official proposed Swin Transformer model: Swin-Tiny (Swin-T), Swin-Small (Swin-S), Swin-Base (Swin-B), and Swin-Large (Swin-L). Because the computational effort of the Swin-T model is comparable to that of ResNet50, we designed a multimodal attention mechanism model based on the Transformer model using the Swin-T model as the backbone network in order to facilitate comparison of the Transformer-based multimodal model with the ResNet50-based multimodal model.

Similar to CNN, the Swin-T model has a hierarchical design with four stages, each of which reduces the resolution of the input feature map and expands the receptive field layer by layer. A Patch Embedding was performed during the image pre-processing stage to cut the image into individual blocks and embed them into the Embedding. Each stage consists of a Patch Merging module and several Block modules, where the Patch Merging module primarily reduces the image resolution at the beginning of each stage, while the Block consists of LayerNorm, Multi-Layer Perceptron (MLP), Window Attention, and Shifted Window Attention, which are used to extract the feature values of the image.

Current multimodal data fusion methods can usually be divided into data fusion, feature fusion and model fusion. We designed three types of Transformer multimodal models based on the Swin Transformer model structure according to different fusion methods (Figure 3, Figure 4 and Figure 5).

(1): Pre-fuze-multimodal model

The Pre-fuze-multimodal model uses the Swin-T backbone network to extract the features of the data after feature fusion. After reading the data, the model first encodes the image data into a feature vector using the Patch Partition module, then normalizes the input environmental variable features using the feature norm module, then fuses the image feature vector and environmental variable features using the feature fusion module (feature norm), followed by the data fused input to the Swin-T network structure, and finally outputs the classification results (Figure 3).

(2): Post-fuze-multimodal model

After reading the original data, the Post-fuze-multimodal model extracts the features of the remote sensing image using the Swin-T backbone network; at the same time, it uses the feature norm layer to normalize the environmental variable features and uses the FC layer to up-dimension the feature vector; then it performs feature fusion on the image features and environmental variable features; finally, it classifies the fused feature values (Figure 4).

(3): Mid-fuze-multimodal model

The Mid-fuze-multimodal model will fuse image features and environment variable features at each stage of Swin-T (Figure 5). Because the network structure of Swin-T performs a down-sampling-like operation on the image at each stage, the feature input dimensions of the four stages here are 3136, 784, 196, and 49 dimensions, respectively. In order to enable the image features and the environment variable features to be fused, the environment variable features must be up-dimensioned to the same dimension using the FC layer prior to the fusion of each stage structure.

(4): Feature fusion methods

Feature addition and feature concatenation are two commonly used feature fusion methods. The feature addition method unifies two feature vectors to the same dimension and then directly adding the two vectors. The lengths of the two feature values are typically unified by up-dimensioning the lower dimensional feature values through the FC layer to the same dimensional length as the higher dimensional vector. Typically, feature concatenation is accomplished by directly concatenating an N-dimensional vector with an M-dimensional vector, resulting in an N + M-dimensional vector. We experimented with Transformer-based networks using two feature fusion methods, respectively.

3. Results

We used the top-30 correctness rate as the model evaluation metric to facilitate cross-sectional comparisons with other methods [30]. Top-30 accuracy refers to using the model to predict the 30 species with the highest probability of occurring at that sample point, and a prediction is considered correct if the correct result is included among the 30 predicted results. We assume that each sample point

i

has a unique label

C_{i}

and the

K

categories with the highest confidence in the model prediction are

c_{i 1}, \dots c_{i K}

. If there exists a certain prediction value

c_{i j} = C_{i}

, the model prediction for that sample is considered to be correct. We assume that

d_{i j} = d (c_{i j}, C_{i}) = \{\begin{matrix} 1, i f c_{i j} \neq C_{i} \\ 0, o t h e r s i s e \end{matrix}

The following is the formula for calculating Top-K accuracy:

P r e c i s i o n @ K = 1 - \frac{1}{N} \sum_{i = 1}^{N} \begin{matrix} m i n \\ K \end{matrix} d_{i K}

where N is the total number of test samples and K is set to 30.

3.1. Test Results

To test the variation in accuracy of SDMs with different data type inputs, as well as the variation in model accuracy caused by different ways of model structures fusion methods, we propose the ResNet50 network model (including SDMM-Net v1 and SDMM-Net v2) and the Swin-T network model (including PreFuzeMM-Swin, PostFuzeMM-Swin, and MidFuzeMM-Swin). Among the various types of models, we list the test results for the model with the highest accuracy (Table 2).

The test codes for the Top-30 most present species and Random forest models were taken directly from the code repository provided by E. Cole et al. [30] The Top-30 most present species prediction method is used as a baseline method, which directly uses the 30 most frequent species in the training data as the prediction result for each test sample point, this method uses entirely statistical methods to make predictions, and the predicted results are only statistically significant. The Random forest model is a popular machine learning model that uses a statistical approach to training environmental variables, predicting sample points with similar environmental variables as likely to have the same species. The test results for the ss model (using remote sensing images as input) from the test results provided by S. Seneviratne [33], which is the model with the highest accuracy among the public methods. The model is trained by using RGB remote sensing images as training data and using a deep learning contrast representation learning approach. Other models’ test results are from the GEOLIFE dataset’s public test set.

According to the test results, we can observe that (1) the highest accuracy of our proposed PreFuzeMM-Swin model is 29.56%, the accuracy of this model has exceeded the existing ss model with the highest accuracy by 3.24%, improving the highest accuracy by 10.96%. This result indicates that a multimodal model with multiple data types as input can significantly improve the accuracy of the existing SDMs. (2) The accuracy of the random forest model using only environmental variables as training data was 21.68%, and the accuracy of the ResNet50 model using only remote sensing images as training data was 25.72%, indicating that both environmental variables and remote sensing data can be used as valid training data for SDM. (3) The accuracy of the SDMM-Net v2 model, which uses remote sensing images, environmental features, and latitude and longitude as input data, is higher than that of the SDMM-Net v1 model, which uses only remote sensing images and environmental variable features as input data, indicating that adding latitude and longitude as input data can effectively improve the accuracy of the model. (4) The accuracy of the PreFuzeMM-Swin model with Swin-T as the backbone network outperforms the SDMM-Net v2 model with ResNet50 as the backbone network (29.56 > 26.4%), indicating that the accuracy of the multimodal model with Transformer as the backbone network is higher than that of the multimodal model with the traditional ResNet structure. (5) The PreFuzeMM-Swin model has a higher model accuracy than the MidFuzeMM-Swin model and PostFuzeMM-Swin model.

3.2. Comparison of Different Feature Fusion Structures

To compare the effects of different feature fusion structures on the accuracy of the multimodal models, we compared the final test accuracies of three models with Swin-T as the backbone network, PreFuzeMM-Swin, PostFuzeMM-Swin, and MidFuzeMM-Swin, all of which were fused by choosing the feature addition method (Table 2, Figure 6). After 20 rounds of training our model using the feature addition approach to feature fusion, the following conclusions were drawn: Regardless of the fusion structure, the accuracy of the multimodal model was higher than that of the base model using only remote sensing images as input (Swin-T). PreFuzeMM-Swin has the highest model accuracy, which is slightly higher than MidFuzeMM-Swin, while the model accuracy of PostFuzeMM-Swin is the lowest.

3.3. Comparison of Different Feature Fusion Methods

We selected the two most common feature fusion methods for testing: feature concatenation and feature addition. According to the test results (Table 3), we found that the fusion accuracy of feature concatenation is much lower than that of feature addition when the Transformer is the main network, and even lower than that of the network without feature fusion, which may be due to the unique attention mechanism of the Transformer network.

4. Conclusions

This research proposes a multimodal network model for species distribution prediction that can extract both image and environmental features from sample points. We called this network the Species Distribution Multimodal Model Net (SDMM-Net), which was tested on the GeoLifeCLEF 2020 dataset with a test accuracy of SOTA. We found that multimodal models that incorporate both image information and environmental variables predict species distributions better than models using only image information or single variable inputs using only environmental variables. Furthermore, using Transformer-based multimodal models for SDM can significantly improve the accuracy of existing models. While different model structure fusion methods have a greater impact on the model accuracy, we found that the model accuracy of the PreFuzeMM-Swin network is higher than that of the PostFuzeMM-Swin network and MidFuzeMM-Swin network, and the model accuracy of the feature addition is higher than that of the feature concatenation.

At the same time, we discovered that the current species distribution prediction model’s overall accuracy is still very low. The main reason for this is in the experimental data, specifically the uneven distribution of categories in the datasets we used, and secondly, the datasets we used are all single category labels, whereas the reality is that a sample point can correspond to multiple biological species. In a subsequent study, we may consider using novel methods to address the existing issues of unbalanced data categories and single labels. Furthermore, because our training and test data are from the GeoLifeCLEF 2020 dataset, we are unable to create a prediction map. As a result, future research can focus on selecting a study area and developing better datasets to test the performance of previous algorithms, as well as developing prediction maps to better monitor and manage the sustainable development of regional resources.

Author Contributions

Conceptualization, X.Z. and P.P.; Methodology, X.Z. and Y.Z.; Software, Y.Z.; Validation, X.Z., Y.Z. and P.P.; Formal Analysis, G.W.; Writing—Original Draft Preparation, X.Z.; Writing—Review & Editing, X.Z., Y.Z., P.P. and G.W.; Visualization, X.Z. and Y.Z.; Funding Acquisition, P.P. and G.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Second Tibetan Plateau Scientific Expedition and Research Program of P. R. China (2019QZKK0301), the National Natural Science Foundation of P.R. China (41671432, 31860123 and 31560153), and the Biodiversity Survey and Evaluation of the Ministry of Ecology and Environment of P. R. China (2019HJ2096001006).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The experimental data in this study were accessed from public datasets: This data can be found here: https://www.imageclef.org/GeoLifeCLEF2020 (accessed on 24 October 2022).

Acknowledgments

We thank the editors and anonymous reviewers for their valuable comments and feedback on this manuscript. We also thank GeoLifeCLEF for the dataset provided for training and testing.

Conflicts of Interest

All authors declare no competing interests.

References

Franklin, J. Species distribution models in conservation biogeography: Developments and challenges. Divers. Distrib. 2013, 19, 1217–1223. [Google Scholar] [CrossRef]
Guillera-Arroita, G.; Lahoz-Monfort, J.J.; Elith, J.; Gordon, A.; Kujala, H.; Lentini, P.E.; McCarthy, M.A.; Tingley, R.; Wintle, B.A. Is my species distribution model fit for purpose? Matching data and models to applications. Glob. Ecol. Biogeogr. 2015, 24, 276–292. [Google Scholar] [CrossRef]
Guisan, A.; Thuiller, W. Predicting species distribution: Offering more than simple habitat models. Ecol. Lett. 2005, 8, 993–1009. [Google Scholar] [CrossRef] [PubMed]
Guisan, A.; Tingley, R.; Baumgartner, J.B.; Naujokaitis-Lewis, I.; Sutcliffe, P.R.; Tulloch, A.I.; Regan, T.J.; Brotons, L.; McDonald-Madden, E.; Mantyka-Pringle, C.; et al. Predicting species distributions for conservation decisions. Ecol. Lett. 2013, 16, 1424–1435. [Google Scholar] [CrossRef]
Newbold, T. Applications and limitations of museum data for conservation and ecology, with particular attention to species distribution models. Prog. Phys. Geog. 2010, 34, 3–22. [Google Scholar] [CrossRef]
Bekessy, S.A.; Wintle, B.A.; Gordon, A.; Fox, J.C.; Chisholm, R.; Brown, B.; Regan, T.; Mooney, N.; Read, S.M.; Burgman, M.A. Modelling human impacts on the Tasmanian wedge-tailed eagle (Aquila audax fleayi). Biol. Conserv. 2009, 142, 2438–2448. [Google Scholar] [CrossRef]
Keith, D.A.; Elith, J.; Simpson, C.C.; Franklin, J. Predicting distribution changes of a mire ecosystem under future climates. Divers. Distrib. 2014, 20, 440–454. [Google Scholar] [CrossRef]
Pearce, J.; David, L. Bioclimatic analysis to enhance reintroduction biology of the endangered helmeted honeyeater (Lichenostomus melanops cassidix) in southeastern Australia. Restor. Ecol. 1998, 6, 238–243. [Google Scholar] [CrossRef]
Reşit Akçakaya, H.; McCarthy, M.A.; Pearce, J.L. Linking landscape data with population viability analysis: Management options for the helmeted honeyeater Lichenostomus melanops cassidix. Biol. Conserv. 1995, 73, 169–176. [Google Scholar] [CrossRef]
Ferrier, S.; Drielsma, M.; Manion, G.; Watson, G. Extended statistical approaches to modelling spatial pattern in biodiversity in northeast New South Wales. II. Community-level modelling. Biodivers. Conserv. 2002, 11, 2309–2338. [Google Scholar] [CrossRef]
Brown, A.M.; Warton, D.I.; Andrew, N.R.; Binns, M.; Cassis, G.; Gibb, H.; Yoccoz, N. The fourth-corner solution—Using predictive models to understand how species traits interact with the environment. Methods Ecol. Evol. 2014, 5, 344–352. [Google Scholar]
Cerrejón, C.; Valeria, O.; Mansuy, N.; Barbé, M.; Fenton, N.J. Predictive mapping of bryophyte richness patterns in boreal forests using species distribution models and remote sensing data. Ecol. Indic. 2020, 119, 106826. [Google Scholar] [CrossRef]
He, K.S.; Bradley, B.A.; Cord, A.F.; Rocchini, D.; Tuanmu, M.N.; Schmidtlein, S.; Turner, W.; Wegmann, M.; Pettorelli, N.; Nagendra, H.; et al. Will remote sensing shape the next generation of species distribution models? Remote Sens. Ecol. Conserv. 2015, 1, 4–18. [Google Scholar] [CrossRef] [Green Version]
Sumsion, G.R.; Bradshaw, M.S.; Hill, K.T.; Pinto, L.D.G.; Piccolo, S.R. Remote sensing tree classification with a multilayer perceptron. PeerJ 2019, 7, e6101. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Zhang, B.; Zhao, L.; Zhang, X. Three-dimensional convolutional neural network model for tree species classification using airborne hyperspectral images. Remote Sens. Environ. 2020, 247, 111938. [Google Scholar] [CrossRef]
Scholl, V.; Cattau, M.; Joseph, M.; Balch, J. Integrating National Ecological Observatory Network (NEON) Airborne Remote Sensing and In-Situ Data for Optimal Tree Species Classification. Remote Sens. 2020, 12, 1414. [Google Scholar] [CrossRef]
Bhattarai, R.; Rahimzadeh-Bajgiran, P.; Weiskittel, A.; Meneghini, A.; MacLean, D.A. Spruce budworm tree host species distribution and abundance mapping using multi-temporal Sentinel-1 and Sentinel-2 satellite imagery. ISPRS J. Photogramm. Remote Sens. 2021, 172, 28–40. [Google Scholar] [CrossRef]
Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
Marconi, S.; Graves, S.J.; Gong, D.; Nia, M.S.; Le Bras, M.; Dorr, B.J.; Fontana, P.; Gearhart, J.; Greenberg, C.; Harris, D.J.; et al. A data science challenge for converting airborne remote sensing data into ecological information. PeerJ 2019, 6, e5843. [Google Scholar] [CrossRef] [Green Version]
Lorieul, T.; Cole, E.; Deneu, B.; Servajean, M.; Bonnet, P.; Joly, A. Overview of GeoLifeCLEF 2021: Predicting species distribution from 2 million remote sensing images. In Proceedings of the Working Notes of CLEF 2021-Conference and Labs of the Evaluation Forum, Bucharest, Romania, 21–24 September 2021; pp. 1451–1462. [Google Scholar]
Zhang, C.; Yang, Z.; He, X.; Deng, L. Multimodal Intelligence: Representation Learning, Information Fusion, and Applications. IEEE J. Sel. Top. Signal Process. 2020, 14, 478–493. [Google Scholar] [CrossRef] [Green Version]
Yu, J.; Li, J.; Yu, Z.; Huang, Q. Multimodal Transformer with Multi-View Visual Representation for Image Captioning. IEEE Trans. Circuits Syst. Video Technol. 2015, 30, 4467–4480. [Google Scholar] [CrossRef] [Green Version]
Xu, T.; Zhang, P.; Huang, Q.; Zhang, H.; Gan, Z.; Huang, X.; He, X. AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 1316–1324. [Google Scholar]
Ben-younes, H.; Cadene, R.; Cord, M.; Thome, N. Mutan: Multimodal tucker fusion for visual question answering. In Proceedings of the IEEE International Conference On Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2612–2620. [Google Scholar]
Nam, H.; Ha, J.-W.; Kim, J. Dual attention networks for multimodal reasoning and matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 299–307. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, N.A.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CL, USA, 7–9 December 2017; pp. 5998–6008. [Google Scholar]
Lu, J.; Batra, D.; Parikh, D.; Lee, S. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 8–15 December 2019; pp. 13–23. [Google Scholar]
Chen, Y.-C.; Li, L.; Yu, L.; Ahmed., A.E.K.F.; Gan, Z.; Cheng, Y.; Liu, J. Uniter: Universal image-text representation learning. In Proceedings of the European Conference On Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 104–120. [Google Scholar]
Li, C.; Yan, M.; Xu, H.; Luo, F.; Wang, W.; Bi, B.; Huang, S. Semvlp: Vision-language pre-training by aligning semantics at multiple levels. arXiv 2021, arXiv:2103.07829. [Google Scholar]
Cole, E.; Deneu, B.; Lorieul, T.; Servajean, M.; Botella, C.; Morris, D.; Jojic, N.; Bonnet, P.; Joly, A. The geolifeclef 2020 dataset. arXiv 2020, arXiv:2004.04192. [Google Scholar]
Hijmans, R.J.; Cameron, S.E.; Parra, J.L.; Jones, P.G.; Jarvis, A. Very high resolution interpolated climate surfaces for global land areas. Int. J. Climatol. 2005, 25, 1965–1978. [Google Scholar] [CrossRef]
Hengl, T.; Mendes de Jesus, J.; Heuvelink, G.B.; Ruiperez Gonzalez, M.; Kilibarda, M.; Blagotic, A.; Shangguan, W.; Wright, M.N.; Geng, X.; Bauer-Marschallinger, B.; et al. SoilGrids250m: Global gridded soil information based on machine learning. PLoS ONE 2017, 12, e0169748. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 10012–10022. [Google Scholar]

Figure 1. Classification of species distribution models.

Figure 2. SDMM-Net network structure.

Figure 3. PreFuzemm-Swin network structure.

Figure 4. PostFuzemm-Swin network structure.

Figure 5. MidFuzemm-Swin network structure.

Figure 6. Comparison of different feature fusion structures on model accuracy.

Table 1. Summary of the low-resolution environmental variable rasters provided. The first 19 rows correspond to the bio-climatic variables from WorldClim [31]. The last 8 rows correspond to the soil variables from SoilGrid [32].

Name	Description	Resolution
bio_1	Annual Mean Temperature	30 arcsec
bio_2	Mean Diurnal Range (Mean of monthly (max temp–min temp))	30 arcsec
bio_3	Isothermality (bio_2/bio_7) (×100)	30 arcsec
bio_4	Temperature Seasonality (standard deviation × 100)	30 arcsec
bio_5	Max Temperature of Warmest Month	30 arcsec
bio_6	Min Temperature of Coldest Month	30 arcsec
bio_7	Temperature Annual Range (bio_5–bio_6)	30 arcsec
bio_8	Mean Temperature of Wettest Quarter	30 arcsec
bio_9	Mean Temperature of Driest Quarter	30 arcsec
bio_10	Mean Temperature of Warmest Quarter	30 arcsec
bio_11	Mean Temperature of Coldest Quarter	30 arcsec
bio_12	Annual Precipitation	30 arcsec
bio_13	Precipitation of Wettest Month	30 arcsec
bio_14	Precipitation of Driest Month	30 arcsec
bio_15	Precipitation Seasonality (Coefficient of Variation)	30 arcsec
bio_16	Precipitation of Wettest Quarter	30 arcsec
bio_17	Precipitation of Driest Quarter	30 arcsec
bio_18	Precipitation of Warmest Quarter	30 arcsec
bio_19	Precipitation of Coldest Quarter	30 arcsec
orcdrc	Soil organic carbon content (g/kg at 15 cm depth)	250 m
phihox	Ph × 10 in H20 (at 15 cm depth)	250 m
cecsol	cation exchange capacity of soil in cmolc/kg 15 cm depth	250 m
bdticm	Absolute depth to bedrock in cm	250 m
clyppt	Clay (0–2 micro meter) mass fraction at 15 cm depth	250 m
sltppt	Silt mass fraction at 15 cm depth	250 m
sndppt	Sand mass fraction at 15 cm depth	250 m
bldfie	Bulk density in kg/m³ at 15 cm depth	250 m

Table 2. Comparison of test results.

Model	Input Data			Top-30 Accuraccy
Model	RGB Images	Environmental Variable	Longitude and Latitude	Top-30 Accuraccy
Top-30 most present species	✗	✗	✗	4.36
Random forest	✗	✓	✗	21.68
ss model	✓	✗	✗	26.32
ResNet50	✓	✗	✗	25.72
SDMM-Net v1	✓	✓	✗	25.8
SDMM-Net v2	✓	✓	✓	26.4
Swin-T	✓	✗	✗	25.13
PreFuzeMM-Swin	✓	✓	✓	29.56
PostFuzeMM-Swin	✓	✓	✓	26.07
MidFuzeMM-Swin	✓	✓	✓	29.37

Table 3. Comparison of different feature fusion methods.

	Pre	Post	Mid	Base
Fuze Pattern	Pre	Post	Mid	Base
addition	29.558	26.066	29.369	25.133
concatenation	7.367	20.977	23.505	25.133

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, X.; Zhou, Y.; Peng, P.; Wang, G. A Novel Multimodal Species Distribution Model Fusing Remote Sensing Images and Environmental Features. Sustainability 2022, 14, 14034. https://doi.org/10.3390/su142114034

AMA Style

Zhang X, Zhou Y, Peng P, Wang G. A Novel Multimodal Species Distribution Model Fusing Remote Sensing Images and Environmental Features. Sustainability. 2022; 14(21):14034. https://doi.org/10.3390/su142114034

Chicago/Turabian Style

Zhang, Xiaojuan, Yongxiu Zhou, Peihao Peng, and Guoyan Wang. 2022. "A Novel Multimodal Species Distribution Model Fusing Remote Sensing Images and Environmental Features" Sustainability 14, no. 21: 14034. https://doi.org/10.3390/su142114034

APA Style

Zhang, X., Zhou, Y., Peng, P., & Wang, G. (2022). A Novel Multimodal Species Distribution Model Fusing Remote Sensing Images and Environmental Features. Sustainability, 14(21), 14034. https://doi.org/10.3390/su142114034

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Novel Multimodal Species Distribution Model Fusing Remote Sensing Images and Environmental Features

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.2. Multimodal Models

2.2.1. Multimodal Model Based on ResNet50

2.2.2. Multimodal Model Based on Swin Transformer

3. Results

3.1. Test Results

3.2. Comparison of Different Feature Fusion Structures

3.3. Comparison of Different Feature Fusion Methods

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI