Tree Species Classification at the Pixel Level Using Deep Learning and Multispectral Time Series in an Imbalanced Context

Mouret, Florian; Morin, David; Planells, Milena; Vincent-Barbaroux, Cécile

doi:10.3390/rs17071190

Open AccessArticle

Tree Species Classification at the Pixel Level Using Deep Learning and Multispectral Time Series in an Imbalanced Context

¹

P2E Laboratory (Physiology, Ecology and Environment), USC INRAE 1328, Université of Orléans, 45100 Orléans, France

²

CESBIO (Centre d’Etude Spatiale de La Biosphère), CNES/CNRS/INRAE/IRD, Université de Toulouse, 31400 Toulouse, France

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(7), 1190; https://doi.org/10.3390/rs17071190

Submission received: 31 January 2025 / Revised: 14 March 2025 / Accepted: 24 March 2025 / Published: 27 March 2025

(This article belongs to the Special Issue Remote Sensing Applications for Forest Ecosystem Monitoring and Spatial Modeling (2nd Edition))

Download

Browse Figures

Versions Notes

Abstract

:

This paper investigates tree species classification using the Sentinel-2 multispectral satellite image time series (SITS). Despite its importance for many applications and users, such mapping is often unavailable or outdated. The value of using SITS to classify tree species on a large scale has been demonstrated in numerous studies. However, many methods proposed in the literature still rely on a standard machine learning algorithm, usually the random forest (RF) algorithm. Our analysis shows that the use of deep learning (DL) models can lead to a significant improvement in classification results, especially in an imbalanced context where the RF algorithm tends to predict the majority class. In our case study in central France with 10 tree species, we obtained an overall accuracy (OA) of around 95% and an F1-macro score of around 80% using three different benchmark DL architectures (fully connected, convolutional, and attention-based networks). In contrast, using the RF algorithm, the OA and F1 scores obtained were 92% and 60%, indicating that the minority classes are poorly classified. Our results also show that DL models are robust to imbalanced data, although small improvements can be obtained by specifically addressing this issue. Validation on independent in situ data shows that all models struggle to predict in areas not well covered by training data, but even in this situation, the RF algorithm is largely outperformed by deep learning models for minority classes. The proposed framework can be easily implemented as a strong baseline, even with a limited amount of reference data.

Keywords:

forest monitoring; tree species classification; deep learning; multispectral satellite time series; imbalanced data; Sentinel-2

1. Introduction

Tree species identification is necessary for many applications related to forest monitoring. For example, species information is used in combination with allometric equations for carbon estimation since carbon growth and storage are species-dependent [1,2,3]. The proportion of tree species and combinations of species is also used as an indicator of biodiversity and forest resilience [4,5]. In addition, climate change is increasingly affecting forests, with large-scale abiotic (fire, drought, and windthrow) and biotic (insects and pathogens) disturbances impacting the species composition. Hence, there is an urgent need for updated tree species maps, e.g., for accurate analysis of tree health [6] or to support reforestation decisions [7,8,9]. Unfortunately, such information is not always available, accurate, or updated at a fine scale and for large areas. Standard ground-based inventories are time-consuming and cannot be performed on a yearly basis. For instance, in France, the last open dataset provided by IGN (Institut national de l’information géographique et forestière) is outdated, since it was conducted between 2007 and 2018 (BD Forêt^® V2, [10]). In addition, National Forest Inventory (NFI) plots are not freely available and do not cover forests continuously.

The use of remote sensing data has been identified as a very efficient way to map tree species in a timely manner over large areas and with fine spatial resolution [8,11]. Sentinel-2 (S2) satellites are increasingly used for such applications [12,13,14]. Indeed, they provide multispectral images worldwide with fine spatial resolution (up to 10 m) and a low revisit time (∼5 days in Europe) [15]. This high spatio-temporal resolution can be used to monitor timely changes in vegetation cover, which is important to highlight the specific phenology of each tree species (or, more generally, other types of vegetation) [16]. Some studies have shown potential interest in using additional data, such as Sentinel-1 [12], but it is not the focus of our analysis, mainly because Sentinel-1 data have much higher computational and storage costs from an operational perspective.

Standard frameworks for mapping tree species with in situ data often rely on the random forest (RF) algorithm [17]. For example, ref. [11] observed in their review that the RF algorithm is one of the most widely used algorithms for such tasks, along with other standard machine learning methods such as the support vector machine (SVM). A similar observation was made in [18], and in more recent studies, the RF algorithm is still widely used, e.g., [12,19,20,21]. The popularity of RF with remote sensing data can be explained by several advantages: it is less prone to overfitting, requires less data for training, and is faster than deep learning (DL) methods. Moreover, the RF algorithm is more easily interpretable than other algorithms, which can be interesting depending on the task at hand (see, for example, [6] in the case of forest health detection).

Despite its advantages, the RF algorithm is known to be affected by imbalanced data [22], which is a common problem when dealing with tree species mapping since the different species are not naturally equally distributed. Moreover, it is also known that the RF algorithm can be outperformed by deep learning approaches to classify remote sensing data at a large scale [23]. In line with these points, a recent review has highlighted the shift towards deep learning models for tree species classification [24]. However, the methods reviewed focus on patch-level approaches, as in [25], where the patches are 400 × 400 S2 pixels in size. In many cases, having ground data that correspond to the size of very large patches can be difficult (in the review made in [24], the patch sizes range from 64 × 64 to 500 × 500 pixels). For instance, in our use case, we had field plots of various sizes, many of them being 3 × 3 S2 pixels in size. In addition, working with patches is more computationally intensive than working with single-pixel time series. The results obtained in [26,27] have shown that pixel-level approaches can outperform patch-level approaches for tree species classification with S2 images. This is particularly important in mixed-species forests, such as those in our study area, where large patches can consist of different tree species.

Therefore, in this paper, we propose to explore pixel-level deep learning strategies adapted to work with time series [28] and compare their results with those obtained with the RF algorithm, which is still used as a standard classifier in many recent studies. Our use case focuses on a relatively small dataset, ∼4400 reference plots, which is nevertheless larger than the dataset used in many studies, such as [26]. Moreover, the study area is the Centre-Val de Loire region of France and its surroundings, which corresponds to a large area (11 S2 tiles, 110,000 km²) with a wide variety of tree species and silvicultural practices. We also validated our results on an independent validation dataset for four key species (oak, pine, beech, and chestnut), mainly to analyze the generalization of the mappings to unseen areas and minority species. Complementing previous studies such as [26], which have shown that DL methods at the pixel level can yield interesting results, some key steps are analyzed in more detail in our study, in particular, the imbalanced data problem and the hyperparameter choice of the different algorithms. Finally, the implemented DL models used with the open source iota2 (version 20241121) Python library [29], a processing chain for the operational production of land cover maps from remotely sensed image time series, have been made available (accessed on 20 March 2025): https://framagit.org/fl.mouret/tree_species_classification_iota2. The resulting maps from our analysis are also available in [30]. In this respect, the proposed framework can easily be applied to other case studies.

2. Study Area and Data

This section presents the study area, reference plots, and satellite data used for our analysis.

2.1. Study Area

Our study area is the Centre-Val de Loire region and its surroundings (11 S2 tiles, 110,000 km²), and is depicted in Figure 1. The same area was analyzed in a previous study related to the detection of oak dieback, where the need for accurate mapping of tree species was identified [6]. Our study area is a large region in northern France with a temperate climate and diverse forests. The region is a plateau with some hills, drained by the Loire River and its tributaries. Soil acidity and rainfall determine the type of forest. Thus, oak forests (65% of the total [31]) with hornbeam, birch, and chestnut dominate most of the region, which has drier and more acidic soils.

2.2. Training Reference Data

Our reference data were provided by the ONF (Office National des Forêts). The reference plots are pure stands, i.e., more than 75% of a single species (further work could be interesting to include plots with mixed species). The distribution of each species is provided in Table 1. The total number of pixels is 60k (the ratio between the classes is about the same). The training plots occupy, on average, 7 S2 pixels (0.7 ha, 5th percentile = 0.06 ha and 90th percentile = 0.13 ha). It can be seen that oak plots dominate the dataset, with more than 70% of the plots belonging to this class (which is consistent with the fact that oak trees are predominant in our study area). The spatial distribution of the reference plots is neither systematic nor regular. Figure 1 shows a higher concentration of reference plots in the northwest of the study area, and some areas have no reference plots. Finally, Figure 2 provides an example of plots showing that (1) reference plots can be small and (2) different species can be found close to each other.

2.3. Additional Independent Validation Data

We were able to validate our mappings using independent plots of four key species from forest health campaign assessments (these plots were used, for example, in [6,32] to map forest dieback). In total, we had at our disposal 1497 oak, 197 chestnut, 91 pine, and 136 beech pure plots from these campaigns. Having these independent plots is interesting since our training data are (1) highly imbalanced and (2) do not cover all the massifs of the region. The validation patches occupy, on average, 0.16 ha (10th percentile = 0.07 ha and 90th percentile = 0.20 ha).

2.4. Satellite Data

As explained in the introduction of this paper, our mapping is based on S2 data only, as we focus on a large area and want to propose a classification framework that can be easily used for similar tasks. We used the same processing chain as in [6]. Consequently, a brief description is provided below; more details are available in that reference. The S2 satellites (S2-A and S2-B) are operated by the European Space Agency for the European Union’s Copernicus Earth observation program [15]. S2 data are multispectral images. We used 10 spectral bands for our analysis to cover the visible (bands 2, 3, and 4), red-edge (bands 5, 6, and 7), near-infrared (NIR) (bands 8 and 8a), and short-wave infrared (SWIR) (bands 11 and 12) parts of the spectrum. Each image was resampled to pixels of size

10 \times 10

m. The MAJA processing chain [33] was used to produce ortho-rectified level-2A images with cloud and shadow masks. Finally, S2 images acquired between 2019 and 2020 were used, providing long-term canopy information (the effect of changing the temporal length of the input data is discussed in Section 5).

3. Methods

This section provides the methodological steps used to map tree species. A simplified workflow is depicted in Figure 3. The Python processing chain

i o t a^{2}

[29] was used to extract training samples and produce maps of our study area. As explained in the previous paragraph, level 2A S2 images are acquired and cloud-filtered. Then, they are interpolated on a regular time grid to produce 740 features for each pixel. These features correspond to the 10 bands acquired over 74 dates (this number of features is relatively high but was found to be efficient in discriminating the different species; see Section 4). Finally, a classification algorithm is trained on the features of the reference data and used to produce a large-scale tree species map. More details on each step are provided below.

3.1. Preprocessing

Linear interpolation (i.e., filling gaps) is used to ensure a consistent time series for each S2 tile (1 interpolated data every 10 days) [6,34,35]. At the end of these preprocessing steps, each pixel is characterized by 740 features corresponding to 10 bands acquired over 2 years (74 dates). Other strategies for dealing with missing data and irregularly sampled time series are also tested (without improving our results) and are discussed in Section 5.

In order to be used efficiently by the classification algorithm (this is particularly true for the convolutional- and attention-based DL models), it is common to have standardized data with a mean equal to 0 and a standard deviation equal to 1. In our case, especially for the convolutional operation, it is important to preserve the temporal correlation of the S2 bands [36]. Therefore, each S2 band was standardized, taking all acquisitions over time (and not by standardizing each acquisition independently).

3.2. Classification Algorithms

3.2.1. RF Algorithm

Among the many classification algorithms available in the literature, the RF algorithm is widely used because of its scalability, robustness, and ability to model complex phenomena. The RF algorithm is an ensemble method based on decision trees: it constructs many decision trees and averages their predictions by majority voting. The RF algorithm uses bootstrap aggregation (known as bagging) to improve the stability and reduce the variances of the classification: each tree is trained on a subset of the original dataset, and at each split, a random subset of the features is selected. As explained in the introduction of this paper, we use the RF algorithm as a benchmark method because it is one of the most popular classification approaches in remote sensing. A major advantage of the RF algorithm is its robustness to the choice of its hyperparameters. For our experiments, we used default values, i.e., 100 fully grown decision trees (our tests have confirmed that changing these hyperparameters does not significantly affect the classification results). We found that the number of trees (the higher, the better) and the maximum depth of the trees (higher depth means that more complex phenomena can be modeled, with the risk of increasing overfitting) are two influential parameters. In our case, the use of class weights (i.e., penalizing more minority classes) is also an important parameter when no oversampling is used (see Section 4). For the operational mapping, we use the C++ implementation available in the iota2 processing chain based on the Shark machine learning library [37].

3.2.2. Deep Learning Models

In our analysis, we focus on 3 different DL architectures based on different mechanisms. We use two classical architectures, which are strong baselines to classify time series [28], namely multilayer perceptron (MLP or fully connected neural network) and a temporal convolutional network (TempCNN) [36]. These two models are easy to implement, and our results show that they can work without extensive tuning, which is interesting for operational purposes. These standard architectures are depicted in Figure 4: the figure consists of a succession of layers (we have fixed the number of layers to 3, as it provided the best results), with each layer being composed of a transformation layer (i.e., a linear or convolutional layer), a batch normalization layer, and a nonlinear activation function (the rectified linear unit (ReLU) was used). Since we are working with time series, note that the convolution layers are one-dimensional convolution layers. Unlike in [28], the MLP network also uses batch normalization instead of dropout, as it was found to be more efficient and stable. We also implemented a more recent method, the lightweight temporal attention encoder (LTAE) [38], which is based on the attention mechanism and is a state-of-the-art method for land cover classification [23].

In summary, from a simplified point of view, these 3 structures are very similar: they all aim at extracting relevant features, which are used by the output linear layer to classify the time series. The main difference is related to the feature extraction mechanism (dense layers, convolutions with global max pooling, or an attention mechanism).

3.2.3. Hyperparameters Used for Deep Learning Models

Training a deep learning network is known to be more complex than classical machine learning algorithms such as RF. In the following, we propose a simple guideline for the choice of each parameter. The values used in our experiments are reported in Table 2. Our results show that these default parameters have produced good results for each model. We used the ADAM optimizer, and other optimizers were tested, but our results did not change significantly. The number of epochs was fixed to 100. An adaptive learning rate strategy was used to reduce the learning rate by a factor of 2 if the validation loss did not improve after 20 epochs. Moreover, the training was stopped after 40 epochs without improvements. The batch size was set to 2048, apart from the MLP with SMOTE, where, in that case, it was set to 8192 (smaller values lead to very poor results; see our discussion on that point). Finally, cross-entropy loss was used, and other losses (e.g., margin loss) were tested without improving our results.

The MLP and TempCNN can be easily tuned by setting a sufficient number of neurons/filters and an appropriate learning rate. Since the MLP tends to overfit more easily, we found that reducing its learning rate to

1^{- 4}

instead of

1^{- 3}

was more stable. In particular, the use of numerous neurons (1024 or 2048) or filters (128) was found to be efficient even when the number of training samples is relatively small (see discussion in Section 5). For the MLP, we divided the number of neurons by 2 at each layer to reduce the size of the model and improve performance. The LTAE is slightly more complex to tune (number of heads, embedding size, and final MLP). Setting the number of heads to 6 was the most effective in our case; reducing this number can lead to underfitting, and increasing it can lead to convergence problems (this may, of course, depend on the dataset at hand). The dimension of the query/key vector was set to 8, as in the original paper. Similarly, the embedding size was set to 370 (number of features divided by 2). As for the other models, we have found that setting the number of neurons in the MLP to a relatively high value (512 instead of 128 in the original paper) was more efficient.

3.3. Methods for Dealing with Imbalanced Data

Most of the methods used to deal with class imbalance can be grouped into algorithm-level methods (i.e., the training algorithm directly takes the class imbalance into account) and data-level methods (i.e., a modification of the training dataset is carried out) [39]. Our analysis mainly focuses on one method from each family. The first approach is an algorithmic-level method that uses class weights to penalize errors related to the under-represented classes during training. The second strategy, which belongs to the data-level methods, consists of generating synthetic samples from the minority classes so that all classes have the same number of samples. We used the classical synthetic minority oversampling technique (SMOTE) algorithm, which has been widely used for a wide range of applications [40]. Finally, we also trained our DL models on our raw dataset without considering the class imbalance problem.

Note that other methods have been tested, and a discussion of their performance is provided in Section 5. In particular, we tested a variant of SMOTE, namely ADASYN [41], without any improvement in our results (see the additional results provided in Section 4.3).

3.4. Validation Experiments and Metrics

Our experimental results were validated by stratified cross-validation (CV, 10-fold). The train/test separation was carried out at the plot level to avoid spatial auto-correlation problems [20]. The main metric used to validate our results was the F1 macro score, i.e., the average F1 score of each class, a metric robust to imbalanced data. More precisely, the F1 score is the harmonic mean of precision and recall, where precision is the percentage of samples correctly labeled in class j and recall is the percentage of samples in class j that were correctly labeled. In addition, we also showed the overall accuracy (OA), which is the percentage of samples correctly classified (so it is affected by class imbalance), and the balanced accuracy (BA), which is the recall averaged over each class (so it is not affected by class imbalance). These two additional metrics can be useful since their interpretation is intuitive.

4. Results

4.1. Classifier Comparison and Main Results

The CV results (F1, OA, and BA) obtained with the different tested configurations are displayed in Table 3. Overall, it is clear that the DL approaches largely outperform the RF algorithm, especially in terms of F1 and BA, which means that the RF algorithm struggles to correctly classify the under-represented classes. Moreover, all DL models can achieve very close classification metrics, regardless of the strategy used to deal with imbalanced data. The MLP provides the best F1 score (0.81) without needing any strategy to deal with imbalanced data. Finally, note that the best BA (0.83) is obtained using LTAE with SMOTE (0.82 is obtained using MLP with SMOTE), which means that the recall of minority classes is higher on average.

In addition to this table, we have provided normalized confusion matrices averaged after 10 runs for each classifier using SMOTE oversampling in Figure 5 (similar results are obtained with other strategies). Looking at these confusion matrices, it is clear that the RF algorithm tends to classify the minority species (except conifers, poplars, and beech) as oak, which explains the poor results in terms of the F1 score and BA obtained in Table 3. A significant improvement is observed with the DL methods, even if some minority species (such as birch, hornbeam, beech, and Fraxinus) can be partially classified as oak (between 10 and 30% confusion, depending on the species and model). In addition, DL models also tend to confuse hornbeam with beech and Robinia with Fraxinus, tree species that are more similar to each other than to oak. Finally, the models with the best BA (typically LTAE and MLP with SMOTE) tend to have higher recall in minority classes (e.g., the best recall for five minority species is obtained with LTAE). However, the recall for oak tends to be slightly lower in this case (0.97 instead of 0.98 or 0.99), which affects the precision for other species as the dataset is highly imbalanced.

4.2. Validation on Independent Data

The RF, MLP, TempCNN, and LTAE models were validated on our independent datasets described in Section 2.3. Each model was trained using the dataset presented in Section 2.2. For each species (oak, pine, chestnut, and beech), we computed the percentage of plots correctly classified by the different models (i.e., recall). For this experiment, a plot is considered correctly classified if more than 50% of its pixels are correctly classified (results obtained with other classification rules are discussed below). These results are reported in Table 4.

Overall, the most confusion was observed for chestnut and beech species, which tend to be predicted as oak. LTAE with SMOTE performs best and can retrieve, on average, 71% of the plots, with a significantly higher recall for beech compared to other models (38% for LTAE instead of 29% for MLP, the second-best model for this species). Very close results were obtained using MLP with SMOTE, which was found to be better for chestnut retrieval. Overall, it appears that using the SMOTE algorithm leads to better validation results compared to using the class weight, indicating better generalization to areas without training data. The RF algorithm is again largely outperformed, with a BA of 57% and poor retrieval of chestnut and beech (which is consistent with the results obtained with the standard CV in the previous section).

We also performed the same analysis considering that a plot was correctly classified when more than 20% of its pixels were correctly detected (this is easier because only a few pixels need to be correctly detected). In this case, with SMOTE oversampling, the average recall (BA) was 0.66, 0.85, 0.77, and 0.82 for RF, MLP, TempCNN, and LTAE, respectively. This significant improvement shows that at least part of the pixels are generally correctly detected, which is valuable for real-world applications. Finally, pixel-level results were also computed and were very close to the results provided in Table 4 (BA equal to 0.58, 0.72, 0.67, and 0.69 for RF, MLP, TempCNN, and LTAE). This highlights that the models’ performances exhibit a strong consistency across both aggregated and pixel-level evaluations, reinforcing the reliability of our findings.

4.3. Additional Results

To complement the previous results, Table 5 provides additional experiments conducted to evaluate some variations in the classification workflow.

Time range used for the analysis: using 1 year of S2 data instead of 2 slightly decreases the classification results (F1 = 0.78 instead of 0.81). We observed this with the other models.
Size of the DL models: regarding the size of the DL models, a good F1 score can be obtained with an MLP with three layers of 128, 64, and 32 neurons, with or without class weights. Note that without class weight, the BA is lower (0.77 instead of 0.80). Overall, we found that smaller networks tend to be more impacted by imbalanced data. In practice, we observed that choosing a larger number (typically 1024 or 2048 neurons in the first layer) is not a problem and can lead to a slight improvement in classification, which is interesting because it can be adapted to larger datasets. We found that at least 128 filters were needed for TempCNN, with a drop in accuracy observed when using three layers of 64 filters. On the other hand, we observed that adding layers was not beneficial and led to a slight deterioration in our classification results. For LTAE, using too many heads can lead to lower accuracy due to convergence problems. However, using four heads instead of six results in a slightly lower BA (0.80 instead of 0.83).
Regularization: a small batch size (here, 256) can reduce the accuracy of the results. In our experiment, when using a batch size equal to 256 with an MLP, even if the F1 score was close to the optimal value, we observed a decrease in the BA (0.77 instead of 0.80). In this case, using class weights mitigated this problem (0.79 instead of 0.80). Using a very large batch size (e.g., 8192) was not found to affect the overall results when using batch normalization (without it, the F1 score dropped to 0.73 when using a batch of size 8192). Finally, without batch normalization, the F1 score dropped to 0.79 instead of 0.81 (and the BA to 0.77 instead of 0.80). The same conclusions were found with or without class weights and also with the TempCNN architecture.
Other strategies to deal with imbalanced data: our results show that undersampling oak plots (here, by randomly selecting 400 plots) improves RF predictions (which are still far from the optimal scores) but worsens the F1 score obtained with DL methods, implying that overall it is better to use all available samples with an appropriate strategy to deal with imbalanced data. Regarding oversampling strategies, SMOTE (S) and ADASYN (A) can lead to very similar results. Regarding the choice of their hyperparameters, we found that the number of neighbors used in the SMOTE algorithm does not have much impact on the classification results, while it is important to choose a large number (e.g., 100 or more) in the ADASYN algorithm. Finally, when using MLP with SMOTE, we observed a significant reduction in the F1 score when reducing the batch size (0.73 instead of 0.80 when using a batch size of 2048).

5. Discussion and Perspectives

The fact that the RF algorithm can be affected by imbalanced data has already been identified in the literature [12,22,42]. In [43], in a land cover classification context, the authors highlighted that RF tends to predict the majority class, which is consistent with our observations. In our case, we found that using RF with SMOTE or class weight failed to handle this issue. As observed in Section 4.3, undersampling the oak plots improves the results of the RF algorithm, but the F1 score we obtained is still much lower than those obtained with DL methods. For example, this strategy was used in [12] to map tree species in Germany, where the authors found that mapping less common species was still challenging, confirming the results we obtained with the RF algorithm. We also tested a variant of the RF algorithm, namely the balanced RF [42] implemented in the Python toolbox imbalanced-learn [44]. However, this variant leads to very poor results. As a key takeaway, we recommend that practitioners consider these potential issues for future work, especially since our results highlight the potential benefit of using DL methods, which we found to be able to generalize better, especially for less frequent classes. Other classification methods, such as sparse Gaussian processes [23], the support vector machine (SVM) algorithm [45], or tree-boosting algorithms (XGboost) [46], have also been tested without providing competitive results. In particular, we also tested other DL architectures (e.g., ResNet, 2D convolutional neural network, recurrent neural network, etc.) without improving our classification results (see Appendix A). This means that most of the DL architectures can achieve a similar accuracy for our task.

Our results highlight the potential benefits of using DL models, and some simple guidelines can be followed to achieve good accuracy. Specifically, choosing a sufficiently large model (e.g., in our case, at least 200k parameters) and applying batch normalization and class weights is a good start. Our empirical analysis shows that MLPs with adequate capacity can serve as a robust baseline, requiring less extensive hyperparameter exploration compared to more complex architectures like LTAE. Based on these recommendations, a review of the literature revealed that there was a tendency for some studies to use DL models with too few parameters. For example, ref. [47] implemented an MLP with only one hidden layer with 100 neurons for tree species classification, which led to poor results. Similarly, ref. [48] used a single layer with 18 hidden units. In both cases, the small number of neurons could explain the poor performances obtained with DL methods.

Regarding the method used to address data imbalance, slight improvements and better generalization can be achieved by oversampling with the SMOTE algorithm, but in this case, we found that choosing an appropriate batch size was important. In fact, we observed that if the SMOTE algorithm is used with a batch size too small, poor results can be obtained. This observation is especially true for the MLP, which is probably due to the fact that MLP can easily overfit. Hence, they are sensitive to small batches that can be composed of synthetic noisy samples. Finally, the LTAE architecture was found to provide the best performance when using the SMOTE algorithm, although it required more hyperparameter tuning. Additional tests using different losses (e.g., margin or focal loss, label smoothing) or the ADASYN algorithm did not improve our results. Overall, it appears that DL methods can also learn patterns related to the minority classes if the model is sufficiently large; hence, they can be more robust in solving the imbalanced data problem. This is consistent with previous studies (conducted for other types of data) that showed that a good tuning of models can be sufficient to achieve optimal accuracy in an imbalanced data context [49].

Regarding the input data used for classification, we observed that the direct use of the raw S2 band is sufficient. Similarly to [6], we observed that using 2 years of data instead of 1 can be useful to better characterize the forests and thus improve the classification results. Therefore, it seems that using many input features (740) is not necessarily problematic, as the classification algorithm can extract relevant information from it. Using additional sensors and information, such as synthetic aperture radar (SAR) [12,50] or hyperspectral data [51], is an interesting perspective that could improve our results. Using other additional information, such as temperature or precipitation, could also help improve classification results and potentially reduce the amount of satellite data needed.

We also tested other strategies to deal with missing data (related to clouds). Our results confirm that standard gap-filling (i.e., linear interpolation) is a strong baseline adapted to work at a large scale (11 S2 tiles) and in an operational context [6,23,52]. However, we observed that poor interpolation could impact the generalization of the results, especially for minority classes, since few training examples are available. In that extent, working on this point is an interesting perspective as recent studies have shown that improvement is possible by working directly with raw (irregular and unaligned) SITS [53]. Our initial tests were not very conclusive: we tested the method developed in [53], but this led to inaccurate classification results. This approach, tested for land cover classification, proposes an end-to-end learning framework that makes use of attention-based interpolation to embed the raw SITS into a regular temporal grid. More tests should be carried out to investigate why such an approach failed in our case (imbalance data, small dataset, classes with very similar behavior, etc.). We also performed experiments using the LTAE with raw S2 data and a cloud mask, as tested in [53]. While theoretical results obtained with standard CV appear close to those obtained with our proposed framework, our large-scale test results obtained by producing a map of the study area have showed important changes at the frontier of two S2 tiles. This highlights the fact that such a framework tends to largely overfit the specific acquisitions of a given S2 tile, which confirms the results observed in [53].

Other interesting perspectives could be explored, three of which are developed below. We observed that combining the different maps produced with the different models, e.g., by taking the mode of the predictions, could filter some noisy areas in some cases. However, our first tests show that the gain in classification accuracy is not always obvious; hence, further work is required to develop an optimal ensemble learning approach. Nevertheless, we observed that the three different DL architectures produced varying predictions in certain areas, which could serve as an indicator of mapping uncertainties. Interestingly, using different initializations within the same DL architecture yielded significantly more consistent results, providing little to no basis for uncertainty measurement. Finally, for operational mapping, it may be interesting to add more reference data to produce maps that are as accurate as possible. Using accessible tree species information from BD Forêt^® V2, [10] could be an obvious way to increase the size of the training dataset. However, since this database is not up-to-date, adding such samples might be non-trivial. Working on mixed-species plots could be another way to add more training samples to the dataset, the main problem being to automatically separate these plots into pure class pixels.

Finally, we were able to share our maps with local foresters. The main interest of the maps identified was related to the ability to find minority classes, e.g., chestnuts, which are difficult to find in practice. This interaction was considered useful for producing maps adapted to the needs of local actors.

6. Conclusions

This paper investigates tree species classification in temperate forests based on multi-spectral remote sensing data in an imbalanced context and over a large area (110,000 km²). Our study area is located around the Centre-Val de Loire region of France, which is dominated by oak forests (75% of our training plots), and our training dataset is relatively small (less than 5000 plots). The proposed framework uses 2 years of Sentinel-2 (S2) data as input features (preprocessing is carried out to remove clouds and interpolate the different input time series on the same temporal grid).

Our results highlight that deep learning (DL) models can be used in that context with good accuracy and largely outperform classical machine learning models such as the random forest algorithm, which is highly biased toward the majority class in our case. We tested three different architectures based on different mechanisms, i.e., a fully connected network (MLP), a convolutional-based network (1D convolutional network, TempCNN), and an attention-based network (LTAE). These implementations and configuration files have been made freely available and can be used as a baseline for other use cases (and potentially other applications).

Our results show that the three deep learning architectures tested provide similar results and largely outperform the RF algorithm. The MLP architecture was found to be a strong baseline (F1 score equal to 0.81) that can be used without extensive hyperparameter tuning and low computation time (see Appendix A. While more complex to implement, the LTAE also provided good generalization capabilities and appears to be more efficient at retrieving minority classes when used with the SMOTE algorithm, an oversampling technique. Overall, we found that using the SMOTE algorithm with an appropriate batch size could slightly improve the generalization of the classification models.

Author Contributions

Conceptualization, F.M., D.M., M.P. and C.V.-B.; methodology, F.M., D.M., M.P. and C.V.-B.; software, F.M.; validation, F.M., D.M., M.P. and C.V.-B.; formal analysis, F.M., D.M., M.P. and C.V.-B.; investigation, F.M.; resources, F.M.; data curation, F.M. and D.M.; writing—original draft preparation, F.M.; writing—review and editing, F.M., D.M., M.P. and C.V.-B.; visualization, F.M.; supervision, M.P. and C.V.-B.; project administration, M.P. and C.V.-B.; funding acquisition, M.P. and C.V.-B. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the SYCOMORE program, with the financial support of the Région Centre-Val de Loire (France), in collaboration with the SuFoSaT project of the GRAINE ADEME program.

Data Availability Statement

The configuration files used to train the models and map the study area with the iota2 processing chain are available at the following (accessed on 20 March 2025): https://framagit.org/fl.mouret/tree_species_classification_iota2. The open-source iota2 project can be used to produce land cover or vegetation maps from remotely sensed time series, more information and implementation can be found at https://framagit.org/iota2-project/iota2. The various maps of the study area produced during our analysis are freely available at [30] https://zenodo.org/records/15001875. Please note that these maps should be used with caution and may not be accurate for more quantitative analysis.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Additional Results

Appendix A.1. Additional Models

To complement our main analysis, we have reported results obtained with additional models in Table A1. The hyperparameters used for these models are described in the following paragraphs. As stated previously, our primary objective was not to conduct an exhaustive comparison of all available algorithms. Consequently, we employed standard hyperparameter values for the models presented here rather than performing extensive tuning. While we acknowledge that further optimization may yield improved results, we believe that reporting the performance achieved with these baseline configurations provides valuable context and supports the consistency of our findings.

Implementation details: XGBoost (version 1.7.5) [46] is a tree-boosting algorithm. We used default values with 500 estimators, early termination after 40 rounds without improvement, a max tree depth of 6, and a learning rate set to

0.3

. Sparse Gaussian processes (S-GP) implemented in [23] (the implementation can be found at (accessed on 20 March 2025) https://gitlab.com/Valentine-Bellet/land_cover_southfrance_mtan_gp_irregular_sits/) were also tested. The number of epochs was set to 50 and the learning rate to

0.001

; for the rest, we used default parameters. We implemented TempResNET described in [28]. It consists of two blocks of TempCNN with a residual connection between them. The first block is the same as the TempCNN test in our analysis. The second block is the same but with a reduced number of filters (64 instead of 128). In total, TempResNET had 140k trainable parameters. Finally, we also tested a 2D CNN. In this case, the time series of the 10 bands were treated as a 10 × 74 image (10 bands and 74 dates). Similarly to TempCNN, we used three 2D convolutional layers (64 filters in each layer) with batch normalization. The number of learnable parameters of the 2D CNN was 100k.

The results presented in Table A1 show that S-GP and 2D CNN are not competitive in our case. Although S-GP provides an acceptable BA, it struggles to classify oak trees accurately, leading to a decrease in OA and F1. Using the XGBoost algorithm, good classification results can be obtained, but they are lower than those obtained with the DL methods tested, especially for the minority classes. This highlights that the XGBoost algorithm is affected by our imbalanced data.

Table A1. F1, BA, and OA obtained with various models after 10-fold cross-validation. The 95% confidence interval is shown in parentheses. The experiment is the same as the one conducted in Table 3.

Model	F1	OA	BA
XGBoost	0.77 (0.02)	0.95 (0.01)	0.72 (0.03)
S-GP	0.72 (0.02)	0.92 (0.02)	0.79 (0.03)
TempResNET	0.78 (0.03)	0.95 (0.01)	0.78 (0.03)
2D CNN	0.67 (0.02)	0.90 (0.01)	0.75 (0.03)

Appendix A.2. Computational Times

We have reported the computational time obtained during the training of the models in Table A2. We used a fold of 50k samples, but it should be noted that the training and inference time of the DL algorithms is directly proportional to the number of samples, as the data are divided into batches. It is important to note that RF runs on CPUs, while the other algorithms (including XGBoost, which has the best inference time) run on GPUs. As a result, computation times can vary dramatically depending on the hardware available. In our case, even with a relatively modest GPU, the DL algorithms (especially MLP) have a computation time close to that of the RF algorithm, especially for inference. In addition, the training time may depend on several factors, including the number of epochs and the size of the models. Finally, among the DL methods tested in our main analysis, MLP is the fastest, followed by CNN and LTAE, the latter being much more computationally intensive. The two additional DL architectures tested (TempResNET and 2D CNN) are even slower than LTAE, despite having chosen relatively small networks. The GPs are also much more time-consuming, which may call into question their relevance for operational use.

Table A2. Training and inference time in seconds, obtained on a fold of 50k samples using an i7-11850H (2.50GHz) CPU and a RTX A3000 Mobile (4096 cores) GPU.

	Train Time (s)	Inference Time (s)
RF	12	0.3
MLP	29	0.32
TempCNN	59	0.32
LTAE	97	0.72
TempResNET	106	0.78
2D CNN	200	1.5
GP	154	11
XGBOOST	33	0.16

References

Wang, C. Biomass allometric equations for 10 co-occurring tree species in Chinese temperate forests. For. Ecol. Manag. 2006, 222, 9–16. [Google Scholar] [CrossRef]
Návar, J. Allometric equations for tree species and carbon stocks for forests of northwestern Mexico. For. Ecol. Manag. 2009, 257, 427–434. [Google Scholar] [CrossRef]
Henry, M.; Bombelli, A.; Trotta, C.; Alessandrini, A.; Birigazzi, L.; Sola, G.; Vieilledent, G.; Santenoise, P.; Longuetaud, F.; Valentini, R.; et al. GlobAllomeTree: International platform for tree allometric equations to support volume, biomass and carbon assessment. IForest-Biogeosci. For. 2013, 6, 326–330. [Google Scholar] [CrossRef]
Cavers, S.; Cottrell, J.E. The basis of resilience in forest tree species and its use in adaptive forest management in Britain. Forestry 2014, 88, 13–26. [Google Scholar] [CrossRef]
Kacic, P.; Kuenzer, C. Forest Biodiversity Monitoring Based on Remotely Sensed Spectral Diversity: A Review. Remote Sens. 2022, 14, 5363. [Google Scholar] [CrossRef]
Mouret, F.; Morin, D.; Martin, H.; Planells, M.; Vincent-Barbaroux, C. Toward an Operational Monitoring of Oak Dieback With Multispectral Satellite Time Series: A Case Study in Centre-Val de Loire Region of France. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 17, 643–659. [Google Scholar] [CrossRef]
MacKenzie, W.H.; Mahony, C.R. An ecological approach to climate change-informed tree species selection for reforestation. For. Ecol. Manag. 2021, 481, 118705. [Google Scholar] [CrossRef]
Haneda, L.E.; Brancalion, P.H.; Molin, P.G.; Ferreira, M.P.; Silva, C.A.; de Almeida, C.T.; Resende, A.F.; Santoro, G.B.; Rosa, M.; Guillemot, J.; et al. Forest landscape restoration: Spectral behavior and diversity of tropical tree cover classes. Remote Sens. Appl. Soc. Environ. 2023, 29, 100882. [Google Scholar] [CrossRef]
Shovon, T.A.; Auge, H.; Haase, J.; Nock, C.A. Positive effects of tree species diversity on productivity switch to negative after severe drought mortality in a temperate forest experiment. Glob. Change Biol. 2024, 30, e17252. [Google Scholar] [CrossRef]
IGN. LA BD FORÊT V2. In Inventaire Forestier (IF); IGN: Paris, France, 2019. [Google Scholar]
Fassnacht, F.E.; Latifi, H.; Stereńczak, K.; Modzelewska, A.; Lefsky, M.; Waser, L.T.; Straub, C.; Ghosh, A. Review of studies on tree species classification from remotely sensed data. Remote Sens. Environ. 2016, 186, 64–87. [Google Scholar] [CrossRef]
Blickensdörfer, L.; Oehmichen, K.; Pflugmacher, D.; Kleinschmit, B.; Hostert, P. National tree species mapping using Sentinel-1/2 time series and German National Forest Inventory data. Remote Sens. Environ. 2024, 304, 114069. [Google Scholar] [CrossRef]
Liu, P.; Ren, C.; Wang, Z.; Jia, M.; Yu, W.; Ren, H.; Xia, C. Evaluating the Potential of Sentinel-2 Time Series Imagery and Machine Learning for Tree Species Classification in a Mountainous Forest. Remote Sens. 2024, 16, 293. [Google Scholar] [CrossRef]
Vaghela Himali, P.; Raja, R.A.A. Automatic Identification of Tree Species from Sentinel-2A Images Using Band Combinations and Deep Learning. IEEE Geosci. Remote Sens. Lett. 2024, 21, 2501405. [Google Scholar] [CrossRef]
Drusch, M.; Del Bello, U.; Carlier, S.; Colin, O.; Fernandez, V.; Gascon, F.; Hoersch, B.; Isola, C.; Laberinti, P.; Martimort, P.; et al. Sentinel-2: ESA’s Optical High-Resolution Mission for GMES Operational Services. Remote Sens. Environ. 2012, 120, 25–36. [Google Scholar] [CrossRef]
Kowalski, K.; Senf, C.; Hostert, P.; Pflugmacher, D. Characterizing spring phenology of temperate broadleaf forests using Landsat and Sentinel-2 time series. Int. J. Appl. Earth Obs. Geoinf. 2020, 92, 102172. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Grabska, E.; Hawryło, P.; Socha, J. Continuous Detection of Small-Scale Changes in Scots Pine Dominated Stands Using Dense Sentinel-2 Time Series. Remote Sens. 2020, 12, 1298. [Google Scholar] [CrossRef]
Persson, M.; Lindberg, E.; Reese, H. Tree Species Classification with Multi-Temporal Sentinel-2 Data. Remote Sens. 2018, 10, 1794. [Google Scholar] [CrossRef]
Karasiak, N.; Dejoux, J.F.; Monteil, C.; Sheeren, D. Spatial dependence between training and test sets: Another pitfall of classification accuracy assessment in remote sensing. Mach. Learn. 2021, 111, 2715–2740. [Google Scholar] [CrossRef]
Hemmerling, J.; Pflugmacher, D.; Hostert, P. Mapping temperate forest tree species using dense Sentinel-2 time series. Remote Sens. Environ. 2021, 267, 112743. [Google Scholar] [CrossRef]
More, A.S.; Rana, D.P. Review of random forest classification techniques to resolve data imbalance. In Proceedings of the 2017 1st International Conference on Intelligent Systems and Information Management (ICISIM), Aurangabad, India, 5–6 October 2017; pp. 72–78. [Google Scholar] [CrossRef]
Bellet, V.; Fauvel, M.; Inglada, J. Land Cover Classification With Gaussian Processes Using Spatio-Spectro-Temporal Features. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4400221. [Google Scholar] [CrossRef]
Zhong, L.; Dai, Z.; Fang, P.; Cao, Y.; Wang, L. A Review: Tree Species Classification Based on Remote Sensing Data and Classic Deep Learning-Based Methods. Forests 2024, 15, 852. [Google Scholar] [CrossRef]
Bolyn, C.; Lejeune, P.; Michez, A.; Latte, N. Mapping tree species proportions from satellite imagery using spectral–spatial deep learning. Remote Sens. Environ. 2022, 280, 113205. [Google Scholar] [CrossRef]
Xi, Y.; Ren, C.; Tian, Q.; Ren, Y.; Dong, X.; Zhang, Z. Exploitation of Time Series Sentinel-2 Data and Different Machine Learning Algorithms for Detailed Tree Species Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2021, 14, 7589–7603. [Google Scholar] [CrossRef]
Ballanti, L.; Blesius, L.; Hines, E.; Kruse, B. Tree Species Classification Using Hyperspectral Imagery: A Comparison of Two Classifiers. Remote Sens. 2016, 8, 445. [Google Scholar] [CrossRef]
Wang, Z.; Yan, W.; Oates, T. Time series classification from scratch with deep neural networks: A strong baseline. In Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; pp. 1578–1585. [Google Scholar] [CrossRef]
Inglada, J.; Vincent, A.; Arias, M.; Tardy, B. iota2-a25386. Software, CESBIO. 2016. Available online: https://zenodo.org/records/58150 (accessed on 20 March 2025).
Mouret, F.; Morin, D.; Planells, M.; Vincent-Barbaroux, C. Dataset: Tree species classification at the pixel-level using deep learning and multispectral time series in an imbalanced context. arXiv 2025, arXiv:2408.08887. [Google Scholar] [CrossRef]
Simon, M.; Colin, A.; Letouz, F. Étude de L’évaluation de la Ressource et des Disponibilités Futures en Bois Région Centre-Val-de-Loire; Technical Report; Institut National de L’Information Géographique et Forestière (IGN): Paris, France, 2018. [Google Scholar]
Mouret, F.; Morin, D.; Martin, H.; Planells, M.; Vincent-Barbaroux, C. Mapping the dieback of several tree species in Centre of France using Sentinel-2 derived indices: A preliminary analysis. In Proceedings of the IGARSS 2024-2024 IEEE International Geoscience and Remote Sensing Symposium, Athens, Greece, 7–12 July 2024; pp. 1–4. [Google Scholar]
Hagolle, O.; Huc, M.; Villa Pascual, D.; Dedieu, G. A multi-temporal and multi-spectral method to estimate aerosol optical thickness over land, for the atmospheric correction of FormoSat-2, LandSat, VENμS and Sentinel-2 images. Remote Sens. 2015, 7, 2668–2691. [Google Scholar] [CrossRef]
Inglada, J.; Vincent, A.; Arias, M.; Marais-Sicre, C. Improved Early Crop Type Identification By Joint Use of High Temporal Resolution SAR And Optical Image Time Series. Remote Sens. 2016, 8, 362. [Google Scholar] [CrossRef]
Vuolo, F.; Ng, W.T.; Atzberger, C. Smoothing and gap-filling of high resolution multi-spectral time series: Example of Landsat data. Int. J. Appl. Earth Obs. Geoinf. 2017, 57, 202–213. [Google Scholar] [CrossRef]
Pelletier, C.; Webb, G.; Petitjean, F. Temporal Convolutional Neural Network for the Classification of Satellite Image Time Series. Remote Sens. 2019, 11, 523. [Google Scholar] [CrossRef]
Igel, C.; Heidrich-Meisner, V.; Glasmachers, T. Shark. J. Mach. Learn. Res. 2008, 9, 993–996. [Google Scholar]
Garnot, V.S.F.; Landrieu, L. Lightweight Temporal Self-attention for Classifying Satellite Images Time Series. In Proceedings of the Workshop on Advanced Analytics and Learning on Temporal Data (AALTD), Ghent, Belgium, 18 September 2020; Lemaire, V., Malinowski, S., Bagnall, A., Guyet, T., Tavenard, R., Ifrim, G., Eds.; Springer International Publishing: Ghent, Belgium, 2020; pp. 171–181. [Google Scholar]
Johnson, J.M.; Khoshgoftaar, T.M. Survey on deep learning with class imbalance. J. Big Data 2019, 6, 27. [Google Scholar] [CrossRef]
He, H.; Garcia, E. Learning from Imbalanced Data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar] [CrossRef]
He, H.; Bai, Y.; Garcia, E.A.; Li, S. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IJCNN), Hong Kong, China, 1–8 June 2008; pp. 1322–1328. [Google Scholar] [CrossRef]
Chen, C.; Liaw, A.; Breiman, L. Using Random Forest to Learn Imbalanced Data; Technical Report 666; Department of Statistics, University of California, Berkeley: Berkeley, CA, USA, 2004. [Google Scholar]
Mellor, A.; Boukir, S.; Haywood, A.; Jones, S. Exploring issues of training data imbalance and mislabelling on random forest performance for large area land cover classification using the ensemble margin. ISPRS J. Photogramm. Remote Sens. 2015, 105, 155–168. [Google Scholar] [CrossRef]
Lemaître, G.; Nogueira, F.; Aridas, C.K. Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. J. Mach. Learn. Res. 2017, 18, 559–563. [Google Scholar]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the SIGKDD, New York, NY, USA, 13–17 August 2016; KDD’16; pp. 785–794. [Google Scholar] [CrossRef]
Verhulst, M.; Heremans, S.; Blaschko, M.B.; Somers, B. Temporal Transferability of Tree Species Classification in Temperate Forests with Sentinel-2 Time Series. Remote Sens. 2024, 16, 2653. [Google Scholar] [CrossRef]
Zagajewski, B.; Kluczek, M.; Raczko, E.; Njegovec, A.; Dabija, A.; Kycko, M. Comparison of Random Forest, Support Vector Machines, and Neural Networks for Post-Disaster Forest Species Mapping of the Krkonoše/Karkonosze Transboundary Biosphere Reserve. Remote Sens. 2021, 13, 2581. [Google Scholar] [CrossRef]
Shwartz-Ziv, R.; Goldblum, M.; Li, Y.L.; Bruss, C.B.; Wilson, A.G. Simplifying Neural Network Training Under Class Imbalance. In Proceedings of the Thirty-Seventh Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]
Priyanka; Rajat; Avtar, R.; Malik, R.; Musthafa, M.; Rathore, V.S.; Kumar, P.; Singh, G. Forest plantation species classification using Full-Pol-Time-Averaged SAR scattering powers. Remote Sens. Appl. Soc. Environ. 2023, 29, 100924. [Google Scholar] [CrossRef]
Dmitriev, P.A.; Kozlovsky, B.L.; Dmitrieva, A.A.; Varduni, T.V. Maple species identification based on leaf hyperspectral imaging data. Remote Sens. Appl. Soc. Environ. 2023, 30, 100964. [Google Scholar] [CrossRef]
Inglada, J.; Arias, M.; Tardy, B.; Hagolle, O.; Valero, S.; Morin, D.; Dedieu, G.; Sepulcre, G.; Bontemps, S.; Defourny, P.; et al. Assessment of an Operational System for Crop Type Map Production Using High Temporal and Spatial Resolution Satellite Optical Imagery. Remote Sens. 2015, 7, 12356–12379. [Google Scholar] [CrossRef]
Bellet, V.; Fauvel, M.; Inglada, J.; Michel, J. End-to-End Learning for Land Cover Classification Using Irregular and Unaligned SITS by Combining Attention-Based Interpolation With Sparse Variational Gaussian Processes. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 2980–2994. [Google Scholar] [CrossRef]

Figure 1. Our study area is delimited in gray, the boundaries between its 11 Sentinel-2 tiles are shown in lighter gray, and administrative departments are outlined in black. The training plots are shown in blue, while the independent validation plots (4 species) are shown in orange.

Figure 2. Illustration of different reference plots from our database. Each plot covers different S2 pixels (here, the plots are very small, covering 6 to 8 pixels). Each dominant tree species is represented by a different color with the corresponding name. An S2 image acquired in July 2019 is shown in the background in false colors (red, green, and blue channels correspond to infrared (band 8), red (band 4), and green (band 3) of S2, respectively).

Figure 3. Methodological steps used to map tree species with S2 time series.

Figure 4. The different deep learning architectures tested in our analysis. The last layer is a standard linear layer with a number of neurons equal to the number of classes to be predicted and a softmax activation.

Figure 5. Normalized confusion matrices averaged after 10-fold CV using SMOTE oversampling.

Table 1. Reference data used for our analysis.

Species	Definition	# Plots
Birch	Tree species of the genre Betula	52
Hornbeam	Tree species of the genre Carpinus	48
Chestnut	Castanea sativa Mill.	61
Oak	Quercus robur L. and Quercus petraea (Matt.) Liebl.	3219
Douglas Fir	Pseudotsuga menziesii (Mirb.) Franco.	131
Fraxinus	Tree species of the genre Fraxinus	39
Beech	Fagus sylvatica L.	254
Poplars	Tree species of the genre Populus	78
Pines	Tree species of the genre Pinus	486
Robinia	Robinia pseudoacacia L.	20

Table 2. Hyperparameter values used in different DL models.

	MLP	TempCNN	LTAE
Learning rate	$1^{- 4}$	$1^{- 3}$	$1^{- 3}$
Neurons/conv. filters	1024, 512, 256	128, 128, 128	512
Filter size	-	3, 3, 2	-
Heads	-	-	6
Embedding size	-	-	370
Learnable parameters	1.4 M	113 K	271 K

Table 3. F1, OA, and BA obtained for different classification configurations after 10-fold cross-validation. The 95% confidence interval is shown in parentheses. The best values are in bold.

Model	Imb. Strat.	F1	OA	BA
RF	-	0.59 (0.06)	0.93 (0.01)	0.52 (0.05)
RF	Class weight	0.62 (0.06)	0.93 (0.01)	0.54 (0.05)
RF	SMOTE	0.62 (0.06)	0.93 (0.01)	0.54 (0.05)
MLP	-	0.81 (0.03)	0.96 (0.01)	0.80 (0.04)
MLP	Class weight	0.81 (0.03)	0.96 (0.01)	0.80 (0.04)
MLP	SMOTE	0.80 (0.03)	0.95 (0.01)	0.82 (0.03)
TempCNN	-	0.79 (0.03)	0.95 (0.01)	0.79 (0.03)
TempCNN	Class weight	0.80 (0.02)	0.95 (0.01)	0.80 (0.02)
TempCNN	SMOTE	0.80 (0.02)	0.95 (0.01)	0.79 (0.02)
LTAE	-	0.80 (0.02)	0.95 (0.01)	0.77 (0.03)
LTAE	Class weight	0.80 (0.04)	0.95 (0.01)	0.81 (0.04)
LTAE	SMOTE	0.80 (0.04)	0.95 (0.01)	0.83 (0.04)

Table 4. Percentage of validation plots correctly retrieved (i.e., 50% of the pixels are correctly classified). The best values are in bold.

Model	Imb. Strat.	Oak	Pine	Chestnut	Beech	Mean (BA)
RF	SMOTE	0.99	0.92	0.33	0.06	0.58
MLP	Class weight	0.98	0.91	0.62	0.24	0.69
MLP	SMOTE	0.97	0.95	0.60	0.29	0.70
TempCNN	Class weight	0.98	0.91	0.42	0.18	0.62
TempCNN	SMOTE	0.97	0.93	0.56	0.26	0.68
LTAE	Class weight	0.98	0.87	0.43	0.38	0.67
LTAE	SMOTE	0.96	0.96	0.52	0.38	0.71

Table 5. F1 and BA obtained for different configurations after 10-fold CV. TempCNN is abbreviated CNN. “Same” means that the configuration is the same as in Table 3. “BS” stands for batch size and “BN” for batch normalization. “Imb. Strat”. is the strategy used to deal with imbalanced data, “Under”. means that the oak plots were undersampled by randomly selecting 400 training plots. For SMOTE and ADASYN, the number of neighbors is provided (SMOTE k = 100 means SMOTE with 100 neighbors). The 95% confidence interval is shown in parentheses.

Model	Config.	Imb. Strat.	Data	F1	BA
MLP	Same	Class weight	1 year	0.78 (0.04)	0.78 (0.04)
MLP	128, 64, 32	Class weight	Same	0.80 (0.05)	0.80 (0.04)
MLP	128, 64, 32	-	Same	0.79 (0.05)	0.77 (0.04)
CNN	64, 64, 64	-	Same	0.76 (0.03)	0.70 (0.04)
LTAE	4 heads	-	Same	0.79 (0.03)	0.80 (0.04)
LTAE	8 heads	-	Same	0.75 (0.04)	0.78 (0.03)
MLP	BS 256	-	Same	0.80 (0.03)	0.77 (0.03)
MLP	BS 256	Class weight	Same	0.80 (0.03)	0.79 (0.03)
MLP	BS 8k	Class weight	Same	0.80 (0.03)	0.80 (0.04)
MLP	BS 8k no BN	Class weight	Same	0.73 (0.05)	0.80 (0.04)
MLP	No BN	-	Same	0.79 (0.02)	0.77 (0.02)
RF	Same	Class weight	Under.	0.70 (0.03)	0.68 (0.04)
MLP	Same	Class weight	Under.	0.69 (0.03)	0.80 (0.04)
CNN	Same	-	Under.	0.72 (0.02)	0.77 (0.03)
MLP	BS 2048	SMOTE, k = 5	Same	0.73 (0.03)	0.82 (0.03)
MLP	BS 4096	SMOTE, k = 5	Same	0.77 (0.03)	0.82 (0.03)
MLP	Same	SMOTE, k = 100	Same	0.79 (0.03)	0.81 (0.03)
MLP	Same	ADASYN, k = 5	Same	0.78 (0.03)	0.79 (0.03)
MLP	Same	ADASYN, k = 100	Same	0.79 (0.02)	0.81 (0.03)
CNN	Same	SMOTE, k = 100	Same	0.80 (0.02)	0.80 (0.02)
CNN	Same	ADASYN, k = 5	Same	0.76 (0.03)	0.72 (0.03)
CNN	Same	ADASYN, k = 100	Same	0.80 (0.01)	0.79 (0.02)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mouret, F.; Morin, D.; Planells, M.; Vincent-Barbaroux, C. Tree Species Classification at the Pixel Level Using Deep Learning and Multispectral Time Series in an Imbalanced Context. Remote Sens. 2025, 17, 1190. https://doi.org/10.3390/rs17071190

AMA Style

Mouret F, Morin D, Planells M, Vincent-Barbaroux C. Tree Species Classification at the Pixel Level Using Deep Learning and Multispectral Time Series in an Imbalanced Context. Remote Sensing. 2025; 17(7):1190. https://doi.org/10.3390/rs17071190

Chicago/Turabian Style

Mouret, Florian, David Morin, Milena Planells, and Cécile Vincent-Barbaroux. 2025. "Tree Species Classification at the Pixel Level Using Deep Learning and Multispectral Time Series in an Imbalanced Context" Remote Sensing 17, no. 7: 1190. https://doi.org/10.3390/rs17071190

APA Style

Mouret, F., Morin, D., Planells, M., & Vincent-Barbaroux, C. (2025). Tree Species Classification at the Pixel Level Using Deep Learning and Multispectral Time Series in an Imbalanced Context. Remote Sensing, 17(7), 1190. https://doi.org/10.3390/rs17071190

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Tree Species Classification at the Pixel Level Using Deep Learning and Multispectral Time Series in an Imbalanced Context

Abstract

1. Introduction

2. Study Area and Data

2.1. Study Area

2.2. Training Reference Data

2.3. Additional Independent Validation Data

2.4. Satellite Data

3. Methods

3.1. Preprocessing

3.2. Classification Algorithms

3.2.1. RF Algorithm

3.2.2. Deep Learning Models

3.2.3. Hyperparameters Used for Deep Learning Models

3.3. Methods for Dealing with Imbalanced Data

3.4. Validation Experiments and Metrics

4. Results

4.1. Classifier Comparison and Main Results

4.2. Validation on Independent Data

4.3. Additional Results

5. Discussion and Perspectives

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A. Additional Results

Appendix A.1. Additional Models

Appendix A.2. Computational Times

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI