1. Introduction
Tree species identification is necessary for many applications related to forest monitoring. For example, species information is used in combination with allometric equations for carbon estimation since carbon growth and storage are species-dependent [
1,
2,
3]. The proportion of tree species and combinations of species is also used as an indicator of biodiversity and forest resilience [
4,
5]. In addition, climate change is increasingly affecting forests, with large-scale abiotic (fire, drought, and windthrow) and biotic (insects and pathogens) disturbances impacting the species composition. Hence, there is an urgent need for updated tree species maps, e.g., for accurate analysis of tree health [
6] or to support reforestation decisions [
7,
8,
9]. Unfortunately, such information is not always available, accurate, or updated at a fine scale and for large areas. Standard ground-based inventories are time-consuming and cannot be performed on a yearly basis. For instance, in France, the last open dataset provided by IGN (Institut national de l’information géographique et forestière) is outdated, since it was conducted between 2007 and 2018 (BD Forêt
® V2, [
10]). In addition, National Forest Inventory (NFI) plots are not freely available and do not cover forests continuously.
The use of remote sensing data has been identified as a very efficient way to map tree species in a timely manner over large areas and with fine spatial resolution [
8,
11]. Sentinel-2 (S2) satellites are increasingly used for such applications [
12,
13,
14]. Indeed, they provide multispectral images worldwide with fine spatial resolution (up to 10 m) and a low revisit time (∼5 days in Europe) [
15]. This high spatio-temporal resolution can be used to monitor timely changes in vegetation cover, which is important to highlight the specific phenology of each tree species (or, more generally, other types of vegetation) [
16]. Some studies have shown potential interest in using additional data, such as Sentinel-1 [
12], but it is not the focus of our analysis, mainly because Sentinel-1 data have much higher computational and storage costs from an operational perspective.
Standard frameworks for mapping tree species with in situ data often rely on the random forest (RF) algorithm [
17]. For example, ref. [
11] observed in their review that the RF algorithm is one of the most widely used algorithms for such tasks, along with other standard machine learning methods such as the support vector machine (SVM). A similar observation was made in [
18], and in more recent studies, the RF algorithm is still widely used, e.g., [
12,
19,
20,
21]. The popularity of RF with remote sensing data can be explained by several advantages: it is less prone to overfitting, requires less data for training, and is faster than deep learning (DL) methods. Moreover, the RF algorithm is more easily interpretable than other algorithms, which can be interesting depending on the task at hand (see, for example, [
6] in the case of forest health detection).
Despite its advantages, the RF algorithm is known to be affected by imbalanced data [
22], which is a common problem when dealing with tree species mapping since the different species are not naturally equally distributed. Moreover, it is also known that the RF algorithm can be outperformed by deep learning approaches to classify remote sensing data at a large scale [
23]. In line with these points, a recent review has highlighted the shift towards deep learning models for tree species classification [
24]. However, the methods reviewed focus on patch-level approaches, as in [
25], where the patches are 400 × 400 S2 pixels in size. In many cases, having ground data that correspond to the size of very large patches can be difficult (in the review made in [
24], the patch sizes range from 64 × 64 to 500 × 500 pixels). For instance, in our use case, we had field plots of various sizes, many of them being 3 × 3 S2 pixels in size. In addition, working with patches is more computationally intensive than working with single-pixel time series. The results obtained in [
26,
27] have shown that pixel-level approaches can outperform patch-level approaches for tree species classification with S2 images. This is particularly important in mixed-species forests, such as those in our study area, where large patches can consist of different tree species.
Therefore, in this paper, we propose to explore pixel-level deep learning strategies adapted to work with time series [
28] and compare their results with those obtained with the RF algorithm, which is still used as a standard classifier in many recent studies. Our use case focuses on a relatively small dataset, ∼4400 reference plots, which is nevertheless larger than the dataset used in many studies, such as [
26]. Moreover, the study area is the Centre-Val de Loire region of France and its surroundings, which corresponds to a large area (11 S2 tiles, 110,000 km
2) with a wide variety of tree species and silvicultural practices. We also validated our results on an independent validation dataset for four key species (oak, pine, beech, and chestnut), mainly to analyze the generalization of the mappings to unseen areas and minority species. Complementing previous studies such as [
26], which have shown that DL methods at the pixel level can yield interesting results, some key steps are analyzed in more detail in our study, in particular, the imbalanced data problem and the hyperparameter choice of the different algorithms. Finally, the implemented DL models used with the open source iota2 (version 20241121) Python library [
29], a processing chain for the operational production of land cover maps from remotely sensed image time series, have been made available (accessed on 20 March 2025):
https://framagit.org/fl.mouret/tree_species_classification_iota2. The resulting maps from our analysis are also available in [
30]. In this respect, the proposed framework can easily be applied to other case studies.
5. Discussion and Perspectives
The fact that the RF algorithm can be affected by imbalanced data has already been identified in the literature [
12,
22,
42]. In [
43], in a land cover classification context, the authors highlighted that RF tends to predict the majority class, which is consistent with our observations. In our case, we found that using RF with SMOTE or class weight failed to handle this issue. As observed in
Section 4.3, undersampling the oak plots improves the results of the RF algorithm, but the F1 score we obtained is still much lower than those obtained with DL methods. For example, this strategy was used in [
12] to map tree species in Germany, where the authors found that mapping less common species was still challenging, confirming the results we obtained with the RF algorithm. We also tested a variant of the RF algorithm, namely the balanced RF [
42] implemented in the Python toolbox imbalanced-learn [
44]. However, this variant leads to very poor results. As a key takeaway, we recommend that practitioners consider these potential issues for future work, especially since our results highlight the potential benefit of using DL methods, which we found to be able to generalize better, especially for less frequent classes. Other classification methods, such as sparse Gaussian processes [
23], the support vector machine (SVM) algorithm [
45], or tree-boosting algorithms (XGboost) [
46], have also been tested without providing competitive results. In particular, we also tested other DL architectures (e.g., ResNet, 2D convolutional neural network, recurrent neural network, etc.) without improving our classification results (see
Appendix A). This means that most of the DL architectures can achieve a similar accuracy for our task.
Our results highlight the potential benefits of using DL models, and some simple guidelines can be followed to achieve good accuracy. Specifically, choosing a sufficiently large model (e.g., in our case, at least 200k parameters) and applying batch normalization and class weights is a good start. Our empirical analysis shows that MLPs with adequate capacity can serve as a robust baseline, requiring less extensive hyperparameter exploration compared to more complex architectures like LTAE. Based on these recommendations, a review of the literature revealed that there was a tendency for some studies to use DL models with too few parameters. For example, ref. [
47] implemented an MLP with only one hidden layer with 100 neurons for tree species classification, which led to poor results. Similarly, ref. [
48] used a single layer with 18 hidden units. In both cases, the small number of neurons could explain the poor performances obtained with DL methods.
Regarding the method used to address data imbalance, slight improvements and better generalization can be achieved by oversampling with the SMOTE algorithm, but in this case, we found that choosing an appropriate batch size was important. In fact, we observed that if the SMOTE algorithm is used with a batch size too small, poor results can be obtained. This observation is especially true for the MLP, which is probably due to the fact that MLP can easily overfit. Hence, they are sensitive to small batches that can be composed of synthetic noisy samples. Finally, the LTAE architecture was found to provide the best performance when using the SMOTE algorithm, although it required more hyperparameter tuning. Additional tests using different losses (e.g., margin or focal loss, label smoothing) or the ADASYN algorithm did not improve our results. Overall, it appears that DL methods can also learn patterns related to the minority classes if the model is sufficiently large; hence, they can be more robust in solving the imbalanced data problem. This is consistent with previous studies (conducted for other types of data) that showed that a good tuning of models can be sufficient to achieve optimal accuracy in an imbalanced data context [
49].
Regarding the input data used for classification, we observed that the direct use of the raw S2 band is sufficient. Similarly to [
6], we observed that using 2 years of data instead of 1 can be useful to better characterize the forests and thus improve the classification results. Therefore, it seems that using many input features (740) is not necessarily problematic, as the classification algorithm can extract relevant information from it. Using additional sensors and information, such as synthetic aperture radar (SAR) [
12,
50] or hyperspectral data [
51], is an interesting perspective that could improve our results. Using other additional information, such as temperature or precipitation, could also help improve classification results and potentially reduce the amount of satellite data needed.
We also tested other strategies to deal with missing data (related to clouds). Our results confirm that standard gap-filling (i.e., linear interpolation) is a strong baseline adapted to work at a large scale (11 S2 tiles) and in an operational context [
6,
23,
52]. However, we observed that poor interpolation could impact the generalization of the results, especially for minority classes, since few training examples are available. In that extent, working on this point is an interesting perspective as recent studies have shown that improvement is possible by working directly with raw (irregular and unaligned) SITS [
53]. Our initial tests were not very conclusive: we tested the method developed in [
53], but this led to inaccurate classification results. This approach, tested for land cover classification, proposes an end-to-end learning framework that makes use of attention-based interpolation to embed the raw SITS into a regular temporal grid. More tests should be carried out to investigate why such an approach failed in our case (imbalance data, small dataset, classes with very similar behavior, etc.). We also performed experiments using the LTAE with raw S2 data and a cloud mask, as tested in [
53]. While theoretical results obtained with standard CV appear close to those obtained with our proposed framework, our large-scale test results obtained by producing a map of the study area have showed important changes at the frontier of two S2 tiles. This highlights the fact that such a framework tends to largely overfit the specific acquisitions of a given S2 tile, which confirms the results observed in [
53].
Other interesting perspectives could be explored, three of which are developed below. We observed that combining the different maps produced with the different models, e.g., by taking the mode of the predictions, could filter some noisy areas in some cases. However, our first tests show that the gain in classification accuracy is not always obvious; hence, further work is required to develop an optimal ensemble learning approach. Nevertheless, we observed that the three different DL architectures produced varying predictions in certain areas, which could serve as an indicator of mapping uncertainties. Interestingly, using different initializations within the same DL architecture yielded significantly more consistent results, providing little to no basis for uncertainty measurement. Finally, for operational mapping, it may be interesting to add more reference data to produce maps that are as accurate as possible. Using accessible tree species information from BD Forêt
® V2, [
10] could be an obvious way to increase the size of the training dataset. However, since this database is not up-to-date, adding such samples might be non-trivial. Working on mixed-species plots could be another way to add more training samples to the dataset, the main problem being to automatically separate these plots into pure class pixels.
Finally, we were able to share our maps with local foresters. The main interest of the maps identified was related to the ability to find minority classes, e.g., chestnuts, which are difficult to find in practice. This interaction was considered useful for producing maps adapted to the needs of local actors.
6. Conclusions
This paper investigates tree species classification in temperate forests based on multi-spectral remote sensing data in an imbalanced context and over a large area (110,000 km2). Our study area is located around the Centre-Val de Loire region of France, which is dominated by oak forests (75% of our training plots), and our training dataset is relatively small (less than 5000 plots). The proposed framework uses 2 years of Sentinel-2 (S2) data as input features (preprocessing is carried out to remove clouds and interpolate the different input time series on the same temporal grid).
Our results highlight that deep learning (DL) models can be used in that context with good accuracy and largely outperform classical machine learning models such as the random forest algorithm, which is highly biased toward the majority class in our case. We tested three different architectures based on different mechanisms, i.e., a fully connected network (MLP), a convolutional-based network (1D convolutional network, TempCNN), and an attention-based network (LTAE). These implementations and configuration files have been made freely available and can be used as a baseline for other use cases (and potentially other applications).
Our results show that the three deep learning architectures tested provide similar results and largely outperform the RF algorithm. The MLP architecture was found to be a strong baseline (F1 score equal to 0.81) that can be used without extensive hyperparameter tuning and low computation time (see
Appendix A. While more complex to implement, the LTAE also provided good generalization capabilities and appears to be more efficient at retrieving minority classes when used with the SMOTE algorithm, an oversampling technique. Overall, we found that using the SMOTE algorithm with an appropriate batch size could slightly improve the generalization of the classification models.