1. Introduction
Detailed and accurate information on crop-type cultivation is essential for developing economically and ecologically sustainable agricultural strategies in a changing climate, and for satisfying human food demands [
1]. Multi-temporal remote sensing (RS) images acquired throughout the growing season provide an effective method for acquiring crop cover information over large areas [
1,
2]. Multi-temporal images can be used to distinguish crop growth states and the phenological characteristics of crops. In addition, they provide enriched features that allow more complex and stable crop classification tasks. They have thus seen wide use in the field of agricultural RS [
3,
4].
Two main strategies are available for multi-temporal crop classification. The first strategy is to stack multi-temporal images by time sequence and classify them with classifiers such as support vector machine (SVM), random forest and maximum likelihood [
5,
6]. However, this approach does not model temporal correlations and uses features independently, ignoring possible temporal dependencies [
6,
7]. Most classifiers such as SVM rely heavily on features that are not designed for time-series data, making it difficult to exploit any inherent time-series variability features. In addition, the stacked images increase redundancy and lead to the dimensionality catastrophe with increasing time-series length, which negatively affects classification performance [
6,
8]. The second strategy is to obtain new images from reflectance images by using spectral indices, such as the normalized difference vegetation index (NDVI) and the enhanced vegetation index (EVI), and then construct time-series data to reveal the temporal pattern of the different features. With this method, crops and other vegetation are classified with high accuracy. However, the classification results of this method are limited strictly by the number of images in the time-series. If the number is too small, then the temporal pattern has little effect on classification performance [
8]. In addition, manual feature engineering based on human experience and prior knowledge is essential with this approach, which increases the complexity of processing and computation [
7,
9]. Moreover, the construction of VIs based on the specific spectral features ignores other spectral bands, which in turn affects the classification performance.
Current multi-temporal RS images are multi-spectral, multi-temporal and multi-spatial. In multi-temporal images, crops are represented via variations in temporal, spectral, and spatial features. These features can be comprehensively included in four-dimensional (4D: time, height, width, and band) data that require classification models to learn and represent temporal, spectral, and spatial features. Multi-temporal images thus pose new challenges to the models used for data processing, so integrating multi-temporal images and continuously improving crop classification accuracy requires continued attention.
Deep learning is a breakthrough technique in machine learning that outperforms traditional algorithms in terms of feature extraction and representation [
5,
6,
7], which has led to its application in numerous RS classification tasks [
8,
9,
10]. Convolutional neural networks (CNNs) produce more accurate results than other models in most RS image classification problems [
8,
9,
11]. The one-dimensional CNN (1D-CNN) model is commonly used to extract spectral features from hyperspectral images or temporal features from time-series images, providing an effective and efficient method for crop identification in time-series RS images [
12]. The CNN learning process is computationally efficient and insensitive to data shifts such as image translation, allowing CNN models to recognize image patterns in two dimensions (2D) [
13]. Three-dimensional (3D) CNN models use the spatial, temporal, and spectral information in multi-temporal images, and therefore are widely used in multi-temporal crop classification [
11,
14]. Long short-term memory (LSTM), a variant of recurrent neural networks (RNNs), is a natural candidate to represent temporal dependency over various temporal periods with gated recurrent connections [
9,
15]. LSTM models have been widely used for multi-temporal crop classification because they can analyze sequential data [
9,
16,
17]. For multi-temporal crop classification, both CNN and RNN provide more accurate results than machine learning and traditional classification [
5,
9,
11]. However, various deep learning architectures produce different results when applied to multi-temporal crop classification, feature learning and representation of crop spectral, spatial, and temporal information.
Convolutional LSTM (ConvLSTM) is a type of RNN with internal matrix multiplication replaced by convolution operations [
18]. ConvLSTM, integrating both LSTM and CNN structures, shows unexpected adaptability to multi-temporal images [
19,
20,
21]. However, due to the prevalence of CNNs and RNNs and the requirement for higher data dimensions, the ConvLSTM model is less commonly used in multi-temporal crop classification. Nevertheless, the potential of the ConvLSTM model deserves further exploration.
To summarize, multi-temporal images pose a new challenge to classification models in terms of data processing and feature extraction, but also open new opportunities for using data-driven deep learning to classify RS images. In this work, we use multi-temporal Sentinel-2 RS images as input data, and analyze the advantages of using such data and the structural advantages of various deep learning models. This research investigates (1) the possibility of using multi-temporal images for more accurately classifying crops; (2) the contribution of spectral, temporal, and spatial information to multi-temporal crop classification; and (3) the potential and requirements of using deep learning for multi-temporal crop classification. We also (4) search for a feasible and suitable deep learning model that provides optimum classification accuracy from multi-temporal images. Although such deep learning models have long been used for RS applications, this work compares and analyzes multi-temporal crop classification based on the deep learning architectures of CNN, LSTM, and ConvLSTM.
4. Results
The accuracy of crop classification via multi-temporal images mainly depends on three factors: time-series data construction, feature extraction, and classification method. Our experiments verify the contribution of time-series data and deep learning models. Various time-series data are constructed based on the strategy presented in
Section 3.2 and feed into the deep learning architectures (
Figure 4) of
Section 3.3 for different experiments. The classification results and accuracies are given in subsequent sections.
4.1. Classification Based on VI Time Series
E1 and E2 in
Figure 5 and
Table 5 show the results of time-series crop classification based on NDVI and EVI. The classification accuracies produced by the 1D-CNN (
Figure 4b) and LSTM (
Figure 4a) models for E1 and E2 exceed 92%, and the kappa coefficient is greater than 0.9. The highest overall accuracy (OA) for E2 (LSTM) is close to 94%. Compared with random forest, deep learning models based on 1D-CNN and LSTM have higher accuracy (
Table 5) and better performance in local regions (
Figure 5). These results show that the 1D-CNN and LSTM models constructed herein are suitable for multi-temporal crop classification based on VI. Compared with E1, the OA for E2 increases by 0.26% and 0.69% for the 1D-CNN and LSTM models, respectively. This reflects the variability of different VIs and the similarity of time-series VI for crop classification. Compared with the 1D-CNN model, the LSTM model is more accurate; the OA improves by 0.75% and 1.18% for E1 and E2, respectively. These results show that both the LSTM and 1D-CNN models can capture temporal features, although the LSTM model is more accurate.
Differences in architecture also affect classification accuracy. Compared with the other results in
Figure 5, the RF-based results (
Figure 5a,e) are worse locally, while almost no salt-and-pepper noises appear in
Figure 5c,h. Compared with E1 and E2, the accuracy of E7 and E8 improved by 0.82% to 2.24%, and the improvement exceeds RF by 3.5%. E7 and E8 classified by the 2D-CNN model (
Figure 4c) produce a favorable overall classification accuracy of above 94.7% and a kappa coefficient of 0.934, which is attributed to the effective learning and representation of temporal and spatial information in patch-based time-series VI data by 2D-CNN.
Figure 5 and
Table 5 also show that the classification results based on deep learning outperform the random forest. However, the misclassification of crop types in
Figure 5 indicates that further optimization is still needed. Based on the same model, there is no significant accuracy difference in E1 and E2. This indicates that improving accuracy solely using time-series data (temporal features) constructed from a single VI is difficult. However, the addition of spatial information not only improves crop classification accuracy but also eliminates salt-and-pepper noise. In addition, the 1D-CNN and LSTM architectures limit the possibility of exploiting spatial information in multi-temporal crop classification, whereas the 2D-CNN model produces more accurate crop classification based on single VI time-series data.
4.2. Classification Based on Multi-Spectral Time Series
Figure 6 and
Table 6 show the classification results of E3–E6 based on the time-series data constructed from multi-spectral, multi-temporal images. The crop classification accuracy of the 1D-CNN model is less than that of the LSTM model applied to E3–E6, which is similar to the results of the LSTM model. Therefore, hereinafter, we consider only the crop classification results based on LSTM.
The input data in E3–E6 have both multi-spectral and -temporal features, differing only in the number of multi-spectral bands, as explained in
Section 3.4.
Table 6 shows that the accuracy of RF-based is lower than deep learning, and
Figure 6 also shows that results of deep learning are better in local areas. The OA of E3–E6 is 95.31%, 96.72%, 96.37%, and 96.94%, respectively. Compared with E3, the addition of spectral bands, especially red-edge bands (E5) or SWIR bands (E4), improves the crop classification accuracy, with SWIR bands contributing slightly more than red-edge bands. Using the LSTM model with E6 surprisingly remains the most accurate configuration, with the crop-classification accuracy improving by 1.63% with respect to E3. This indicates that the advantage of the number of spectral bands in multi-spectral images cannot be neglected. With the addition of spectral bands, salt-and-pepper noise is eliminated to varying degrees, with the least salt-and-pepper noise coinciding with the most accurate crop classification (
Figure 6f), indicating that the salt-and-pepper phenomenon is weakened but hardly eliminated by using multi-spectral bands. Combined with the presentation in
Section 4.1, these results further demonstrate how spatial information affects multi-temporal crop classification.
Furthermore, the addition of different spectral bands in E3–E6 increases the diversity of input classification data. In the same experimental group, the accuracy difference between 1D-CNN and LSTM varies from 0.1% to 0.44%, with the minimum difference of 0.1% presented in E6. However, in the different experimental groups, the accuracy difference of the same model varies from 1.06% to 1.95%, with E6 showing an accuracy improvement of nearly 2% compared to E3. In E9 and E3, the spatial information causes differences in the input data. The accuracy difference between different deep learning models with the same input data is small, ranging from 0.21% to 0.42%. In contrast, the accuracy difference between the same models with different input data is larger, ranging from 1.88% to 1.25%. This indicates that increasing the diversity of input data is more important for improving crop classification accuracy than using different deep learning models.
Figure 7 and
Table 6 present the classification results of E9 and E10 using the 3D-CNN (
Figure 4d) and ConvLSTM2D (
Figure 4e) models. The OA of 3D-CNN in E9 and E10 was 96.77% and 96.56%, respectively, with kappa coefficients of 0.960 and 0.957. The OA of ConvLSTM2D in E9 and E10 was 97.43% and 97.25%, respectively, with kappa coefficients of 0.968 and 0.966. The accuracy is slightly greater when using the 3D-CNN model than when using the ConvLSTM2D model. The use of the 3D-CNN model on E10 produces the greatest crop classification accuracy of 97.43%, which translates into an OA improved by 3.69%, 2.67%, 0.49%, and 4.93% with respect to E2 (LSTM), E8 (2D-CNN), E6 (LSTM), and E1 (1D-CNN), respectively. Compared with the E6 (LSTM), the salt-and-pepper noise is eliminated in E9 and E10 (
Figure 7b,d), although the improvement in accuracy is not obvious. E10 produces more accurate results than E9 because it contains more spectral bands in the input data.
The classification results of the different experiments verify the feasibility of the model constructed herein (
Figure 4) for multi-temporal crop classification. The comparison of the results of the different experiments shows that both the construction of the time-series data and that of the classification model influence the crop classification accuracy. The LSTM model produces more accurate crop classification results than the 1D-CNN model. However, when using time-series data constructed from VIs, the 2D-CNN model produces more accurate results than the 1D-CNN and LSTM models after the elimination of the salt-and-pepper noise. When using time-series data constructed by stacking spectral bands, increasing the number of bands in the input data improves the crop classification accuracy while somewhat reducing the salt-and-pepper noise. Additionally, the LSTM model again produces slightly more accurate crop classifications than the 1D-CNN model, which indicates that the LSTM model is more able to capture temporal features.
E10 treated by the 3D-CNN and ConvLSTM2D models (
Figure 4) produces the most accurate crop classification of all experiments. In addition, the architectures of the 3D-CNN and ConvLSTM2D models lead to better learning and representation for multi-temporal crop features, making these models more suitable for crop classification from multi-temporal images.
Combined with the previous analysis of classification accuracy, VI time-series data using only temporal information only slightly improves the crop classification accuracy. The addition of multi-spectral data based on temporal information improves crop classification accuracy, and the salt-and-pepper noise is more easily alleviated upon increasing the number of spectral bands. As the number of input features increases, the contribution of spatial information in improving classification accuracy decreases. However, the elimination of salt-and-pepper noise through the use of spatial information remains a clear advantage in crop mapping. Therefore, making full use of the temporal, spectral, and spatial information is a more feasible strategy for multi-temporal crop classification. The deep learning architecture fed with 4D data involving multi-temporal images is thus the best model for accurate crop classification based on multi-temporal images.
6. Conclusions
This paper constructs various time-series datasets based on Sentinel-2 multi-temporal images by VI or spectral stacking, and develops deep learning models with different structures for classifying crops from multi-temporal images. The results lead to the following conclusions:
- (1)
Greater data diversity (temporal, spectral and spatial information) is effective in improving crop classification accuracy. The temporal feature only provides limited improvement in the accuracy of crop classification from multi-temporal images. As more spectral information is added, the accuracy can be further improved, and the impact of salt-and-pepper noise can be alleviated. The inclusion of spatial information can eliminate salt-and-pepper noise, and its contribution to accuracy decreases as the number of input features increases.
- (2)
Various deep learning models have limitations in crop classification from multi-temporal images. 1D-CNN and LSTM models cannot extract spatial features while integrating temporal and spectral features. Additionally, a 2D-CNN is suitable for crop classification of time-series data given a single feature such as a VI or band because the multi-spectral advantages are hard to consider when combining temporal and spatial information. The 3D-CNN and ConvLSTM2D models are the most accurate for classifying crops and are more suitable for multi-temporal crop classification than other deep learning models.
- (3)
The deep learning models based on Conv3D and ConvLSTM2D, which integrate temporal, spectral, and spatial information, are the most accurate models for multi-temporal crop classification. In addition, the advantages of incorporating RNN and CNN and the more flexible structure mean that ConvLSTM should be investigated.
In this paper, smaller areas and simple crop types are used for deep learning multi-temporal crop classification application studies. In future research, crop classification based on deep learning is still needed for large-scale study areas and complex planting systems, such as crop rotation and more crop types. In addition, the impact of clouds on image acquisition is difficult to avoid. While the acquisition of synthetic aperture radar (SAR) is not affected by clouds, which can also increase the diversity of classification data. Therefore, research into crop classification by synergistic SAR and optical images with different acquisition frequencies will be carried out. Additionally, the ConvLSTM model will be used as the classification model to explore its potential in multi-source image crop classification.