1. Introduction
Considering the prospect of human population growth, which is expected to reach 8.7 billion by 2030, the food supply system is subjected to escalating pressure [
1,
2]. Additionally, climate change effects and catastrophic natural disasters (e.g., drought and flood) are already hampering agricultural production and threatening food security from local to global scales [
3,
4]. Accordingly, it is vital to obtain authentic information about the location, extent, type, health, and yield of crops to ensure food security, poverty reduction, and water resource management [
5]. Additionally, it is more appealing to incorporate efficient approaches that facilitate the requirement of sustainability and climate change adaption [
6,
7]. Thus, it is crucial to employ efficient approaches, such as advanced machine learning along with remote sensing (RS) data, to ensure high-quality information is derived about crops in order to achieve specified goals [
8].
RS has long been recognized as a trustworthy approach for extracting specialized information about agricultural products [
9,
10,
11,
12]. This is owing to the frequent, broad-scale, and spatially consistent data acquisition of RS systems. In particular, RS allows timely monitoring of croplands to extract different information concerning the crop phenological status [
13,
14], health [
15], types [
16,
17,
18], and yield estimation [
19,
20] over small- to large-scale areas, based on different characteristics of satellite images (e.g., spatial, temporal, and spectral resolutions). These practices have been performed using different sources of RS data, including multi-spectral [
21,
22], synthetic aperture radar (SAR) [
23,
24], light detection and ranging (LiDAR) [
25,
26], hyperspectral [
27,
28], thermal [
29,
30], and digital elevation model (DEM) [
31,
32].
Along with the advancement of RS, image processing techniques and machine learning algorithms have also been significantly promoted [
33,
34,
35]. Accordingly, machine learning algorithms offer the potential to exploit the information content of RS data through automated frameworks [
36,
37]. In this regard, many scholars incorporated RS data and machine learning algorithms for crops mapping and monitoring. For instance, Zhang et al. [
38] implemented a random forest (RF) algorithm to classify croplands in China and Canada. To this end, textural features and vegetation indices were extracted from RapidEye images and were added to the spectral bands. The results revealed that the integration of all spectral, textural, and vegetation indices features could considerably enhance the classification results. Additionally, Mandal et al. [
39] employed a support vector machine (SVM) classifier along with time-series RADARSAT-2 C-band quad-pol data to discriminate different crops in Vijayawada, India. Temporal signatures of backscattering intensities were initially derived, and then informative features were selected by adopting kernel principal component analysis (PCA). It was reported that selecting discriminative temporal features improved the classification results by 7%. Moreover, Maponya et al. [
40] evaluated the potential of five machine learning algorithms, including SVM, RF, decision tree, K nearest neighbour, and maximum likelihood, for crop mapping using multi-temporal Sentinel-2 images. Four different scenarios were considered for the classification tasks, and the results indicated the superiority of RF and SVM over other classifiers, especially when a subset of hand-selected images (i.e., knowledge-based) were utilized. Furthermore, Saini and Ghosh [
41] identified the major crop types in Roorkee, India, using four different machine learning algorithms and Sentinel-2 images. It was observed that extreme gradient boosting (XGBOOST) outperformed other classifiers with an overall accuracy of 87%.
Currently, deep learning algorithms are recognized as a breakthrough approach for processing RS data [
35,
42]. In particular, classification studies using RS data greatly benefit from deep learning approaches because of their flexibility in feature representation, automation through the end-to-end procedure, and automatic extraction features [
43,
44,
45]. In this regard, different deep learning models (i.e., different structures and networks) have been employed for crop type mapping and monitoring [
46,
47,
48,
49,
50,
51]. For example, Zhao, Duan, Liu, Sun, and Reymondin [
51] compared five deep learning models for crop mapping based on dense time-series Sentinel-2 images. Their results suggested the high capabilities of one-dimensional convolutional neural networks (1D-CNN), long short-term memory CNN (LSTM-CNN), and gated recurrent unit CNN (GRU-CNN) models for crop mapping, respectively. Furthermore, Ji, Zhang, Xu, Shi, and Duan [
47] developed a three-dimensional CNN (3D-CNN) model to automatically classify crops using spatio-temporal RS images. The proposed network was enhanced using an active learning strategy to increase the labelling accuracy. The results were compared to a two-dimensional CNN (2D-CNN) classifier, suggesting higher efficiency and accuracy of their proposed approach.
Similar to other countries, RS data have widely been utilized for crop type mapping in Iran. For instance, Akbari et al. [
52] implemented the particle swarm optimization algorithm to select informative features from time-series Sentinel-2 images for crop mapping in Ghale-Nou, Tehran, Iran. The selected features were ingested into an RF classifier, and the results showed the high potential of the proposed method for heterogenous crop fields classification. Asgarian et al. [
53] also investigated the potential of time-series Landsat-8 images for crop mapping in fragmented and heterogeneous landscapes of Najaf-Abad, Iran. To this end, long-term in-situ phenological information was combined with satellite images to map annual crop types using intensive decision trees and SVM classifiers. Furthermore, Saadat, et al. [
54] employed time-series Sentinel-1 data to map rice in the northern part of Iran. To this end, Gamma Nought, Sigma Nought, and Beta Nought features of Sentinel-1 images in three scenarios were used in the RF classifier. Their results indicated the superiority of Sigma Nought and Gamma Nought Sentinel-1 data in vertical transmittance and horizontal receiving (VH) polarization.
Although many crop mapping frameworks have been proposed by various researchers, they generally have one of the following disadvantages:
- (I)
Most crop mapping studies have focused on conventional machine learning methods (e.g., RF and SVM). These algorithms do not usually provide the highest possible accuracies due to several factors, such as climatic conditions and the fluctuations in planting times.
- (II)
Many studies have only used spectral-temporal information for crop mapping. However, spatial information should be included in the classification algorithm to produce highly accurate maps.
- (III)
Many state-of-the-art deep learning methods for crop mapping have only used the 2D/3D convolution blocks for extracting deep features. All of these extracted deep features are not informative for crop mapping and provide redundant information. In this regard, attention blocks should be implemented to select the most informative features.
The Iranian crop system is under escalating pressure mainly due to the severe water crisis and population growth [
55]. Additionally, climate change and the current dramatic drought condition in Iran also exacerbate the existing pressure [
56]. Furthermore, the current economic and political sanctions have become a notable issue that would amplify this pressure in Iran [
57,
58]. Consequently, incorporation of advanced technologies, such as remote sensing and machine/deep learning algorithms, is required to support efficient agricultural practices in Iran. Considering the importance of crop mapping in Iran, a novel deep learning algorithm was developed in this study for accurate crop classification. The classification model has three main steps: (1) data preparation, (2) deep feature extraction based on multi-scale residual kernel convolution and CNN’s parameters optimization, and (3) crop type mapping based on an optimized model. The key contributions of this research are as follows:
- (I)
Proposing a novel framework for mapping crop types based on a two-stream CNN with a DAM.
- (II)
Introducing novel spatial and spectral attention mechanisms (AMs) to extract informative deep features for crop mapping.
- (III)
Utilizing multi-scale and residual blocks for increasing the accuracy of the proposed network.
- (IV)
Evaluating the sensitivity of the proposed method during the growing season of crops based on a time-series normalized difference vegetation index (NDVI).
- (V)
Evaluating the performance of commonly used machine learning and deep learning methods for crop type mapping.
3. Method
The general framework of crop type classification based on the proposed method is illustrated in
Figure 3. The proposed classification framework was implemented in three main steps: (1) data preparation and normalized difference vegetation index (NDVI) calculation, (2) model training and parameters tuning, and (3) prediction and accuracy assessment. The detail of each step is discussed in the following subsections.
3.1. Data Preprationand Time-Series NDVI Calculation
Sentinel-2 datasets require several preprocessing steps, such as cloud masking and atmospheric correction. In this regard, we selected only non-cloudy images for the analysis. Moreover, the atmospheric correction was implemented using the Sen2cor module [
68], which is available in the SNAP software.
Spectral feature extraction is the most common step in RS classification tasks [
61]. The feature extraction can be conducted in two main categories: (1) combining spectral bands using simple mathematical operations, such as the spectral indices of NDVI [
69,
70]; and (2) deriving high order statistical features (i.e., covariance and correlation), such as PCA [
71] and factor analysis (FA). Among different spectral indices, NDVI was selected due to its simplicity and its high applicability for crop mapping [
72,
73,
74,
75]. NDVI was computed based on the red (0.665 µm) and near-infrared (NIR) (0.842 µm) bands (see Equation (1)).
Crops have a dynamic nature because of their growth during their lifetime. Thus, employing time-series datasets is an effective and pertinent solution for mapping crops [
76,
77]. Consequently, the time-series NDVI features were utilized in this study for crop types classification.
3.2. Proposed Deep Learning Architecture
This study proposed a new dual-stream CNN architecture with both spectral and spatial attention blocks. According to the presented architecture in
Figure 4, the proposed method received input patches of 11 × 11 × 13, and then the patches were fed into two separate streams for deep feature extraction.
The first stream explored deep features based on multi-scale residual convolution blocks and spectral attention blocks. This stream focused on deep spectral feature extraction based on spectral AM. In this regard, a shallow multi-layer feature extractor, max-pooling layer, spectral attention blocks, and multi-scale residual blocks were employed. First, the shallow deep features were extracted via a multi-scale convolution block. Then, the spectral attention block was employed to investigate the inter-channel relationship of feature maps. Subsequently, the max-pooling layer was applied to reduce the size of the generated feature maps. The multi-scale residual block was then employed to find more meaningful features. Similarly, the spectral attention block and max-pooling were employed. Finally, the extracted deep features were transferred to the latest multi-scale residual and spectral attention blocks to generate high-level deep features.
The second stream investigated deep features while concentrating on deep spatial features using spatial attention blocks. Similarly, this stream had one multi-scale convolution block and three multi-scale residual blocks. Moreover, after convolution block layers, the spectral attention block and max-pooling layers were employed.
After deep feature extraction based on multi-scale residual blocks and attention blocks, the deep features were flattened using a flattening layer. Then, they were fed to a dense layer, and the decision was made via a soft-max layer.
The main differences between the proposed architecture and other CNN frameworks are:
- (1)
Utilizing a double streams framework for investigating spatial/spectral deep feature extraction.
- (2)
Proposing a novel AM framework for extraction of informative deep features that have a higher efficiency compared to the convolutional block attention module (CBAM).
- (3)
Taking advantage of residual, depth-wise, and separable convolution blocks as well as combining them for deep feature extraction.
- (4)
Employing separable (point/depth-wise convolution layers) convolution which has a better performance.
3.2.1. Attention Mechanism (AM)
The AM in deep learning was inspired by the psychological attention mechanisms within the human brain [
78,
79,
80,
81,
82]. The main idea behind the AM is to direct the focus of the network on extracting meaningful features instead of non-essential features [
81]. The efficiency of the AM in deep learning models has been proven in previous literature [
78,
82,
83,
84,
85]. In this regard, this study proposed a novel AM to increase the efficiency of the implemented/developed architecture by considering both spectral and spatial AM. The main idea to incorporate the AM was to explore the relationship between spectral-temporal and spatial-temporal information of input patches for the crop type classification task.
The developed spectral AM concentrated on ‘what’ is meaningful in the given input feature map [
83,
84,
86]. To this end, we introduced a spectral attention block in accordance with the architecture illustrated in
Figure 5. Based on this, the input feature map was fed into a convolution block with a kernel size (a,b) that was equal to the length and width of the input feature data. The size of the output feature map was 1 × 1 × c. Moreover, the number of filters was c, which was equal to the number of feature maps of the input data. After reshaping the output of the previous layer, the features were transferred into a multi-layer perceptron (MLP) layer with two dense layers with different neuron sizes. The first and second layers reduced the number of neurons based on the reduction rate and reconstructed the features, respectively. Simultaneously, the separable convolution layer was employed for input data before the multiplication of features with the input feature map. Finally, the output of the first stream and separable convolution layer was fused using multiplication. The separable convolution layer was implemented in two steps: point-wise convolution and depth-wise convolution on the output of the point-wise convolution.
The developed spatial AM considered the inter-spatial relationship of feature maps [
84,
87,
88]. The spatial AM concentrated on ‘where’ a useful region within the input feature map is [
86,
89]. This AM was implemented similarly to the spectral AM, but with different output sizes of convolution layers (see
Figure 6). Based on this, the input feature map was transferred into a convolution block with a kernel size (a,b) with only one kernel convolution and padding. This means that the output size of the feature map was a × b × 1. After reshaping the output of the previous layer, the features were fed into an MLP with two fully connected layers with different neuron sizes. The first layer reduced the number of neurons based on the reduction rate. Then, the second fully connected layer reconstructed the features. Simultaneously, the separable convolution layer was employed for input data before the multiplication of features with the input feature map. Finally, the output of the first stream and separable convolution layer was fused via multiplication.
3.2.2. Convolution Layer
Convolution layers are the central core of CNN frameworks, and the main task of these layers is extracting high-level deep features from input imagery [
90]. The convolution layers automatically explore spatial and spectral features at the same time. The basic computation of the convolutional layer can be defined as follows (Equation (2)) [
91].
where
x is the input data from layer
,
is an activation function,
w and
b are the weighted template and bias vector, respectively.
The output of
jth feature map for a 2D convolution at the spatial location of
(x,y) can be computed using Equation (3) [
35].
where
m is the feature cube connected to the current feature cube in the
()th layer, and
R and
S are the length and width of the filter, respectively.
This research took advantage of both residual and multi-scale blocks. The multi-scale blocks increase the efficiency of the network against the differences in the scale of objects [
35]. Moreover, the residual blocks improved the efficiency of the network and helped to prevent gradient vanishing.
3.3. Model Training
Since the unknown parameters of the deep learning architecture cannot be calculated through an analytical solution, the iterative framework was employed to optimize the model parameters [
90]. The adaptive moment estimation (Adam) optimizer [
80] was used in this study to optimize the model parameters. Furthermore, the cross-entropy (CE) loss function was utilized to calculate the error of the network during the training phase. The training phase was conducted based on the training samples, and then the loss value of the trained model was computed using validation samples. The CE loss function can be calculated using Equation (4):
where
and
are the true and predicted labels, respectively. Moreover,
N refers to the number of classes.
3.4. Accuracy Assessment
The statistical accuracy assessment was performed using independent test samples. The six most common statistical criteria, all extracted from the confusion matrix of the classification, were utilized to evaluate classification results. These criteria were overall accuracy (OA), user accuracy (UA), producer accuracy (PA), Kappa coefficient (KC), omission error (OE), and commission error (CE).
3.5. Comparison with Other Classification Methods
Crop mapping has widely been applied by machine learning and deep learning-based methods [
92,
93]. The RF and XGBOOST are the most common machine learning methods that have widely been used in many crop mapping applications based on time-series datasets [
93,
94]. This research implemented these two machine learning-based methods to evaluate their efficiency in comparison with deep learning-based methods. Thus, six different classifiers, including two commonly used machine learning algorithms (i.e., RF [
94] and XGBOOST [
95]) and four deep learning models (i.e., recurrent-convolutional neural network (R-CNN) [
49], 2D-CNN [
47], 3D-CNN [
47], and convolutional block attention module (CBAM)), were also implemented to produce a more comprehensive evaluation of the performance of the proposed model. R-CNN, developed by Mazzia, Khaliq and Chiaberge [
49], combines the LSTM cells and 2D convolution layers for crop mapping based on time-series datasets. Moreover, CBAM [
82] combines channel attention and spatial attention after each convolution layer, wherein the channel attention block is employed after the spatial attention block. The inputs of the RF and XGBOOST algorithms were spectral-temporal features with the size of 1 × 13 where 1 and 13 refer to the spectral (i.e., NDVI) and temporal information. Moreover, the input datasets of the deep learning-based methods were spatial-spectral-temporal information with the size of 11 × 11 × 1 × 13, where 11 × 11 was the width and length of spatial information, respectively, 1 was the spectral information (i.e., NDVI), and 13 was the temporal information. It is worth noting that the size of the spatial information for the deep learning-based methods was determined by trial and error. The patch data were also generated by moving a window with the size of 11 × 11. The label of this patch corresponded to the central pixel of the patch.