Next Article in Journal
The Impact of ESG Rating Events on Corporate Green Technology Innovation under Sustainable Development: Perspectives Based on Informal Environmental Regulation of Social Systems
Previous Article in Journal
Retain in the City, Return Flow, or Blind Direction: A Study on the Differentiation Mechanism of Migrant Workers’ Migration Willingness under the Background of China’s Strategy for Integrated Urban–Rural Development
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Classification Model of Grassland Desertification Based on Deep Learning

1
Computer and Information Engineering College, Inner Mongolia Agricultural University, Hohhot 010000, China
2
Inner Mongolia Autonomous Region Key Laboratory of Big Data Research and Application of Agriculture and Animal Husbandry, Hohhot 010018, China
*
Author to whom correspondence should be addressed.
Sustainability 2024, 16(19), 8307; https://doi.org/10.3390/su16198307
Submission received: 22 July 2024 / Revised: 11 September 2024 / Accepted: 13 September 2024 / Published: 24 September 2024

Abstract

:
Grasslands are one of the most important ecosystems on earth, and the impact of grassland desertification on the earth’s environment and ecosystem cannot be ignored. Accurately distinguishing grassland desertification types has important application value. The appropriate grazing strategies can be implemented based on these distinctions. Grassland conservation measures can be tailored accordingly. This contributes to further protecting and restoring grassland vegetation. This project takes color images labeled with the desertification types of grasslands as the research object, uses the currently popular deep learning model as the classification tool, and then establishes a color image-based grassland desertification classification model based on the feature extraction network, based on the Vision Transformer model, by comparing the various deep learning image classification models. The experimental results show that, despite the complex structure and large number of parameters of the grassland desertification classification model obtained in this project, the test accuracy rate reaches 88.72% and the training loss is only 0.0319. Compared with the popular classification models such as VGG16, ResNet50, ResNet101, DenseNet101, DenseNet169, and DenseNet201, and so on, the Vision Transformer demonstrates clear advantages in classification accuracy, fitting ability, and generalization capacity. By integrating with deep learning technology, the model can be applied to grassland management and ecological restoration. Mobile devices can be used to conveniently capture image data, and information can be processed quickly. This provides efficient tools for grazing managers, environmental scientists, and conservation organizations. These tools assist in quickly assessing the extent of grassland desertification, optimizing grassland management and conservation decisions. Furthermore, strong technical support is offered for the ecological restoration and sustainable management of desertification grasslands.

1. Introduction

Desertification is one of the most severe environmental challenges facing the world today. It has a significant impact on biodiversity, ecological security, poverty eradication, socio-economic stability, and sustainable development. According to statistics, in 2009, the global area affected by desertification was approximately 36 million square kilometers, accounting for 25% of the Earth’s total land area. Currently, two-thirds of the world’s countries and nearly 20% of the global population are affected by land desertification [1]. In 24 October 2023, the UN Convention to Combat Desertification (UNCCD) announced the launch of its first-ever Data Dashboard, compiling national reporting figures from 126 countries, which showed that land degradation was advancing at an astonishing rate across all regions. Between 2015 and 2019, the world lost at least 100 million hectares of healthy and productive land each year, adding up to twice the size of Greenland [2]. Grassland desertification is a global ecological concern [3]. China is one of the countries with the richest grassland resources. China’s grassland area accounts for a significant proportion of the global grassland area. Grasslands cover 40% of China’s total land area. The Inner Mongolia grassland is the largest grassland region in China, and the grassland in the Inner Mongolia Autonomous Region represents 22% of the national grassland area. Approximately 98.5% of desertification land in Inner Mongolia is the result of regressive succession in the grassland ecosystem [4]. In addition, the main grassland type in the central and western regions of Inner Mongolia is desert grassland, which experiences more severe grazing pressure than other grassland types. Grassland desertification manifests in various forms and types. Different control measures can be implemented based on the specific types of desertification. These measures aid in developing more appropriate grazing strategies, preventing further ecological damage caused by overgrazing. Classifying grassland desertification also helps determine its severity, providing a scientific basis for grassland conservation and restoration plans. Research on desertification classification is crucial to effectively prevent grasslands from further desertification and to slow the progression of mild desertification. This research offers essential guidance for the ecological restoration and management of desertification grasslands.
Traditional methods of classifying grassland desertification rely on the researchers’ experience and subjective judgment, which lack consistency and standardization [5]. These methods are often inefficient and time-consuming [6], and they face significant challenges in terms of accuracy [7]. To address these issues, methods for monitoring and identifying grassland desertification using remote sensing imagery and hyperspectral image data have been developed. These approaches enhance the ability to detect and analyze desertification across large areas. The use of these advanced imaging technologies allows for more accurate assessments of grassland degradation and helps inform effective management strategies. In 2013, Li, J. [8], and colleagues applied Spectral Mixture Analysis (SMA) and decision tree methods to interpret Landsat TM/ETM+ images from 1993, 2000, 2006, and 2011 in the study area. The results indicated that grassland vegetation showed signs of recovery during this period. In 2014, Han [9] and colleagues conducted a study on the Hexi Corridor region using Landsat TM and HJ data from June to August 2000, 2005, and 2010. They analyzed the spatial and temporal patterns of desertification by applying five different indicators. In 2015, Li [10] and colleagues applied spectral mixture analysis (SMA) and decision tree methods. Their findings revealed a significant expansion trend in desertification grasslands and areas with varying degrees of desertification between 1985 and 1992. In 2021, Weiqiang Pi [11] and colleagues collected high-resolution remote sensing images of degraded grasslands. They used the GDIF-3D-CNN classification model to classify the dataset, achieving an overall accuracy of 94.815% on the dataset. In 2020, Kuang [12] and colleagues proposed a comprehensive remote sensing-based indicator, the Alpine Grassland Desertification Index (AGDI), for monitoring the desertification areas and severity. The final results showed that the index achieved an overall validation accuracy of 82.05%. In 2020, Weiqiang Pi [13] and colleagues collected data from the Gegentala grassland in Inner Mongolia and developed a DGC-3D-CNN model for desertification grassland vegetation coverage. The overall recognition accuracy of the model reached 96.16%. In 2023, Li [3] and colleagues developed a Desertification Difference Index (DDI) model based on albedo and the Enhanced Vegetation Index (EVI) to clarify the transition intensity between the different levels of desertification grasslands. In 2023, Zhao [14] and colleagues explored the potential of transformer networks in the field of multi-temporal hyperspectral data. Subsequently, they proposed a Multi-temporal Hyperspectral Classification method for grassland samples based on the transformer network. In their study, Möckel [15] and colleagues developed a Partial Least Squares Discriminant Analysis (PLS-DA) model to identify grazing vegetation across different stages of grassland succession. The overall classification accuracy of the model reached 85%. In 2022, Zhang [16] and colleagues utilized classic deep learning models, VGG and Resnet, as well as their corresponding 3D convolutional variants, 3D-VGG and 3D-ResNet, to classify features from the collected data. In 2023, Wang [17] and colleagues developed a streamlined 2D-CNN model with different feature enhancement modules. This model was compared to existing hyperspectral classification models such as ResNet34 and DenseNet121. The results showed that the model achieved an overall classification accuracy of 99.216%. In 2022, Xu [6] and colleagues used the Support Vector Machine (SVM) classification method to assess different levels of grassland desertification. Their findings indicated that using RGB images for grassland desertification classification provided high accuracy.
Previous research has primarily focused on using hyperspectral and remote sensing images to classify and monitor grassland desertification. Researchers have employed traditional algorithms such as spectral mixture analysis, decision trees, and support vector machines (SVM), or used deep learning models like 3D-CNNs and Transformers to classify and evaluate grassland desertification. However, these studies mostly relied on hyperspectral or multispectral data and have not fully explored the use of color images for classification. To address this gap, my work classifies grassland desertification using color images. Compared to hyperspectral data, color images are more widely applicable and easier to acquire. By utilizing color images, this study not only significantly simplifies the classification process but also offers a more practical method for classifying grassland desertification in resource-constrained regions. Classifying grassland desertification using remote sensing data faces several challenges, including limitations in spatial and spectral resolution, the impact of weather conditions, and the complexity of data processing. These factors can lead to reduced classification accuracy and increased processing cost [16]. At the same time, remote sensing imagery is relatively more difficult to acquire and costly compared to field measurements. Overcoming these limitations requires the integration of multi-source data, support from high-performance computing, and ground validation. Hyperspectral imaging technology demands specialized equipment and software. Compared to the more easily accessible color images, obtaining hyperspectral images is more complex due to the large data volume and high dimensionality, which increases the complexity of processing. In summary, although color image data have certain limitations in grassland desertification classification, such as limited classification accuracy, susceptibility to lighting conditions, and smaller scale coverage, they are much easier to acquire compared to hyperspectral data and present lower processing complexity. This convenience compensates for the high cost and processing difficulties associated with hyperspectral data. Color images are intuitive, widely understood, and compatible with more common equipment. Using mobile and portable devices to capture color images allows for flexible data collection at various times and locations, addressing the constraints of traditional remote sensing, which is limited by satellite or fixed sensors. Compared to the high-dimensional nature of hyperspectral data, color image data are lower in dimensionality and can be processed more quickly, reducing the complexity and time required for data handling. When combined with end-to-end deep learning classification models, color images can fully enable online classification of grassland desertification, significantly enhancing the convenience and real-time capabilities of data acquisition, making this approach highly valuable for grassland desertification classification.
In many practical applications, especially on mobile devices, embedded systems, or cloud computing environments, computing resources are limited. Therefore, the deep learning models’ memory and computational efficiency are critical for deployment. Lower memory use and higher efficiency are essential in these environments. This study compares various deep learning models’ memory consumption and inference performance to guide model selection for resource-limited scenarios. Mobile hardware is used for low-altitude imaging of desertification grasslands, providing data for classification. The data are pre-processed and classified into four categories based on desertification levels, forming a complete dataset. The following models are trained on this dataset: Vgg16, Resnet50, Resnet101, Densenet121, Densenet169, Densenet201, MobilenetV2, MobileVIT, ConvNeXT, Vision Transformer, and Swin Transformer. This study will compare the inference performance in terms of model size, parameter count, and computational complexity. Then, these image classification networks will be used to train and test the color image dataset, determining the classification accuracy of each of the 11 models and identifying the model with the highest classification accuracy. Finally, the most effective model from this study will be applied to classify grassland desertification, allowing for the implementation of management measures tailored to different types of grassland desertification, as well as the development of more reasonable grazing strategies to prevent further ecological degradation caused by overgrazing.

2. Materials and Methods

2.1. Materials

2.1.1. Experimental Plot

The research area is located in the experimental base of the first team of the Academy of Agriculture and Animal Husbandry Science, Sizi Wangqi Wangfu, Wulanqab City, Inner Mongolia Autonomous Region, as shown in Figure 1. It is one of the key areas for grassland environment research by the College of Grassland and Resources and Environment of Inner Mongolia Agricultural University. This area belongs to the desert grassland type, with the geographical coordinates of 41.78° N and 111.88° E, at an elevation of 1456 m. The experimental site is situated in a typical mid-temperate continental climate zone, characterized by perennial aridity and low rainfall, with scarce water resources being the main climatic feature. According to the 2021 statistics, the annual average precipitation is only 280 mm, and the accumulated temperature for areas with ≥10 °C is between 2200 and 2500 °C. The main plant community in the experimental area is Stipa brevifloris, and the dominant species are Cryptophyllum and Artemisia frigidum. The average community coverage ranges from 18% to 25%. The soil type is primarily Calcic Chernozem, with a soil layer thickness of approximately 1 m, and a stratified caliche layer of below 40 cm. The soil nutrient status is characterized by high potassium, low phosphorus, and low nitrogen levels, with a relatively low organic matter content [18].

2.1.2. Data Acquisition Mode

Based on the studies by Quan [19] and Milazzo [20], the criteria for categorizing grassland desertification types under different grazing intensities include vegetation coverage, soil quality, and ecosystem health. Intensive grazing leads to sparse vegetation, damaged soil structure, and water loss, which in turn accelerates the desertification process. Therefore, different grazing gradients can be used to classify grassland desertification types.
As shown in Figure 2, in order to study the classification of grassland desertification, the research area was divided into three sections, with the blue slash representing Section 1, the orange slash representing Section 2, and the green slash representing Section 3. Within each block, four stocking rate levels were set, resulting in different grazing gradients based on the variation in stocking rates [18]. According to different grazing gradients, different types of grassland desertification were obtained. The four stocking rate levels corresponded to the four grassland desertification categories, respectively. The treatments were as follows:
  • Control (CK): 0 sheep units h m 2   a 1 , corresponding to non-desertification grassland;
  • Lightly Grazing (LG): 0.93 sheep units h m 2   a 1 , corresponding to mildly desertification grassland;
  • Moderately Grazing (MG): 1.82 sheep units h m 2   a 1 , corresponding to moderately desertification grassland;
  • Heavy Grazing (HG): 2.71 sheep units h m 2   a 1 , corresponding to severely desertification grassland.
Continuous grazing began in early June 2020 and lasted for six months, ending in late November. Each month, key plant communities such as Stipa breviflora and Cleistogenes squarrosa were randomly selected from each grazing plot for photographic documentation. The data were finally recorded in the form of color images.

2.1.3. Data Annotation

As shown in Figure 3, the collected color images were classified and labeled with the assistance of experts from the College of Grassland and Resources and Environment at Inner Mongolia Agricultural University. The images were then categorized into four classes: CK, LG, MG, and HG. CK represents the non-desertification grassland, LG represents the mildly desertification grassland, MG represents the moderately desertification grassland, and HG represents the severely desertification grassland. The folder names were used as label names.

2.1.4. Data Preprocessing

Due to variations in the collection tools and shooting angles, the color image data contain noise such as shadows and color deviations. Therefore, before the experiment, the color image data need to be converted into structured data and preprocessed to reduce noise. The specific steps are as follows:
Image normalization is performed using the Resize function, which scales the image to a size of 512 × 512 pixels according to the proportion. The original image has a width of W 0 and a height of H 0 . The target image has a width and height of 512 × 512, and the scaling ratio r is calculated using Formula (1) below.
R = min ( 512 W 0 , 512 H 0 ) ,
The new image dimensions will be calculated based on the scaling ratio, as shown in Formula (2) below.
W n = W 0 × r ,   H n = H 0 × r
W n   and H 0 represent the adjusted width and height, maintaining the original image’s aspect ratio.
  • Image normalization is performed with the Gaussian filtering applied, as shown in the comparison of the images before and after filtering in Figure 4. The edges of the image become smoother, and noise is effectively reduced. The coefficient calculation formula for the Gaussian filter is provided in Equation (3).
G ( x , y ) = 1 2 π σ 2 exp ( x 2 + y 2 2 σ 2 ) ,
where G(x,y) is the filter coefficient at position (x,y); σ is the standard deviation, which deter mines the degree of smoothness of the filter (in this case, σ = 0.5); and (x,y) are the coordinates of the filter relative to the center of the filter.
A 3 × 3 filter is used, and the discrete filter coefficient matrix can be calculated using the formula above. Assuming the center of the filter is at (0,0), the filter coefficient matrix is shown in Formula (4) below.
G ( 1 , 1 ) G ( 0 , 1 ) G ( 1 , 1 ) G ( 1,0 ) G ( 0,0 ) G ( 1,0 ) G ( 1,1 ) G ( 0,1 ) G ( 1,1 ) ,
These coefficients can be directly applied to each pixel and its surrounding neighborhood in the image to calculate the smoothed pixel values. The effect of the filter on the image can be achieved through convolution, and the calculation formula is shown in Equation (5).
I ( x , y ) = i = k k j = k k I x + i , y + j G ( i , j )
where I(x,y) represents the value of the original image at pixel (x,y), and I′(x,y) represents the value of the filtered image at pixel (x,y). G(i,j) denotes the coefficients of the Gaussian filter, and k is the radius of the filter. For k = 1, the filter corresponds to a 3 × 3 filter.
  • Using the color normalization method based on mean and standard deviation, the average and standard deviation for the three channels are first calculated. Each channel’s pixel values are then standardized to ensure uniform mean and standard deviation across the channels. This process eliminates issues such as lighting conditions and variations in color distribution, thereby enhancing the comparability and consistency of the image. Since the normalized values no longer fall within the usual range of 0 to 255, scale the pixel values back to the 0 to 255 range. Finally, the pixel values are converted into an 8-bit unsigned integer format.
  • For normalization processing, the Totensor function is used to convert image into the Tensor of shape (C,H,W). Then, the Normalize function is used to calculate the mean variance and normalization of pixel values, and each pixel value is normalized to [0,1]. The normalization Formula (6) is shown as follows:
normalization = x i x min x max   x min ,
In the formula, x is the value of a single data, min is the minimum value of the column where the data resides, and max is the maximum value of the column where the data resides. After processing, the data are transformed into 224 by 224 matrices, with each pixel value in the range of [0,1], facilitating subsequent modeling and classification.
  • As shown in Table 1, Given the imbalance in the number of images across different categories, a strategy of differentiated augmentation ratios was employed. Using rotation and mirroring techniques, the dataset was expanded so that each category ultimately contained 1330 images. This ensured a balanced distribution of samples, avoiding the impact of disproportionate sample sizes on the model training results. Figure 5 shows the composition of the original dataset. Table 1 presents the number of images in each category after augmentation.

2.1.5. Dataset Division

In this paper, the pre-processed image data of the four categories are divided according to the ratio of 7:3, in which 70% is the training set and 30% is the test set. The specific division is shown in Table 2.
The training set serves as the data sample for model fitting, using deep learning algorithms during the training process to derive the model’s weight parameters. The test set is used to evaluate the model’s performance and obtain classification accuracy.

2.2. Method

2.2.1. Experiment Environment

As shown in Table 3, this study uses CUDA version 12.1, with an NVIDIA GeForce RTX 3080 Ti graphics card, which has 12 GB of memory. The operating system is Windows 10. The program is written in Python 3.10 and trained on the PyTorch-2.1.0-gpu deep learning framework.

2.2.2. Training Parameter Settings

Before training the model, various training parameters need to be analyzed and set, such as Batch_size, Epoch, and imgsize. The specific training parameters are shown in Table 4. In the table, Batch_size refers to the total number of samples selected for training in each batch. By appropriately setting Batch_size, GPU memory can be effectively utilized, accelerating training speed and enhancing matrix parallelization efficiency. Based on the hardware environment of the experimental platform and repeated testing, this study sets the Batch_size to 32. Epoch refers to the number of times all the data in the training set is used for training. Generally, when the model’s performance on the validation set reaches a certain level or stops improving, it is considered that the model has converged, indicating that the current number of epochs is optimal. However, if the number of epochs is too large, overfitting may occur, where the trained model performs well on the training set but poorly in real-world applications. Imgsize refers to the image size input into the model. Images with lower resolution contain less information, which may prevent the neural network from extracting enough features, leading to poor accuracy. However, images with higher resolution can provide too much information, overloading the network and negatively impacting detection speed.

2.2.3. Evaluation Index

To evaluate the model’s performance on the test set, this study uses training accuracy, training loss, test accuracy, test loss, recall, precision, F1 score, and confusion matrix to measure the training effectiveness of the convolutional neural network and Transformer network. Accuracy is defined as the ratio of correctly classified samples to the total number of samples in the dataset. Loss value is a metric that measures the difference between the model’s predictions and the actual results. Recall is the ratio of correctly retrieved positive samples to the total number of positive samples that should have been retrieved. Precision is the ratio of correctly retrieved positive samples to the total number of positive samples retrieved. The F1 score is the harmonic mean of precision and recall. The confusion matrix clearly presents the classification results of the model in a matrix format, helping us understand the model’s performance across different categories. Table 5 shows the basic structure of the confusion matrix.

2.2.4. Data Processing Workflow and Model Selection

The data processing workflow in this study is divided into three parts, input, feature extraction, and output, as shown in Figure 5. In Figure 6, the input is a color picture with type annotation obtained by a color camera. Then, the features are extracted through the feature extraction network, and the type of desertification of the grassland represented by the output picture is determined according to the obtained features.
As can be seen from Figure 6, the performance of the feature extraction network determines the performance of the entire model, so the selection of the feature extraction network has a key impact on the research results of this study. Below is a basic overview of the 11 models used in the experiment. VGG16 is a convolutional neural network developed by the Visual Geometry Group at the University of Oxford. It has a 16-layer structure and is mainly used for image classification tasks. Its key feature is the use of small 3 × 3 convolutional kernels, resulting in a simple yet deep network. ResNet50, a deep residual network with 50 layers, addresses the vanishing gradient problem in deep networks by introducing residual blocks and skip connections, allowing for a deeper network without compromising performance. ResNet101 is an extended version of the ResNet network with a deeper structure of 101 layers, designed for more complex tasks requiring higher accuracy. DenseNet121 is a version of the densely connected network with 121 layers. The key feature of DenseNet is that each layer is connected to all previous layers, promoting feature reuse and reducing the number of parameters. DenseNet169 is a variant in the DenseNet series, with 169 layers, used to handle even more complex tasks. DenseNet201 is another variant of the DenseNet architecture, featuring 201 layers, making it suitable for applications that require high-level feature extraction. MobileNetV2 is a lightweight convolutional neural network designed for mobile devices and embedded systems. It reduces computational cost and model size by incorporating depth wise separable convolutions and an inverted residual structure. The Vision Transformer is a visual model based on self-attention mechanisms, applying the Transformer architecture to image classification tasks for the first time. It extracts features by processing images in patches. The Swin Transformer is an improved version of the Transformer model, using a sliding window mechanism for feature extraction, making it suitable for handling high-resolution images and complex visual tasks. ConvNeXT is a novel convolutional neural network architecture that draws inspiration from the strengths of the Vision Transformer, offering enhanced performance and flexibility. MobileVIT combines the lightweight design of MobileNet with the feature representation capabilities of the Vision Transformer, making it suitable for image classification and other visual tasks. Vision Transformer model is an image classification model based on Transformer architecture, which has unique structural and performance advantages that make it excellent in visual tasks. The Vision Transformer model first splits the input image into fixed-size slices. Each slice is transformed into a fixed-length vector through linear embedding. Then, a corresponding positional encoding is added to each slice, allowing the model to understand the positional relationships between the slices. The embedded slices are processed using multiple Transformer encoder layers, each containing a Multi-head Self Attention mechanism and a Feed Forward Neural Network. Residual connections and Layer Normalization are used to connect the layers. A special classification token ([CLS] token) is added before all slices for the final classification output. The Vision Transformer model can effectively capture global features in images through the self-attention mechanism, making it better at handling long-range dependencies compared to traditional Convolutional Neural Networks (CNNs). Additionally, due to the inherent normalization mechanisms of the Transformer structure, the Vision Transformer model has less need for data normalization and training is more stable. The Vision Transformer model can expand its capacity by adjusting parameters such as the number of Transformer layers and the number of attention heads, making it adaptable to different scales of datasets and tasks. The Vision Transformer model takes advantage of the advantages of Transformer to perform well in image classification tasks. Its unique structural characteristics and performance advantages allow it to have an efficient feature extraction ability, good expansibility, and excellent transfer learning ability when processing visual tasks. These features make the Vision Transformer the model of choice for many computer vision tasks. Therefore, this paper mainly selects the Vision Transformer as the feature extraction network and analyzes the advantages of the Vision Transformer by comparing it with other ten feature extraction networks.

3. Results and Discussion

3.1. Comparison of Model Inference Performance Results

Table 6 presents the training and inference performance of the 11 models on the dataset used in this study. Based on these results, the following conclusions can be drawn:
  • Convergence speed: ResNet101 reached the highest test accuracy in the 40th epoch, demonstrating its fast convergence. In contrast, the Vision Transformer did not reach its maximum accuracy until the 97th epoch. This aligns with the findings of Dosovitskiy [21] and colleagues, who observed that Resnet, due to its fewer layers and parameters, typically converges faster on smaller datasets. However, the slower convergence of the Vision Transformer may be attributed to its complex architecture, requiring more training epochs to fully capture advanced data features.
  • Training time: MobileNetV2 completed training in the shortest time, taking 1135 s, while the Swin Transformer took the longest, with a training time of 5003 s. The Vision Transformer required 4500 s. This aligns with the findings of Gao et al. [22], who noted that MobileNetV2, due to its smaller parameter count and simple structure, can complete training in a shorter time. However, this may also result in less comprehensive feature extraction. In contrast, the Swin Transformer and Vision Transformer, with larger parameter counts of 88 M and 85.7 M, respectively, offer more thorough feature extraction but require more time to complete training.
  • Memory requirements: MobileVIT has the smallest memory requirement at 11.0 M, while VGG16 requires the most, with a memory demand of 1537.3 M. The Swin Transformer and Vision Transformer fall in the mid-range, needing 993.3 M and 982.5 M, respectively. These figures are directly related to the number of parameters in each model. For example, VGG16 has 138 M parameters, while MobileVIT has only 1.3 M, which explains its lower memory requirement. This finding is consistent with Gao et al. [23], who observed that the memory demand of neural networks is directly correlated with the number of model parameters.
  • Sample processing speed: In terms of the average number of samples processed per second, MobileNetV2 leads with the ability to process 469 samples per second, making it the fastest. The Swin Transformer and Vision Transformer process 106 and 118 samples per second, respectively, at a slower pace. This outcome is influenced not only by the number of parameters in each model but also by the amount of data that must be processed for each image. While VGG16 has a larger number of parameters, its strong data processing capability allows it to handle 253 samples per second. However, the Swin Transformer and Vision Transformer, despite being thorough in feature extraction, are slower due to the large data volume each image requires and their relatively lower data processing capacity per second. Existing research indicates that MobileNetV2, thanks to its lightweight design and efficient depth wise separable convolutions, excels in processing speed and lower computational complexity, making it highly suitable for mobile devices or resource-constrained environments [24]. In contrast, the Swin Transformer and Vision Transformer, with their complex hierarchical structures and self-attention mechanisms, though more comprehensive in feature extraction, process samples more slowly due to the higher data volume per image and increased computational complexity. While VGG16 has a higher parameter count and strong data processing ability, its overall sample processing speed is still slower than MobileNetV2 due to its higher computational demands. These conclusions are consistent with the previous studies, further supporting the correlation between the number of parameters and sample processing speed [25].

3.2. Comparison of Model Performance Index

Table 7 shows the comparison results of the performance indicators of 11 models under the optimal cycle. In terms of classification accuracy, the Vision Transformer has the highest test accuracy, reaching 88.72%, while Resnet101 has the lowest classification accuracy, only 85.96%, with a difference of 2.76%. This first shows that the Vision Transformer has the best classification effect in the dataset of this study. But the classification accuracy of 88.72% is actually not too high. From the perspective of Train loss, the Vision Transformer has the smallest Train loss, up to 0.0319, and its training accuracy is 98.79%, indicating that it has sufficient training, can fit training data well, and performs well in the training process. Moreover, the Recall and F1 Score of Vision Transformer are 88.72% and 88.70% respectively, which indicates that Vision Transformer has high accuracy and stability in classification tasks. However, in terms of test loss, vision Transformer also has the highest test loss, reaching 0.6754. The reason for this result, on the one hand, is that the classification of sample data comes from the gradient division of the sample land when the data are acquired, but the picture reflects a part of the sample land. The images categorized as showing higher degrees of desertification do not necessarily correspond to plots with lower vegetation coverage. Similarly, the plots reflected in images classified under lower degrees of desertification may not always show higher vegetation coverage. There may be misclassification in sample labeling. On the other hand, due to the large loss value during testing, the Vision Transformer has certain defects in its generalization ability, and there is still room for improvement.
Figure 6 reflects the change of training accuracy and training loss value of 11 models with the increase in the number of iterations during 100 training sessions. As can be seen from Figure 6a, most models achieve high training accuracy within 60 epochs, showing rapid convergence characteristics. The convergence rates of MobilenetVIT, ConVNeXT, and vgg16 were slower, and the training accuracy of all three models was lower, although gradually improving, it was never as good as the other models. The Vision Transformer model performs best, with almost the highest accuracy in every round of training. From Figure 6b, it can be seen that most model training losses decline rapidly during the first 60 epochs and then level off. The loss of MobilenetVIT, ConVNeXT, and Vgg16 decreased slowly, and the loss value was relatively high, indicating that the error was large during the training process and the model fitting effect was poor. The Vision Transformer is the model with the fastest convergence speed, and its training loss curve is the lowest, demonstrating exceptional fitting ability. This is due to the global attention mechanism, flexible image patch processing, and multi-scale information fusion through the multi-head attention mechanism. These features enable the Vision Transformer to better adapt to complex image recognition tasks, offering superior accuracy compared to other models.

3.3. Vision Transformer Confusion Matrix

Figure 7 shows the confusion matrix of the Vision Transformer model at its highest test accuracy. There are a total of 1596 images in the test set, with 399 images in each category. There are 343 images predicted as CK class in the CK category, with a prediction accuracy of 85.96% for this category. There are 381 images predicted as HG class in the HG category, with a prediction accuracy of 95.49% for this category. There are 360 images predicted as MG class in the MG category, with a prediction accuracy of 90.23% for this category. There are 332 images predicted as LG class in the LG category, with a prediction accuracy of 83.21% for this category. From this, it can be seen that the model has a high prediction accuracy for the HG and MG classes in the dataset, exceeding 90%. However, 6.49% of MG class images are predicted to be LG class. For the CK and LG classes, the prediction accuracy is relatively lower, with 6.27% of CK class images predicted as LG, and 8.52% of LG class images predicted as CK. This indicates that when the desertification classification degree between the two classes is close, the Vision Transformer tends to have inaccuracies in its predictions.
Based on the comparison of the experimental results mentioned above, the Vision Transformer model has the highest test accuracy, excellent performance metrics, lower training loss, higher training accuracy, and reasonable computational efficiency. It is currently the ideal classification model for grassland desertification classification. This study uses color image data. During image acquisition or transmission, different devices may introduce image noise. Image noise can cause random variations in pixel values, which reduces the visual quality of the image and compromises the integrity of the signal. Applying Gaussian filtering to the color image data can effectively reduce the random noise caused by different devices. This enhances the visual quality of the image and improves signal integrity. Gaussian filtering smooths the image, reducing high-frequency noise such as random brightness fluctuations or small noise spots. Additionally, color images are easily affected by external factors such as lighting, camera equipment, and environmental conditions. The color of the same object may vary under different lighting conditions or with different devices, which can introduce errors during model training. Color normalization adjusts the color distribution of images, effectively reducing the impact of lighting, equipment, and environmental differences on the image’s color. This enhances data consistency, reduces errors during model training, and improves the stability and generalization capability of the training process. By applying standardization, the model focuses more on essential features of the image, such as shape and texture. This reduces the risk of overfitting and enhances performance and reliability across different scenarios. The Vision Transformer model trained on this basis achieves the highest test accuracy, outstanding performance metrics, low training loss, high training accuracy, and reasonable computational efficiency. It effectively handles complex color image data, making it an ideal model for classifying grassland desertification using color images. Additionally, the self-attention mechanism enables the model to capture global and multi-scale features, making it suitable for applications such as environmental dynamic monitoring, precise ecological restoration, and land resource management.

4. Conclusions

This study utilized mobile devices to capture color image data of grasslands with varying degrees of desertification. The images were processed and used to train and test image classification algorithms. Experimental results demonstrate that the Vision Transformer model performed the best in classifying grassland desertification. Although the Vision Transformer has a complex architecture, a large number of parameters, and a longer training time, the Vision Transformer classification accuracy reached 88.72%, surpassing other models. Therefore, the Vision Transformer model can classify grassland desertification quickly and accurately. However, since the sample data classification is based on plot gradient division during data acquisition, there is a limitation in misalignment during sample labeling. As a result, there is an inherent limit to the accuracy of grassland desertification classification based on color images. Therefore, future research should focus on the following key points to improve classification accuracy: (1) Optimize data labeling by enhancing the precision of plot gradient division and thus reducing the errors caused by potential misalignment in classification. (2) Improve sample preprocessing by introducing more advanced noise reduction and color correction algorithms to enhance image quality and consistency. (3) Refine the Vision Transformer model through model simplification, parameter tuning, and data augmentation techniques to improve generalization performance and computational efficiency, making it better suited for grassland desertification classification.

Author Contributions

Conceptualization, H.J., M.L. and Y.Z.; methodology, H.L. and Y.F.; software, W.Y.; validation, P.Z.; data curation, R.W.; writing—original draft preparation, H.J.; writing—review and editing, M.L.; project administration, H.J. and R.W.; funding acquisition, M.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Inner Mongolia Natural Science Foundation Project, “Research on Feature Point Space Invariance Method for Multi target Individual Identity Recognition of Livestock” (grant number: 2023LHMS06012). and the Basic Research Business Fee Project of Inner Mongolia Autonomous Region Directly Affiliated Universities, “Research and Demonstration of Key Technologies for Multi scale Yellow River Ice Situation Automatic Detection Based on AIoT” (grant number: BR231407).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Feng, Q.; Tian, Y.; Yu, T.; Yin, Z.; Cao, S. Combating desertification through economic development in northwestern China. Land Degrad. Dev. 2019, 30, 910–917. [Google Scholar] [CrossRef]
  2. United Nations Convention to Combat Desertification (UNCCD). Global Desertification and Land Degradation Data. 2024. Available online: https://www.unccd.int (accessed on 6 September 2024).
  3. Li, J.; Cao, C.; Xu, M.; Yang, X.; Gao, X.; Wang, K.; Guo, H.; Yang, Y. A 20-Year Analysis of the Dynamics and Driving Factors of Grassland Desertification in Xilingol, China. Remote Sens. 2023, 15, 5716. [Google Scholar] [CrossRef]
  4. Kang, S.; Niu, J.; Zhang, Q.; Han, J.; Bao, H. State Transition and Sustainability in the Degradation Succession of Stipa Grassland. J. Ecol. 2020, 39, 3147–3154. [Google Scholar] [CrossRef]
  5. Guo, Q.; Fu, B.; Shi, P.; Cudahy, T.; Zhang, J.; Xu, H. Satellite Monitoring the Spatial-Temporal Dynamics of Desertification in Response to Climate Change and Human Activities across the Ordos Plateau, China. Remote Sens. 2017, 9, 525. [Google Scholar] [CrossRef]
  6. Xu, X.; Liu, L.; Han, P.; Gong, X.; Zhang, Q. Accuracy of Vegetation Indices in Assessing Different Grades of Grassland Desertification from UAV. Int. J. Environ. Res. Public Health 2022, 19, 16793. [Google Scholar] [CrossRef]
  7. Wang, N.; Peng, S.; Meng, N.; Lv, H. Analysis of Spatiotemporal Dynamics of Land Desertification in Qilian Mountain National Park Based on Google Earth Engine. ISPRS Int. J. Geo-Inf. 2024, 13, 117. [Google Scholar] [CrossRef]
  8. Li, J.; Yang, X.; Jin, Y.; Yang, Z.; Huang, W.; Zhao, L.; Xu, B. Monitoring and analysis of grassland desertification dynamics using Landsat images in Ningxia, China. Remote Sens. Environ. 2013, 138, 19–26. [Google Scholar] [CrossRef]
  9. Han, L.; Zhang, Z.; Zhang, Q.; Wan, X. Desertification assessments in the Hexi corridor of northern China’s Gansu Province by remote sensing. Nat. Hazards 2015, 75, 2715–2731. [Google Scholar] [CrossRef]
  10. Li, J.; Xu, B.; Yang, X.; Jin, Y.; Zhao, L.; Zhao, F.; Ma, H. Characterizing changes in grassland desertification based on Landsat images of the Ongniud and Naiman Banners, Inner Mongolia. Int. J. Remote Sens. 2015, 36, 5137–5149. [Google Scholar] [CrossRef]
  11. Yu, W.; Yao, X.; Shao, L.; Liu, J.; Shen, Y.; Zhang, H. Classification of Desertification on the North Bank of Qinghai Lake. Comput. Mater. Contin. 2022, 72, 695–711. [Google Scholar] [CrossRef]
  12. Kuang, Q.; Yuan, Q.; Han, J.; Leng, R.; Wang, Y.; Zhu, K.; Lin, S.; Ren, P. A remote sensing monitoring method for alpine grasslands desertification in the eastern Qinghai-Tibetan Plateau. J. Mt. Sci. 2020, 17, 1423–1437. [Google Scholar] [CrossRef]
  13. Pi, W.; Du, J.; Liu, H.; Zhu, X. Desertification Grassland Classification and Three-Dimensional Convolution Neural Network Model for Identifying Desert Grassland Landforms with Unmanned Aerial Vehicle Hyperspectral Remote Sensing Images. J. Appl. Spectrosc. 2020, 87, 309–318. [Google Scholar] [CrossRef]
  14. Zhao, X.; Zhang, S.; Shi, R.; Yan, W.; Pan, X. Multi-Temporal Hyperspectral Classification of Grassland Using Transformer Network. Sensors 2023, 23, 6642. [Google Scholar] [CrossRef] [PubMed]
  15. Möckel, T.; Dalmayne, J.; Prentice, H.; Eklundh, L.; Purschke, O.; Schmidtlein, S.; Hall, K. Classification of grassland successional stages using airborne hyperspectral imagery. Remote Sens. 2014, 6, 7732–7761. [Google Scholar] [CrossRef]
  16. Zhang, Y.; Du, J. Deep Learning Classification of Grassland Desertification in China via Low-Altitude UAV Hyperspectral Remote Sensing. Spectroscopy 2022, 37, 28–35. [Google Scholar] [CrossRef]
  17. Wang, S.; Bi, Y.; Du, J.; Zhang, T.; Gao, X.; Jin, E. The Unmanned Aerial Vehicle (UAV)-Based Hyperspectral Classification of Desert Grassland Plants in Inner Mongolia, China. Appl. Sci. 2023, 13, 12245. [Google Scholar] [CrossRef]
  18. Qu, Z.; Sun, X.; Yang, Z.; Wang, Y.; Bai, L.; Li, Z.; Na, Y.; Han, G.; Zhang, Z.; Wang, J. Effects of Different Grazing Intensities on the Characteristics of Micro-patch of Desert Grassland Meadows. Acta Agrestia Sinica 2024, 32, 1558–1571. [Google Scholar]
  19. Quan, Z.; Cheng, Y.; Tsubo, M.; Shinoda, M. Sensitivity and regulation factors of soil organic carbon content in steppe and desert—Steppe grasslands of the Mongolian Plateau. Plant Soil 2024. [Google Scholar] [CrossRef]
  20. Milazzo, F.; Francksen, R.M.; Abdalla, M.; Ravetto Enri, S.; Zavattaro, L.; Pittarello, M.; Hejduk, S.; Newell-Price, P.; Schils, R.L.M.; Smith, P.; et al. An Overview of Permanent Grassland Grazing Management Practices and the Impacts on Principal Soil Quality Indicators. Agronomy 2023, 13, 1366. [Google Scholar] [CrossRef]
  21. Chen, X.; Hsieh, C.J.; Gong, B. When Vision Transformers Outperform ResNets without Pre-training or Strong Data Augmentations. arXiv 2021, arXiv:2106.01548. [Google Scholar]
  22. Nie, Y.; He, W.; Han, K.; Tang, Y.; Guo, T.; Du, F.; Wang, Y. LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models. arXiv 2023, arXiv:2312.00674. [Google Scholar]
  23. Gao, Y.; Liu, Y.; Zhang, H.; Li, Z.; Zhu, Y.; Lin, H.; Yang, M. Estimating GPU memory consumption of deep learning models. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2020), New York, NY, USA, 8–13 November 2020; pp. 1342–1352. [Google Scholar] [CrossRef]
  24. Li, X.; Du, J.; Yang, J.; Li, S. When Mobilenetv2 Meets Transformer: A Balanced Sheep Face Recognition Model. Agriculture 2022, 12, 1126. [Google Scholar] [CrossRef]
  25. Cheng, J.; Song, Q.; Peng, H.; Huang, J.; Wu, H.; Jia, B. Optimization of VGG16 Algorithm Pattern Recognition for Signals of Michelson–Sagnac Interference Vibration Sensing System. Photonics 2022, 9, 535. [Google Scholar] [CrossRef]
Figure 1. The geographical position in the study area.
Figure 1. The geographical position in the study area.
Sustainability 16 08307 g001
Figure 2. Experimental area division diagram.
Figure 2. Experimental area division diagram.
Sustainability 16 08307 g002
Figure 3. The data annotation classification diagram.
Figure 3. The data annotation classification diagram.
Sustainability 16 08307 g003
Figure 4. Comparison diagram before and after Gaussian filter processing.
Figure 4. Comparison diagram before and after Gaussian filter processing.
Sustainability 16 08307 g004
Figure 5. Data processing flow chart. * Extra learable [class] embedding.
Figure 5. Data processing flow chart. * Extra learable [class] embedding.
Sustainability 16 08307 g005
Figure 6. (a) Comparison of training accuracy of 11 models; (b) Comparison of training loss in 11 models.
Figure 6. (a) Comparison of training accuracy of 11 models; (b) Comparison of training loss in 11 models.
Sustainability 16 08307 g006
Figure 7. Confusion matrix of Vision Transformer.
Figure 7. Confusion matrix of Vision Transformer.
Sustainability 16 08307 g007
Table 1. Expanded category quantity.
Table 1. Expanded category quantity.
ClassificationRaw Image QuantityExpansion RateExpanded Image Quantity
CK 11088122.21330
LG 21328100.21330
MG 31272104.61330
HG 4705188.71330
1 non-desertification grassland. 2 mildly desertification grassland. 3 moderately desertification grassland. 4 severely desertification grassland.
Table 2. Dataset partitioning.
Table 2. Dataset partitioning.
ClassificationNumber of PicturesTraining Set’s NumberTest Set’s Number
CK 11330931399
LG 21330931399
MG 31330931399
HG 41330931399
total532037241596
1 non-desertification grassland. 2 mildly desertification grassland. 3 moderately desertification grassland. 4 severely desertification grassland.
Table 3. Experiment environment.
Table 3. Experiment environment.
Experiment EnvironmentConfiguration
operating systemWindows 10
CUDA versionsCUDA 12.1
graphics cardNVIDIA GeForce RTX 3080 Ti
video memory12 G
Hard drive capacity80 G
program languagePython 3.10
Deep learning frameworkPyTorch 2.1.0
Table 4. Model Training Parameters.
Table 4. Model Training Parameters.
ParameterValueParaphrase
Epoch200Total training rounds
Batch_size32The number of samples selected for a round of training
Imgsize[512 × 512]Image resolution input to the network
lr0.001Initial learning rate
lrf0.0001Final learning rate
optimizeradamOptimizer class
Device0Training is conducted using a GPU
Table 5. The basic structure of the confusion matrix.
Table 5. The basic structure of the confusion matrix.
Forecast is a Positive CategoryForecast is a Negative Category
Actually a positive categoryTP 1FN 3
Actually a negative categoryFP 2TN 4
1 TP: That is, True Positive, the model correctly predicts the sample that is actually a positive category to be a positive category. 2 FP: That is, False Negative, the model incorrectly predicts a sample that is actually a positive category as a negative category. 3 FN: That is, False Positive, the model incorrectly predicts a sample that is actually a negative category as a positive category. 4 TN: That is, True Negative, the model correctly predicts a sample that is actually a negative category as a negative category.
Table 6. Comparison of model parameter size and inference performance.
Table 6. Comparison of model parameter size and inference performance.
ModleBest Epoch 3TTT 4 (S)ETS 5 (MB)Params 6FLOPs 7FLOPS 8Samples/s 9
VGG167721051537.2138 M15.5 G3940.0 G253
Resnet50501800376.825.6 M4.1 G1226.1 G296
Resnet101402456487.044.5 M7.9 G1711.1 G217
Densenet12161191080.28.0 M2.9 G797.4 G279
Densenet169592400143.514.3 M3.5 G779.3 G225
Densenet201652700207.718.1 M4.4 G868.9 G197
MobilenetV2591135161.93.4 M0.3 G153.6 G4689
MobileVIT86130011.01.3 M0.4 G164.4 G409
ConVNeXT902300330.627.8 M4.4 G1043.2 G231
VIT 1974500982.585.6 M17.6 G2090.1 G118
Swin-T 2765003993.388.0 M15.4 G1645.9 G106
1 VIT: It’s Vison transformer. 2 Swin-T: It’s Swin transformer. 3 Best Epoch: Indicates the most accurate round of training 100 tests. 4 TTT: That is, Total Training Time, which represents the time spent training for 100 epochs, measured in seconds (S). 5 ETS: That is, establish total size, which represents the overall memory footprint of the model, measured in megabytes (MB). It includes four parts: the size of the model input, the amount of data processed in a forward pass, the amount of data processed in a backward pass, and the total size of all weights and biases in the model. 6 Params: This represents the number of parameters involved in processing a single image in a neural network model, measured in units of M (Mega, i.e., millions or 10 6 ) or G (Giga, i.e., billions or 10 9 ). 7 FLOPs: That is, floating point operations (FLOPs), which represent the number of floating-point calculations required to process a single image, measured in units of M (Mega, i.e., millions or 10 6 ) or G (Giga, i.e., billions or 10 9 ). 8 FLOPS: That is, floating point operations per second (FLOPS), which indicates the number of floating-point calculations performed per second, measured in units of M (Mega, i.e., millions or 10 6 ) or G (Giga, i.e., billions or 10 9 ). 9 Samples/s: This represents the number of samples processed by the model per second, measured in units of picture.
Table 7. Comparison of model performance index.
Table 7. Comparison of model performance index.
ModleTrain_lossTrain_accTest_accTest_lossPrecisionRecallF1_scores
VGG160.166393.90%87.34%0.428487.34%87.34%87.28%
Resnet500.148794.41%86.97%0.491286.91%86.90%86.90%
Resnet1010.191093.02%85.96%0.470585.92%85.96%85.87%
Densenet1210.104996.08%87.16%0.490387.24%87.16%87.16%
Densenet1690.118995.62%87.09%0.483487.06%87.09%87.04%
Densenet2010.128595.14%87.18%0.501687.38%87.28%87.28%
MobilenetV20.178093.18%87.91%0.464787.87%87.91%87.85%
MobileVIT0.332387.27%86.84%0.407086.86%86.84%86.85%
ConVNeXT0.201092.13%87.66%0.366687.64%87.66%87.59%
VIT 10.031998.79%88.72%0.675488.69%88.72%88.70%
Swin-T 20.119595.57%88.22%0.418088.13%88.22%88.16%
1 VIT: It’s Vison transformer. 2 Swin-T: It’s Swin transformer.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jiang, H.; Wu, R.; Zhang, Y.; Li, M.; Lian, H.; Fan, Y.; Yang, W.; Zhou, P. Classification Model of Grassland Desertification Based on Deep Learning. Sustainability 2024, 16, 8307. https://doi.org/10.3390/su16198307

AMA Style

Jiang H, Wu R, Zhang Y, Li M, Lian H, Fan Y, Yang W, Zhou P. Classification Model of Grassland Desertification Based on Deep Learning. Sustainability. 2024; 16(19):8307. https://doi.org/10.3390/su16198307

Chicago/Turabian Style

Jiang, Huilin, Rigeng Wu, Yongan Zhang, Meian Li, Hao Lian, Yikun Fan, Wenqian Yang, and Peng Zhou. 2024. "Classification Model of Grassland Desertification Based on Deep Learning" Sustainability 16, no. 19: 8307. https://doi.org/10.3390/su16198307

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop