1. Introduction
In recent years, convolutional neural networks (CNNs), as a core method of deep learning technology, have achieved remarkable results in image classification, object detection, and other fields. CNNs extract multi-scale features of images through multiple layers of convolution and pooling operations and perform well in tasks such as plant classification [
1,
2,
3]. In plant classification, researchers have widely explored strategies such as multi-scale feature extraction, transfer learning, and ensemble classifiers, which have significantly improved model performance. For example, Hu et al. [
4] proposed the multi-scale fusion convolutional neural network (MSF-CNN), which achieved excellent performance on the MalayaKew Leaf [
5] and LeafSnap [
6] datasets, verifying the effectiveness of multi-scale features. Pereira et al. [
7] used AlexNet transfer learning and image distortion techniques to improve recognition accuracy on the Flavia leaf dataset. Tan et al. [
8] proposed the D-Leaf model, which combines classifiers such as support vector machines (SVMs), artificial neural networks (ANNs) and k-nearest neighbors (k-NNs), to achieve more accurate plant species identification. Ukwoma et al. [
9] combined VGG16, ResNet50, and YOLOV7 to achieve an accuracy of over 98% in classifying fruit images, further demonstrating the powerful potential of deep learning in plant image analysis. The above studies have laid a solid foundation for applying CNN in plant image recognition.
On this basis, in recent years, researchers have conducted in-depth explorations around the structural optimization and performance improvement of CNNs, with particularly groundbreaking progress made in 2024 and 2025. In 2024, Hasan et al. [
10] proposed an image classification method based on CNNs, achieving an accuracy rate of 98% on the MNIST dataset, further demonstrating the powerful ability of CNNs to improve image classification performance. In 2025, Sun et al. [
11] proposed the scalable quantum convolutional neural network (SQCNN) model, which uses quantum circuits to extract features in parallel, achieving a classification accuracy of 99.79% on the MNIST and Fashion MNIST datasets. This is significantly better than existing quantum neural network models and shows strong generalization capabilities. Meanwhile, the CA-MFE model proposed by Liu Yongmin et al. [
12] combines deformable CNN with an attention mechanism to construct a multi-scale graph neural network (GNN), which effectively extracts multi-scale local and global features and significantly improves image classification performance. It also performs well on both the mini-ImageNet and tiered-ImageNet datasets. In addition, the CBAM-SqueezeNet model proposed by Zhao et al. [
13] further optimizes the feature extraction module of the CNN by introducing channel and spatial attention mechanisms, achieving more accurate and efficient detection of target grasping by robotic arms. This model achieved a grasp detection accuracy of 94.8% and 96.4% on the Cornell Grasping Dataset and Jacquard Dataset, respectively, and a success rate of 93% in actual robotic grasping experiments. The inference speed reached 15ms, which fully balances model accuracy and speed. These research results show that CNN models that combine quantum characteristics, attention mechanisms, and multi-scale feature extraction have made significant breakthroughs in the past two years, further promoting the development of image classification and related applications.
Despite the many advances made by CNNs in the field of image analysis, they still face significant challenges in the task of succulent plant classification. Succulents come in a wide variety of species, with highly similar morphological characteristics, and are significantly affected by their growth environment [
14,
15]. There are more than 12,000 species of succulents known worldwide, belonging to about 80 families [
16], but only about 100 species are sold as ornamental potted plants in the Chinese market. Due to the difficulty involved in identifying them with the naked eye, traditional manual classification methods are inefficient and rely on expert knowledge, making it difficult to meet large-scale automation needs. Therefore, the core challenge in the task of classifying succulents is how to improve the generalization ability of the model with a limited dataset.
To solve the above problems, researchers have proposed various effective strategies in recent years, including data augmentation, dropout methods, and label regularization. Data augmentation methods help models learn more robust features by expanding sample diversity and are a common means of alleviating overfitting. Methods such as CutMix [
17] and RandAugment [
18] significantly improve the adaptability of models in complex image tasks through image cropping, splicing, and transformation. Dropout-like methods such as MaxDropout [
19] and DropBlock [
20] reduce the model’s dependence on specific weights and alleviate the overfitting of deep networks. Label regularization methods such as Label Smoothing [
21] and JoCoR [
22] have also achieved remarkable results in dealing with noisy labels and improving model stability. The above research provides an important technical reference for the task of image classification of succulents.
However, in the task of classifying succulents, the scarcity of data, small differences between classes, and large differences within classes significantly exacerbate the problem of overfitting the model. To alleviate this problem, researchers have proposed various effective strategies, among which data augmentation techniques have achieved remarkable results as an important means to improve the generalization ability of models. Feng et al. [
23] introduced advanced enhancement methods such as Cutout and Mixup, combined with a variety of image transformation strategies, which improved the model accuracy from 67.35% to 78.68%, significantly improving the robustness of the model in complex scenarios. Similarly, Wen et al. [
24] used data augmentation methods such as random flipping, rotation, and color jittering, combined with transfer learning strategies, to improve the model’s test set accuracy from 59.48% to 96.90%, effectively alleviating the overfitting of the model. In addition, in recent years, researchers have gradually introduced attention mechanisms to improve the model’s ability to focus on key features, thereby exhibiting excellent performance in fine-grained classification tasks [
25,
26]. While emphasizing significant regions of an image, the attention mechanism can effectively suppress redundant features, further improving the model’s classification accuracy in complex images. To further address the problem of data scarcity, researchers have begun to explore few-shot learning methods to enable effective training of models on small datasets. The few-shot learning model combined with an attention mechanism aggregates multi-level fine-grained features through a multi-branch structure, achieving more accurate feature representation and classification performance, especially in the task of succulent plant classification with limited data resources [
27,
28]. The above studies show that combining methods such as data augmentation and attention mechanisms provides an effective way to solve the overfitting problem in succulent plant classification and improve the generalization performance of the model, opening up new research directions for the application of deep learning in the analysis of complex images.
In summary, this paper proposes a multi-succulent plant classification algorithm framework that combines deep learning, a lightweight nature, and an attention mechanism to address the shortcomings of current classification methods. The research object is a multi-succulent plant image dataset, covering multi-succulent plants from various families. The core of the research is to design a model based on a multi-branch structure, which aims to more accurately capture the key details of multi-succulent plants, enhance category distinction, and improve the generalization ability of the model. The innovations of this model mainly include the following two aspects:
(1) Spatial-channel attention mechanism (CBAM): The CBAM module (Convolutional Block Attention Module) enhances feature expression through a combination of channel and spatial attention. The channel attention module introduces global average pooling and max pooling to generate weights for different channels and multiply them with the original features to enhance the feature expression of key channels. Spatial attention combines the average pooling and maximum pooling results of the input features, generates a spatial attention map through a 7 × 7 convolution, and then multiplies it with the original features to ensure that the network can focus on important spatial locations. The optimization and integration of the CBAM module improves the model’s feature extraction ability and further alleviates the classification difficulties caused by the similar morphological characteristics of succulents.
(2) Lightweight Inception Module: The lightweight Inception module replaces the original 5 × 5 convolution with two 3 × 3 convolutions. This design significantly reduces the computational complexity and parameters while maintaining the same receptive field. In addition, the number of convolution channels in the module is reduced to further reduce the amount of computation. Compared with the original Inception module, the lightweight module integrates CBAM, which adds an attention mechanism to the output feature map after stitching, making the feature representation more accurate. These modifications make the lightweight Inception module not only have a lower computational cost, but also retain good feature extraction capabilities, making it suitable for application scenarios with limited computing resources.
In this study, GoogLeNetCBAM is chosen as the model architecture, mainly based on its advantages in feature extraction, parameter optimization, and overfitting prevention. GoogLeNet adopts the Inception module with a powerful multi-scale feature extraction ability, which can extract rich image features under different sensory fields, especially in the case of succulent plants, considering the large number of species and high similarity between some categories, which helps the model capture more discriminative features and improve the classification accuracy. Meanwhile, the CBAM attention mechanism introduced by the model combines channel attention and spatial attention, which can effectively guide the model to focus on the key feature regions and inhibit the interference of irrelevant information, thus further improving the generalization ability of the model. To further alleviate the overfitting problem in small sample scenarios, we introduce various optimization strategies in the model training process. These include: data enhancement to enrich the diversity of data samples and enhance the model’s adaptability to image transformations; the introduction of dropout (p = 0.3) in the fully connected layer to reduce neuron co-adaptation and improve the robustness of the model; the addition of weight decay (Weight Decay = ) in the Adam optimizer to limit the amplitude of the parameters and prevent overfitting caused by too-large model parameters; and the introduction of an early stopping strategy to prevent overfitting caused by too-large model parameters. The early stop strategy (patience = 15) is introduced to stop training when the accuracy of the validation set is not improved in consecutive rounds, to avoid the model from overfitting due to overtraining. In summary, the advantages of GoogLeNetCBAM architecture in feature extraction and attention mechanism, combined with a variety of optimization strategies, enable the model to improve the classification accuracy while significantly enhancing the stability and generalization ability, ensuring its excellent performance in small sample scenarios.
3. Results
3.1. Contrastive Learning Succulent Image Classification Model
To further validate the advantages of the lightweight GoogLeNet model, this paper trains the succulent dataset on the Aexnet, VGG16, GoogLeNet, and lightweight GoogLeNet network models, respectively. The validation accuracy curves and loss curves of these network models in the training phase are shown in
Figure 6.
From
Figure 6, we can see that there are obvious differences in the performance of the four models in terms of training loss and validation accuracy: AlexNet has a faster decline in training loss and a stable validation accuracy of about 90%, but the complexity of the model is low and there is limited room for improvement in the accuracy; VGG16 has a smooth decline in training loss and a gradual improvement in validation accuracy, but there are large fluctuations in the later stage, indicating that it is prone to overfitting and has a weaker generalization ability. The validation accuracy of GoogLeNet improves faster in the early stage, but there are large fluctuations in the later stage, which shows the overfitting phenomenon, and the model’s performance on the validation set is not stable enough. In contrast, the lightweight GoogLeNet performs well, the training loss decreases rapidly, the validation accuracy increases steadily and finally exceeds 90%, and the fluctuation in the validation accuracy is small, which indicates that the model has good generalization ability. Introducing the lightweight module reduces the complexity of the model, solves the overfitting problem of the original GoogLeNet, and maintains high accuracy despite the limited resources. Overall, the lightweight GoogLeNet has the best overall performance among these four models, solving the overfitting problem and achieving excellent performance on the validation set.
3.2. Comparative Analysis of Data Enhancement Models
In order to verify the robustness of the improved model to the size of the dataset, we extended it by applying various data enhancement strategies to the original image dataset. Specific enhancement methods include: random horizontal flipping to simulate changes in left and right viewpoints, random rotation to generate image views with different angles, color dithering, which entails the random perturbation of brightness, contrast, saturation, and hue to simulate complex lighting conditions, and random cropping and scaling, which is an adjustment of image field of view ranges and enhances the learning ability of local features. Through these enhancement operations, five diverse enhanced images were generated for each original image. The enhanced data significantly increased the diversity in terms of spatial transformations, color distributions, and image contents, successfully simulating complex real-world application scenarios. These enhancement strategies not only effectively improved the generalization ability of the deep learning model, but also enhanced its robustness to image transformations. In addition, data augmentation effectively mitigates the overfitting risk of the model by expanding the diversity of the data under limited sample conditions, and provides the model with more comprehensive and diverse training samples. This enables the model to perform more stably and accurately under complex conditions. Eventually, the number of images in the original dataset was expanded from 691 to 3455 succulent images, providing rich data support for subsequent experiments.
In order to verify the robustness of the improved model to the dataset size, we conducted experiments on the original dataset of 691 images and the extended dataset of 3455 images after data enhancement (as shown in
Figure 7). In the original dataset, the training set was 626 images and the test set was 65 images, and the final validation accuracy of the model reached 98.5%, with a training loss as low as 0.001, which demonstrates the efficient learning ability and excellent performance of the improved model under small dataset conditions. In the extended dataset after data enhancement, the training set contained 3180 images and the test set 342, and the validation accuracy was slightly improved to 99.4%, while the training loss was kept at 0.001 (as shown in
Table 3). This comparison experiment verifies the efficient learning and stability of the improved model in a small dataset environment, and the effect of data augmentation was shown to be minimal due to saturation, although it did improve the accuracy. The robustness and effectiveness of the model have been fully demonstrated under the conditions of a small dataset with strong adaptive ability, especially in application scenarios where data scarcity is an issue.
By comparing the experimental results on datasets of different sizes, it can be seen that the improved model is able to adapt effectively to both small and large datasets, and the validation accuracy is always maintained at a high level. Although the data enhancement technique further optimizes the generalization ability of the model by extending the data diversity, the experimental results on small-scale datasets show that the model is still able to achieve high performance stably under data-constrained conditions. This result fully demonstrates that the model is insensitive to the size of the dataset and also has good robustness under a limited sample size. In addition, the loss values of the model on both small and extended datasets in the experiments remain consistently as low as 0.001, which further supports that the optimization process of the model does not depend on changes in dataset size. This phenomenon indicates that the improved model has high efficiency in feature extraction and decision learning, and can still fully mine effective information and realize high-precision prediction in the case of fewer training samples.
In summary, the experimental results demonstrate that the improved model has a low dependence on dataset size and can achieve high validation accuracy and stable training loss using both small and large-scale datasets. This provides important support for the application of the model in data-constrained scenarios, and further illustrates the robustness and reliability of the model under dataset size variation.
3.3. Model Predictive Analytics
The purpose of this experiment was to evaluate the performance of the improved GoogLeNetCBAM model in the succulent plant image classification task, and to explore the effect of mixed-precision training on the model performance improvement The dataset consists of 691 images of succulent plants, covering 10 categories. It was divided into a training set and a test set in a ratio of 9:1, where the training set contained 626 images and the test set contained 65 images. The experiment evaluated the classification accuracy and confidence of the model by visualizing and analyzing the test set.
In the experimental setup, the hardware environment was first checked for CUDA support to ensure that the model could run efficiently on the GPU. Image preprocessing was performed using a uniform process, including resizing the images to 224 × 224 pixels and applying normalization (0.5 mean and 0.5 standard deviation for the three RGB channels) to ensure format consistency in the input data. For modeling, we loaded the pre-trained GoogLeNetCBAM model. The model is based on the classic GoogLeNet architecture with the addition of the Convolutional Block Attention Module (CBAM), which improves the feature representation of the model by enhancing attention to important feature regions. The experiments turned off the auxiliary classifiers and focused on the prediction results of the main branch loaded with pre-trained weights. The output of the model was processed by Softmax to generate the probability distribution of each category, and the category with the highest probability was finally selected as the prediction result.
The experiment visually displayed the prediction results of the model and output the probability distribution of each category at the terminal, to analyze the classification performance of each model (as shown in
Table 4). The values in
Table 4 represent the Softmax output probabilities of each model when classifying succulents, which were used to measure the confidence of each model in each category. During the experiment, a random succulent image was selected from the validation set as a sample, and it was input into the AlexNet, VGG16, GoogLeNet, and lightweight GoogLeNet models for prediction. The output of each model was a vector of 10 values, each of which corresponded to the predicted probability of a category. A value of 1 indicates that the model correctly identified the category with the highest confidence; non-zero values for the incorrect category reflect that the model was confused by some similar category during prediction. In this way, the recognition accuracy of each model in the succulent plant classification task and its ability to distinguish similar categories can be analyzed. The experimental results further show that the lightweight GoogLeNet model exhibits stronger discriminative ability in many cases, with higher probability values for the correct category and closer to zero probability values for other categories, which verifies its advantages in improving accuracy and anti-confusion. The design of this experiment effectively evaluates the performance of the lightweight GoogLeNet model in multi-class image classification tasks, providing a reference for further optimizing and improving the model’s performance.
By comparing the performance of four models, AlexNet, VGG16, GoogLeNet, and lightweight GoogLeNet, in the succulent classification task, this experiment aims to highlight the advantages of lightweight GoogLeNet, especially in terms of classification accuracy and exclusion of incorrect categories. Lightweight GoogLeNet demonstrated excellent classification performance while maintaining a low computational overhead (as shown in
Figure 8).
First, in the Graptoveria ‘Opalina’ category, the prediction probability of all the models was 1.0, which indicates that all the models successfully recognized the images in this category. However, of more comparative significance is the model performance in the other categories. In most categories, the lightweight GoogLeNet maintained comparable prediction accuracy to the original GoogLeNet, and performed particularly well when excluding the wrong categories. For example, in both the Crassula obliqua ‘Gollum’ and Sedum burrito categories, the lightweight GoogLeNet’s prediction probability of 0.0 accurately excluded these categories, showing its excellent exclusion of non-target categories. This result is close to the GoogLeNet model, but the lightweight GoogLeNet requires significantly fewer computational resources, demonstrating its efficiency advantage. In contrast, AlexNet and VGG16 did not perform as well as lightweight GoogLeNet in predicting certain categories. e.g., in the Haworthia truncata category, lightweight GoogLeNet exhibited a prediction close to 0.0, indicating that it can accurately exclude this category, whereas the prediction probability of 8.83 × 10−5 for VGG16 shows a certain degree of categorization uncertainty. This indicates that lightweight GoogLeNet reduces the possibility of misclassification while maintaining high accuracy.
Overall, the lightweight GoogLeNet not only maintains similar prediction performance to the original GoogLeNet in several categories, especially in the categories of Senecio haworthii and Echeveria agavoides ‘Ebony’, but its classification exclusion effect is much better than that of AlexNet and VGG16. Meanwhile, it significantly reduces computational costs, which makes the model more practical in real applications. Through this experimental comparison, lightweight GoogLeNet demonstrates the ability to optimize computational resources while maintaining model accuracy, making it suitable for deployment in resource-constrained environments.
Therefore, lightweight GoogLeNet not only retains the high-precision performance of GoogLeNet but also highlights its potential value in real-world applications through higher efficiency and accurate category exclusion. This advantage provides strong support for deployment in large-scale image classification tasks.
4. Discussion
Although the GoogLeNet classification and recognition method based on the lightweight and CBAM attention module proposed in this paper has achieved significant classification results, there is still room for further optimization and exploration. The following summarizes possible future improvement directions.
4.1. Parameter Tuning
In this study, the hyperparameter selection of the model mainly relied on experience and limited tuning due to limitations in training resources and time. Although good accuracy has been achieved in the experiments, there is still potential for further improvement in model performance. Future work can focus on the systematic and automated tuning of hyperparameters, such as using Bayesian optimization or evolutionary algorithms to explore the parameter space efficiently and thus improve the model’s generalization ability on different datasets. Meanwhile, targeted tuning for task-specific datasets can help improve the model’s performance in fine-grained classification tasks.
4.2. Data Enhancement
Data augmentation techniques have been widely used to enhance the performance of deep learning models, especially in fine-grained classification tasks where they play an important role. Data augmentation techniques were used in this study, but in the future, we can focus on exploring unsupervised data augmentation methods to enhance the generalization ability of the models. For example, Generative Adversarial Networks (GANs) can be employed to generate training samples with higher diversity, or data transformation strategies in self-supervised learning can be used to enhance data richness. In addition, reasonable data augmentation can help to solve data imbalance problems and further enhance the robustness of the model and its ability to adapt to complex scenarios.
4.3. Improvements in Attention Mechanisms
Currently, attention mechanisms are widely used in computer vision to highlight the model’s attention to key features. In this paper, the CBAM attention module is used, but other more efficient attention mechanisms, such as Self-Attention or Vision Transformer (ViT), can be explored in the future to further enhance the quality of feature representation. Especially on resource-limited devices, finding an efficient attention module with low computational overhead is an important direction for future research. In addition, combining multiple attention mechanisms to achieve collaborative modeling of different feature layers may also improve the overall performance of the model.
4.4. Model Lightweighting
Some degree of model lightweighting was achieved in this study by reducing the number of Inception modules (from nine to seven) and introducing depth-separable convolution, as well as adding batch normalization (BN) operations after each convolutional layer. However, how to maintain the performance of a model while compressing it remains an important research question. In the future, knowledge distillation techniques can be introduced to efficiently migrate the knowledge from large models to lightweight models, thus improving the classification accuracy of the models while significantly reducing the number of parameters. In addition, techniques such as pruning and quantization can be explored to further reduce the demand for computational resources and make the model suitable for real-time applications on low-power devices.
In summary, this paper has enhanced the attention mechanism of GoogLeNet and made it more lightweight when deployed in classification recognition tasks, and achieved certain results. However, future research can further improve the accuracy and adaptability of the model through more refined parameter tuning, reasonable data enhancement strategies, innovative attention mechanism design, and more efficient model lightweight methods. It is hoped that this study can provide a useful reference for such future research.