1. Introduction
Winter wheat is one of the most important grain crops in China, with an annual output of over 10,000 tons, making China the world’s largest wheat producer. In total, 95% of wheat is sown as winter wheat in autumn, while the rest is sown as spring wheat in the spring. Its yield and quality directly affect China’s food security [
1]. The greening period is one of the most important stages of winter wheat growth and a crucial reflection of wheat seedling quality in agricultural production. The greening period is when the temperature rises in spring and plants begin to grow. The greening period occurs when 50% of the plants grow new leaves, the leaf sheaths extend 1–2 cm, and the leaf color changes from dark green to green, usually from late February to early March [
2]. Different types of winter wheat seedlings correspond to different predicted yields and require corresponding field management techniques. Therefore, accurately classifying and identifying winter wheat seedling status is important for scientific agricultural production management [
3,
4].
Greening seedlings are traditionally detected through manual observation and empirical judgment, leading to problems such as strong subjectivity, large workload, and significant time consumption. This method cannot meet the requirements of efficiency, accuracy, and real-time processing. In addition to manual detection, image-processing technology is used in research, and machine learning researchers have manually designed feature extraction algorithms [
5,
6]. Peng Fang et al. [
7] used 10-m-resolution images from the Sentinel-2 satellite combined with machine learning algorithms for large-scale winter wheat area recognition and mapping. Their study compared the classification performance and mechanism of three machine learning algorithms: support vector machine (SVM), random forest (RF), and classification regression tree (CART). The results showed that SVM performed the best in winter wheat recognition, with an overall accuracy of 0.94. Yuan Fang et al. [
8] used laser radar to evaluate the tiller number of wheat in the field. They first used the adaptive hierarchical algorithm (AL) for clustering segmentation and then the hierarchical clustering algorithm (HC) for tiller detection between clusters. This algorithm estimated the tiller number of wheat, and the regression coefficient (R2) values were 0.61, 0.56, and 0.65. Lukas Roth et al. [
5] used unmanned aerial vehicles (UAVs) to obtain multi-view images of quantified wheat leaf area changes by repeatedly flying over a breeding nursery with early single-row wheat plots. They used support vector machines to predict stem elongation and watershed algorithms and growth models to predict the seedling rate and tiller number. The results showed that the emergence rate R2 was 0.52, the tiller number R2 was 0.86, and the stem elongation R2 was 0.77. Related research has combined image-processing technology by manually designing feature extraction algorithms, but the effectiveness is extremely dependent on the researcher’s experience level; the transferability is low, with low accuracy. Moreover, some studies require specialized spectral or specialized equipment in specific fields.
With the rapid development of artificial intelligence, applying deep learning to image recognition has provided new ideas and methods for identifying and grading winter wheat seedlings during the greening period. Deep learning technology has a powerful advantage in its automatic learning features, and it has achieved significant results in fields such as image recognition, object detection, and image segmentation [
9,
10,
11,
12,
13].
With the vigorous development of deep learning research, models are moving toward deeper levels and larger parameter quantities, and the requirements for datasets and training costs are increasing. This phenomenon is more prominent in large models. In the agricultural field, high-quality public datasets are relatively scarce. However, transfer learning has greatly improved this situation. The concept of transfer learning originates in machine learning. In a broad sense, transfer learning is a machine learning method that applies existing knowledge to solve different but related domain problems. Transfer learning is widely used in deep learning research owing to its powerful learning ability and limited training time [
14,
15,
16]. A transfer learning paradigm based on pre-training–fine-tuning in deep learning is widely used by researchers [
17]. ShengJie Teng et al. [
18] used ConvNeXt to detect the substitution rate of steel sand by adding SE attention(Squeeze and Excitation attention) to the ConvNeXt Block. After the weights were transferred from one mixed sand group to three steel sand groups, the accuracy rate reached 92.64%, 4.65% higher than without transfer learning. Xiaoqi Wang et al. [
19] used ConvNeXt to classify and recognize seven rice diseases, adding ECA (Efficient Channel Attention) between the convolution layer and the global average pool layer, with the frozen convolution layer only training itself. To avoid overfitting, the model parameters were divided into two parts: weights and biases. The attenuation operation was only used for weights, with an accuracy of 94.82%. Tongkai Li et al. [
20] used transfer learning to grade the quality of Oudemansiella raphanipes; they froze the pre-trained model, trained the head model and FC layer, and used various optimization methods to improve model performance and avoid overfitting. The accuracy reached 98.75%, and the detection speed for one image was 22.5 ms. Although the above studies have effectively improved this model’s performance, they did not use the pre-training–fine-tuning method to effectively address overfitting; thus, the model only performed well when the dataset quality was good and the categories were relatively balanced.
ConvNeXtV2 is an optimized model based on Convnet. It is created by co-designing a fully convolutional masked autoencoder (FCMAE) and using global response normalization (GRN). Compared with ConvNeXt, the model is more complicated. ConvNeXt’s design is relatively simple and clear, which makes it easier to understand, de-bug and optimize. ConvNeXt provides a strong foundation for researchers and developers, without having to invest too much energy in complex self-monitoring training strategies. Although ConvNeXtV2 has improved its performance by introducing new technologies, ConvNeXt still has advantages in computing efficiency, simple architecture and mature stability.
There is little research on applying deep learning to winter wheat seedling status during the greening period. Although these studies have improved the recognition effect of crops to a certain extent, they often perform poorly in situations where the dataset samples are insufficient and the categories are not balanced. In the context of ensuring improved recognition accuracy, the requirements for computing resources are relatively high. This study aims to reduce labor costs, improve the accuracy and efficiency of detection, save model training costs, and better the complete identification and grading of winter wheat seedlings during the greening period.
Thus, to solve the problem of uneven recognition accuracy caused by uneven wheat seedling sample data classification, we improved the ConvNext network by changing the cross-entropy loss to focal loss and adding an improved SE attention module (SET) to reduce model overfitting and help the model better utilize image feature information. From the perspective of low recognition accuracy and saving computational resources, we compared the impact of data augmentation and transfer learning optimization methods on model performance. When using transfer learning to fine-tune a model, it is necessary to control overfitting. Using pre-training–fine-tuning techniques in transfer learning, we validated the impact of five fine-tuning methods on the network model, as well as the effect of different embedding attention methods on model performance.
2. Materials and Methods
2.1. Datasets
Image data for winter wheat seedlings during the greening period were acquired in March 2023 and March 2024. Over 2000 winter wheat plants were surveyed and harvested in Shengli Village, Pingdu City, Shandong Province, and Xizhai Village, Pingdu City, Shandong Province. Considering the practicality of the model under real-world conditions, no specialized shooting system is set up, only a smartphone needs to be used for shooting. Considering the distance of winter wheat in the image, both too bright and too dark can affect the model. By capturing images of winter wheat during the greening period at different orientations and distances (20 cm–50 cm) under daily brightness conditions, and the final dataset contained 2831 images. Of these, there were 689 Strong-class, 605 First-class, 1048 Second-class, and 489 Third-class images, and all were annotated. Unbalanced datasets negatively impact models, as they focus too much on the feature information in multi-class samples and ignore that of classes with fewer samples. Some scholars have solved this problem by oversampling or undersampling their datasets [
21], but these optimization strategies are relatively complex and ultimately have unstable performance. Therefore, this study addresses this issue by replacing the focal loss function.
Detecting winter wheat seedlings focuses on several growth parameters, including the number of tillers per plant, the total tillers, secondary roots, and the leaf age of the main stem. We referred to the local standard in Shandong Province to define the greening period for winter wheat (DB 37/7T 4366-2021); that is, the date when the average temperature in spring stabilizes above 3 °C, the wheat seedlings resume growth, the leaves turn bright green, and the new leaves elongate by 1.5 cm–2 cm. We classify winter wheat seedlings into four types: Strong class, First class, Second class, and Third class.
While we drew on the local standard, we also made changes based on actual conditions, mainly focusing on the individual growth parameters of winter wheat. The five-point sampling method was used for field sampling plots with uniform growth, with 10 samples taken from each point to represent the growth of winter wheat seedlings on a mu of land. The specific classification criteria are as follows: The Strong class has more than 7 tillers per plant, the growth appearance is characterized by slender leaves, there are many albeit weak tillers, and there are more than 7 secondary roots per plant. The First-class has 5–6 tillers per plant, with green leaves and many large tillers, and there are 5–7 secondary roots on a single plant. The Second class has 3–4 tillers per plant, the growth appearance is green, and there are 3–4 secondary roots on a single plant. The Third class has 1–2 tillers per plant, with a growth appearance of light green leaves, small and weak tillers, and less than 3 secondary roots per plant.
We preprocessed the collected images, adjusted their sizes to 224 × 224, and performed data augmentation. We included 9 image transformation methods, including horizontal flipping, left and right rotation, narrowing, enlargement, random color changes, random occlusion, and Gaussian blur, as shown in
Figure 1.
2.2. Transfer Learning and Pre-Training–Fine-Tuning
Pre-training–fine-tuning is an important technique in transfer learning. Pre-training refers to obtaining a pre-trained model by learning the basic features of images based on a large dataset. Fine-tuning involves making the pre-trained model more suitable for the new task. Usually, pre-trained models are not fully applicable to the new related task. By further learning and adjusting the parameters of the pre-trained model using the new task’s dataset, a target model suitable for the new task can be obtained. This process is called pre-training–fine-tuning [
22,
23]. The types of fine-tuning include the following:
Fine-tuning all layers: All parameters of the entire pre-trained model are retrained.
Fine-tuning some layers: The parameters at the bottom layer of the model are frozen, and only the parameters of the last FC layer or the last few layers of the model are retrained. The parameters of the other layers are frozen and remain unchanged, as shown in
Figure 2.
The ConvNeXt pre-trained model used in this study is based on the original model under the Pytorch framework [
24]. Its dataset is the publicly available ImageNet-1K [
25]. To be more suitable for this task, the original 1000 classes in the FC were changed into 4 classes, an improved SET attention mechanism module was added, and the cross-entropy loss was changed to focal loss. Finally, a new SETFL-ConvNeXt network was obtained. The pre-training–fine-tuning model transfers convolution weights that have been trained for SETFL-ConvNeXt, and then, a self-built winter wheat dataset trains the model. Finally, the task of winter wheat seedling identification and grading during the greening period can be completed, as shown in
Figure 3.
4. Experimental Validation and Analysis of the Results
4.1. Configuration Environment
The experimental environment was the Windows 10 operating system; the CPU was an AMD EPYC 7402 24 Core Processor; the main frequency was 2.79 GHz; the Python 3.8.18 development environment was used; Pytorch 2.0.0 and CUDA12.2 were selected as the deep learning framework; an NVIDIA GeForce RTX 3090 was used for GPU operation acceleration; and the running memory of the GPU was 24 GB. The input image size was 224 pixels × 224 pixels; there were 50 training rounds; the batch size was 128; and the chosen optimizer was AdamW, using an exponential learning rate adjustment strategy with a parameter of 0.99.
4.2. Evaluation Indicators
Here, is the number of samples correctly predicted as positive, that is, the number of winter wheat seedling classes correctly classified and identified; is the number of samples correctly predicted as negative samples, that is, the number of winter wheat seedlings classified and identified as other classes; is the number of samples incorrectly predicted as positive, that is, the number of winter wheat seedling classes incorrectly classified and identified; is the number of samples incorrectly predicted as negative samples, that is, the number of winter wheat seedlings classified and identified as other classes. The F1 value is the harmonic mean of precision and recall. Accuracy represents the accuracy of the model.
4.3. The Impact of Different Learning Methods and Data Augmentation on Models
We compared the dataset before and after data augmentation using transfer learning training methods. The pre-training–fine-tuning mode in transfer learning uses full fine-tuning in the following ways:
① Using transfer learning without data augmentation;
② Using data augmentation without transfer learning;
③ Not using transfer learning or data augmentation;
④ Using transfer learning and data augmentation.
For models that only use data augmentation strategies without transfer learning, overfitting occurs after 20 epochs (
Figure 5). The loss value of the model decreases and then increases, while the accuracy is basically unchanged. The enhanced data do not improve the generalization ability of the model but do introduce noise, resulting in the model learning too many noisy features. Both models using transfer learning strategies reached stability after five epochs, indicating that the convergence speed of the models was faster when using transfer learning. Thus, the accuracy of the models increased by 10.82% compared with the cases using and not using data augmentation. When neither optimization strategy was used, the model’s loss curve converges at 25 epochs, and the accuracy can only reach 75.65%. Using transfer learning can not only accelerate the convergence speed of the model but also utilize the rich underlying features of large data to improve its generalization ability and overall performance.
The two models that only use one of the two optimization strategies have an impressive training set accuracy of around 99% (
Table 2), but their test sets are only around 85%, indicating that they have learned a large number of useless features and ignored general and generalizable features, leading to overfitting.
4.4. The Impact of Pre-Training–Fine-Tuning Methods on the Model
In many studies, researchers default to using partial fine-tuning when using transfer learning techniques, freezing the previous network results and only training the last few layers of the network, except for the convolutional layer or the last FC layer. Some scholars choose different fine-tuning methods based on the degree of difference between the large dataset learned by the pre-trained model and the task dataset [
30,
31]. There is currently no definitive conclusion on the impact of different fine-tuning methods on this model, but one thing is certain: training with a larger sample size dataset can achieve better results. However, this also increases costs and training times.
This experimental design includes
Global fine-tuning (A);
Partial fine-tuning (B) to train only the last layer of the FC layer and the SET attention mechanism;
Partial fine-tuning (C) to train the last three layers of the ConvNeXt Block and the last layer of the FC layer and SET attention mechanism;
Partial fine-tuning (D) to freeze the first six layers of the ConvNeXt Block to train the remaining layers;
Partial fine-tuning (E) to freeze the first three layers of to ConvNeXt Block to train the remaining layers.
Figure 6 shows that the accuracy of the partially fine-tuned C model trained on the last three layers of the ConvNeXt Block, the last layer of the FC layer, and the SET attention mechanism reached 0.9668. Compared with the globally fine-tuned A model, the partially fine-tuned B model trained only on the last layer of the FC layer and the SET attention mechanism with the lowest accuracy, the partially fine-tuned D model trained on the remaining layers of the first six frozen ConvNeXt Blocks, and the partially fine-tuned E model trained on the remaining layers of the first three frozen ConvNeXt Blocks, the C model’s accuracy increased by 0.19%, 6.19%, 0.44%, and 0.75%, respectively.
The convergence speed shown in
Figure 6 reveals that the B model exhibited a convergence trend after approximately 45 rounds. This indicates that the number of fine-tuning layers and fine-tuning strategies significantly impacts the convergence speed of the model. Because it only fine-tunes the last FC layer, the B model lacks adjustments for deeper features, resulting in a slower convergence speed and lower accuracy. The C model performs the best by fine-tuning multiple deep feature layers and attention mechanisms to better capture features and converge quickly.
The loss curve shows that the C model performs best with no significant oscillations or upward trends, indicating that the model maintains a stable learning process during training. By contrast, the curve oscillation of the A model is the most significant, possibly because global fine-tuning involves many parameter updates during the training process, making it difficult for the model to optimize smoothly. The curve of the B model is relatively smooth, but its learning effect is limited by the few fine-tuning layers. The D and E models also exhibit a certain degree of oscillation, but they are still relatively stable overall.
4.5. The Impact of Insertion Position on the Model
Attention mechanisms have attracted much attention, and the flexible use of plug-and-play attention mechanisms can greatly improve the performance indicators of a model. In a pre-training–fine-tuning paradigm transfer learning, adding an attention mechanism module suitable for the target task—in addition to changing the FC layer of the last layer of the pre-trained model according to the target task requirements and the appropriate method of embedding the attention mechanism module—has a significant impact on the model’s performance. In this section, we design an experiment to explore the impact of embedding the attention mechanism module on the model’s performance.
Model V is the original ConvNeXt model using global fine-tuning;
Model IV replaces ConvNeXt Block IV with a SET-ConvNeXt Block and freezes the rest of the layers, only training the SET-ConvNeXt Block and the FC layer;
Model III replaces ConvNeXt Block III and ConvNeXt Block IV with a SET-ConvNeXt Block and freezes the remaining layers, only training the SET-ConvNeXt Block and the FC layer;
Model II replaces ConvNeXt Block II, ConvNeXt Block III, and ConvNeXt Block IV with a SET-ConvNeXt Block and freezes the remaining layers, only training the SET-ConvNeXt Block and the FC layer;
Model I replaces ConvNeXt Block I, ConvNeXt Block II, ConvNeXt Block III, and ConvNeXt Block IV with a SET-ConvNeXt Block and freezes the remaining layers, only training the SET-ConvNeXt Block and the FC layer;
Model C puts the attention mechanism in
Section 4.4, outside of the Block.
In this section, we take the common example of placing attention mechanisms in ConvNeXt Blocks, as shown in
Figure 7, to compare and analyze the model with the C model in
Section 4.4, placing attention outside of the ConvNeXt Blocks.
As the attention mechanism is inserted, the degree of change in the pre-trained model increases for models IV, III, II, and I. Based on this loss,
Figure 8 shows that three models (I, II, and III) exhibit overfitting where the loss values decrease and then increase. However, directly training the original ConvNeXt network (model V) using the fully fine-tuned method resulted in significant fluctuations in the loss curve, and the model learned too many redundant features. The IV loss value of the model with fewer structural changes first decreases and then reaches stability. As the original pre-trained model structure changes more, the model’s accuracy also decreases. The C model with the highest accuracy reaches 0.9668; the original network V accuracy is 0.9577; and the other four models are model IV at 0.9358, model III at 0.9224, model II at 0.8959, and model I at 0.8812. This indicates that changing the insertion attention mechanism has different effects on the feature extraction ability of the pre-trained model. The greater the damage to the feature extraction of the pre-trained model, the worse the performance. By contrast, the pre-trained model with the least changes in structure showed an accuracy improvement of 3.1–8.56% compared with the other methods, as well as an accuracy improvement of 0.91% compared with not changing the pre-trained model structure. Overfitting in the model also improved to some extent. Against the backdrop of the original ConvNeXt network having relatively high accuracy in this task, the pre-trained model with the least changes, structure C, improved accuracy by 0.91% compared with the original ConvNeXt model (V), and model overfitting improved to some extent. Through fine-tuning and improvement, the model better adapts to specific task requirements based on the target dataset. This means that the model learns more useful features rather than simply remembering noise or irrelevant patterns in the training data, thereby reducing overfitting. The weights in the pre-trained model are initially trained on large-scale datasets and may not directly adapt to the target task. During the fine-tuning process, the model learns a more suitable feature representation for the current task by further training on the target dataset, improving the model’s classification ability. Thus, the effectiveness of the optimization strategy has been verified. Optimization strategies include adjusting and changing the network structure, freezing some layer weights, and data augmentation. These operations help the model generalize better, avoiding overfitting while improving accuracy.
4.6. Model Comparison
To verify the improved SETFL-ConvNeXt model’s performance, ablation experiments were carried out on the classic ConvNeXt network, ConvNeXt-FocalLoss, and SET-ConvNeXt-cross-entropy. At the same time, the proposed SETFL-ConvNeXt network model was compared with the classical networks in other image classification tasks [
32,
33,
34,
35]. Using data augmentation and transfer learning, the same super-parameter was set, and the results of the test set are shown in
Table 2.
When compared with EfficientNet_v2s, MobileNet_V3s, ResNet18, and Vgg11, the accuracy of the classic ConvNeXt network was improved by 0.86%, 3.41%, 1.17%, and 1.52%, respectively (
Table 3). The accuracy of the ConvNext_base model was 0.05% higher than that of ConvNext(ConvNext_tiny), but the model was not adopted because it is too large and consumes too many computing resources.
Compared with the MobileNet_V3s model with the lowest accuracy, the accuracy of the SETFL-ConvNeXt model improved by 4.27%. Compared with ConvNeXt_base, EfficientNet_v2s, ResNet18, and Vgg11, it increased by 0.86%, 1.72%, 2.03%, and 2.38%, respectively. The performance of SETFL-ConvNeXt is better than that of traditional networks.
Compared with the classical ConvNeXt network, the SET-ConvNeX-cross-entropy network, and the ConvNeXt-FL network, the accuracy of the SETFL-ConvNeXt model improved by 0.91%, 0.14%, and 0.21%, respectively. These improvements are mainly due to two key improvement measures: First, introducing the SET module enables the model to automatically learn more appropriate classification weights and focus on the feature information with high values. This ability significantly improves the discriminant ability of the model when dealing with complex scenes. Second, by using the focal loss function, the model can pay more attention to a small number of samples and difficult samples, effectively reducing the adverse effects of these samples on the overall model performance. These two improvement measures work together to further improve the model’s accuracy.
The analysis of the confusion matrix shows that the characteristics of the First-class seedlings are more similar to those of the Strong-class seedlings in the distribution of the four types (
Figure 9). The data volume of one seedling type is smaller among the four types, and its classification accuracy is the lowest. In the classic ConvNeXt model, the accuracy for First-class seedlings is only 90.736%. To resolve this issue, we optimized for imbalanced data. By using the focal loss function, the ConvNeXt-FL model could improve its accuracy for First-class seedlings to 94.070%, 3.334% higher than the classical ConvNeXt model. In addition, the F1 values of the ConvNeXt-FL model increased by 0.8%, 1.0%, 0.6%, and 0.7% for the four seedling conditions, and the overall F1 value increased by 0.8%. These results indicate that focal loss performs well in dealing with imbalanced datasets and helps improve the model’s classification performance for minority classes. Meanwhile, after incorporating the SET attention mechanism, the SET-ConvNeXt model improved classification accuracy by 1.188%, 2.199%, and 0.132% compared with the classical ConvNeXt model for Strong-class, First-class, and Third-class seedlings. The accuracy for Second-class seedlings only slightly decreased by 0.034%. These results further demonstrate that introducing the SET module effectively enhances the feature extraction capability of the model and improves its generalization performance; in situations where features are similar or data are scarce, the model can still maintain good performance. The final SETFL-ConvNeXt network comprehensively utilizes the above advantages, and the accuracy for the four seedling types increased by 0.828%, 2.521%, 0.263%, and 0.401%.
5. Discussion
Classifying winter wheat seedlings during the greening period still relies on manual labor in many places, with inconsistent standards and subjective assumptions. Using spectroscopic equipment or other fixed facilities to capture canopy images can be costly. This article starts from reality and considers the model’s practicality in real conditions. There is no need to set up a specialized shooting system, only to use a smartphone to collect winter wheat images. This method aims to assist government officials, researchers, and agricultural practitioners in classifying winter wheat seedlings during the greening period more conveniently and quickly, understanding the growth status of winter wheat, and providing a reference for subsequent field management. It also provides a solution and approach for researchers to address overfitting.
We found that when pre-trained models are used in transfer learning, targeted improvements are needed based on the characteristics and differences in downstream tasks. In classifying winter wheat seedlings during the greening period, owing to the small First-class seedling sample size in the dataset and its similarity to the characteristics of Strong-class seedlings, the expected model performance could not be achieved. Therefore, it is necessary to improve the model’s attention to important features. In addition, it is necessary to focus on addressing the common overfitting phenomenon in the transfer learning process. The research in this article yielded the following key results:
Combining transfer learning and data augmentation can effectively improve the performance of models on training and testing sets and is thus an important means of enhancing model performance. When only data augmentation was performed on the dataset, the model exhibited overfitting. However, by using both optimization strategies simultaneously, the accuracy was improved by 20.84%. The model’s performance under different strategies indicates that reasonably selecting and combining training strategies can significantly enhance a model’s generalization ability and avoid overfitting.
In the transfer learning process, it is important to choose appropriate fine-tuning strategies based on the characteristics of downstream tasks. Compared with the other four fine-tuning strategies, training the last three block layers, FC layer, and the partial fine-tuning of the SET attention mechanism improved the model’s accuracy by 0.19–6.19%.
When using attention mechanisms to improve model performance, appropriate embedding methods should be selected based on the degree of change in the pre-trained model’s structure. As the degree of change decreases, the model’s accuracy increases, and overfitting is improved. For classifying winter wheat greening seedlings, selecting the method with the least degree of damage can achieve 0.9668 accuracy. Compared with other methods, the accuracy increased by 3.1–8.56%, and compared with not changing the structure of the pre-trained model, the accuracy increased by 0.91%.
The SETFL-ConvNeXt model not only successfully improved the overall classification accuracy of the model to 0.9668 by introducing the focal loss and SET modules but also significantly improved the model’s performance in dealing with data imbalance and similar features. For classifying First-class seeding, the accuracy rate significantly improved by 2.521%, showing the effectiveness and practicability of these improvement measures.
Choosing appropriate optimization strategies and utilizing both transfer learning and image enhancement can enhance the practicality of a model. When selecting a plug-and-play attention module based on the characteristics of downstream tasks, ap-propriate fine-tuning methods should be combined with the model to complete down-stream target tasks. It is necessary to make appropriate adjustments to any overfitting to ensure the model’s generalization ability.
Compared with previous studies [
11,
18,
19,
20], our method provides a more robust framework for handling agricultural tasks that require precise classification. Although the above model has been successfully applied to the target task, our SETFL-ConvNeXt model has the highest accuracy in winter wheat classification. When using fine-tuning, only changing the FC layer or completely fine-tuning [
18,
20] may result in insufficient model performance or overfitting. Training the FC layer separately may result in the loss of complex features during the transmission from the convolutional layer to the FC layer, affecting the classification ability of the model. Global fine-tuning may also disrupt the common features learned by early layers in the pre trained model, leading to a decrease in model accuracy or overfitting. This indicates that the architecture and fine-tuning strategies of the model play a crucial role in its success.
Future work can focus on expanding the dataset to include more diverse types and regions, which may improve the model’s generalization ability in different agricultural contexts. In addition, attention should be paid to how to maintain the efficiency and accuracy of the model in various complex environments, in order to meet the needs of practical applications. After appropriate adjustments, the SETFL-ConvNeXt model not only has the potential to improve the classification of winter wheat seedlings, but also to extend its application to other crops and agricultural tasks, providing smarter and more refined technical support for agricultural production.