1. Introduction
Maize is one of the most important staple crops worldwide, and its yield and quality are directly related to the economic benefits of agricultural production and food security [
1,
2]. In modern agricultural production, maize kernel recognition technology is widely used in seed quality inspection, genetic breeding research, and precision agriculture management. Accurately and rapidly identifying the morphological characteristics of maize kernels not only assists seed companies in variety screening and quality control but also provides agricultural researchers with efficient trait analysis tools. This, in turn, optimizes breeding strategies and enhances the efficiency of elite variety selection [
3,
4,
5]. Therefore, developing high-precision and efficient maize kernel recognition methods is crucial for advancing intelligent agriculture. Traditional maize kernel recognition methods mainly rely on manual measurements and statistical analysis, such as using vernier calipers to measure kernel length, width, and thickness or employing optical microscopes for grain morphology observation [
6]. Although these methods provide high measurement accuracy, they suffer from complex operations, long processing times, and low efficiency, making them unsuitable for large-scale seed testing and breeding experiments. Additionally, some studies have applied computer vision methods based on color, shape, and texture features, extracting maize kernel morphological parameters through image-processing techniques such as edge detection, morphological operations, and grayscale histogram analysis [
7,
8,
9]. While these methods reduce human intervention, they are highly sensitive to lighting conditions and background complexity. Moreover, in cases of kernel adhesion, occlusion, or high varietal diversity, the detection accuracy significantly decreases.
The related research has undergone a technological evolution from traditional thresholding methods to deep learning and, more recently, to Transformer-based architectures. Initially, maize trait measurement relied on low-level features such as color, shape, and texture, using fixed or adaptive thresholds to segment target regions. These methods were simple but lacked adaptability in complex backgrounds [
10,
11,
12,
13]. Subsequently, Convolutional Neural Networks (CNNs) gradually replaced traditional approaches by enabling automatic feature extraction, significantly improving recognition accuracy and robustness. CNNs have shown strong performance in tasks such as object detection, segmentation, and variety classification, albeit at the cost of high computational demands [
14]. In recent years, Transformer-based models—particularly the DEtection TRansformer (DETR) architecture—have emerged as a promising alternative. By introducing the multi-head self-attention mechanism, these models effectively capture long-range dependencies and overcome the limitations of CNNs’ local receptive fields, demonstrating superior performance in handling occlusion and adhesion challenges [
15]. The Faster Region-based Convolutional Neural Network (R-CNN), as a two-stage object detection method, first generates candidate regions and then performs precise classification and regression, achieving high detection accuracy [
16]. Zhang et al. [
17] described a rice panicle detection model based on an improved Faster R-CNN. Experimental results showed that the model achieved an average precision (mAP) of 92.47%, significantly improving upon the original Faster R-CNN model (mAP of 40.96%). However, the method incurs high computational costs, making real-time applications challenging. The YOLO series, as a typical single-stage detector, achieves efficient detection through an end-to-end framework, balancing detection speed and accuracy [
18]. Yang et al. [
19] proposed a new high-accuracy and real-time maize pest and disease detection method called Maize-YOLO. Experimental results demonstrated that this method outperformed existing state-of-the-art YOLO-based object detection algorithms, achieving 76.3% mAP and 77.3% recall. Xia et al. [
20] developed a maize seed surface mold detection model using the YOLOv5s deep learning algorithm based on machine vision technology. They further enhanced its portability for deployment on mobile devices, ultimately developing the improved YOLOv5s-ShuffleNet-CBAM model, which achieved an mAP50 value of 0.955. Yang et al. [
21] designed a maize kernel detection and recognition model named “YOLOv7-MEF”. Experimental results showed that the improved algorithm achieved an accuracy of 98.94%, a recall of 96.42%, and a frame rate of 76.92 FPS. However, due to the small size and similar morphology of maize kernels, existing object detection models still face challenges in high-density small-object detection, particularly under kernel adhesion and partial occlusion conditions, where misdetections or missed detections frequently occur.
To address these issues, this study proposes a maize kernel detection method based on a neighborhood attention mechanism and neighborhood loss. First, during the feature extraction phase, the proposed method introduces a neighborhood attention mechanism, ensuring stable feature representation across different scales of kernel detection tasks. Additionally, to further optimize the spatial consistency of detection results, this study introduces neighborhood loss, which constructs a local region constraint loss, ensuring that the detection results of adjacent kernels are more consistent, effectively reducing misdetections and missed detections. The main contributions of this study are as follows:
Neighborhood attention mechanism: The neighborhood attention mechanism proposed in this article only focuses on features within the neighborhood range, and performs local self-attention calculation through sliding windows to improve the model’s ability to capture local features in corn kernel detection tasks.
Neighborhood loss to optimize spatial consistency among objects: This study designs a neighborhood loss that introduces local region constraints, ensuring that adjacent objects have more consistent feature distributions, thereby reducing misdetections and missed detections.
Combining the advantages of Transformer and YOLO architectures: The proposed method integrates the global feature extraction capability of Transformer models with the rapid object detection characteristics of YOLO structures.
This study ensures high detection performance while optimizing computational complexity, allowing the method to operate efficiently in resource-constrained environments. Future research will further enhance the model’s lightweight characteristics and deploy and test it in real agricultural environments at the Jiangsu Academy of Agricultural Sciences, Germplasm Resources and Biotechnology Research Institute. This will validate its effectiveness in large-scale agricultural applications, improving the efficiency and reliability of automated seed quality inspection.
2. Materials and Methods
2.1. Data Collection
The data collection for this study was primarily conducted at the experimental field of the Jiangsu Academy of Agricultural Sciences, located in Nanjing, China. Additionally, publicly available online datasets were incorporated to ensure diversity and broad applicability. A high-resolution RGB camera, Canon EOS 5D Mark IV (Canon Inc., Tokyo, Japan), equipped with a 50mm prime lens, was used for image acquisition to ensure clarity and detailed feature representation. The camera was mounted on an adjustable-height tripod to maintain stability across different shooting angles, as shown in
Figure 1.
The shutter speed was set to 1/1000 s to prevent motion blur, and the aperture was configured at to achieve an optimal depth of field. It should be noted that image-processing results can be significantly influenced by several image acquisition parameters, including focal length, color depth, local magnification, object rotation, aperture, and threshold values. In this study, to ensure consistency and reproducibility, we employed a fixed imaging configuration consisting of a 50 mm lens, 1/1000 s shutter speed, and aperture. While this setup provides a controlled environment for evaluating the model, we acknowledge that such consistency may restrict the evaluation of robustness across more diverse imaging scenarios. Therefore, as part of our future work, we intend to explore additional experiments involving varied acquisition settings to assess the generalization and adaptability of the proposed method under complex real-world conditions. To enhance consistency across images, automatic white balance mode was enabled, and multiple captures were taken at different times of the day to accommodate varying natural light conditions. Additionally, a multi-angle imaging strategy was employed to obtain comprehensive maize ear images, including top-view, side-view, and 45-degree oblique-view perspectives, facilitating subsequent object detection and morphological analysis. The data collection process followed a standardized protocol. Representative maize varieties were selected, including both conventional and specialty maize, such as high-anthocyanin and high-carotenoid varieties, resulting in a total of 1672 images. Furthermore, images of maize kernels with impurities and defects, including damaged and moldy kernels, were collected to enhance the model’s adaptability to complex environments. During image acquisition, special attention was given to key maize kernel traits, including ear length, ear diameter, kernel row number, kernel size, and color characteristics.
2.2. Dataset Annotation and Augmentation
The primary objective of dataset annotation is to provide accurate bounding boxes and class labels, ensuring that the supervised learning model correctly identifies and classifies maize kernels during training. In object detection tasks, each sample must be annotated with its target position and class information, including the bounding box coordinates
, where
represents the center coordinates of the target, and
w and
h denote the width and height of the object. The annotation process typically involves image loading, bounding box drawing, and data format conversion. Given an input image
I, the annotator manually selects the target region
and records its bounding box coordinates:
where
i represents the
i-th target. In object detection models, normalized coordinates are usually used for training to mitigate the impact of image resolution on the model’s performance. The normalization process is as follows:
where
W and
H represent the width and height of the image, and
are the normalized bounding box coordinates.
During the annotation process, target class labels
must also be defined, typically represented using one-hot encoding:
where
K is the total number of object classes, and
indicates that the target belongs to class
k, otherwise
. During model training, the class labels are input into the network along with the bounding boxes for optimization. To enhance the model’s robustness and improve its adaptability to different lighting conditions, shapes, and backgrounds, various data augmentation methods are applied in this study, such as Mosaic, CutMix, and GridMask. The Mosaic augmentation method stitches together four different images, increasing background diversity and improving the model’s ability to adapt to complex environments. Given four original images
, a random stitching center point
is first selected in the image space, and then the four sub-regions are partitioned as follows:
Simultaneously, the bounding box coordinates must be adjusted accordingly to ensure alignment with the transformed image positions. The CutMix augmentation method enhances model robustness by randomly cropping a rectangular region from one image and replacing it with the corresponding region from another image, thereby simulating partial occlusion of the target. Given two original images
and
, a random rectangular region
is selected from
and replaced with the corresponding region from
:
The size of the rectangular region
R is controlled by the hyperparameter
, which is typically sampled from a Beta distribution
:
where
represents the area of the cropped region, and
are the image width and height. To ensure the consistency of class labels, the new sample’s class label is computed as a weighted average:
This strategy ensures that the model retains good recognition capability even when encountering partially occluded targets. The GridMask augmentation method overlays regularly or randomly distributed grid occlusions on the image, encouraging the model to focus on more discriminative global features during training. Given an input image
I, GridMask defines a grid structure
M and applies occlusion based on a predefined ratio
r:
where
is defined as
Here, d represents the grid cycle, and r controls the size of the occlusion region. When r is small, the occlusion region is minimal, having little impact on the image; when r is large, more areas are occluded, requiring the model to rely on stronger global feature extraction capabilities. By increasing local information loss, GridMask improves model generalization, ensuring that it maintains good recognition performance even when some target regions are missing or occluded. In the maize kernel recognition task, these data augmentation methods effectively enhance the model’s generalization ability, allowing it to stably detect targets even under complex backgrounds and varying lighting conditions.
2.3. Proposed Method
2.3.1. Maize Kernel Shape Recognition Network Based on Object Detection Algorithm
As shown in
Figure 2, the maize kernel shape recognition network, based on object detection algorithms, adopts a hierarchical feature extraction architecture consisting of three core subnetworks: Base-Net, Overview-Net, and Focus-Net. These subnetworks are responsible for initial feature extraction, global feature modeling, and fine-grained feature refinement, respectively. Base-Net, serving as the backbone network, receives the input RGB image
and first maps it into a lower-dimensional feature space through an embedding layer. The extracted hierarchical features are then processed through multiple basic blocks, with progressively increasing channel dimensions
while the spatial resolution is gradually reduced to
,
, and
. The structure of the basic block in this stage employs an improved ResNet module, mathematically represented as
where
represents the transformation mapping of convolutional layers combined with normalization and activation functions. The residual connection ensures stable gradient propagation, thereby improving network training efficiency. The output of Base-Net is used as high-level features, which are then forwarded to both Overview-Net and Focus-Net. Overview-Net is designed to construct global perceptual features, enhancing the expression of semantic information through additional basic blocks. The output feature map has a size of
with an expanded channel dimension of
. Furthermore, Overview-Net generates global contextual prior information, denoted as Context Prior, which is transmitted to Focus-Net via the Context Guidance Flow (represented by red arrows). This process enhances the semantic consistency of local details and is mathematically formulated as
where
P represents the global prior information, and
denotes a nonlinear mapping function based on global average pooling and a multi-layer perceptron (MLP). Focus-Net is primarily responsible for fine-grained feature enhancement, addressing the challenges associated with detecting small objects. To accommodate variations in maize kernel shapes across different scales, this module incorporates dynamic blocks (Dynamic Blocks) that utilize an adaptive receptive field strategy. Initially, Focus-Net receives feature maps of size
from Base-Net, which are further processed through an embedding layer for dimensionality reduction. Dynamic blocks are then introduced for feature enhancement, with their core computation defined as
where
A represents the neighborhood attention weight matrix, and
V denotes the value projection of the input features. The weight matrix
A is computed adaptively based on regional information, leveraging global features to guide local attention aggregation. This mechanism enables the network to effectively capture maize kernel shapes under occlusion conditions. Finally, the output features of Focus-Net are transmitted back to Overview-Net to reinforce global background perception, ensuring robust object detection predictions. This architectural design provides multiple advantages. Base-Net establishes stable low-level features to retain fundamental morphological information, Overview-Net enhances global contextual modeling to improve shape perception, and Focus-Net specializes in precise small-object detection. This structure improves both the accuracy and robustness of maize kernel detection. Additionally, the introduction of Context Prior and Context Guidance Flow effectively establishes feature associations between global and local regions, further optimizing the recognition performance of maize kernel shapes. Compared to conventional single-object detection models, the proposed method achieves a balance between computational efficiency and detection accuracy, particularly enhancing the capability to detect small objects. This makes it well suited for precision agricultural tasks such as maize kernel shape recognition.
2.3.2. Neighborhood Attention Mechanism
As shown in
Figure 3, the neighborhood attention mechanism optimizes the computational efficiency and feature extraction capabilities of traditional self-attention by integrating local perception with global information interaction, demonstrating superior performance in object detection tasks. The computational complexity of self-attention is
, where
N represents the spatial resolution of the input feature map, leading to excessive computational costs when processing high-resolution inputs. In contrast, the neighborhood attention mechanism reduces computational complexity to
through a local window partitioning strategy, improving efficiency while preserving global contextual information. In this study, the core idea of the mechanism involves segmenting the input features into localized regions, computing feature similarity within each window, and dynamically adjusting representations using global perceptual information to enhance the robustness and accuracy of object detection. Subsequently, adaptive pooling is applied to
K to reduce its dimensionality, lowering the computational complexity of attention while retaining the effectiveness of global features. During the computation of neighborhood attention, a global perceptual kernel is introduced to refine the locally computed results, thereby preserving comprehensive information. The output of the neighborhood attention mechanism is fused with the input features via a residual connection to obtain the final output
Y:
In the maize kernel shape recognition network, the neighborhood attention mechanism is integrated into the feature extraction module of the object detection backbone network. The specific structure consists of three main stages: the initial feature extraction layer, the local attention computation layer, and the global fusion layer. The initial feature extraction layer employs a modified ResNet as the backbone network, with channel dimensions set to , and spatial resolution progressively reduced from to . The local attention computation layer utilizes a window partitioning strategy with a window size of , where neighborhood attention is computed within each window, while inter-window information is refined using a global perceptual kernel. Finally, the global fusion layer applies a normalization convolution to merge all neighborhood attention-enhanced features, improving the robustness of shape detection. This architectural design offers significant advantages, as it enables precise capturing of maize kernel morphological details while mitigating the high computational complexity associated with self-attention on large-scale feature maps. Consequently, the model effectively detects kernel shapes while maintaining real-time performance and computational efficiency.
2.3.3. Lightweight Design
The lightweight design plays a crucial role in the maize kernel shape recognition network, aiming to reduce computational complexity while maintaining detection accuracy, thereby enhancing the model’s applicability in real-world agricultural scenarios. This optimization focuses on network depth, channel configuration, and computational efficiency by reducing redundant parameters and improving inference speed. The proposed lightweight strategy is implemented in two main aspects: first, grouped convolution, depthwise separable convolution, and channel attention mechanisms are employed in Base-Net, Overview-Net, and Focus-Net to minimize computational overhead; second, pruning and low-rank decomposition are introduced to optimize weight storage and accelerate inference. In Base-Net, the input image
is mapped to a feature representation of dimension
through an embedding layer and subsequently downsampled across multiple stages. The hierarchical feature extraction is structured as follows:
where
represents different convolution transformations. To optimize efficiency,
adopt depthwise separable convolution, reducing the computational complexity from that of a standard convolution:
to
significantly decreasing the computational load while preserving feature extraction capability. In Overview-Net, features are further compressed to dimension
. The primary optimization in this module is the incorporation of a channel attention mechanism, which enhances computational efficiency by dynamically re-weighting feature channels. This mechanism utilizes global average pooling (GAP) to compute channel-wise importance, followed by a fully connected network to generate attention scores:
where
represents the global average pooling operation,
and
are the weights of fully connected layers,
denotes the ReLU activation function, and
represents the Sigmoid activation function. The computed channel weights are subsequently applied to adjust the input features:
This approach effectively suppresses redundant channels, improves computational efficiency, and ensures that critical features receive higher attention. In Focus-Net, a dynamic receptive field adjustment strategy is adopted, allowing each dynamic block to select convolution kernels of varying sizes based on input features instead of using fixed
or
kernel structures. This method is implemented using deformable convolution, mathematically expressed as
where
p denotes the current pixel location,
R represents the convolution window, and
is the learned offset. This mechanism allows the convolution kernel to dynamically adjust sampling points based on local variations in the input features, enhancing detection accuracy while reducing redundant computations. The lightweight nature of Focus-Net is further reflected in channel control, where unnecessary intermediate feature channels are reduced to lower overall computational complexity while retaining sufficient feature representation capacity. The combination of depthwise-separable convolution, channel attention, and deformable convolution enables the network to efficiently capture maize kernel shape details while maintaining real-time inference performance and computational efficiency.
2.3.4. Neighborhood Loss
As shown in
Figure 4, neighborhood loss is introduced in this study to address the balance between local and global features in object detection, particularly in maize kernel recognition tasks. Due to the small scale of the targets and the clustering of kernels with similar morphology within local regions, traditional loss functions often fail to effectively capture spatial relationships. Conventional loss functions such as mean squared error (MSE) and cross-entropy (CE) primarily focus on the prediction accuracy of individual pixels or objects while disregarding the similarity between adjacent regions. This leads to instability in predictions, particularly at object boundaries. To mitigate this issue, neighborhood loss incorporates local structural constraints, enforcing correlation between neighboring pixels or objects to enhance spatial consistency in detection results. Specifically, given the predicted output
and the ground truth
Y, the traditional cross-entropy loss is formulated as
This function considers only individual point-wise predictions. In contrast, neighborhood loss introduces local region constraints, ensuring stronger semantic consistency among adjacent pixels within the target area. The neighborhood loss is defined as
where
denotes the neighborhood set of pixel
i,
represents a distance metric (e.g., Euclidean distance) between two predicted values, and
is a weighting coefficient that adjusts the contribution of different neighboring regions. To further improve boundary prediction in detection tasks, a boundary-aware loss is introduced, ensuring smoother transitions in boundary areas. The boundary-aware loss is formulated as
where
represents the set of boundary points in the target region, and
and
denote gradient computations in the
x and
y directions, respectively. The final form of neighborhood loss is derived by integrating the cross-entropy loss, neighborhood consistency loss, and boundary-aware loss as follows:
where
and
are balancing coefficients that control the relative contributions of the different loss components. This loss function design provides significant advantages in maize kernel detection. First, it ensures consistency in detecting adjacent kernels, reducing noise interference caused by illumination variations and occlusions. Second, the boundary-aware loss enhances the accuracy of object localization, enabling the model to distinguish adjacent kernels more effectively, thereby reducing misclassification and missed detections. Lastly, the improved robustness of the proposed loss function ensures better generalization across different maize varieties and kernel morphologies. Compared to conventional loss functions, neighborhood loss integrates local similarity constraints and boundary optimization, achieving high detection accuracy while improving stability, making maize kernel shape recognition more precise and reliable.
2.4. Experimental Setup
2.4.1. Hardware and Software Configuration and Hyperparameters
The experiments in this study were conducted on a high-performance computing server. The computing platform utilizes an NVIDIA Tesla A100 GPU with a memory capacity of 40 GB to ensure efficient deep learning model training and inference. Additionally, the server is equipped with two AMD EPYC 7742 processors, each with a clock speed of 2.25 GHz, totaling 128 cores, and is configured with 1TB of memory to support large-scale data loading and parallel computing. The storage system employs NVMe SSDs, providing high-speed data read and write capabilities to enhance training data access efficiency. The overall hardware environment is designed to ensure the efficient execution of the maize kernel recognition task, particularly in handling high-resolution images and complex network models by providing ample computational resources.
In terms of software environment, this experiment uses Ubuntu 20.04 as the operating system, along with CUDA 11.8 and cuDNN 8.6 to fully leverage GPU acceleration for the training process. The deep learning framework utilized is PyTorch 1.13.1, in combination with Torchvision 0.14.1 for data preprocessing and object detection tasks. To enhance data processing efficiency, the NVIDIA DALI library is employed to optimize the data preprocessing pipeline. The experimental code is implemented in Python 3.9, with environment management handled by Anaconda to ensure reproducibility across different experimental phases. Additionally, the model training process is managed using Weights and Biases (WandB) for logging and experiment monitoring, allowing for tracking of training progress and hyperparameter tuning effects.
Regarding hyperparameter settings, the optimizer used is AdamW with an initial learning rate of , dynamically adjusted using the cosine annealing strategy to improve model convergence speed and avoid local optima. The batch size is set to 32, and each training epoch iterates 100 times to ensure that the model fully learns the data distribution. The weight decay coefficient is set to to suppress overfitting and enhance model generalization ability. The momentum parameters and are set to 0.9 and 0.999, respectively, to optimize the gradient update process. Furthermore, mixed precision training is adopted to reduce GPU memory usage and improve computational efficiency. During training, five-fold cross-validation (5-fold cross-validation) is used to evaluate model stability and generalization performance under different data splits.
2.4.2. Dataset Partitioning and k-Fold Cross-Validation
The dataset in this study is divided into training, validation, and test sets in a ratio of 7:2:1. The training set is used for learning model parameters, the validation set is used for hyperparameter tuning, and the test set is employed to evaluate the final generalization performance of the model. To ensure model stability across different data splits and to minimize the impact of data distribution bias on experimental results, k-fold cross-validation (k-fold cross-validation) is employed, where k is set to either 5 or 10, depending on the dataset size and computational resources. In the cross-validation process, the dataset is equally divided into k subsets, and during each training iteration, subsets are used as the training set, while the remaining subset serves as the validation set. This process is repeated for k rounds to ensure that the model fully utilizes the available data and achieves better generalization ability. This partitioning strategy effectively improves model robustness, ensuring high recognition accuracy even when facing different maize kernel varieties or varying lighting and background conditions, while also preventing overfitting or underfitting due to uneven data splits.
2.4.3. Evaluation Metrics
In maize kernel recognition tasks, model performance evaluation is crucial. The primary metrics used include accuracy, recall, precision, and mAP to assess detection effectiveness. Accuracy measures the overall correctness of classification tasks and is calculated as the proportion of correctly predicted samples over the total samples. Recall reflects the model’s ability to detect all true targets, where a higher value indicates a lower miss rate, making it suitable for scenarios that require high detection completeness. Precision assesses the false detection rate, representing the proportion of correctly predicted positive samples among all predicted positives, where a higher value indicates a lower false detection rate. In object detection tasks, the commonly used mAP metric evaluates the accuracy of object localization and classification comprehensively. Specifically,
refers to the mean precision when the Intersection over Union (IoU) threshold is set to 0.5, while
averages results over IoU thresholds from 0.5 to 0.95 in increments of 0.05, providing a more comprehensive evaluation of model performance under varying detection stringencies. The mathematical formulations of these metrics are as follows:
Here, True Positive (TP) represents correctly detected targets, True Negative (TN) denotes correctly identified non-target samples, False Positive (FP) represents the number of false detections, and False Negative (FN) represents missed targets. represents the predicted bounding box, while is the ground-truth bounding box, with ∩ indicating the intersection area and ∪ representing the union area. denotes the precision–recall curve, N is the total number of classes, and represents the average precision of class i. Overall, these evaluation metrics allow for a multi-dimensional assessment of maize kernel detection model performance and provide direction for model optimization.
2.5. Baseline
In this study, SSD [
22], RetinaNet [
23], YOLOv10 [
24], YOLOv11 [
25], Faster R-CNN [
26], and DETR [
27] were selected as baseline models due to their widespread application in object detection and their superior performance in terms of accuracy, speed, and robustness. SSD is a single-stage detector that predicts at multiple feature map scales, enabling it to detect objects of different sizes efficiently, making it suitable for real-time detection tasks. RetinaNet incorporates Feature Pyramid Network (FPN) to enhance small-object detection capabilities and introduces focal loss to mitigate class imbalance issues, improving recognition accuracy for hard-to-detect objects. YOLOv10 and YOLOv11, as representatives of the YOLO series, employ an anchor-free structure and optimized loss functions to significantly improve detection speed while maintaining high accuracy, making them particularly suitable for resource-constrained scenarios. Faster R-CNN, a classic two-stage detector, employs a region proposal network (RPN) to generate candidate boxes and utilizes RoI Pooling for refined predictions. Although computationally intensive, it achieves high detection accuracy, making it ideal for tasks requiring precise object localization. DETR employs a Transformer architecture with self-attention mechanisms to model long-range dependencies, effectively handling occlusions and complex backgrounds. However, it has a high computational complexity and requires longer training times. The selection of these baseline models facilitates a comprehensive evaluation of different detection methods in maize kernel recognition, analyzing their advantages and disadvantages across various scenarios. This provides a solid benchmark for comparing the proposed method, ensuring the validity and practicality of model improvements.
3. Results and Analysis
3.1. Maize Trait Recognition
This experiment aims to evaluate the performance of different object detection models in maize trait recognition, assessing their precision, recall, and overall detection capability in complex agricultural environments. Maize trait recognition involves detecting various kernel and ear characteristics, posing challenges such as small-object detection, occlusion, and variations in lighting conditions. Therefore, multiple state-of-the-art object detection models were selected for comparative analysis, as shown in
Table 1.
The experimental results demonstrate that DETR, based on the Transformer architecture, along with the YOLO series models, particularly YOLOv10, YOLOv11, and the proposed method, outperform traditional CNN-based object detection models such as Faster R-CNN, SSD, and RetinaNet across all metrics. In terms of precision and recall, Faster R-CNN and SSD exhibit relatively lower performance. This can be attributed to the region proposal-based detection approach, which may result in a higher number of false positives and false negatives when dealing with complex backgrounds. SSD, being a single-stage detector, shows a lower accuracy in small-object detection. RetinaNet, incorporating focal loss for loss adjustment, achieves better performance than Faster R-CNN and SSD. However, its effectiveness in high-density object detection remains limited. DETR improves recall and mAP@50-95 through its global relationship modeling capabilities using the Transformer framework, making it particularly effective in detecting occluded objects. However, its reliance on global attention computation results in relatively lower detection efficiency in practical applications. Among YOLO-based models, YOLOv10 and YOLOv11 enhance detection accuracy, recall, and mAP through an optimized backbone structure and an efficient anchor-free mechanism. YOLOv11 further refines its attention mechanism and multi-scale feature fusion, leading to a higher mAP@50 of 0.90 and mAP@50-95 of 0.64 compared to YOLOv10. This suggests that YOLO-based models achieve a balance between computational efficiency and detection performance. The proposed method achieves the highest performance across all evaluation metrics, with a precision of 0.95, recall of 0.92, mAP@50 of 0.92, and mAP@50-95 of 0.65. This advantage is attributed to the incorporation of the neighborhood attention mechanism, which effectively enhances local feature representation. Additionally, the proposed method integrates a neighborhood loss function at the optimization level, differing from conventional cross-entropy loss by improving spatial consistency among adjacent detected objects, leading to more stable detection at object boundaries. For farmers and breeders, this advancement translates to faster and more accurate identification of maize traits like kernel size, row alignment, and defects (e.g., mold or insect damage). Traditional methods rely on manual measurements, which are time-consuming and prone to human error. By automating these tasks, the proposed method enables rapid screening of thousands of seeds or ears, directly supporting precision breeding programs and reducing post-harvest losses. For instance, uniform kernel morphology is critical for hybrid seed production, and early defect detection can prevent contamination in storage facilities.
Mathematically, the proposed approach leverages the global modeling capability of the Transformer framework while maintaining the efficiency of YOLO-based structures. This enables simultaneous capture of both global and local features, ultimately achieving superior performance in maize trait recognition tasks.
3.2. Detection Results of Different Maize Traits Using the Proposed Method
This experiment aims to evaluate the performance of the proposed neighborhood attention mechanism in different maize trait recognition tasks, including ear length, ear diameter, kernel row number, kernel size, and color characteristics. Since maize trait recognition involves complex morphological structures, and different traits exhibit significant variations in image representation, the key challenge is to optimize the detection accuracy of multiple traits within a unified framework, as shown in
Table 2.
The experimental results indicate that the proposed method demonstrates high precision, recall, and mAP in all trait detection tasks, with the best performance observed in kernel size and color characteristic detection, achieving mAP@50 values of 0.95 and 0.94 and mAP@50-95 values of 0.68 and 0.67, respectively. These results confirm that the neighborhood attention mechanism effectively enhances feature extraction, improving the model’s adaptability to different traits. In contrast, the detection accuracy of ear length and ear diameter is slightly lower, with mAP@50-95 values of 0.62 and 0.64, respectively. This may be due to the greater variability of these morphological traits across different maize varieties, increasing the complexity of the detection task. Additionally, kernel row number and kernel size exhibit superior performance compared to ear-level detection tasks, indicating that the neighborhood attention mechanism is particularly advantageous in detecting small localized traits. This can be attributed to its ability to fully exploit local features while avoiding the computational burden associated with traditional global attention mechanisms. For agricultural stakeholders, the ability to automatically quantify traits like kernel row number and color characteristics addresses critical bottlenecks in seed certification and breeding. For instance, kernel row number is a key yield predictor, and manual counting is laborious and error-prone. Automating this process allows breeders to screen thousands of ears efficiently, accelerating the selection of high-yield hybrids. Similarly, precise color detection (e.g., identifying anthocyanin-rich kernels) supports the development of specialty maize varieties with enhanced nutritional or market value.
From a mathematical perspective, the core principle of the neighborhood attention mechanism is to construct efficient attention mapping within local regions, ensuring more effective information propagation among neighboring features. Traditional self-attention mechanisms rely on global relationship modeling with a computational complexity of , where N represents the spatial resolution of the feature map. In contrast, the neighborhood attention mechanism introduces local attention windows, reducing the computational complexity to . This optimization not only decreases computational costs but also enhances the model’s ability to capture fine-grained local features. In maize trait recognition, different traits exhibit distinct spatial distribution patterns. For example, color characteristics rely on local spectral information, while kernel row number detection depends on repetitive structural patterns. The proposed method dynamically adjusts attention distribution for different tasks through the neighborhood attention mechanism, ensuring high detection accuracy across diverse traits.
Furthermore, the incorporation of neighborhood constraints in the loss function enhances spatial smoothness in detection results, which is particularly beneficial for morphological traits such as ear diameter and kernel size. The experimental results validate the superiority of the proposed method over conventional object detection approaches, particularly in tasks requiring fine-grained trait recognition, further demonstrating the effectiveness of the neighborhood attention mechanism in agricultural object detection.
3.3. Ablation Study on Different Attention Mechanisms
This experiment aims to investigate the effectiveness of different types of attention mechanisms through an ablation study, validating the proposed neighborhood attention mechanism in maize kernel trait recognition. The experiment evaluates the performance of channel attention, spatial attention, and the proposed neighborhood attention mechanism in terms of precision, recall, accuracy, and mAP metrics, as shown in
Table 3.
The experimental results indicate that channel attention performs the weakest, with precision, recall, and mAP@50-95 values of 0.63, 0.60, and 0.39, respectively. This suggests that relying solely on channel information is insufficient to effectively distinguish complex maize kernel shapes. In contrast, spatial attention achieves significant improvements in detection precision and recall, reaching 0.84 and 0.80, respectively, with mAP@50-95 increasing from 0.39 to 0.57. This improvement demonstrates that spatial attention effectively enhances feature extraction across different spatial regions, improving object localization accuracy. However, the proposed neighborhood attention mechanism outperforms all other approaches, achieving an mAP@50 of 0.92 and an mAP@50-95 of 0.65. This suggests that neighborhood attention further refines feature representation, providing enhanced robustness in shape recognition tasks.
The experimental results demonstrate that, compared to conventional attention mechanisms, neighborhood attention not only effectively captures local structural information but also enhances the global feature representation of the model, leading to superior overall detection performance. From a mathematical perspective, channel attention primarily relies on GAP to compute the importance of different feature channels, adjusting channel weights across the entire feature map. Since this method models only the relationships between feature channels without considering spatial distributions, its performance is limited in shape-based object detection. In contrast, spatial attention mechanisms leverage two-dimensional convolution to highlight important spatial regions. While spatial attention enhances information representation in local regions, it lacks an effective combination of channel-wise dependencies, resulting in performance bottlenecks.
The proposed neighborhood attention mechanism constructs a regional correlation matrix to aggregate features within a local area. The key advantage of the neighborhood attention mechanism lies in its ability to compute attention relationships within spatial neighborhoods dynamically, thereby adapting feature representations to local contexts while preserving global semantic consistency. This enables the model to balance fine-grained local feature extraction with holistic object understanding, which is crucial for detecting complex maize kernel shapes. Consequently, compared to channel attention and spatial attention, the neighborhood attention mechanism achieves superior performance in maize kernel trait recognition, significantly improving both detection accuracy and robustness. The experimental results validate the effectiveness of this approach, further demonstrating its applicability in fine-grained object detection tasks.
3.4. Ablation Study on Different Loss Functions
This experiment aims to evaluate the effectiveness of different loss functions through an ablation study, validating the proposed neighborhood loss in maize kernel trait recognition. Loss functions are critical in object detection tasks, as they determine how the model optimizes predictions to achieve higher detection accuracy and stability. In this study, the performance of smooth L1 loss, IoU loss, and the proposed neighborhood loss was compared, as shown in
Table 4.
The experimental results indicate that smooth L1 loss performs the weakest, with precision and recall values of 0.64 and 0.61, and an mAP@50-95 of only 0.34. This is primarily due to the fact that smooth L1 loss computes loss solely based on bounding box coordinates without sufficiently optimizing the internal distribution of regional features, resulting in limited learning capacity for object shapes and local information. In contrast, IoU loss optimizes the Intersection over Union of the predicted and ground-truth bounding boxes, ensuring a closer fit to actual targets. Consequently, IoU loss achieves significantly better performance than smooth L1 loss across all metrics, with mAP@50 and mAP@50-95 reaching 0.80 and 0.51, respectively. However, IoU loss focuses solely on the spatial overlap of bounding boxes without effectively modeling the relationships between local features. This limitation reduces its effectiveness in complex object shape recognition tasks. In comparison, the proposed neighborhood loss achieves the best results across all metrics, with mAP@50-95 reaching 0.65. This indicates that in addition to optimizing object detection, it effectively preserves the structural integrity of local shape information, improving model stability.
From a mathematical perspective, smooth L1 loss optimizes only bounding box regression error without considering feature relationships within neighboring regions, leading to suboptimal performance in complex environments. This loss function lacks contextual constraints, limiting its effectiveness in shape-dependent detection tasks. On the other hand, IoU loss directly optimizes the spatial alignment between predicted and ground-truth bounding boxes. Although IoU loss improves spatial alignment, its optimization capacity for small objects is weaker. When object regions are small, the gradient update for IoU computation becomes minimal, leading to slower model convergence. The proposed neighborhood loss constructs a local feature correlation matrix, ensuring that both bounding box optimization and internal feature consistency are maintained.
The core idea behind neighborhood loss is to incorporate local constraints, ensuring greater stability in shape learning while avoiding reliance solely on global IoU matching, which often overlooks fine-grained details. The experimental results demonstrate that this loss function significantly enhances the accuracy and robustness of maize trait recognition, enabling the detection model to maintain high performance across different object shapes and scales. Compared to traditional loss functions, neighborhood loss not only improves spatial alignment but also strengthens feature consistency within local regions, making it particularly effective in maize trait recognition and other complex object detection tasks.
5. Conclusions
Maize trait recognition plays a critical role in agricultural production and breeding research. However, traditional object detection methods still face limitations in handling complex backgrounds and detecting small targets with highly variable shapes. In this study, a maize trait recognition method based on a neighborhood attention mechanism is proposed to enhance the feature extraction and attention modeling capabilities of object detection networks, thereby improving the accuracy of kernel and ear trait detection. Experimental results demonstrate that the proposed method outperforms existing object detection approaches across multiple key metrics, achieving high precision in different trait detection tasks while significantly improving the robustness and generalization ability of object detection. The primary contribution of this study lies in the introduction of a neighborhood attention mechanism, which captures feature correlations within local regions, enabling the detection network to maintain high accuracy and stability even in complex environments. Ablation studies validate the effectiveness of the proposed mechanism, showing that compared to traditional channel attention and spatial attention methods, the proposed method improves mAP@50 and mAP@50-95 by approximately 11% and 8%, respectively. Additionally, the proposed neighborhood loss function enhances the stability of internal feature representations while optimizing bounding boxes. Compared to smooth L1 loss and IoU loss, the mAP@50 and mAP@50-95 are improved by 31% and 14%, respectively. Further experiments on different maize traits confirm the effectiveness of the proposed method, demonstrating high detection accuracy for ear length, ear diameter, kernel row number, kernel size, and color characteristics. The overall mAP@50 reaches 0.92, and mAP@50-95 achieves 0.65, with both precision and recall exceeding 0.92. By optimizing the attention mechanism and loss function, the proposed method not only improves object detection accuracy but also reduces computational resource dependence, making it more feasible for practical applications in agricultural production and intelligent breeding systems. Future research can further explore the application of this method in multimodal data fusion, such as integrating spectral information and improving robustness under high-illumination conditions, to enhance the adaptability and detection performance of the model.