Neighborhood Attention-Based Detection for Maize Traits in Precision Agriculture

Li, Zhongxu; Li, Juyi; Shen, Hongjun; Zhang, Mohan; Zhang, Hanwen; Zhou, Yi; Yang, Zhiyuan; Lv, Chunli

doi:10.3390/agronomy15040931

Open AccessArticle

Neighborhood Attention-Based Detection for Maize Traits in Precision Agriculture

by

Zhongxu Li

,

Juyi Li

,

Hongjun Shen

,

Mohan Zhang

,

Hanwen Zhang

,

Yi Zhou

,

Zhiyuan Yang

and

Chunli Lv

^*

China Agricultural University, Beijing 100083, China

^*

Author to whom correspondence should be addressed.

Agronomy 2025, 15(4), 931; https://doi.org/10.3390/agronomy15040931

Submission received: 6 March 2025 / Revised: 6 April 2025 / Accepted: 9 April 2025 / Published: 10 April 2025

(This article belongs to the Special Issue Advanced Machine Learning in Agriculture)

Download

Browse Figures

Versions Notes

Abstract

:

Maize trait recognition plays a crucial role in agricultural production and breeding research; however, the adaptability of existing object detection methods in complex environments remains limited. In this study, a maize trait detection method based on the neighborhood attention mechanism is proposed to enhance the accuracy of kernel and ear morphology detection. Experimental results demonstrate that the proposed method outperforms existing approaches across multiple evaluation metrics, achieving an overall mAP@50 of 0.92 and mAP@50-95 of 0.65, with precision and recall reaching 0.95 and 0.92, respectively. Compared to traditional attention mechanisms and loss functions, the proposed method significantly improves both detection accuracy and stability, providing reliable technical support for precision agriculture.

Keywords:

maize trait recognition; object detection in agriculture; digital agriculture; machine learning

1. Introduction

Maize is one of the most important staple crops worldwide, and its yield and quality are directly related to the economic benefits of agricultural production and food security [1,2]. In modern agricultural production, maize kernel recognition technology is widely used in seed quality inspection, genetic breeding research, and precision agriculture management. Accurately and rapidly identifying the morphological characteristics of maize kernels not only assists seed companies in variety screening and quality control but also provides agricultural researchers with efficient trait analysis tools. This, in turn, optimizes breeding strategies and enhances the efficiency of elite variety selection [3,4,5]. Therefore, developing high-precision and efficient maize kernel recognition methods is crucial for advancing intelligent agriculture. Traditional maize kernel recognition methods mainly rely on manual measurements and statistical analysis, such as using vernier calipers to measure kernel length, width, and thickness or employing optical microscopes for grain morphology observation [6]. Although these methods provide high measurement accuracy, they suffer from complex operations, long processing times, and low efficiency, making them unsuitable for large-scale seed testing and breeding experiments. Additionally, some studies have applied computer vision methods based on color, shape, and texture features, extracting maize kernel morphological parameters through image-processing techniques such as edge detection, morphological operations, and grayscale histogram analysis [7,8,9]. While these methods reduce human intervention, they are highly sensitive to lighting conditions and background complexity. Moreover, in cases of kernel adhesion, occlusion, or high varietal diversity, the detection accuracy significantly decreases.

The related research has undergone a technological evolution from traditional thresholding methods to deep learning and, more recently, to Transformer-based architectures. Initially, maize trait measurement relied on low-level features such as color, shape, and texture, using fixed or adaptive thresholds to segment target regions. These methods were simple but lacked adaptability in complex backgrounds [10,11,12,13]. Subsequently, Convolutional Neural Networks (CNNs) gradually replaced traditional approaches by enabling automatic feature extraction, significantly improving recognition accuracy and robustness. CNNs have shown strong performance in tasks such as object detection, segmentation, and variety classification, albeit at the cost of high computational demands [14]. In recent years, Transformer-based models—particularly the DEtection TRansformer (DETR) architecture—have emerged as a promising alternative. By introducing the multi-head self-attention mechanism, these models effectively capture long-range dependencies and overcome the limitations of CNNs’ local receptive fields, demonstrating superior performance in handling occlusion and adhesion challenges [15]. The Faster Region-based Convolutional Neural Network (R-CNN), as a two-stage object detection method, first generates candidate regions and then performs precise classification and regression, achieving high detection accuracy [16]. Zhang et al. [17] described a rice panicle detection model based on an improved Faster R-CNN. Experimental results showed that the model achieved an average precision (mAP) of 92.47%, significantly improving upon the original Faster R-CNN model (mAP of 40.96%). However, the method incurs high computational costs, making real-time applications challenging. The YOLO series, as a typical single-stage detector, achieves efficient detection through an end-to-end framework, balancing detection speed and accuracy [18]. Yang et al. [19] proposed a new high-accuracy and real-time maize pest and disease detection method called Maize-YOLO. Experimental results demonstrated that this method outperformed existing state-of-the-art YOLO-based object detection algorithms, achieving 76.3% mAP and 77.3% recall. Xia et al. [20] developed a maize seed surface mold detection model using the YOLOv5s deep learning algorithm based on machine vision technology. They further enhanced its portability for deployment on mobile devices, ultimately developing the improved YOLOv5s-ShuffleNet-CBAM model, which achieved an mAP50 value of 0.955. Yang et al. [21] designed a maize kernel detection and recognition model named “YOLOv7-MEF”. Experimental results showed that the improved algorithm achieved an accuracy of 98.94%, a recall of 96.42%, and a frame rate of 76.92 FPS. However, due to the small size and similar morphology of maize kernels, existing object detection models still face challenges in high-density small-object detection, particularly under kernel adhesion and partial occlusion conditions, where misdetections or missed detections frequently occur.

To address these issues, this study proposes a maize kernel detection method based on a neighborhood attention mechanism and neighborhood loss. First, during the feature extraction phase, the proposed method introduces a neighborhood attention mechanism, ensuring stable feature representation across different scales of kernel detection tasks. Additionally, to further optimize the spatial consistency of detection results, this study introduces neighborhood loss, which constructs a local region constraint loss, ensuring that the detection results of adjacent kernels are more consistent, effectively reducing misdetections and missed detections. The main contributions of this study are as follows:

Neighborhood attention mechanism: The neighborhood attention mechanism proposed in this article only focuses on features within the neighborhood range, and performs local self-attention calculation through sliding windows to improve the model’s ability to capture local features in corn kernel detection tasks.
Neighborhood loss to optimize spatial consistency among objects: This study designs a neighborhood loss that introduces local region constraints, ensuring that adjacent objects have more consistent feature distributions, thereby reducing misdetections and missed detections.
Combining the advantages of Transformer and YOLO architectures: The proposed method integrates the global feature extraction capability of Transformer models with the rapid object detection characteristics of YOLO structures.

This study ensures high detection performance while optimizing computational complexity, allowing the method to operate efficiently in resource-constrained environments. Future research will further enhance the model’s lightweight characteristics and deploy and test it in real agricultural environments at the Jiangsu Academy of Agricultural Sciences, Germplasm Resources and Biotechnology Research Institute. This will validate its effectiveness in large-scale agricultural applications, improving the efficiency and reliability of automated seed quality inspection.

2. Materials and Methods

2.1. Data Collection

The data collection for this study was primarily conducted at the experimental field of the Jiangsu Academy of Agricultural Sciences, located in Nanjing, China. Additionally, publicly available online datasets were incorporated to ensure diversity and broad applicability. A high-resolution RGB camera, Canon EOS 5D Mark IV (Canon Inc., Tokyo, Japan), equipped with a 50mm prime lens, was used for image acquisition to ensure clarity and detailed feature representation. The camera was mounted on an adjustable-height tripod to maintain stability across different shooting angles, as shown in Figure 1.

The shutter speed was set to 1/1000 s to prevent motion blur, and the aperture was configured at

f / 8

to achieve an optimal depth of field. It should be noted that image-processing results can be significantly influenced by several image acquisition parameters, including focal length, color depth, local magnification, object rotation, aperture, and threshold values. In this study, to ensure consistency and reproducibility, we employed a fixed imaging configuration consisting of a 50 mm lens, 1/1000 s shutter speed, and

f / 8

aperture. While this setup provides a controlled environment for evaluating the model, we acknowledge that such consistency may restrict the evaluation of robustness across more diverse imaging scenarios. Therefore, as part of our future work, we intend to explore additional experiments involving varied acquisition settings to assess the generalization and adaptability of the proposed method under complex real-world conditions. To enhance consistency across images, automatic white balance mode was enabled, and multiple captures were taken at different times of the day to accommodate varying natural light conditions. Additionally, a multi-angle imaging strategy was employed to obtain comprehensive maize ear images, including top-view, side-view, and 45-degree oblique-view perspectives, facilitating subsequent object detection and morphological analysis. The data collection process followed a standardized protocol. Representative maize varieties were selected, including both conventional and specialty maize, such as high-anthocyanin and high-carotenoid varieties, resulting in a total of 1672 images. Furthermore, images of maize kernels with impurities and defects, including damaged and moldy kernels, were collected to enhance the model’s adaptability to complex environments. During image acquisition, special attention was given to key maize kernel traits, including ear length, ear diameter, kernel row number, kernel size, and color characteristics.

2.2. Dataset Annotation and Augmentation

The primary objective of dataset annotation is to provide accurate bounding boxes and class labels, ensuring that the supervised learning model correctly identifies and classifies maize kernels during training. In object detection tasks, each sample must be annotated with its target position and class information, including the bounding box coordinates

(x, y, w, h)

, where

(x, y)

represents the center coordinates of the target, and w and h denote the width and height of the object. The annotation process typically involves image loading, bounding box drawing, and data format conversion. Given an input image I, the annotator manually selects the target region

B_{i}

and records its bounding box coordinates:

B_{i} = (x_{i}, y_{i}, w_{i}, h_{i}),

(1)

where i represents the i-th target. In object detection models, normalized coordinates are usually used for training to mitigate the impact of image resolution on the model’s performance. The normalization process is as follows:

x_{i}^{'} = \frac{x_{i}}{W}, y_{i}^{'} = \frac{y_{i}}{H}, w_{i}^{'} = \frac{w_{i}}{W}, h_{i}^{'} = \frac{h_{i}}{H},

(2)

where W and H represent the width and height of the image, and

(x_{i}^{'}, y_{i}^{'}, w_{i}^{'}, h_{i}^{'})

are the normalized bounding box coordinates.

During the annotation process, target class labels

c_{i}

must also be defined, typically represented using one-hot encoding:

c_{i} = [c_{1}, c_{2}, \dots, c_{K}],

(3)

where K is the total number of object classes, and

c_{k} = 1

indicates that the target belongs to class k, otherwise

c_{k} = 0

. During model training, the class labels are input into the network along with the bounding boxes for optimization. To enhance the model’s robustness and improve its adaptability to different lighting conditions, shapes, and backgrounds, various data augmentation methods are applied in this study, such as Mosaic, CutMix, and GridMask. The Mosaic augmentation method stitches together four different images, increasing background diversity and improving the model’s ability to adapt to complex environments. Given four original images

I_{1}, I_{2}, I_{3}, I_{4}

, a random stitching center point

(x_{m}, y_{m})

is first selected in the image space, and then the four sub-regions are partitioned as follows:

I_{new} (x, y) = \{\begin{matrix} I_{1} (x, y), & x \leq x_{m}, y \leq y_{m} \\ I_{2} (x - x_{m}, y), & x > x_{m}, y \leq y_{m} \\ I_{3} (x, y - y_{m}), & x \leq x_{m}, y > y_{m} \\ I_{4} (x - x_{m}, y - y_{m}), & x > x_{m}, y > y_{m} . \end{matrix}

(4)

Simultaneously, the bounding box coordinates must be adjusted accordingly to ensure alignment with the transformed image positions. The CutMix augmentation method enhances model robustness by randomly cropping a rectangular region from one image and replacing it with the corresponding region from another image, thereby simulating partial occlusion of the target. Given two original images

I_{A}

and

I_{B}

, a random rectangular region

R = (x_{1}, y_{1}, x_{2}, y_{2})

is selected from

I_{A}

and replaced with the corresponding region from

I_{B}

:

I_{new} (x, y) = \{\begin{matrix} I_{A} (x, y), & (x, y) \notin R \\ I_{B} (x, y), & (x, y) \in R . \end{matrix}

(5)

The size of the rectangular region R is controlled by the hyperparameter

λ

, which is typically sampled from a Beta distribution

Beta (α, α)

:

λ \sim Beta (α, α), A_{R} = W H λ

(6)

where

A_{R}

represents the area of the cropped region, and

W, H

are the image width and height. To ensure the consistency of class labels, the new sample’s class label is computed as a weighted average:

y_{n e w} = λ y_{A} + (1 - λ) y_{B} .

(7)

This strategy ensures that the model retains good recognition capability even when encountering partially occluded targets. The GridMask augmentation method overlays regularly or randomly distributed grid occlusions on the image, encouraging the model to focus on more discriminative global features during training. Given an input image I, GridMask defines a grid structure M and applies occlusion based on a predefined ratio r:

I_{n e w} (x, y) = I (x, y) \cdot M (x, y),

(8)

where

M (x, y)

is defined as

M (x, y) = \{\begin{matrix} 0, & (x mod d < r d) and (y mod d < r d) \\ 1, & otherwise . \end{matrix}

(9)

Here, d represents the grid cycle, and r controls the size of the occlusion region. When r is small, the occlusion region is minimal, having little impact on the image; when r is large, more areas are occluded, requiring the model to rely on stronger global feature extraction capabilities. By increasing local information loss, GridMask improves model generalization, ensuring that it maintains good recognition performance even when some target regions are missing or occluded. In the maize kernel recognition task, these data augmentation methods effectively enhance the model’s generalization ability, allowing it to stably detect targets even under complex backgrounds and varying lighting conditions.

2.3. Proposed Method

2.3.1. Maize Kernel Shape Recognition Network Based on Object Detection Algorithm

As shown in Figure 2, the maize kernel shape recognition network, based on object detection algorithms, adopts a hierarchical feature extraction architecture consisting of three core subnetworks: Base-Net, Overview-Net, and Focus-Net. These subnetworks are responsible for initial feature extraction, global feature modeling, and fine-grained feature refinement, respectively. Base-Net, serving as the backbone network, receives the input RGB image

X \in R^{3 \times H \times W}

and first maps it into a lower-dimensional feature space through an embedding layer. The extracted hierarchical features are then processed through multiple basic blocks, with progressively increasing channel dimensions

C_{1}, C_{2}, C_{3}

while the spatial resolution is gradually reduced to

H / 4 \times W / 4

,

H / 8 \times W / 8

, and

H / 16 \times W / 16

. The structure of the basic block in this stage employs an improved ResNet module, mathematically represented as

Y = F (X, W) + X,

(10)

where

F (X, W)

represents the transformation mapping of convolutional layers combined with normalization and activation functions. The residual connection ensures stable gradient propagation, thereby improving network training efficiency. The output of Base-Net is used as high-level features, which are then forwarded to both Overview-Net and Focus-Net. Overview-Net is designed to construct global perceptual features, enhancing the expression of semantic information through additional basic blocks. The output feature map has a size of

H / 32 \times W / 32

with an expanded channel dimension of

C_{4}

. Furthermore, Overview-Net generates global contextual prior information, denoted as Context Prior, which is transmitted to Focus-Net via the Context Guidance Flow (represented by red arrows). This process enhances the semantic consistency of local details and is mathematically formulated as

P = G (Y),

(11)

where P represents the global prior information, and

G (\cdot)

denotes a nonlinear mapping function based on global average pooling and a multi-layer perceptron (MLP). Focus-Net is primarily responsible for fine-grained feature enhancement, addressing the challenges associated with detecting small objects. To accommodate variations in maize kernel shapes across different scales, this module incorporates dynamic blocks (Dynamic Blocks) that utilize an adaptive receptive field strategy. Initially, Focus-Net receives feature maps of size

H / 16 \times W / 16

from Base-Net, which are further processed through an embedding layer for dimensionality reduction. Dynamic blocks are then introduced for feature enhancement, with their core computation defined as

Z = A V,

(12)

where A represents the neighborhood attention weight matrix, and V denotes the value projection of the input features. The weight matrix A is computed adaptively based on regional information, leveraging global features to guide local attention aggregation. This mechanism enables the network to effectively capture maize kernel shapes under occlusion conditions. Finally, the output features of Focus-Net are transmitted back to Overview-Net to reinforce global background perception, ensuring robust object detection predictions. This architectural design provides multiple advantages. Base-Net establishes stable low-level features to retain fundamental morphological information, Overview-Net enhances global contextual modeling to improve shape perception, and Focus-Net specializes in precise small-object detection. This structure improves both the accuracy and robustness of maize kernel detection. Additionally, the introduction of Context Prior and Context Guidance Flow effectively establishes feature associations between global and local regions, further optimizing the recognition performance of maize kernel shapes. Compared to conventional single-object detection models, the proposed method achieves a balance between computational efficiency and detection accuracy, particularly enhancing the capability to detect small objects. This makes it well suited for precision agricultural tasks such as maize kernel shape recognition.

2.3.2. Neighborhood Attention Mechanism

As shown in Figure 3, the neighborhood attention mechanism optimizes the computational efficiency and feature extraction capabilities of traditional self-attention by integrating local perception with global information interaction, demonstrating superior performance in object detection tasks. The computational complexity of self-attention is

O (N^{2})

, where N represents the spatial resolution of the input feature map, leading to excessive computational costs when processing high-resolution inputs. In contrast, the neighborhood attention mechanism reduces computational complexity to

O (N)

through a local window partitioning strategy, improving efficiency while preserving global contextual information. In this study, the core idea of the mechanism involves segmenting the input features into localized regions, computing feature similarity within each window, and dynamically adjusting representations using global perceptual information to enhance the robustness and accuracy of object detection. Subsequently, adaptive pooling is applied to K to reduce its dimensionality, lowering the computational complexity of attention while retaining the effectiveness of global features. During the computation of neighborhood attention, a global perceptual kernel is introduced to refine the locally computed results, thereby preserving comprehensive information. The output of the neighborhood attention mechanism is fused with the input features via a residual connection to obtain the final output Y:

Y = X + Z .

(13)

In the maize kernel shape recognition network, the neighborhood attention mechanism is integrated into the feature extraction module of the object detection backbone network. The specific structure consists of three main stages: the initial feature extraction layer, the local attention computation layer, and the global fusion layer. The initial feature extraction layer employs a modified ResNet as the backbone network, with channel dimensions set to

64, 128, 256, 512

, and spatial resolution progressively reduced from

H \times W

to

H / 16 \times W / 16

. The local attention computation layer utilizes a window partitioning strategy with a window size of

7 \times 7

, where neighborhood attention is computed within each window, while inter-window information is refined using a global perceptual kernel. Finally, the global fusion layer applies a

3 \times 3

normalization convolution to merge all neighborhood attention-enhanced features, improving the robustness of shape detection. This architectural design offers significant advantages, as it enables precise capturing of maize kernel morphological details while mitigating the high computational complexity associated with self-attention on large-scale feature maps. Consequently, the model effectively detects kernel shapes while maintaining real-time performance and computational efficiency.

2.3.3. Lightweight Design

The lightweight design plays a crucial role in the maize kernel shape recognition network, aiming to reduce computational complexity while maintaining detection accuracy, thereby enhancing the model’s applicability in real-world agricultural scenarios. This optimization focuses on network depth, channel configuration, and computational efficiency by reducing redundant parameters and improving inference speed. The proposed lightweight strategy is implemented in two main aspects: first, grouped convolution, depthwise separable convolution, and channel attention mechanisms are employed in Base-Net, Overview-Net, and Focus-Net to minimize computational overhead; second, pruning and low-rank decomposition are introduced to optimize weight storage and accelerate inference. In Base-Net, the input image

X \in R^{3 \times H \times W}

is mapped to a feature representation of dimension

C_{1}

through an embedding layer and subsequently downsampled across multiple stages. The hierarchical feature extraction is structured as follows:

X_{1} = f_{1} (X) \in R^{C_{1} \times \frac{H}{4} \times \frac{W}{4}},

(14)

X_{2} = f_{2} (X_{1}) \in R^{C_{2} \times \frac{H}{8} \times \frac{W}{8}},

(15)

X_{3} = f_{3} (X_{2}) \in R^{C_{3} \times \frac{H}{16} \times \frac{W}{16}},

(16)

where

f_{i} (\cdot)

represents different convolution transformations. To optimize efficiency,

f_{1}, f_{2}, f_{3}

adopt depthwise separable convolution, reducing the computational complexity from that of a standard convolution:

O (C_{in} \times C_{out} \times K^{2} \times H \times W),

(17)

to

O (C_{in} \times K^{2} \times H \times W + C_{in} \times C_{out} \times H \times W),

(18)

significantly decreasing the computational load while preserving feature extraction capability. In Overview-Net, features are further compressed to dimension

C_{4}

. The primary optimization in this module is the incorporation of a channel attention mechanism, which enhances computational efficiency by dynamically re-weighting feature channels. This mechanism utilizes global average pooling (GAP) to compute channel-wise importance, followed by a fully connected network to generate attention scores:

s = σ (W_{2} δ (W_{1} GAP (X))),

(19)

where

GAP (\cdot)

represents the global average pooling operation,

W_{1}

and

W_{2}

are the weights of fully connected layers,

δ (\cdot)

denotes the ReLU activation function, and

σ (\cdot)

represents the Sigmoid activation function. The computed channel weights are subsequently applied to adjust the input features:

X^{'} = s \cdot X .

(20)

This approach effectively suppresses redundant channels, improves computational efficiency, and ensures that critical features receive higher attention. In Focus-Net, a dynamic receptive field adjustment strategy is adopted, allowing each dynamic block to select convolution kernels of varying sizes based on input features instead of using fixed

3 \times 3

or

5 \times 5

kernel structures. This method is implemented using deformable convolution, mathematically expressed as

y (p) = \sum_{p_{i} \in R} w (p_{i}) \cdot x (p + Δ p_{i}),

(21)

where p denotes the current pixel location, R represents the convolution window, and

Δ p_{i}

is the learned offset. This mechanism allows the convolution kernel to dynamically adjust sampling points based on local variations in the input features, enhancing detection accuracy while reducing redundant computations. The lightweight nature of Focus-Net is further reflected in channel control, where unnecessary intermediate feature channels are reduced to lower overall computational complexity while retaining sufficient feature representation capacity. The combination of depthwise-separable convolution, channel attention, and deformable convolution enables the network to efficiently capture maize kernel shape details while maintaining real-time inference performance and computational efficiency.

2.3.4. Neighborhood Loss

As shown in Figure 4, neighborhood loss is introduced in this study to address the balance between local and global features in object detection, particularly in maize kernel recognition tasks. Due to the small scale of the targets and the clustering of kernels with similar morphology within local regions, traditional loss functions often fail to effectively capture spatial relationships. Conventional loss functions such as mean squared error (MSE) and cross-entropy (CE) primarily focus on the prediction accuracy of individual pixels or objects while disregarding the similarity between adjacent regions. This leads to instability in predictions, particularly at object boundaries. To mitigate this issue, neighborhood loss incorporates local structural constraints, enforcing correlation between neighboring pixels or objects to enhance spatial consistency in detection results. Specifically, given the predicted output

\hat{Y}

and the ground truth Y, the traditional cross-entropy loss is formulated as

L_{CE} = - \sum_{i} Y_{i} log {\hat{Y}}_{i} + (1 - Y_{i}) log (1 - {\hat{Y}}_{i}) .

(22)

This function considers only individual point-wise predictions. In contrast, neighborhood loss introduces local region constraints, ensuring stronger semantic consistency among adjacent pixels within the target area. The neighborhood loss is defined as

L_{neighbor} = \sum_{i} \sum_{j \in N (i)} w_{i, j} \cdot d ({\hat{Y}}_{i}, {\hat{Y}}_{j}),

(23)

where

N (i)

denotes the neighborhood set of pixel i,

d (\cdot, \cdot)

represents a distance metric (e.g., Euclidean distance) between two predicted values, and

w_{i, j}

is a weighting coefficient that adjusts the contribution of different neighboring regions. To further improve boundary prediction in detection tasks, a boundary-aware loss is introduced, ensuring smoother transitions in boundary areas. The boundary-aware loss is formulated as

L_{boundary} = \sum_{i \in Ω} |\nabla_{x} {\hat{Y}}_{i} - \nabla_{x} Y_{i}| + |\nabla_{y} {\hat{Y}}_{i} - \nabla_{y} Y_{i}|,

(24)

where

Ω

represents the set of boundary points in the target region, and

\nabla_{x}

and

\nabla_{y}

denote gradient computations in the x and y directions, respectively. The final form of neighborhood loss is derived by integrating the cross-entropy loss, neighborhood consistency loss, and boundary-aware loss as follows:

L_{final} = L_{CE} + λ_{1} L_{neighbor} + λ_{2} L_{boundary},

(25)

where

λ_{1}

and

λ_{2}

are balancing coefficients that control the relative contributions of the different loss components. This loss function design provides significant advantages in maize kernel detection. First, it ensures consistency in detecting adjacent kernels, reducing noise interference caused by illumination variations and occlusions. Second, the boundary-aware loss enhances the accuracy of object localization, enabling the model to distinguish adjacent kernels more effectively, thereby reducing misclassification and missed detections. Lastly, the improved robustness of the proposed loss function ensures better generalization across different maize varieties and kernel morphologies. Compared to conventional loss functions, neighborhood loss integrates local similarity constraints and boundary optimization, achieving high detection accuracy while improving stability, making maize kernel shape recognition more precise and reliable.

2.4. Experimental Setup

2.4.1. Hardware and Software Configuration and Hyperparameters

The experiments in this study were conducted on a high-performance computing server. The computing platform utilizes an NVIDIA Tesla A100 GPU with a memory capacity of 40 GB to ensure efficient deep learning model training and inference. Additionally, the server is equipped with two AMD EPYC 7742 processors, each with a clock speed of 2.25 GHz, totaling 128 cores, and is configured with 1TB of memory to support large-scale data loading and parallel computing. The storage system employs NVMe SSDs, providing high-speed data read and write capabilities to enhance training data access efficiency. The overall hardware environment is designed to ensure the efficient execution of the maize kernel recognition task, particularly in handling high-resolution images and complex network models by providing ample computational resources.

In terms of software environment, this experiment uses Ubuntu 20.04 as the operating system, along with CUDA 11.8 and cuDNN 8.6 to fully leverage GPU acceleration for the training process. The deep learning framework utilized is PyTorch 1.13.1, in combination with Torchvision 0.14.1 for data preprocessing and object detection tasks. To enhance data processing efficiency, the NVIDIA DALI library is employed to optimize the data preprocessing pipeline. The experimental code is implemented in Python 3.9, with environment management handled by Anaconda to ensure reproducibility across different experimental phases. Additionally, the model training process is managed using Weights and Biases (WandB) for logging and experiment monitoring, allowing for tracking of training progress and hyperparameter tuning effects.

Regarding hyperparameter settings, the optimizer used is AdamW with an initial learning rate of

α = 1 \times 10^{- 4}

, dynamically adjusted using the cosine annealing strategy to improve model convergence speed and avoid local optima. The batch size is set to 32, and each training epoch iterates 100 times to ensure that the model fully learns the data distribution. The weight decay coefficient is set to

1 \times 10^{- 4}

to suppress overfitting and enhance model generalization ability. The momentum parameters

β_{1}

and

β_{2}

are set to 0.9 and 0.999, respectively, to optimize the gradient update process. Furthermore, mixed precision training is adopted to reduce GPU memory usage and improve computational efficiency. During training, five-fold cross-validation (5-fold cross-validation) is used to evaluate model stability and generalization performance under different data splits.

2.4.2. Dataset Partitioning and k-Fold Cross-Validation

The dataset in this study is divided into training, validation, and test sets in a ratio of 7:2:1. The training set is used for learning model parameters, the validation set is used for hyperparameter tuning, and the test set is employed to evaluate the final generalization performance of the model. To ensure model stability across different data splits and to minimize the impact of data distribution bias on experimental results, k-fold cross-validation (k-fold cross-validation) is employed, where k is set to either 5 or 10, depending on the dataset size and computational resources. In the cross-validation process, the dataset is equally divided into k subsets, and during each training iteration,

k - 1

subsets are used as the training set, while the remaining subset serves as the validation set. This process is repeated for k rounds to ensure that the model fully utilizes the available data and achieves better generalization ability. This partitioning strategy effectively improves model robustness, ensuring high recognition accuracy even when facing different maize kernel varieties or varying lighting and background conditions, while also preventing overfitting or underfitting due to uneven data splits.

2.4.3. Evaluation Metrics

In maize kernel recognition tasks, model performance evaluation is crucial. The primary metrics used include accuracy, recall, precision, and mAP to assess detection effectiveness. Accuracy measures the overall correctness of classification tasks and is calculated as the proportion of correctly predicted samples over the total samples. Recall reflects the model’s ability to detect all true targets, where a higher value indicates a lower miss rate, making it suitable for scenarios that require high detection completeness. Precision assesses the false detection rate, representing the proportion of correctly predicted positive samples among all predicted positives, where a higher value indicates a lower false detection rate. In object detection tasks, the commonly used mAP metric evaluates the accuracy of object localization and classification comprehensively. Specifically,

m A P @ 50

refers to the mean precision when the Intersection over Union (IoU) threshold is set to 0.5, while

m A P @ 50 - 95

averages results over IoU thresholds from 0.5 to 0.95 in increments of 0.05, providing a more comprehensive evaluation of model performance under varying detection stringencies. The mathematical formulations of these metrics are as follows:

Accuracy = \frac{T P + T N}{T P + T N + F P + F N},

(26)

Recall = \frac{T P}{T P + F N},

(27)

Precision = \frac{T P}{T P + F P},

(28)

IoU = \frac{| B_{p} \cap B_{g} |}{| B_{p} \cup B_{g} |},

(29)

A P = \int_{0}^{1} P (R) d R,

(30)

m A P @ 50 = \frac{1}{N} \sum_{i = 1}^{N} A P_{i},

(31)

m A P @ 50 - 95 = \frac{1}{10} \sum_{k = 0}^{9} m A P @ (0.5 + 0.05 k) .

(32)

Here, True Positive (TP) represents correctly detected targets, True Negative (TN) denotes correctly identified non-target samples, False Positive (FP) represents the number of false detections, and False Negative (FN) represents missed targets.

B_{p}

represents the predicted bounding box, while

B_{g}

is the ground-truth bounding box, with ∩ indicating the intersection area and ∪ representing the union area.

P (R)

denotes the precision–recall curve, N is the total number of classes, and

A P_{i}

represents the average precision of class i. Overall, these evaluation metrics allow for a multi-dimensional assessment of maize kernel detection model performance and provide direction for model optimization.

2.5. Baseline

In this study, SSD [22], RetinaNet [23], YOLOv10 [24], YOLOv11 [25], Faster R-CNN [26], and DETR [27] were selected as baseline models due to their widespread application in object detection and their superior performance in terms of accuracy, speed, and robustness. SSD is a single-stage detector that predicts at multiple feature map scales, enabling it to detect objects of different sizes efficiently, making it suitable for real-time detection tasks. RetinaNet incorporates Feature Pyramid Network (FPN) to enhance small-object detection capabilities and introduces focal loss to mitigate class imbalance issues, improving recognition accuracy for hard-to-detect objects. YOLOv10 and YOLOv11, as representatives of the YOLO series, employ an anchor-free structure and optimized loss functions to significantly improve detection speed while maintaining high accuracy, making them particularly suitable for resource-constrained scenarios. Faster R-CNN, a classic two-stage detector, employs a region proposal network (RPN) to generate candidate boxes and utilizes RoI Pooling for refined predictions. Although computationally intensive, it achieves high detection accuracy, making it ideal for tasks requiring precise object localization. DETR employs a Transformer architecture with self-attention mechanisms to model long-range dependencies, effectively handling occlusions and complex backgrounds. However, it has a high computational complexity and requires longer training times. The selection of these baseline models facilitates a comprehensive evaluation of different detection methods in maize kernel recognition, analyzing their advantages and disadvantages across various scenarios. This provides a solid benchmark for comparing the proposed method, ensuring the validity and practicality of model improvements.

3. Results and Analysis

3.1. Maize Trait Recognition

This experiment aims to evaluate the performance of different object detection models in maize trait recognition, assessing their precision, recall, and overall detection capability in complex agricultural environments. Maize trait recognition involves detecting various kernel and ear characteristics, posing challenges such as small-object detection, occlusion, and variations in lighting conditions. Therefore, multiple state-of-the-art object detection models were selected for comparative analysis, as shown in Table 1.

The experimental results demonstrate that DETR, based on the Transformer architecture, along with the YOLO series models, particularly YOLOv10, YOLOv11, and the proposed method, outperform traditional CNN-based object detection models such as Faster R-CNN, SSD, and RetinaNet across all metrics. In terms of precision and recall, Faster R-CNN and SSD exhibit relatively lower performance. This can be attributed to the region proposal-based detection approach, which may result in a higher number of false positives and false negatives when dealing with complex backgrounds. SSD, being a single-stage detector, shows a lower accuracy in small-object detection. RetinaNet, incorporating focal loss for loss adjustment, achieves better performance than Faster R-CNN and SSD. However, its effectiveness in high-density object detection remains limited. DETR improves recall and mAP@50-95 through its global relationship modeling capabilities using the Transformer framework, making it particularly effective in detecting occluded objects. However, its reliance on global attention computation results in relatively lower detection efficiency in practical applications. Among YOLO-based models, YOLOv10 and YOLOv11 enhance detection accuracy, recall, and mAP through an optimized backbone structure and an efficient anchor-free mechanism. YOLOv11 further refines its attention mechanism and multi-scale feature fusion, leading to a higher mAP@50 of 0.90 and mAP@50-95 of 0.64 compared to YOLOv10. This suggests that YOLO-based models achieve a balance between computational efficiency and detection performance. The proposed method achieves the highest performance across all evaluation metrics, with a precision of 0.95, recall of 0.92, mAP@50 of 0.92, and mAP@50-95 of 0.65. This advantage is attributed to the incorporation of the neighborhood attention mechanism, which effectively enhances local feature representation. Additionally, the proposed method integrates a neighborhood loss function at the optimization level, differing from conventional cross-entropy loss by improving spatial consistency among adjacent detected objects, leading to more stable detection at object boundaries. For farmers and breeders, this advancement translates to faster and more accurate identification of maize traits like kernel size, row alignment, and defects (e.g., mold or insect damage). Traditional methods rely on manual measurements, which are time-consuming and prone to human error. By automating these tasks, the proposed method enables rapid screening of thousands of seeds or ears, directly supporting precision breeding programs and reducing post-harvest losses. For instance, uniform kernel morphology is critical for hybrid seed production, and early defect detection can prevent contamination in storage facilities.

Mathematically, the proposed approach leverages the global modeling capability of the Transformer framework while maintaining the efficiency of YOLO-based structures. This enables simultaneous capture of both global and local features, ultimately achieving superior performance in maize trait recognition tasks.

3.2. Detection Results of Different Maize Traits Using the Proposed Method

This experiment aims to evaluate the performance of the proposed neighborhood attention mechanism in different maize trait recognition tasks, including ear length, ear diameter, kernel row number, kernel size, and color characteristics. Since maize trait recognition involves complex morphological structures, and different traits exhibit significant variations in image representation, the key challenge is to optimize the detection accuracy of multiple traits within a unified framework, as shown in Table 2.

The experimental results indicate that the proposed method demonstrates high precision, recall, and mAP in all trait detection tasks, with the best performance observed in kernel size and color characteristic detection, achieving mAP@50 values of 0.95 and 0.94 and mAP@50-95 values of 0.68 and 0.67, respectively. These results confirm that the neighborhood attention mechanism effectively enhances feature extraction, improving the model’s adaptability to different traits. In contrast, the detection accuracy of ear length and ear diameter is slightly lower, with mAP@50-95 values of 0.62 and 0.64, respectively. This may be due to the greater variability of these morphological traits across different maize varieties, increasing the complexity of the detection task. Additionally, kernel row number and kernel size exhibit superior performance compared to ear-level detection tasks, indicating that the neighborhood attention mechanism is particularly advantageous in detecting small localized traits. This can be attributed to its ability to fully exploit local features while avoiding the computational burden associated with traditional global attention mechanisms. For agricultural stakeholders, the ability to automatically quantify traits like kernel row number and color characteristics addresses critical bottlenecks in seed certification and breeding. For instance, kernel row number is a key yield predictor, and manual counting is laborious and error-prone. Automating this process allows breeders to screen thousands of ears efficiently, accelerating the selection of high-yield hybrids. Similarly, precise color detection (e.g., identifying anthocyanin-rich kernels) supports the development of specialty maize varieties with enhanced nutritional or market value.

From a mathematical perspective, the core principle of the neighborhood attention mechanism is to construct efficient attention mapping within local regions, ensuring more effective information propagation among neighboring features. Traditional self-attention mechanisms rely on global relationship modeling with a computational complexity of

O (N^{2})

, where N represents the spatial resolution of the feature map. In contrast, the neighborhood attention mechanism introduces local attention windows, reducing the computational complexity to

O (N)

. This optimization not only decreases computational costs but also enhances the model’s ability to capture fine-grained local features. In maize trait recognition, different traits exhibit distinct spatial distribution patterns. For example, color characteristics rely on local spectral information, while kernel row number detection depends on repetitive structural patterns. The proposed method dynamically adjusts attention distribution for different tasks through the neighborhood attention mechanism, ensuring high detection accuracy across diverse traits.

Furthermore, the incorporation of neighborhood constraints in the loss function enhances spatial smoothness in detection results, which is particularly beneficial for morphological traits such as ear diameter and kernel size. The experimental results validate the superiority of the proposed method over conventional object detection approaches, particularly in tasks requiring fine-grained trait recognition, further demonstrating the effectiveness of the neighborhood attention mechanism in agricultural object detection.

3.3. Ablation Study on Different Attention Mechanisms

This experiment aims to investigate the effectiveness of different types of attention mechanisms through an ablation study, validating the proposed neighborhood attention mechanism in maize kernel trait recognition. The experiment evaluates the performance of channel attention, spatial attention, and the proposed neighborhood attention mechanism in terms of precision, recall, accuracy, and mAP metrics, as shown in Table 3.

The experimental results indicate that channel attention performs the weakest, with precision, recall, and mAP@50-95 values of 0.63, 0.60, and 0.39, respectively. This suggests that relying solely on channel information is insufficient to effectively distinguish complex maize kernel shapes. In contrast, spatial attention achieves significant improvements in detection precision and recall, reaching 0.84 and 0.80, respectively, with mAP@50-95 increasing from 0.39 to 0.57. This improvement demonstrates that spatial attention effectively enhances feature extraction across different spatial regions, improving object localization accuracy. However, the proposed neighborhood attention mechanism outperforms all other approaches, achieving an mAP@50 of 0.92 and an mAP@50-95 of 0.65. This suggests that neighborhood attention further refines feature representation, providing enhanced robustness in shape recognition tasks.

The experimental results demonstrate that, compared to conventional attention mechanisms, neighborhood attention not only effectively captures local structural information but also enhances the global feature representation of the model, leading to superior overall detection performance. From a mathematical perspective, channel attention primarily relies on GAP to compute the importance of different feature channels, adjusting channel weights across the entire feature map. Since this method models only the relationships between feature channels without considering spatial distributions, its performance is limited in shape-based object detection. In contrast, spatial attention mechanisms leverage two-dimensional convolution to highlight important spatial regions. While spatial attention enhances information representation in local regions, it lacks an effective combination of channel-wise dependencies, resulting in performance bottlenecks.

The proposed neighborhood attention mechanism constructs a regional correlation matrix to aggregate features within a local area. The key advantage of the neighborhood attention mechanism lies in its ability to compute attention relationships within spatial neighborhoods dynamically, thereby adapting feature representations to local contexts while preserving global semantic consistency. This enables the model to balance fine-grained local feature extraction with holistic object understanding, which is crucial for detecting complex maize kernel shapes. Consequently, compared to channel attention and spatial attention, the neighborhood attention mechanism achieves superior performance in maize kernel trait recognition, significantly improving both detection accuracy and robustness. The experimental results validate the effectiveness of this approach, further demonstrating its applicability in fine-grained object detection tasks.

3.4. Ablation Study on Different Loss Functions

This experiment aims to evaluate the effectiveness of different loss functions through an ablation study, validating the proposed neighborhood loss in maize kernel trait recognition. Loss functions are critical in object detection tasks, as they determine how the model optimizes predictions to achieve higher detection accuracy and stability. In this study, the performance of smooth L1 loss, IoU loss, and the proposed neighborhood loss was compared, as shown in Table 4.

The experimental results indicate that smooth L1 loss performs the weakest, with precision and recall values of 0.64 and 0.61, and an mAP@50-95 of only 0.34. This is primarily due to the fact that smooth L1 loss computes loss solely based on bounding box coordinates without sufficiently optimizing the internal distribution of regional features, resulting in limited learning capacity for object shapes and local information. In contrast, IoU loss optimizes the Intersection over Union of the predicted and ground-truth bounding boxes, ensuring a closer fit to actual targets. Consequently, IoU loss achieves significantly better performance than smooth L1 loss across all metrics, with mAP@50 and mAP@50-95 reaching 0.80 and 0.51, respectively. However, IoU loss focuses solely on the spatial overlap of bounding boxes without effectively modeling the relationships between local features. This limitation reduces its effectiveness in complex object shape recognition tasks. In comparison, the proposed neighborhood loss achieves the best results across all metrics, with mAP@50-95 reaching 0.65. This indicates that in addition to optimizing object detection, it effectively preserves the structural integrity of local shape information, improving model stability.

From a mathematical perspective, smooth L1 loss optimizes only bounding box regression error without considering feature relationships within neighboring regions, leading to suboptimal performance in complex environments. This loss function lacks contextual constraints, limiting its effectiveness in shape-dependent detection tasks. On the other hand, IoU loss directly optimizes the spatial alignment between predicted and ground-truth bounding boxes. Although IoU loss improves spatial alignment, its optimization capacity for small objects is weaker. When object regions are small, the gradient update for IoU computation becomes minimal, leading to slower model convergence. The proposed neighborhood loss constructs a local feature correlation matrix, ensuring that both bounding box optimization and internal feature consistency are maintained.

The core idea behind neighborhood loss is to incorporate local constraints, ensuring greater stability in shape learning while avoiding reliance solely on global IoU matching, which often overlooks fine-grained details. The experimental results demonstrate that this loss function significantly enhances the accuracy and robustness of maize trait recognition, enabling the detection model to maintain high performance across different object shapes and scales. Compared to traditional loss functions, neighborhood loss not only improves spatial alignment but also strengthens feature consistency within local regions, making it particularly effective in maize trait recognition and other complex object detection tasks.

4. Discussion

4.1. Discussion with Other Maize Trait Recognition Methods

In order to comprehensively evaluate the performance of various object detection methods in maize trait recognition tasks, a series of comparative experiments were conducted based on the multi-trait detection demands in complex agricultural environments. The experimental results are presented in Table 1 and Table 2, and a systematic comparison and analysis were performed in reference to the representative methods reviewed in the introduction. First, among the two-stage detection models, Faster R-CNN generates candidate boxes through a region proposal network and performs classification and regression for precise detection. Although this model achieved a precision of 0.84, a recall of 0.80, and an mAP@50 of 0.81 in maize trait detection, its overall mAP@50-95 reached only 0.54, indicating significant limitations when addressing small objects, occlusion, and adhesion scenarios. This is consistent with the improved Faster R-CNN model proposed by Zhang et al. [17] for rice panicle detection, which achieved substantial mAP improvement; however, its high computational complexity restricts its suitability for real-time agricultural applications. In contrast, the YOLO series, as a representative class of single-stage detectors, achieves a balance between detection accuracy and speed. In the present experiments, YOLOv10 and YOLOv11 achieved precision values of 0.92 and 0.93, respectively, with mAP@50 scores of 0.88 and 0.90, and mAP@50-95 values of 0.63 and 0.64. Notably, YOLOv11 outperformed Maize-YOLO [19], which reported an mAP of 76.3% and a recall of 77.3%, indicating that the continuous optimization of the YOLO architecture enhances its adaptability to complex agricultural targets. Particularly, by integrating a neighborhood attention mechanism into YOLOv11, the proposed method further improved precision to 0.95, mAP@50 to 0.92, and mAP@50-95 to 0.65, demonstrating significant advantages in distinguishing small targets and enhancing overall detection robustness. In addition, when compared with the lightweight model constructed by Xia et al. [20], which is based on YOLOv5s, ShuffleNet, and CBAM, the proposed method showed superior detection performance despite being slightly less lightweight. Xia’s model was designed primarily for mold detection on maize seed surfaces and achieved an mAP@50 of 0.955, making it suitable for single-trait recognition tasks. In contrast, the proposed model targets multi-trait joint detection, achieving not only high precision but also strong cross-trait generalizability and adaptability. Similarly, the YOLOv7-MEF model proposed by Yang et al. [21] achieved an accuracy of 98.94% and a frame rate of 76.92 FPS, exhibiting excellent efficiency. However, as the model was designed for single maize kernel detection, it is difficult to directly generalize it to the multi-trait task considered in this study. The proposed method maintained a high accuracy of 0.93 and a recall of 0.92 while demonstrating effective multi-task detection capability. As shown in Table 2, further analysis of the detection results for individual traits reveals consistently high performance across all five trait types. In particular, a precision of 0.97 and an mAP@50-95 of 0.68 were achieved in the detection of color characteristics, highlighting the strength of the neighborhood attention mechanism in capturing local color variations. For kernel size and kernel row number, which involve dense structural features, mAP@50-95 values reached 0.67 and 0.65, respectively, outperforming most existing models and effectively mitigating the false detection issues caused by occlusion and adhesion.

4.2. Discussion on Kernel Trait Recognition

As shown in Figure 5, the proposed maize kernel trait recognition method provides a comprehensive visual representation of the recognition process on typical maize ear images, including original image input, kernel boundary segmentation, cob contour extraction, and kernel row structure modeling. To further evaluate the model’s performance in challenging scenarios involving densely packed targets, adhesive structures, and irregular arrangements, an in-depth experiment was conducted on kernel recognition as a representative “hard case”, followed by a systematic comparison with established methods [28,29,30].

Compared with micro-CT-based three-dimensional kernel reconstruction methods [29], the proposed method enables accurate extraction of kernel contours and structural attributes using only conventional images, without the need for expensive equipment. While micro-CT offers precise measurements of internal features such as kernel volume and endosperm cavity, its high cost and limited throughput hinder its application in large-scale field trials or germplasm evaluation. In contrast, the method presented here leverages deep modeling of 2D images with a structure-aware mechanism, achieving both high-throughput and non-destructive detection, while maintaining strong resolution of structural traits, making it suitable for large-sample phenotyping. Furthermore, in comparison with the kernel density estimation and structural disentanglement pipeline built upon MKNet and RMG [30], the proposed method maintains interpretability of structural outputs while eliminating the need for multi-stage reasoning. MKNet predicts the total number of kernels per ear via density map regression, and RMG reconstructs the structural arrangement through row mask generation. Although this approach achieves certain structural modeling capability, its overall pipeline remains complex and sensitive to noise in the density map. The method described here adopts an end-to-end neighborhood attention modeling strategy, enabling simultaneous localization of dense targets and structural alignment within a single stage. This significantly simplifies the processing pipeline and improves the completeness and consistency of kernel boundary prediction. In addition, compared with methods that integrate QTL and GWAS analyses [28], the current approach emphasizes the precision and throughput of phenotype acquisition. While genome-level studies have identified numerous loci associated with kernel size, row number, and internal cavity traits, and functional analyses have revealed that these candidate genes participate in developmental regulation and stress resistance pathways, such studies heavily rely on large-scale, well-structured, and stable phenotypic datasets. The proposed kernel recognition framework is designed to address this prerequisite, offering foundational support for phenotype analysis by enabling accurate extraction of kernel boundaries, row count, and local distribution patterns. The model generates clear segmentation boundaries between multiple kernel rows and precisely extracts cob contours, yielding structured outputs with strong spatial consistency. Without the need for complex post-processing steps, explicit expression of the “number of kernels per ear”, “number of rows per ear”, and “kernels per row” is achieved. These results demonstrate the model’s robustness and practical effectiveness in handling structurally difficult cases.

4.3. Limitations and Future Work

Despite the superior performance achieved by the proposed maize trait recognition method based on the neighborhood attention mechanism in multiple experiments, certain limitations remain. First, although the neighborhood attention mechanism effectively improves detection accuracy and enhances local feature representation, its performance may degrade under extreme illumination variations and severe occlusions. In large-scale field data collection, variations in lighting conditions, shadows, and background complexity may impact the stability of the detection model, leading to misidentification of certain targets. Second, while the proposed method exhibits high mAP and accuracy across multiple maize trait detection tasks, its generalization capability to different maize varieties requires further validation. In practical agricultural production, maize exhibits significant trait variations across different growth environments and developmental stages, necessitating further exploration of the method’s adaptability to unseen varieties.

Future research can focus on several optimization directions. First, the integration of multimodal data, such as hyperspectral imaging and depth information, can be explored to enhance the model’s robustness against varying illumination conditions and complex backgrounds. Second, more robust feature extraction mechanisms should be investigated to ensure high detection accuracy across different maize varieties and geographical regions. Additionally, future work will involve incorporating datasets that cover a broader spectrum of boundary conditions to thoroughly evaluate the method’s adaptability and stability under more complex environments. These boundary conditions include, but are not limited to, variations in lighting, scale transformations, occlusion levels, and background complexity. This approach will help establish a systematic set of boundary conditions to support the standardized application of the method across diverse scenarios, thereby confirming its robustness and reliability under a wider range of conditions.

5. Conclusions

Maize trait recognition plays a critical role in agricultural production and breeding research. However, traditional object detection methods still face limitations in handling complex backgrounds and detecting small targets with highly variable shapes. In this study, a maize trait recognition method based on a neighborhood attention mechanism is proposed to enhance the feature extraction and attention modeling capabilities of object detection networks, thereby improving the accuracy of kernel and ear trait detection. Experimental results demonstrate that the proposed method outperforms existing object detection approaches across multiple key metrics, achieving high precision in different trait detection tasks while significantly improving the robustness and generalization ability of object detection. The primary contribution of this study lies in the introduction of a neighborhood attention mechanism, which captures feature correlations within local regions, enabling the detection network to maintain high accuracy and stability even in complex environments. Ablation studies validate the effectiveness of the proposed mechanism, showing that compared to traditional channel attention and spatial attention methods, the proposed method improves mAP@50 and mAP@50-95 by approximately 11% and 8%, respectively. Additionally, the proposed neighborhood loss function enhances the stability of internal feature representations while optimizing bounding boxes. Compared to smooth L1 loss and IoU loss, the mAP@50 and mAP@50-95 are improved by 31% and 14%, respectively. Further experiments on different maize traits confirm the effectiveness of the proposed method, demonstrating high detection accuracy for ear length, ear diameter, kernel row number, kernel size, and color characteristics. The overall mAP@50 reaches 0.92, and mAP@50-95 achieves 0.65, with both precision and recall exceeding 0.92. By optimizing the attention mechanism and loss function, the proposed method not only improves object detection accuracy but also reduces computational resource dependence, making it more feasible for practical applications in agricultural production and intelligent breeding systems. Future research can further explore the application of this method in multimodal data fusion, such as integrating spectral information and improving robustness under high-illumination conditions, to enhance the adaptability and detection performance of the model.

Author Contributions

Conceptualization, Z.L., J.L., H.S. and C.L.; data curation, M.Z. and Y.Z.; formal analysis, H.Z., Y.Z. and Z.Y.; funding acquisition, C.L.; investigation, M.Z. and Z.Y.; methodology, Z.L., J.L. and H.S.; project administration, C.L.; resources, M.Z. and Y.Z.; software, Z.L., J.L. and H.S.; supervision, Z.Y. and C.L.; validation, H.Z.; visualization, H.Z.; writing—original draft, Z.L., J.L., H.S., M.Z., H.Z., Y.Z., Z.Y. and C.L.; Z.L., J.L. and H.S. contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

Funding

The authors would like to express their sincere gratitude to the Computer Association of China Agricultural University (ECC) for their valuable technical support. Upon the acceptance of this paper, the project code and the dataset will be made publicly available to facilitate further research and development in this field.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Xiao, F.; Hui, Z.; Rui, Z.; Lu, Q.; Dong, W.; Li, D.; Zhang, Y.; Zheng, G. Variety Recognition Based on Deep Learning and Double-Sided Characteristics of Maize Kernel. J. Syst. Simul. 2022, 33, 2983–2991. [Google Scholar]
Jia, Y.; Li, Z.; Gao, R.; Zhang, X.; Zhang, H.; Su, Z. Mildew recognition on maize seed by use of hyperspectral technology. Spectrosc. Lett. 2022, 55, 240–249. [Google Scholar] [CrossRef]
Wang, Z.; Guan, B.; Tang, W.; Wu, S.; Ma, X.; Niu, H.; Wan, X.; Zang, Y. Classification of fluorescently labelled maize kernels using convolutional neural networks. Sensors 2023, 23, 2840. [Google Scholar] [CrossRef]
Li, J.; Zhao, B.; Wu, J.; Zhang, S.; Lv, C.; Li, L. Stress-Crack detection in maize kernels based on machine vision. Comput. Electron. Agric. 2022, 194, 106795. [Google Scholar] [CrossRef]
Dai, D.; Ma, Z.; Song, R. Maize kernel development. Mol. Breed. 2021, 41, 1–33. [Google Scholar] [CrossRef]
Warman, C.; Sullivan, C.M.; Preece, J.; Buchanan, M.E.; Vejlupkova, Z.; Jaiswal, P.; Fowler, J.E. A cost-effective maize ear phenotyping platform enables rapid categorization and quantification of kernels. Plant J. 2021, 106, 566–579. [Google Scholar] [CrossRef]
Song, K.; Zhang, Y.; Shi, T.; Yang, D. Rapid detection of imperfect maize kernels based on spectral and image features fusion. J. Food Meas. Charact. 2024, 18, 3277–3286. [Google Scholar] [CrossRef]
Xu, P.; Yang, R.; Zeng, T.; Zhang, J.; Zhang, Y.; Tan, Q. Varietal classification of maize seeds using computer vision and machine learning techniques. J. Food Process Eng. 2021, 44, e13846. [Google Scholar] [CrossRef]
Gebeyehu, S.; Shibeshi, Z.S. Maize seed variety identification model using image processing and deep learning. Indones. J. Electr. Eng. Comput. Sci. 2024, 33, 990–998. [Google Scholar] [CrossRef]
Ali, A.; Mashwani, W.K.; Tahir, M.H.; Belhaouari, S.B.; Alrabaiah, H.; Naeem, S.; Nasir, J.A.; Jamal, F.; Chesneau, C. Statistical features analysis and discrimination of maize seeds utilizing machine vision approach. J. Intell. Fuzzy Syst. 2021, 40, 703–714. [Google Scholar] [CrossRef]
Wang, H.; Pan, X.; Zhu, Y.; Li, S.; Zhu, R. Maize leaf disease recognition based on TC-MRSN model in sustainable agriculture. Comput. Electron. Agric. 2024, 221, 108915. [Google Scholar] [CrossRef]
Yang, D.; Zhou, Y.; Jie, Y.; Li, Q.; Shi, T. Non-destructive detection of defective maize kernels using hyperspectral imaging and convolutional neural network with attention module. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 2024, 313, 124166. [Google Scholar] [CrossRef] [PubMed]
Zhang, H.; Li, Z.; Yang, Z.; Zhu, C.; Ding, Y.; Li, P.; He, X. Detection of the corn kernel breakage rate based on an improved mask region-based convolutional neural network. Agriculture 2023, 13, 2257. [Google Scholar] [CrossRef]
Dönmez, E. Enhancing classification capacity of CNN models with deep feature selection and fusion: A case study on maize seed classification. Data Knowl. Eng. 2022, 141, 102075. [Google Scholar] [CrossRef]
Bi, C.; Hu, N.; Zou, Y.; Zhang, S.; Xu, S.; Yu, H. Development of deep learning methodology for maize seed variety recognition based on improved swin transformer. Agronomy 2022, 12, 1843. [Google Scholar] [CrossRef]
Rasmussen, C.B.; Kirk, K.; Moeslund, T.B. Anchor tuning in Faster R-CNN for measuring corn silage physical characteristics. Comput. Electron. Agric. 2021, 188, 106344. [Google Scholar] [CrossRef]
Zhang, Y.; Xiao, D.; Liu, Y.; Wu, H. An algorithm for automatic identification of multiple developmental stages of rice spikes based on improved Faster R-CNN. Crop J. 2022, 10, 1323–1333. [Google Scholar] [CrossRef]
Meng, Y.; Zhan, J.; Li, K.; Yan, F.; Zhang, L. A rapid and precise algorithm for maize leaf disease detection based on YOLO MSM. Sci. Rep. 2025, 15, 6016. [Google Scholar] [CrossRef]
Yang, S.; Xing, Z.; Wang, H.; Dong, X.; Gao, X.; Liu, Z.; Zhang, X.; Li, S.; Zhao, Y. Maize-YOLO: A new high-precision and real-time method for maize pest detection. Insects 2023, 14, 278. [Google Scholar] [CrossRef]
Xia, Y.; Shen, A.; Che, T.; Liu, W.; Kang, J.; Tang, W. Early Detection of Surface Mildew in Maize Kernels Using Machine Vision Coupled with Improved YOLOv5 Deep Learning Model. Appl. Sci. 2024, 14, 10489. [Google Scholar] [CrossRef]
Yang, L.; Liu, C.; Wang, C.; Wang, D. Maize Kernel Quality Detection Based on Improved Lightweight YOLOv7. Agriculture 2024, 14, 618. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14. Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Los Angeles, CA, USA, 21 May 2017; pp. 2980–2988. [Google Scholar]
Guan, S.; Lin, Y.; Lin, G.; Su, P.; Huang, S.; Meng, X.; Liu, P.; Yan, J. Real-time detection and counting of wheat spikes based on improved YOLOv10. Agronomy 2024, 14, 1936. [Google Scholar] [CrossRef]
Wei, J.; Ni, L.; Luo, L.; Chen, M.; You, M.; Sun, Y.; Hu, T. GFS-YOLO11: A Maturity Detection Model for Multi-Variety Tomato. Agronomy 2024, 14, 2644. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Wang, C.; Li, H.; Long, Y.; Dong, Z.; Wang, J.; Liu, C.; Wei, X.; Wan, X. A Systemic Investigation of Genetic Architecture and Gene Resources Controlling Kernel Size-Related Traits in Maize. Int. J. Mol. Sci. 2023, 24, 1025. [Google Scholar] [CrossRef]
Li, D.; Wang, J.; Zhang, Y.; Lu, X.; Du, J.; Guo, X. CT-based phenotyping and genome-wide association analysis of the internal structure and components of maize kernels. Agronomy 2023, 13, 1078. [Google Scholar] [CrossRef]
Shi, M.; Zhang, S.; Lu, H.; Zhao, X.; Wang, X.; Cao, Z. Phenotyping multiple maize ear traits from a single image: Kernels per ear, rows per ear, and kernels per row. Comput. Electron. Agric. 2022, 193, 106681. [Google Scholar] [CrossRef]

Figure 1. Overview of the data acquisition process for maize image collection. (a) Downloading maize images from the Internet to obtain a wide variety of samples, providing a broad range of maize appearances. (b) Utilizing a high-resolution camera mounted on a tripod to capture maize images from multiple angles (360°) and heights (50–80 cm), ensuring comprehensive documentation of maize characteristics. (c) Area map highlighting Nanjing City, Jiangsu Province, China, as the main location for data collection, with a representative maize field image demonstrating the local environment and maize planting setup.

Figure 2. Overview of the proposed maize trait recognition network architecture. The network consists of three main components: Base-Net, Overview-Net, and Focus-Net.

Figure 3. Illustration of the proposed neighborhood attention mechanism. The attention module computes feature interactions by leveraging adaptive pooling and

1 \times 1

convolutions to generate query (Q) and key (K) representations.

Figure 3. Illustration of the proposed neighborhood attention mechanism. The attention module computes feature interactions by leveraging adaptive pooling and

1 \times 1

convolutions to generate query (Q) and key (K) representations.

Figure 4. Illustration of the proposed neighborhood loss function. The loss function consists of three components: cross-entropy loss, neighborhood consistency loss, and boundary-aware loss.

Figure 5. Visualization of maize kernel trait recognition results.

Table 1. Experimental results of maize trait detection models.

Model	Precision	Recall	Accuracy	mAP@50	mAP@50-95
Faster R-CNN [26]	0.84	0.80	0.82	0.81	0.54
SSD [22]	0.85	0.81	0.83	0.82	0.56
RetinaNet [23]	0.87	0.83	0.86	0.86	0.57
DETR [27]	0.89	0.86	0.87	0.84	0.60
YOLOv10 [24]	0.92	0.87	0.90	0.88	0.63
YOLOv11 [25]	0.93	0.91	0.92	0.90	0.64
Proposed Method	0.95	0.92	0.93	0.92	0.65

Table 2. Detection results of different maize traits using the proposed method.

Maize Trait	Precision	Recall	Accuracy	mAP@50	mAP@50-95
Ear Length	0.92	0.90	0.91	0.91	0.62
Ear Diameter	0.93	0.90	0.92	0.92	0.64
Kernel Row Number	0.95	0.92	0.93	0.93	0.65
Kernel Size	0.96	0.93	0.94	0.94	0.67
Color Characteristics	0.97	0.94	0.95	0.95	0.68

Table 3. Ablation study on different attention mechanisms.

Model	Precision	Recall	Accuracy	mAP@50	mAP@50-95
Channel Attention [26]	0.63	0.60	0.61	0.60	0.39
Spatial Attention [27]	0.84	0.80	0.82	0.81	0.57
Proposed Method	0.95	0.92	0.93	0.92	0.65

Table 4. Ablation study on different loss functions.

Model	Precision	Recall	Accuracy	mAP@50	mAP@50-95
Smooth L1 Loss	0.64	0.61	0.62	0.61	0.34
IoU Loss	0.83	0.79	0.80	0.80	0.51
Proposed Method	0.95	0.92	0.93	0.92	0.65

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Z.; Li, J.; Shen, H.; Zhang, M.; Zhang, H.; Zhou, Y.; Yang, Z.; Lv, C. Neighborhood Attention-Based Detection for Maize Traits in Precision Agriculture. Agronomy 2025, 15, 931. https://doi.org/10.3390/agronomy15040931

AMA Style

Li Z, Li J, Shen H, Zhang M, Zhang H, Zhou Y, Yang Z, Lv C. Neighborhood Attention-Based Detection for Maize Traits in Precision Agriculture. Agronomy. 2025; 15(4):931. https://doi.org/10.3390/agronomy15040931

Chicago/Turabian Style

Li, Zhongxu, Juyi Li, Hongjun Shen, Mohan Zhang, Hanwen Zhang, Yi Zhou, Zhiyuan Yang, and Chunli Lv. 2025. "Neighborhood Attention-Based Detection for Maize Traits in Precision Agriculture" Agronomy 15, no. 4: 931. https://doi.org/10.3390/agronomy15040931

APA Style

Li, Z., Li, J., Shen, H., Zhang, M., Zhang, H., Zhou, Y., Yang, Z., & Lv, C. (2025). Neighborhood Attention-Based Detection for Maize Traits in Precision Agriculture. Agronomy, 15(4), 931. https://doi.org/10.3390/agronomy15040931

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Neighborhood Attention-Based Detection for Maize Traits in Precision Agriculture

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Collection

2.2. Dataset Annotation and Augmentation

2.3. Proposed Method

2.3.1. Maize Kernel Shape Recognition Network Based on Object Detection Algorithm

2.3.2. Neighborhood Attention Mechanism

2.3.3. Lightweight Design

2.3.4. Neighborhood Loss

2.4. Experimental Setup

2.4.1. Hardware and Software Configuration and Hyperparameters

2.4.2. Dataset Partitioning and k-Fold Cross-Validation

2.4.3. Evaluation Metrics

2.5. Baseline

3. Results and Analysis

3.1. Maize Trait Recognition

3.2. Detection Results of Different Maize Traits Using the Proposed Method

3.3. Ablation Study on Different Attention Mechanisms

3.4. Ablation Study on Different Loss Functions

4. Discussion

4.1. Discussion with Other Maize Trait Recognition Methods

4.2. Discussion on Kernel Trait Recognition

4.3. Limitations and Future Work

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI