1. Introduction
Accurate fruit detection represents a critical component of precision agriculture, facilitating reliable yield estimation, timely harvesting, and informed orchard management strategies [
1,
2]. While optical imagery has traditionally formed the backbone of these applications, its inherent dependence on illumination and weather conditions can limit data availability and quality [
3]. In contrast, Synthetic Aperture Radar (SAR) imaging, with its capability to penetrate clouds and operate independently of daylight, provides consistent and robust data acquisition, making it particularly suited for continuous orchard monitoring [
4]. A direct comparison of the same target captured by these two modalities (RGB vs. Ka-band near-field MIMO-SAR) is shown in
Figure 1. However, applying state-of-the-art detection algorithms to SAR imagery for the purpose of apple detection remains a significant challenge due to the modality’s distinct imaging characteristics and the subtle nature of apple signatures within complex orchard canopies.
Several fundamental factors complicate apple detection in SAR data. First, unlike optical images that offer rich color, texture, and intensity gradients, SAR data represent scenes primarily through microwave backscatter intensities [
5,
6]. This results in low-contrast, texture-deficient imagery, where apples often appear as weak scatterers easily confounded by speckle noise, foliage clutter, and complex branching structures. Second, apples may vary substantially in size and distribution within the orchard, necessitating robust multi-scale feature extraction techniques that can discriminate between small, partially obscured targets and heterogeneous backgrounds [
7,
8]. Third, contextual cues—commonly exploited in optical imagery to provide additional semantic guidance—are not readily available in SAR data [
9]. The identification of subtle relationships between targets and their surroundings becomes more demanding, requiring sophisticated approaches to incorporate both global and local contextual information [
10].
Existing methods have sought to mitigate these challenges through advanced feature extraction and fusion strategies as well as attention mechanisms [
11,
12,
13]. Multi-scale representations, such as those enabled by Spatial Pyramid Pooling (SPP), have proven effective in optical domains [
14,
15]. However, fixed-scale SPP configurations often fail to accommodate the dynamic variability present in SAR orchard scenes, leading to suboptimal scale selections for apple detection. Similarly, single-pass feature fusion techniques may struggle with the inherently noisy and semantically ambiguous nature of SAR features, resulting in unstable or semantically inconsistent representations that degrade detection performance [
8]. Furthermore, while attention-based methods enhance feature discriminability by highlighting relevant regions or channels, a lack of explicit modeling of global and local contextual relationships may prevent the detector from leveraging crucial relational cues necessary for distinguishing subtle targets from complex backgrounds [
16].
To address these gaps, this work introduces a novel detection framework tailored to the unique demands of SAR-based apple detection. The main contributions are threefold:
Dynamic Spatial Pyramid Pooling (DSPP): A learnable, adaptive pooling mechanism that replaces fixed-scale pooling with a dynamic selection process, emphasizing the most informative scales and spatial regions. This approach overcomes the rigidity of traditional SPP methods, enabling more effective multi-scale feature representation in SAR orchard imagery.
Recursive Feature Fusion Network (RFN): An iterative, recurrent-based fusion strategy that refines multi-level features over multiple passes. By repeatedly integrating information from various network layers, RFN stabilizes feature representations, enhances semantic coherence, and mitigates the impact of noise and low-contrast targets.
Context-Aware Feature Enhancement (CAFE): An attention-driven module that explicitly incorporates global and local contextual cues into the detection pipeline. CAFE models long-range dependencies and relational patterns, allowing the framework to exploit subtle contextual information that improves discrimination between apples and surrounding clutter.
These components are jointly optimized in an end-to-end framework, resulting in a detector that is robust, adaptive, and contextually aware. Experimental results demonstrate that the integrated DSPP-RFN-CAFE approach significantly improves detection accuracy and robustness compared to baseline methods, paving the way for more reliable SAR-based orchard monitoring and advancing the state of the art in remote sensing applications for precision agriculture [
5].
3. Data Collection and Imaging Methods
We employ a linear array mechanical scanning planar aperture near-field wideband three-dimensional imaging approach. As shown in
Figure 2, taking a one-dimensional linear array scanning along the x-axis as an example, the antenna array is aligned parallel to the y-axis and includes both transmit and receive units operating in a monostatic mode. At each sampling position, signals are transmitted and received in sequence. A mechanical scanning device drives the antenna array to move steadily along the x-axis, thus forming a two-dimensional planar aperture. The center of the region to be imaged is set as the coordinate origin, and the z-axis is perpendicular to the planar aperture, representing the range direction.
Because the antenna array performs high-speed electronic scanning and the target is very close to the array, the horizontal movement during the vertical electronic scanning can be neglected. For each column of sampled data, it is assumed that the horizontal sampling positions remain constant. Under ideal conditions, we assume that the transmit and receive unit positions coincide [
26,
27,
28]. Therefore, the ideal sampling positions corresponding to the two-dimensional aperture are
. Let the coordinates of a target scattering point be
and the distance from the scattering point to the antenna element is given by:
where
denotes the scattering coefficient of each target scattering point. According to the wavenumber domain imaging algorithm, for a planar aperture, the baseband echo signal can be expressed as:
where
,
f represents the operating frequency (or frequency band if multiple frequencies are used), and c denotes the speed of electromagnetic wave propagation in free space. The product
gives the fundamental wavenumber for the imaging process.
Based on the wavenumber domain imaging algorithm [
3], the target scattering coefficient distribution can be obtained by mapping
into the
domain via two-dimensional Fourier transforms in
and
. Once the data are expressed as
, it becomes possible to perform a three-dimensional reconstruction via Stolt interpolation. Specifically,
where
is the vertical distance from the array to the coordinate origin. In practical imaging applications,
can be chosen according to the imaging range and other factors.
According to the dispersion relation, the signal wavenumber
k and the spatial wavenumber components
,
, and
satisfy the following:
To perform a three-dimensional inverse Fourier transform,
must be interpolated onto a uniformly distributed grid in the
coordinate system, resulting in
. This step is known as Stolt interpolation [
29]. After interpolation, we obtain
, which is uniformly distributed in the frequency domain. A three-dimensional inverse Fourier transform then yields the target scattering coefficient distribution
.
The steps of the wavenumber domain algorithm are described below.
First, a two-dimensional Fourier transform is applied along the horizontal and vertical dimensions of the echo signal
to obtain the frequency domain signal
. This frequency-domain signal is then multiplied by the reference function exp
to compensate for the propagation phase. Next, Stolt interpolation is performed to map
to
. Finally, a three-dimensional inverse Fourier transform of
yields the target scattering coefficient distribution
. To provide a clearer mathematical expression,
can be written as:
where
denotes the two-dimensional Fourier transform in the horizontal and vertical directions (
and
),
is the one-dimensional Stolt interpolation mapping
to
, and
represents the three-dimensional inverse Fourier transform. Different parts of the target exhibit distinct reflection coefficients; once the three-dimensional scattering distribution
is obtained, target imaging can be readily performed.
3.1. Prototype Composition
The experimental system mainly consists of four parts: the millimeter-wave array, the mechanical scanning device, the control system, and the data acquisition and processing system. The radar transmits Ka-band signals (with wavelengths ranging from 7.5 mm to 11.25 mm). The millimeter-wave array module integrates an antenna array, switching arrays, RF channels, and frequency synthesis modules. The mechanical scanning device moves the millimeter-wave array horizontally and provides real-time feedback on the array’s horizontal position. The control system coordinates all components, controlling the mechanical scanning for horizontal movement, the millimeter-wave array module for signal transmission and reception, and the data acquisition and processing system for data collection and final image output. An image of the mechanical scanning device is shown in
Figure 3.
3.2. Experimental System Configuration and Parameter Settings
3.2.1. Prototype Parameters
Table 1 lists the parameters of the Ka-band near-field wideband three-dimensional imaging prototype. A linear frequency-modulated (LFM) signal is transmitted, with frequencies ranging from 32 GHz to 37 GHz. In practical applications, when the subject under inspection is close to the array and the array aperture exceeds the antenna element beam coverage, the azimuth resolution of the entire array’s full-resolution area is determined by the antenna element’s beamwidth. Additionally, because the target occupies a relatively small range, the range window can be appropriately narrowed. Under acceptable signal-to-noise ratios, the echo data can be suitably downsampled [
4].
3.2.2. Experimental Dataset Acquisition
We collaborated with the Beijing Institute of Technology and the First Research Institute of the Ministry of Public Security to use a MIMO-SAR-based millimeter-wave imaging device for experimental exploration. The experimental results indicate that the device can complete fruit tree scanning and imaging within 10 s. Using the aforementioned experimental setup, we conducted data acquisition experiments.
Two real fruit-bearing trees were placed at a distance of about 30 cm directly in front of the experimental platform. The scanning mode was enabled, and the test interface is shown in
Figure 4a. By continuously rotating the fruit trees and scanning from different angles, SAR images were obtained and stored once the test was completed. A smartphone mounted behind the device captured corresponding optical images after the SAR scanning finished, producing paired samples of optical and SAR images. In total, we collected 150 pairs of samples, covering about 15 fruit trees, yielding approximately 2000 SAR fruit samples.
Figure 4b shows an example.
Table 1 outlines the specifications of a prototype system for near-field broadband 3D imaging in the Ka band.
4. Method
This section describes the proposed methodology for detecting apples in Synthetic Aperture Radar (SAR) imagery, a task complicated by low signal-to-noise ratios, lack of intuitive texture or color cues, and the presence of speckle noise and clutter. To address these issues, the method integrates three key components: (1) Dynamic Spatial Pyramid Pooling (DSPP) for adaptive multi-scale feature extraction, (2) a Recursive Feature Fusion Network (RFN) for iterative multi-level feature refinement, and (3) a Context-Aware Feature Enhancement (CAFE) module for leveraging global and local contextual relationships. Together, these components form a detection framework suitable for handling the subtle and often low-contrast appearance of apples in SAR images, ultimately improving robustness and accuracy in challenging orchard environments.
4.1. Problem Setup
Let denote an input SAR image. The objective is to predict the positions and extents of apples within this image. An initial feature extraction step is performed using a deep convolutional backbone network (e.g., a ResNet variant), yielding a feature map . This feature map serves as the input for the subsequent modules described below. The overall pipeline is end-to-end trainable, enabling joint optimization of all parameters with respect to a defined detection objective (e.g., a combination of classification and regression losses).
4.2. Dynamic Spatial Pyramid Pooling (DSPP)
In standard computer vision tasks, Spatial Pyramid Pooling (SPP) provides multi-scale representations by pooling features over increasingly fine subregions of the feature map. However, traditional SPP relies on predetermined scaling levels, which may not be optimal for SAR images of apple orchards characterized by widely varying target scales, cluttered backgrounds, and strong speckle noise. As illustrated in
Figure 5, the proposed DSPP module introduces trainable attentional mechanisms to adaptively select the most informative scales and spatial regions.
The DSPP module introduces adaptability into the spatial pyramid. Consider
L pyramid levels, each partitioning the feature map
into a grid of
cells at level l. For each cell, a pooling operator (e.g., max or average pooling) aggregates the features:
where
represents the subregion of F corresponding to the
-th cell at the
l-th pyramid level. To introduce adaptability, trainable weights
are used to produce normalized attention coefficients
through a softmax function:
Additionally, within each pyramid level, a spatial attention mechanism assigns attention scores
to emphasize informative regions:
where
is a learnable mapping. The DSPP output is formed as a weighted aggregation of the level-wise attention-pooled features:
This dynamic formulation enables the model to emphasize the most appropriate scales and spatial regions for apple detection in SAR images.
4.3. Recursive Feature Fusion Network (RFN)
While DSPP provides adaptive multi-scale representations, single-pass feature fusion may be insufficient for SAR data, which often exhibit weak contrasts, non-distinctive textures, and significant noise. Therefore, we introduce a Recursive Feature Fusion Network (RFN) to iteratively refine multi-level features, leading to more stable and coherent representations. As illustrated in
Figure 6, the RFN aggregates feature maps from multiple network stages in a recurrent manner.
Let
denote a set of feature maps extracted from
M different network stages or scales. In practice,
M corresponds to the number of multi-level feature maps that the backbone network outputs. In this study, ResNet-50 provides four principal feature scales (C2, C3, C4, and C5), and thus,
. This choice balances detection performance and computational overhead, though additional levels can be included when higher resolution or more nuanced feature representation is required. Initially, a fusion operator
aggregates these features:
Rather than relying on a single-shot integration, the RFN applies multiple refinement iterations. At iteration
t:
Implementations often employ recurrent layers such as ConvGRU or LSTM-inspired blocks adapted for feature maps. After
T iterations, the final recursive fusion output is as follows:
This iterative process allows low-level and high-level features to be repeatedly integrated, mitigating the effects of noise and strengthening the semantic consistency of the resulting representation, which is critical for accurately identifying small or partially obscured apples in SAR images.
4.4. Context-Aware Feature Enhancement (CAFE)
Even after multi-scale adaptation and iterative fusion, detection in SAR imagery may still benefit from incorporating contextual information. Apples often appear as small, isolated reflectors embedded in complex background structures. As illustrated in
Figure 7, the CAFE module introduces global and local contextual modeling to enhance feature discrimination, particularly in noisy and cluttered orchard environments.
This work adopts a self-attention mechanism to capture long-range dependencies. Starting from
, the spatial dimensions are flattened to
. Three linear transformations are applied:
where
denotes the
j-th position (or spatial location) in
. The term
is a learnable mapping that enriches the feature representation for attention-based context aggregation. A similarity function, as follows:
measures the similarity between locations
i and
j. Subsequently, contextual aggregation is performed:
Reshaping
Y back to the original dimensions and adding it to
yields the enhanced features:
This operation incorporates context-driven cues, helping to highlight subtle apple signatures against complex backgrounds.
4.5. Detection Head and Training Procedure
The final enhanced representation is forwarded to a detection head (e.g., anchor-based or anchor-free) to predict bounding boxes and classification scores. Standard loss functions such as a combination of focal loss (for classification) and Smooth L1 loss (for bounding box regression) are employed.
Training proceeds end-to-end. The backbone, DSPP, RFN, CAFE, and detection head parameters are jointly optimized via stochastic gradient-based methods (e.g., SGD or Adam). Appropriate data augmentation and regularization strategies are applied to improve generalization. During inference, the trained model processes a given SAR image through the entire pipeline and applies non-maximum suppression to produce the final apple detections.
5. Experiments and Results Analysis
This section presents a comprehensive evaluation of the proposed near-field millimeter-wave SAR apple detection framework, which integrates Dynamic Spatial Pyramid Pooling (DSPP), a Recursive Feature Fusion Network (RFN), and the Context-Aware Feature Enhancement (CAFE) module. We begin by introducing the dataset and evaluation metrics employed in our experiments. We then compare the proposed approach with several representative object detection models, followed by an ablation study that examines the individual contributions and underlying mechanisms of each module. Additionally, we provide qualitative analyses, assess computational efficiency, and discuss potential future research directions for leveraging SAR data in precision agriculture fruit detection and yield estimation.
5.1. Dataset and Evaluation Metrics
The dataset used in this study was acquired using the near-field millimeter-wave MIMO-SAR imaging system described in
Section 3. It contains approximately 150 paired SAR-optical image samples and around 2000 accurately annotated apple targets. During data collection, we varied the viewing angles by rotating the fruit-bearing trees and performing multi-angle scanning to ensure diversity and complexity. As a result, the dataset encompasses significant variability in apple size, shape, spatial distribution, and background scattering characteristics, effectively simulating the diverse conditions encountered in real-world orchard environments.
To facilitate objective performance evaluation, the dataset was split into training and test sets at a ratio of roughly 7:3. The training set, with its rich variety of samples, supports robust parameter learning and improved generalization, while the test set serves as an unbiased benchmark of final detection performance. During annotation, corresponding optical images were leveraged to accurately determine apple positions and sizes in the SAR imagery, ensuring annotation precision and consistency.
In terms of evaluation metrics, we adopt the commonly used Average Precision (AP) measure in object detection, computed at an IoU (Intersection over Union) threshold of 0.5. AP effectively quantifies how well predicted bounding boxes align with ground truth targets. In addition, we report Recall and F1-score to provide a more holistic assessment of detection performance under different precision–recall trade-offs.
5.2. Baseline Models and Experimental Setup
To validate the effectiveness of our proposed framework, we compared it against several classical and widely used object detection models that serve as baseline references. The first model, Faster R-CNN [
1], is a classic two-stage detector renowned for its success in optical imagery. Applying this model directly to SAR data establishes a lower bound on performance within this challenging domain. The second set of models, YOLOv3 and YOLOv5 [
2,
3], are representative single-stage, real-time detectors known for their speed and adaptability across multiple scales. Evaluating these models on SAR data, which often exhibits weak textures, provides insights into the stability and robustness of single-shot methods under complex conditions.
The third model, RetinaNet with a Feature Pyramid Network (FPN) [
4], is a single-stage detector that integrates FPN for multi-scale feature fusion and employs Focal Loss to address class imbalance during training. Its proficiency in detecting small objects makes it a relevant benchmark for assessing our multi-scale approach. Additionally, we incorporated Faster R-CNN with a Static Spatial Pyramid Pooling (SPP) module, which adds a traditional fixed-scale SPP to Faster R-CNN. This setup serves as a baseline for evaluating the effectiveness of a static multi-scale strategy on SAR data, providing a point of comparison for our dynamic DSPP approach.
All models were implemented using the PyTorch (version 2.0.0, developed by Meta AI, Menlo Park, CA, USA) framework and trained and tested on a workstation equipped with an NVIDIA GPU. To enhance model generalization, we employed various data augmentation techniques, including horizontal flipping, translation, and noise perturbation. For optimization, we utilized either SGD or Adam optimizers, initially increasing the learning rate gradually through a warm-up strategy. In the later stages of training, the learning rates were adjusted using step decay or cosine annealing to ensure stable convergence.
5.3. Quantitative Experiments
In order to evaluate the performance of different detection models on near-field MIMO-SAR apple detection, we measure their AP, Recall, F1-score, and average inference time under the same hardware conditions (e.g., an NVIDIA Tesla V100 with single-image inference).
Table 2 provides a comprehensive summary of these metrics for both classical optical-based detectors (Faster R-CNN, YOLOv3, YOLOv5, and RetinaNet) and two SAR-specific methods (LSDN-RC and YOLO-RC), along with our proposed DSPP + RFN + CAFE approach. The results reveal that standard models such as Faster R-CNN and YOLO variants typically yield AP values around 60–65%, indicating that weak-texture apple targets and cluttered backgrounds pose significant challenges for conventional detectors that depend heavily on visual priors. Although RetinaNet with FPN and Faster R-CNN supplemented by a static SPP module achieve moderate performance gains, their AP improvements remain relatively limited.
Meanwhile, LSDN-RC and YOLO-RC—both tailored to the range-compressed SAR domain—demonstrate improved performance over purely optical-based methods, with AP in the 69–72% range. However, they still fall short of effectively mitigating the weak-reflector problem inherent in near-field orchard imaging. In contrast, our proposed DSPP + RFN + CAFE framework achieves a remarkable AP of 81.6%, surpassing all competing models by at least 10 percentage points. Such a significant improvement underscores the effectiveness of dynamic multi-scale pooling (DSPP), iterative feature refinement (RFN), and context-aware enhancement (CAFE) in addressing the low-contrast, noise-prone conditions characteristic of SAR-based apple detection. This substantial performance boost paves the way for more reliable fruit monitoring in precision agriculture.
5.4. Ablation Study
To investigate the contribution of each module, we conducted an ablation study by progressively adding or removing DSPP, RFN, and CAFE, with the results presented in
Table 3. Incorporating DSPP alone increased the Average Precision (AP) from 64.1% to 69.0%, affirming that a dynamic multi-scale strategy is crucial for handling the diverse scale variations in SAR orchard scenes. Introducing RFN on top of DSPP further raised the AP to 75.2%, demonstrating that recursive feature fusion effectively refines feature representations and enhances stability and semantic consistency amid weak textures. Similarly, combining CAFE with DSPP yielded an AP of 73.5%, illustrating that context-aware enhancement, through modeling global and local dependencies, provides essential semantic cues for robust detection. When DSPP, RFN, and CAFE were employed jointly, the AP achieved an impressive 81.6%, representing a remarkable 17.5-point improvement over the baseline. This significant enhancement confirms that the three modules complement each other, collectively boosting feature representation and contextual understanding to achieve breakthrough performance in SAR-based apple detection.
Table 3 illustrates the improvements in average precision (AP) percentage with various configurations of the detection model, showcasing the effectiveness of each component in enhancing the overall detection accuracy.
5.5. Qualitative Analysis and Visualization
We conducted qualitative comparisons on selected test samples as shown in
Figure 8. Unlike traditional methods, such as Faster R-CNN, which often miss or misclassify weak-scattering targets, our framework produces more precise bounding boxes. Even in challenging conditions involving occlusion by leaves, complex backgrounds, or subtle scattering signatures, our method reliably locates apples. This performance gain is driven by three key factors: DSPP, which dynamically emphasizes scales relevant to various apple sizes; RFN, which iteratively refines features to suppress noise and ambiguity; and CAFE, which integrates context to leverage environmental cues within complex scenes. Together, these components enhance the framework’s ability to accurately detect apples in diverse and intricate SAR-based orchard environments.
5.6. Runtime Efficiency and Practical Implications
In practical agricultural applications, detection speed and energy efficiency are also important considerations. Although our method introduces a higher computational load compared to single-stage detectors, inference on a GPU-accelerated workstation still occurs within hundreds of milliseconds per SAR image. Future work could involve model pruning, distillation, quantization, or parallelization to meet large-scale and near-real-time monitoring demands.
This study offers a new technical pathway for fruit detection and yield estimation using near-field millimeter-wave SAR data. With ongoing advancements in imaging hardware, computational resources, and the potential fusion of optical or multi-spectral data, there is a promising future for comprehensive, automated fruit detection and decision support systems in precision agriculture.
5.7. Summary
In this section, we validated the effectiveness of our proposed approach on a dedicated near-field millimeter-wave SAR apple dataset. Compared to various classical and state-of-the-art detectors, our method achieves a pronounced lead in AP. Ablation studies confirmed the synergistic effects of DSPP, RFN, and CAFE in enabling dynamic multi-scale adaptation, iterative feature refinement, and context-aware modeling. Qualitative analyses further substantiated the framework’s ability to accurately identify apple targets amidst challenging scattering backgrounds. Overall, this work provides a valuable reference for high-precision fruit detection under stringent SAR imaging conditions and establishes a theoretical and experimental foundation for adopting such technologies in precision agriculture practices.
6. Conclusions
This study addresses the challenges of apple detection in SAR imagery, characterized by weak scattering, low contrast, and complex backgrounds, by proposing an integrated detection framework that incorporates Dynamic Spatial Pyramid Pooling (DSPP), a Recursive Feature Fusion Network (RFN), and a Context-Aware Feature Enhancement (CAFE) module. Through in-depth analysis and empirical validation, we have drawn several key conclusions and contributions:
Effectiveness of Dynamic Multi-Scale Representations: The DSPP module overcomes the limitations of traditional fixed-scale spatial pyramid pooling. By introducing adaptive and learnable mechanisms, it highlights the most informative scales and regions of the image. This adaptability significantly improves the model’s robustness to varying apple sizes and distributions within the complex SAR orchard environment.
Refinement Through Recursive Feature Fusion: The RFN module employs iterative feature integration, progressively refining features to mitigate noise and semantic uncertainty. Compared to a single-pass fusion, this recursive approach stabilizes feature representations and enhances semantic clarity, resulting in more reliable detection performance under low signal-to-noise conditions.
Enhancing Discriminative Power With Context-Aware Modeling: The CAFE module explicitly integrates global and local contextual information into the detection pipeline, compensating for the lack of intuitive texture and color cues in SAR images. By modeling both long-range dependencies and localized relationships, the module leverages environmental cues to improve the distinction between apples and cluttered backgrounds.
Extensive experiments on a dedicated near-field millimeter-wave SAR apple dataset demonstrate that our proposed framework significantly outperforms baseline methods, achieving substantial improvements in AP, Recall, and F1-score. Ablation studies confirm the synergistic benefits of DSPP, RFN, and CAFE in terms of multi-scale adaptability, iterative feature refinement, and context-aware modeling. Qualitative analyses further validate the capability of our approach to consistently detect apples, even under challenging conditions such as foliage occlusion and weak scattering signatures. In summary, this research provides a promising technical pathway for realizing high-precision fruit detection and yield estimation using SAR data, paving the way for future integration with advanced imaging modalities and data fusion strategies. The work lays a solid theoretical and experimental foundation for the adoption of SAR-based detection solutions in precision agriculture, ultimately contributing to more efficient orchard management and improved agricultural productivity.