A Shooting Distance Adaptive Crop Yield Estimation Method Based on Multi-Modal Fusion

Xu, Dan; Li, Ba; Xi, Guanyun; Wang, Shusheng; Xu, Lei; Ma, Juncheng

doi:10.3390/agronomy15051036

Open AccessArticle

A Shooting Distance Adaptive Crop Yield Estimation Method Based on Multi-Modal Fusion

by

Dan Xu

¹

,

Ba Li

¹,

Guanyun Xi

¹,

Shusheng Wang

²

,

Lei Xu

³ and

Juncheng Ma

^1,*

¹

College of Water Resources and Civil Engineering, China Agricultural University, Beijing 100083, China

²

Lushan Botanical Garden, Chinese Academy of Sciences, Nanchang 332900, China

³

Jiangxi Daduo Technology Co., Ltd., Nanchang 330029, China

^*

Author to whom correspondence should be addressed.

Agronomy 2025, 15(5), 1036; https://doi.org/10.3390/agronomy15051036

Submission received: 13 March 2025 / Revised: 23 April 2025 / Accepted: 24 April 2025 / Published: 25 April 2025

(This article belongs to the Special Issue Smart Farming Technologies for Sustainable Agriculture—2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

To address the low estimation accuracy of deep learning-based crop yield image recognition methods under untrained shooting distances, this study proposes a shooting distance adaptive crop yield estimation method by fusing RGB and depth image information through multi-modal data fusion. Taking strawberry fruit fresh weight as an example, RGB and depth image data of 348 strawberries were collected at nine heights ranging from 70 to 115 cm. First, based on RGB images and shooting height information, a single-modal crop yield estimation model was developed by training a convolutional neural network (CNN) after cropping strawberry fruit images using the relative area conversion method. Second, the height information was expanded into a data matrix matching the RGB image dimensions, and multi-modal fusion models were investigated through input-layer and output-layer fusion strategies. Finally, two additional approaches were explored: direct fusion of RGB and depth images, and extraction of average shooting height from depth images for estimation. The models were tested at two untrained heights (80 cm and 100 cm). Results showed that when using only RGB images and height information, the relative area conversion method achieved the highest accuracy, with R² values of 0.9212 and 0.9304, normalized root mean square error (NRMSE) of 0.0866 and 0.0814, and mean absolute percentage error (MAPE) of 0.0696 and 0.0660 at the two untrained heights. By further incorporating depth data, the highest accuracy was achieved through input-layer fusion of RGB images with extracted average height from depth images, improving R² to 0.9475 and 0.9384, reducing NRMSE to 0.0707 and 0.0766, and lowering MAPE to 0.0591 and 0.0610. Validation using a developed shooting distance adaptive crop yield estimation platform at two random heights yielded MAPE values of 0.0813 and 0.0593. This model enables adaptive crop yield estimation across varying shooting distances, significantly enhancing accuracy under untrained conditions.

Keywords:

shooting distance adaptive; crop yield estimation; deep learning; RGB-D; multi-modal fusion

1. Introduction

The accurate monitoring of crop yield is fundamental to optimal control of greenhouse climate [1], as well as precise decisions of outdoor vegetable management [2]. Traditional yield estimation methods, such as destructive sampling and manual measurements, remain labor-intensive and time-consuming [3]. Recent advancements in remote sensing [4] and computer vision [5] have transformed agricultural phenotyping by enabling non-destructive, high-throughput data acquisition. With the fast development of artificial intelligence and accumulated big data, machine learning plays a pivotal role in image recognition [6]. Traditional machine learning methods require empirical extractions of plant features related to crop yield [7]. These extractions are prone to bias due to personal experience. Deep learning has revolutionized crop yield prediction by outperforming traditional machine learning models in processing complex spatial patterns from imaging data [8].

Convolutional neural networks (CNNs) have demonstrated remarkable success in estimating biomass parameters across diverse crops, including lettuce [9], wheat [10], and strawberries [11]. For example, Kim et al. [12] developed a machine vision-based deep learning system for non-destructive weight prediction of butterhead lettuce, achieving high accuracy (R² = 0.95) through optimized CNN models and real-time processing, enabling automated yield monitoring in plant factories. Ma et al. [13] proposed a UAV-RGB-CNN model for wheat yield prediction under limited training samples, achieving high accuracy (R² = 0.89) with the state-of-the-art Efficientnetv2_s model and demonstrating low-cost aerial monitoring potential in precision agriculture. Zheng et al. [14] developed a deep learning framework using high-resolution imagery for precise strawberry canopy delineation and biomass prediction, with R² values of 0.76 and 0.79 for dry biomass prediction with VGG-16 and ResNet-50 models. Lee et al. [15] developed a UAS-based multi-sensor deep learning model integrating RGB and spectral data to predict Napa cabbage fresh weight and identify optimal harvest timing, with the highest accuracy (R² = 0.82, RMSE = 0.47 kg) compared with support vector machine and random forest models during the mid-to-late rosette growth stage. Chaudhary et al. [16] introduced a transfer learning framework utilizing pre-trained CNNs for berry yield prediction under limited datasets, achieving R² of 0.78 for both strawberry and raspberry yield prediction. Okada et al. [17] integrated UAV remote sensing with deep learning to estimate soybean biomass through conventional traits and latent feature extraction, enhancing the high-throughput phenotyping accuracy (R² = 0.935 to 0.940) of dry weight estimation.

The integration of multimodal data—particularly RGB-D (RGB plus depth) fusion—has emerged as a promising strategy to enhance biomass estimation accuracy by synergizing spectral information with three-dimensional structural data [18]. Depth sensors provide critical spatial context, enabling precise quantification of plant volume and architectural features that exhibit strong correlations with fresh weight [19]. For example, Quan et al. [20] demonstrated that dual-stream neural networks combining RGB and depth features improved weed biomass prediction accuracy with root mean square error (RMSE) from 1.011 to 0.398 at the late stage. Xu et al. [21] introduced a novel loss function mean squared percentage error (MSPE) and fused the RGB and depth data at the input layer, lowering the lettuce fresh weight estimation error to 8.47%. With the open-source RGB and depth data of lettuce from the third Autonomous Greenhouse Challenge [22], many researchers [23,24,25,26] investigated the potential of RGB-D fusion for lettuce phenotyping estimation. In addition, Lin et al. [27] built an open-source multi-modal CropNet dataset for crop yield predictions and evaluated the prediction performance for corn, cotton, soybean, and winter wheat. Togninalli et al. [28] proposed a machine learning model that leverages both genotype and phenotype measurements by fusing genetic variants with multiple data sources collected by unmanned aerial systems, obtaining a 13.5% improvement over the linear baseline on wheat yield prediction. Toledo et al. [29] introduced a multi-modal deep learning architecture that fused high-resolution hyperspectral imagery, LiDAR point clouds, and environmental data to forecast maize crop yields with R² ranging from 0.82 to 0.93. Miranda et al. [30] incorporated local neighborhood information using convolutional neural networks and geographical coordinates and estimated soybean yield with an R² of 0.86. Mena et al. [31] presented a novel multi-modal learning approach that included multi-spectral optical images and weather data to predict crop yield for different crops (soybean, wheat, and rapeseed) and regions (Argentina, Uruguay, and Germany) with R² of 0.80 across different countries. Yewle et al. [32] introduced a deep ensemble model RicEns-Net to predict crop yields by integrating the synthetic aperture radar and optical remote sensing data, achieving a mean absolute error (MAE) of 341 kg/Ha for rice crop yield prediction.

Current multi-modal approaches face the challenge of limited generalization across imaging heights due to insufficient training data covering multiple distance gradients. This limitation arises from scale variance in object appearance across different imaging heights and the inability of conventional architectures to adapt to these variations. Liu et al. [33] evaluated the impact of UAV image resolution on potato above-ground biomass (AGB) estimation. High-resolution imagery, combined with spectral and texture features, significantly enhanced model accuracy, identifying optimal flight heights (10–30 m), and demonstrating fusion methods to mitigate spectral saturation. Zhang et al. [34] demonstrated that multi-angle UAV imaging at a 40 m height combined with a crop volume model (CVM) achieved optimal rapeseed biomass estimation (r = 0.792). It revealed comparable accuracy between multi-angle and vertical imaging, offering a cost-effective solution for high-throughput phenotyping in dense field crops through 3D structural analysis. Zhang et al. [35] demonstrated that integrating color indices and texture features from UAV-RGB images via stepwise regression optimally predicted cotton yield at 30 m height during flowering (R² = 0.978). Tan et al. [36] introduced PosNet, a deep learning model integrating shallow features and positional data from oblique images, achieving high accuracy (R² = 0.922, RMSE = 1.89) for lettuce fresh weight estimation in plant factories. However, compensating for the shrinking of lettuce images in the back was not researched in this paper. These findings underscore the urgent need for adaptive frameworks capable of generalizing across variable imaging geometries.

An RGB-D multi-modal deep learning approach that estimates crop fresh weight from different heights automatically has never been researched. This paper takes strawberry fruits as an example to investigate the shooting distance adaptive crop yield estimation method. The contributions of this paper include: (1) investigating multi-modal data fusion approaches (input-level and output-level) combining RGB and depth images to improve the model’s robustness against variations in shooting distances; (2) designing a depth image-based average height extraction algorithm to enhance the interpretability of distance-related features; and (3) implementing an adaptive crop yield estimation framework and evaluating its generalization ability across uncalibrated shooting distances.

2. Materials and Methods

2.1. Data Acquisition

The crop cultivation experiments were conducted between April and July 2023 in an intelligent greenhouse at the Xixia Branch of the Lushan Botanical Garden, Chinese Academy of Sciences, Nanchang City, Jiangxi Province. A total of 3173 strawberry plants of different cultivars were cultivated, as illustrated in Figure 1.

During the experiment, environmental conditions within the greenhouse were strictly controlled: air temperature was maintained at 18 °C to 32 °C, soil temperature at 20 °C to 28 °C, air humidity regulated between 50% and 70%, and soil moisture kept within 35% to 50% as a result of temperature control. To ensure adequate water supply, an automated irrigation system was utilized for periodic watering. The plants received 10 to 12 h of artificial lighting daily to supplement natural sunlight. Pollination was conducted manually to improve fruit setting rates. To optimize photosynthesis, CO₂ concentration was controlled within 800–1200 ppm. When the solar radiation was above 50 W·m⁻², on–off control of the CO₂ concentration was implemented to keep it between the range. Additionally, regular disease monitoring and preventive measures were implemented to safeguard plant health and promote robust growth.

An Intel RealSense D415 camera was employed to capture RGB and depth images of 348 strawberry fruits from five cultivars: Hongyan, Yuexiu, Tianxianzui, Ningyu, and Miaohe. These fruits were sourced from greenhouse cultivation and local markets. Image acquisition was conducted at nine distinct heights ranging from 70 cm to 115 cm from the ground. Before imaging, each strawberry was thoroughly cleaned to remove surface contaminants that might compromise image quality. The cleaned fruits were positioned under the acquisition device for synchronized image capture and fresh weight measurement. A KFS-A electronic scale with a precision of 0.1 g was used for weighing. The built-in sensor fusion technology of the camera ensured precise alignment of RGB and depth images to facilitate subsequent processing and analysis. Fifteen strawberry fruits were imaged simultaneously above a black background plate. In this way, the imaging labor is reduced and the segmentation is simplified. Examples of acquired images are shown in Figure 2.

2.2. Dataset Construction

To explore the potential of future automatic segmentation under complex backgrounds, the YOLOv8 model (latest YOLO version by the time of image processing) was employed to detect and crop RGB and depth images collected from nine different heights. The constructed dataset comprised 348 images, which were divided into training, validation, and test sets in a ratio of 6.4:1.6:2. The strawberry fruits in the images were precisely annotated using Labelme v5.1.1 software and converted into the format required by the YOLOv8 model. Additionally, a YAML configuration file was created to specify dataset paths and directories, facilitating model training and testing. The images were uniformly cropped and resized to 284 × 284 pixels to enhance strawberry recognition accuracy. The detection and cropping procedures are illustrated in Figure 3.

A total of 3132 paired RGB and depth images were generated across nine distinct heights (70 cm, 75 cm, 80 cm, 85 cm, 90 cm, 100 cm, 105 cm, 110 cm, and 115 cm), with 348 images captured at each height. The dataset was partitioned as follows: the training set contained 1998 images (222 per height), the validation set included 504 images (56 per height), and the test set comprised 630 images (70 per height). The fresh weight of the strawberry fruits ranged from 6.3 g to 26.3 g, as illustrated in Figure 4.

2.3. RGB Image-Based Relative Area Transformation Estimation Model

To address the issue of reduced estimation accuracy caused by variations in the relative area of strawberry images due to different acquisition heights, a relative area transformation-based estimation method was developed. This approach involves resizing images to adapt to the camera’s viewing angle and height during capture, ensuring consistent relative areas of strawberries across varying heights. Specifically, using 70 cm as the baseline height, image dimensions were adjusted according to relative heights based on the camera’s field of view (FOV). For instance, images captured at 75 cm were resized to 304 × 304 pixels, those at 80 cm to 324 × 324 pixels, and images at 115 cm reached a maximum of 466 × 466 pixels. Subsequently, all images were center-cropped to 284 × 284 pixels to standardize the relative area of strawberries across heights. The effects of this transformation before and after application are illustrated in Figure 5.

In our previous study on RGB-D fusion-based lettuce fresh weight estimation, the CNN_284 model [15] demonstrated superior accuracy compared to VGG16, ResNet18, MobileNet, and EfficientNet. Therefore, this study adopts the CNN_284 architecture to develop an adaptive strawberry fresh weight estimation model capable of handling varying acquisition heights. The model comprises six convolutional layers, five pooling layers, and one fully connected layer. The convolutional layers employ 5 × 5 kernels to extract image features, with kernel counts sequentially increasing to 16, 32, 64, 128, 256, and 512. Each convolutional layer is followed by a normalization layer to stabilize training and accelerate convergence. Average pooling layers with a 2 × 2 window and stride of 2 are utilized for down-sampling. Finally, the fully connected layer outputs the estimated fresh weight of the strawberries.

2.4. RGB-H Input Layer Fusion Estimation Model

To address the limitations of existing fresh weight estimation models in accuracy and adaptability due to variations in acquisition heights, this study proposes incorporating acquisition height data (H) to enhance model performance. Specifically, a height matrix matching the pixel dimensions of RGB images is constructed, where each matrix element corresponds to the acquisition height. The RGB-H input layer fusion directly combines the height matrix with RGB images to form a four-channel input matrix for the CNN_284 model, as shown in Figure 6. In Figure 6, the RGB image is fused with the height matrix with the same size in the input layer. The white block represents the convolutional layer with a batch normalization and ReLU activation function. The red block represents the average pooling layer. The blue block represents the fully connection layer. This enhancement enables the model to better adapt to data captured at varying heights, thereby significantly improving estimation accuracy and stability.

2.5. RGB-H Output Layer Fusion Estimation Model

Two output layer fusion methods were implemented. The first is the dual-branch fusion method (Dual_CNN_284), where RGB samples and the height matrix are introduced as two independent input streams into the network. This approach aims to integrate features from both streams to ensure accurate estimation across varying heights. The architecture of the Dual_CNN_284 model is illustrated in Figure 7.

The second method, termed Blend_CNN_284, involves adding a 1 × 1 height neuron before the output layer neurons. This neuron corresponds to the acquisition height of the image and combines with the feature neurons derived from convolutional layers to form a new feature tensor, with the 1 × 1 × 513 size. This tensor is subsequently processed by a fully connected layer to output the estimated fresh weight of strawberries. The architecture of the Blend_CNN_284 model is shown in Figure 8.

2.6. RGB-D Input Layer Fusion and Output Layer Fusion Estimation Model

Given that depth images contain richer height-related information, we explore replacing the height matrix (H) with depth image data (D) to enhance estimation accuracy. Two fusion strategies are investigated: the first is input-layer fusion, where the depth image (D) is integrated with the RGB image at the input stage (Figure 6), forming a four-channel input matrix analogous to the previous RGB-H fusion approach. The second is dual-branch output layer fusion (Dual), where the depth image (D) is processed as an independent input stream alongside RGB data, following the dual-branch architecture outlined in Figure 7. Both methods leverage depth information to improve robustness against height variations while maintaining the core CNN_284 backbone architecture.

2.7. Model Training and Evaluation

The dataset was partitioned into training, validation, and test sets with ratios of 6.4:1.6:2. Images captured at untrained heights (80 cm and 100 cm) were exclusively reserved for testing. Images at these two heights are chosen for testing because they are in the middle of all 9 heights. For the training and validation sets, data augmentation techniques—including rotation, flipping, and illumination changing—were applied to expand the sample size by 30 times [37], resulting in 46,620 augmented training samples and 11,760 validation samples. The data distribution and augmentation are shown in Table 1 and Figure 9.

During model training, the ReLU (rectified linear unit) activation function was employed. A mini-batch size of 128 and the Adam optimizer were adopted, with an initial learning rate of 0.0001. Training proceeded for a maximum of 300 epochs to ensure sufficient convergence while mitigating overfitting.

The mean absolute percentage error (MAPE) served as the primary evaluation metric, defined as the arithmetic average of the absolute differences between predicted and actual values normalized by the actual values. Compared to the mean squared error (MSE), MAPE provides a scale-invariant measure of prediction error, enabling more intuitive and dimensionally consistent interpretation. Accordingly, the mean squared percentage error (MSPE) loss function [37] was utilized, as formulated in Equation (1).

S_{i}

is the true value of the

i

th sample,

E_{i}

is the estimated value of the

i

th sample, and

m

is the number of samples. The coefficient of determination (R²), normalized root mean square error (NRMSE), and mean absolute percentage error (MAPE) are used for evaluating the model performance, as shown in Equations (2)–(4).

M S P E = \frac{1}{m} \sum_{i = 1}^{m} {(\frac{S_{i} - E_{i}}{S_{i}})}^{2}

(1)

R^{2} = 1 - \frac{\sum_{i = 1}^{m} {(S_{i} - E_{i})}^{2}}{\sum_{i = 1}^{m} {(S_{i} - \bar{S})}^{2}}

(2)

N R M S E = \frac{\sqrt{\frac{1}{m} \sum_{i = 1}^{m} {(S_{i} - E_{i})}^{2}}}{\bar{S}}

(3)

M A P E = \frac{1}{m} \sum_{i = 1}^{m} \frac{|S_{i} - E_{i}|}{S_{i}}

(4)

3. Results and Discussion

3.1. Results Based on RGB Image Data

By comparing the R², NRMSE, and MAPE metrics of models trained with RGB images and height information (as detailed in Section 2.3, Section 2.4 and Section 2.5), the fresh weight estimation results for strawberries at untrained heights (80 cm and 100 cm) are summarized in Table 2. We can see that the relative area transformation method demonstrated superior performance at both heights, achieving higher R² values alongside lower NRMSE and MAPE, indicating robust generalization capability and prediction accuracy. The fitting results are visualized in Figure 10.

The RGB-H input layer fusion and output layer dual fusion methods exhibited a marginal improvement in R² at 100 cm. However, their prediction errors remained slightly higher than those of the relative area transformation method, suggesting inherent limitations in precision. In contrast, the output layer fusion blend method underperformed at both heights, necessitating further optimization to enhance its predictive efficacy. To further investigate the causes of this decline and understand the model’s attention to different features and weight distribution during learning, this study visualized the weight parameters of the first-layer convolutional kernels for both RGB-H output layer dual and blend fusion models, as shown in Figure 11. These heatmaps are calculated through extracting the weights of the convolutional kernels at the first layer from the trained model. The RGB image is converted into a gray image and then convoluted with the kernels. The weight heatmaps of the RGB-H output layer dual fusion exhibit darker color intensity in strawberry regions, indicating that the model assigns higher weights to features in these areas. In contrast, with the RGB-H output layer blend fusion, the extraction of height-related features fails to focus on these regions, leading to reduced accuracy.

3.2. Results Based on RGB-D Data

Results based on RGB-D data (as detailed in Section 2.6) are shown in Table 3. Both models exhibited unsatisfactory performance at both 80 cm and 100 cm heights. This limitation stems from insufficient height gradients in the training dataset (only seven heights), which hindered the model’s ability to adapt to varying heights. To improve strawberry fresh weight estimation accuracy, expanding the dataset with additional height gradients and information with richer depth is essential.

To address the low prediction accuracy caused by inadequate depth gradients, this study proposes extracting the average depth value (avgD) from depth images to construct an average depth matrix. This approach simplifies depth information complexity and enhances the linear fitting capability for shooting heights. Specifically, the average depth value is calculated by summing all pixel-wise depth values and dividing by the total number of pixels. A new matrix is then generated where all pixels share this average depth value, denoted as RGB-avgD.

By replacing raw depth images with the average depth matrix (RGB-avgD) and retraining models using RGB-D input layer fusion and output layer dual fusion methods (results in Table 4), the simplified depth representation significantly improved height-related linear fitting. The input layer fusion method achieved lower prediction errors than the output layer dual fusion, attributed to its synchronized extraction of height features. The fitting results for RGB-avgD input layer fusion model are visualized in Figure 12.

3.3. Comparisons Between RGB and RGB-Avgd Models

A comparison between the RGB-based relative area transformation method and the RGB-avgD input-layer fusion method (Table 5) revealed that the latter achieved lower prediction errors at untrained heights. This demonstrates that leveraging average height information from depth images surpasses RGB-only approaches in accuracy. Thus, the RGB-avgD input layer fusion method exhibits superior performance among the investigated models.

A total of six models are investigated in this paper, namely RGB relative area transformation, RGB-H input layer fusion, RGB-H input layer fusion, RGB-H output layer dual fusion, RGB-H output layer blend fusion, RGB-avgD input layer fusion, and RGB-avgD output layer dual fusion. The first four models are investigated with only RGB images or height information. Due to the malfunctions of the RGB-H output layer blend fusion model, this structure is not used in exploring the automatic crop yield estimation method through RGB-D images. Due to ineffectiveness with the raw depth data, the replacements of average depth information are explored in this paper.

3.4. Validation of the Crop Fresh Weight Estimation Platform

To address efficiency and accuracy challenges in strawberry fresh weight estimation, a real-time three-degree-of-freedom (up–down, left–right, front–back) estimation platform was developed. The platform integrates a Raspberry Pi 4B embedded board, four 42BYGH34 stepper motors, four TB6600 stepper motor drivers, and an Intel RealSense D415 camera. The Raspberry Pi 4B embedded board is used to control the stepper motors, take pictures, and process these pictures. It incorporates the pre-trained YOLOv8 object detection and cropping model, along with the height-adaptive RGB-avgD fresh weight estimation model. Four stepper motors are used to move the camera with three degrees of freedom. A picture with all these hardware devices is shown in Figure 13.

To validate the platform’s adaptability, two random heights (H1 = 72 cm and H2 = 103 cm) were selected for sampling 12 strawberry fruits, as shown in Figure 14. The estimation errors at both heights were as low as 0.0813 and 0.0593, confirming the trained model’s adaptability to untrained acquisition heights. The fitting results are shown in Figure 15. The system is advanced in that it is low cost (embeds the image processing and crop yield estimation models in a low-cost ARM board), flexible (moves with three degrees of freedom to better capture the crop images), and accurate (estimates the crop yield automatically with RGB-D images shot at random heights). Compared with the human destructive crop yield estimation method, the developed system is fast and non-destructive. Compared with the traditional algorithms, the developed algorithm is more accurate with deep learning techniques and adaptive to different shooting distances.

Real-time crop yield estimation from the field will be researched in the future. Two main challenges need to be tackled to facilitate this future application. On the one hand, efficient automatic crop segmentation from its complex backgrounds must be solved. This procedure is simplified in this paper by employing a black background to focus on investigating the height-adaptive techniques. On the other hand, occlusions of the crop pictures will occur in the field shooting. To handle this problem, picture restoration is being researched by the authors of this paper.

The crop estimation model trained through the dataset at seven heights enables automatic extractions of shooting distance-related features. In the future, a larger dataset at more random distances obtained from a patrol robot will be used to validate its field applications. Meanwhile, more sophisticated models that incorporate the attention mechanisms will also be investigated [38,39].

4. Conclusions

1. Three shooting distance adaptive crop fresh weight estimation methods were proposed based on RGB images: the relative area transformation method, RGB-H input layer fusion, and RGB-H output layer fusion (dual and blend). The results demonstrated that the relative area transformation method achieved the highest accuracy, with errors of 6.65% and 6.60% at the two untrained heights. In contrast, the output layer blend fusion method exhibited the lowest accuracy due to insufficient extraction of height-related features.

2. RGB-D-based approaches further improved accuracy compared to RGB-only methods. However, the direct use of raw depth images yielded unsatisfactory adaptability, owing to limited height gradients. By introducing the average depth matrix (RGB-avgD), errors were reduced to 5.91% and 6.10% at untrained heights, outperforming the RGB-based relative area transformation method. This enhancement stems from the improved linear fitting capability of convolutional neural networks to height information through simplified depth representation.

3. The developed three-degree-of-freedom crop fresh weight estimation platform was tested at two random heights (72 cm and 103 cm). The platform achieved robust image recognition and cropping performance, with low errors (8.13% and 5.93%) in small-sample testing, confirming the effectiveness of the proposed height-adaptive models and system.

4. Future work should focus on further validation and refinement in practical production environments. A larger dataset at more random distances, more sophisticated models that incorporate the attention mechanisms, and the picture restoration of the occlusions caused by field shooting will be researched in the future.

Author Contributions

Conceptualization, D.X.; Methodology, D.X.; Software, B.L. and G.X.; Investigation, D.X. and B.L.; Resources, S.W., L.X. and J.M.; Data curation, B.L. and G.X.; Writing—original draft, D.X., B.L. and G.X.; Supervision, D.X. and S.W.; Funding acquisition, D.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was financially supported by the National Key Research and Development Program of China (2024YFD2000800), the Key Technology Research and Development Program of Shandong (2022CXGC020708), the National Natural Science Foundation of China (32371998 and U20A2020), the National Modern Agricultural Technology System Construction Project (CARS-23-D02), the Beijing Innovation Consortium of Agriculture Research System (BAIC01-2025), and the 2115 Talent Development Program of China Agricultural University.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

The cooperation with Jiangxi Daduo Technology Co., Ltd. facilitates the application of developed algorithms in the future.

Conflicts of Interest

Author Lei Xu was employed by the company Jiangxi Daduo Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Xu, D.; Du, S.; van Willigenburg, G. Double closed-loop optimal control of greenhouse cultivation. Control Eng. Pract. 2019, 85, 90–99. [Google Scholar] [CrossRef]
Murphy, K.M.; Ludwig, E.; Gutierrez, J.; Gehan, M.A. Deep learning in image-based plant phenotyping. Annu. Rev. Plant Biol. 2024, 75, 771–795. [Google Scholar] [CrossRef]
Li, L.; Zhang, Q.; Huang, D. A review of computer vision technologies for plant phenotyping. Comput. Electron. Agric. 2020, 176, 105672. [Google Scholar] [CrossRef]
Han, H.; Liu, Z.; Li, J.; Zeng, Z. Challenges in remote sensing based climate and crop monitoring: Navigating the complexities using AI. J. Cloud Comput. 2024, 13, 34. [Google Scholar] [CrossRef]
Chen, Y.; Huang, Y.; Zhang, Z.; Wang, Z.; Liu, B.; Liu, C.; Huang, C.; Dong, S.; Pu, X.; Wan, F.; et al. Plant image recognition with deep learning: A review. Comput. Electron. Agric. 2023, 212, 108072. [Google Scholar] [CrossRef]
Van Klompenburg, T.; Kassahun, A.; Catal, C. Crop yield prediction using machine learning: A systematic literature review. Comput. Electron. Agric. 2020, 177, 105709. [Google Scholar] [CrossRef]
Paudel, D.; Boogaard, H.; de Wit, A.; Janssen, S.; Osinga, S.; Pylianidis, C.; Athanasiadis, I.N. Machine learning for large-scale crop yield forecasting. Agric. Syst. 2021, 187, 103016. [Google Scholar] [CrossRef]
Nevavuori, P.; Narra, N.; Lipping, T. Crop yield prediction with deep convolutional neural networks. Comput. Electron. Agric. 2019, 163, 104859. [Google Scholar] [CrossRef]
Zhang, L.; Xu, Z.; Xu, D.; Ma, J.; Chen, Y.; Fu, Z. Growth monitoring of greenhouse lettuce based on a convolutional neural network. Hortic. Res. 2020, 7, 124. [Google Scholar] [CrossRef]
Wang, J.; Wang, P.; Tian, H.; Tansey, K.; Liu, J.; Quan, W. A deep learning framework combining CNN and GRU for improving wheat yield estimates using time series remotely sensed multi-variables. Comput. Electron. Agric. 2023, 206, 107705. [Google Scholar] [CrossRef]
Chen, Y.; Lee, W.S.; Gan, H.; Peres, N.; Fraisse, C.; Zhang, Y.; He, Y. Strawberry yield prediction based on a deep neural network using high-resolution aerial orthoimages. Remote Sens. 2019, 11, 1584. [Google Scholar] [CrossRef]
Kim, J.S.G.; Moon, S.; Park, J.; Kim, T.; Chung, S. Development of a machine vision-based weight prediction system of butterhead lettuce (Lactuca sativa L.) using deep learning models for industrial plant factory. Front. Plant Sci. 2024, 15, 1365266. [Google Scholar] [CrossRef] [PubMed]
Ma, J.; Wu, Y.; Liu, B.; Zhang, W.; Wang, B.; Chen, Z.; Guo, A. Wheat yield prediction using unmanned aerial vehicle RGB-imagery-based convolutional neural network and limited training samples. Remote Sens. 2023, 15, 5444. [Google Scholar] [CrossRef]
Zheng, C.; Abd-Elrahman, A.; Whitaker, V.M.; Dalid, C. Deep learning for strawberry canopy delineation and biomass prediction from high-resolution images. Plant Phenomics 2022, 2022, 9850486. [Google Scholar] [CrossRef]
Lee, D.H.; Park, J.H. Development of a UAS-Based Multi-Sensor Deep Learning Model for Predicting Napa Cabbage Fresh Weight and Determining Optimal Harvest Time. Remote Sens. 2024, 16, 3455. [Google Scholar] [CrossRef]
Chaudhary, M.; Gastli, M.S.; Nassar, L.; Karray, F. Transfer learning application for berries yield forecasting using deep learning. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN), Virtual, 18–22 July 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–8. [Google Scholar]
Okada, M.; Barras, C.; Toda, Y.; Hamazaki, K.; Ohmori, Y.; Yamasaki, Y.; Iwata, H. High-throughput phenotyping of soybean biomass: Conventional trait estimation and novel latent feature extraction using UAV remote sensing and deep learning models. Plant Phenomics 2024, 6, 0244. [Google Scholar] [CrossRef]
El Sakka, M.; Ivanovici, M.; Chaari, L.; Mothe, J. A Review of CNN Applications in Smart Agriculture Using Multimodal Data. Sensors 2025, 25, 472. [Google Scholar] [CrossRef]
Mortensen, A.K.; Bender, A.; Whelan, B.; Barbour, M.M.; Sukkarieh, S.; Karstoft, H.; Gislum, R. Segmentation of lettuce in coloured 3D point clouds for fresh weight estimation. Comput. Electron. Agric. 2018, 154, 373–381. [Google Scholar] [CrossRef]
Quan, L.; Li, H.; Li, H.; Jiang, W.; Lou, Z.; Chen, L. Two-stream dense feature fusion network based on RGB-D data for the real-time prediction of weed aboveground fresh weight in a field environment. Remote Sens. 2021, 13, 2288. [Google Scholar] [CrossRef]
Xu, D.; Chen, J.; Li, B.; Ma, J. Improving lettuce fresh weight estimation accuracy through RGB-D fusion. Agronomy 2023, 13, 2617. [Google Scholar] [CrossRef]
Petropoulou, A.S.; van Marrewijk, B.; de Zwart, F.; Elings, A.; Bijlaard, M.; van Daalen, T.; Hemming, S. Lettuce production in intelligent greenhouses—3D imaging and computer vision for plant spacing decisions. Sensors 2023, 23, 2929. [Google Scholar] [CrossRef] [PubMed]
Buxbaum, N.; Lieth, J.H.; Earles, M. Non-destructive plant biomass monitoring with high spatio-temporal resolution via proximal RGB-D imagery and end-to-end deep learning. Front. Plant Sci. 2022, 13, 758818. [Google Scholar] [CrossRef] [PubMed]
Lin, Z.; Fu, R.; Ren, G.; Zhong, R.; Ying, Y.; Lin, T. Automatic monitoring of lettuce fresh weight by multi-modal fusion based deep learning. Front. Plant Sci. 2022, 13, 980581. [Google Scholar] [CrossRef] [PubMed]
Zhang, Q.; Zhang, X.; Wu, Y.; Li, X. TMSCNet: A three-stage multi-branch self-correcting trait estimation network for RGB and depth images of lettuce. Front. Plant Sci. 2022, 13, 982562. [Google Scholar] [CrossRef]
Hou, L.; Zhu, Y.; Wang, M.; Wei, N.; Dong, J.; Tao, Y.; Zhang, J. Multimodal data fusion for precise lettuce phenotype estimation using deep learning algorithms. Plants 2024, 13, 3217. [Google Scholar] [CrossRef]
Lin, F.; Guillot, K.; Crawford, S.; Zhang, Y.; Yuan, X.; Tzeng, N.F. An open and large-scale dataset for multi-modal climate change-aware crop yield predictions. arXiv 2024, arXiv:2406.06081. [Google Scholar]
Togninalli, M.; Wang, X.; Kucera, T.; Shrestha, S.; Juliana, P.; Mondal, S.; Poland, J. Multi-modal deep learning improves grain yield prediction in wheat breeding by fusing genomics and phenomics. Bioinformatics 2023, 39, btad336. [Google Scholar] [CrossRef]
Aviles Toledo, C.; Crawford, M.M.; Tuinstra, M.R. Integrating multi-modal remote sensing, deep learning, and attention mechanisms for yield prediction in plant breeding experiments. Front. Plant Sci. 2024, 15, 1408047. [Google Scholar] [CrossRef]
Miranda, M.; Pathak, D.; Nuske, M.; Dengel, A. Multi-modal fusion methods with local neighborhood information for crop yield prediction at field and subfield levels. In Proceedings of the IGARSS 2024, Athens, Greece, 7–12 July 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 4307–4311. [Google Scholar]
Mena, F.; Pathak, D.; Najjar, H.; Sanchez, C.; Helber, P.; Bischke, B.; Dengel, A. Adaptive fusion of multi-modal remote sensing data for optimal sub-field crop yield prediction. Remote Sens. Environ. 2025, 318, 114547. [Google Scholar] [CrossRef]
Yewle, A.D.; Mirzayeva, L.; Karakuş, O. Multi-modal data fusion and deep ensemble learning for accurate crop yield prediction. arXiv 2025, arXiv:2502.06062. [Google Scholar]
Liu, Y.; Feng, H.; Sun, Q.; Yang, F.; Yang, G. Estimation study of above ground biomass in potato based on UAV digital images with different resolutions. Spectrosc. Spectr. Anal. 2021, 41, 1470–1476. [Google Scholar]
Zhang, J.; Xie, T.; Wei, X.; Wang, Z.; Liu, C.; Zhou, G.; Wang, B. Estimation of feed rapeseed biomass based on multi-angle oblique imaging technique of unmanned aerial vehicle. Acta Agron. Sin. 2021, 47, 1816–1823. [Google Scholar]
Zhang, J.; Guo, S.; Han, Y.; Lei, Y.; Xing, F.; Du, W.; Li, Y.; Feng, L. Estimation of cotton yield based on unmanned aerial vehicle RGB images. J. Agric. Sci. Technol. 2022, 24, 112–120. [Google Scholar]
Tan, J.; Hou, J.; Xu, W.; Zheng, H.; Gu, S.; Zhou, Y.; Qi, L.; Ma, R. PosNet: Estimating lettuce fresh weight in plant factory based on oblique image. Comput. Electron. Agric. 2023, 213, 108263. [Google Scholar] [CrossRef]
Xu, D.; Li, S.; Chen, J.; Cui, T.; Zhhang, Y.; Ma, J. Image recognition of lettuce fresh weight through group estimation. J. China Agric. Univ. 2024, 29, 173–183. [Google Scholar]
Ma, J.; Liu, B.; Ji, L.; Zhu, Z.; Wu, Y.; Jiao, W. Field-scale yield prediction of winter wheat under different irrigation regimes based on dynamic fusion of multimodal UAV imagery. Int. J. Appl. Earth Obs. Geoinf. 2023, 118, 103292. [Google Scholar] [CrossRef]
Tian, H.; Wang, P.; Tansey, K.; Wang, J.; Quan, W.; Liu, J. Attention mechanism-based deep learning approach for wheat yield estimation and uncertainty analysis from remotely sensed variables. Agric. For. Meteorol. 2024, 356, 110183. [Google Scholar] [CrossRef]

Figure 1. The experimental greenhouse and the growth of strawberries.

Figure 2. Strawberry images at nine different heights.

Figure 3. RGB image recognition and cropping effects based on YOLOv8.

Figure 4. Distribution of strawberry fresh weight.

Figure 5. Relative area transformation at 9 heights.

Figure 6. The input layer fusion estimation model.

Figure 7. Dual_CNN_284 fresh weight estimation model.

Figure 8. Blend_CNN_284 fresh weight estimation model.

Figure 9. Augmentation procedure.

Figure 10. Fitting results with the relative area transformation method.

Figure 11. Weight heatmaps for RGB-H output layer dual and blend fusion models.

Figure 12. Fitting results with RGB-avgD input layer fusion model.

Figure 13. Hardware devices.

Figure 14. Crop fresh weight estimation platform.

Figure 15. Fitting results of the platform.

Table 1. Number of strawberry images at different heights.

Height	70 cm	75 cm	80 cm	85 cm	90 cm	100 cm	105 cm	110 cm	115 cm
Training dataset (original and augmented)	222 (46,620)	222 (46,620)	0	222 (46,620)	222 (46,620)	0	222 (46,620)	222 (46,620)	222 (46,620)
Validation dataset (original and augmented)	56 (11,760)	56 (11,760)	0	56 (11,760)	56 (11,760)	0	56 (11,760)	56 (11,760)	56 (11,760)
Test dataset	70	70	70	70	70	70	70	70	70

Table 2. Results based on RGB images.

Height	Model	R²	NRMSE	MAPE
80 cm	RGB relative area transformation	0.9155	0.0761	0.0665
	RGB-H input layer fusion	0.9145	0.0902	0.0716
	RGB-H output layer dual fusion	0.9054	0.0953	0.0785
	RGB-H output layer blend fusion	0.6629	0.1791	0.1341
100 cm	RGB relative area transformation	0.9304	0.0814	0.0660
	RGB-H input layer fusion	0.9326	0.0801	0.0691
	RGB-H output layer dual fusion	0.9328	0.0816	0.0695
	RGB-H output layer blend fusion	0.7116	0.1657	0.1018

Table 3. Results based on RGB-D images.

Height	Model	R²	NRMSE	MAPE
80 cm	RGB-D input layer fusion	−2.9297	0.6117	0.6640
80 cm	RGB-D output layer dual fusion	−2.6640	0.5906	0.6506
100 cm	RGB-D input layer fusion	−0.7466	0.4078	0.2777
100 cm	RGB-D output layer dual fusion	−1.0919	0.4463	0.3363

Table 4. Results based on RGB-avgD images.

Height	Model	R²	NRMSE	MAPE
80 cm	RGB-avgD input layer fusion	0.9475	0.0707	0.0591
80 cm	RGB-avgD output layer dual fusion	0.9227	0.0858	0.0655
100 cm	RGB-avgD input layer fusion	0.9384	0.0766	0.0610
100 cm	RGB-avgD output layer dual fusion	0.9257	0.0841	0.0658

Table 5. Results of relative area transformation and RGB-avgD input layer fusion models.

Height	Model	R²	NRMSE	MAPE
80 cm	RGB relative area transformation	0.9155	0.0761	0.0665
80 cm	RGB-avgD input layer fusion	0.9475	0.0707	0.0591
100 cm	RGB relative area transformation	0.9304	0.0814	0.0660
100 cm	RGB-avgD input layer fusion	0.9384	0.0766	0.0610

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xu, D.; Li, B.; Xi, G.; Wang, S.; Xu, L.; Ma, J. A Shooting Distance Adaptive Crop Yield Estimation Method Based on Multi-Modal Fusion. Agronomy 2025, 15, 1036. https://doi.org/10.3390/agronomy15051036

AMA Style

Xu D, Li B, Xi G, Wang S, Xu L, Ma J. A Shooting Distance Adaptive Crop Yield Estimation Method Based on Multi-Modal Fusion. Agronomy. 2025; 15(5):1036. https://doi.org/10.3390/agronomy15051036

Chicago/Turabian Style

Xu, Dan, Ba Li, Guanyun Xi, Shusheng Wang, Lei Xu, and Juncheng Ma. 2025. "A Shooting Distance Adaptive Crop Yield Estimation Method Based on Multi-Modal Fusion" Agronomy 15, no. 5: 1036. https://doi.org/10.3390/agronomy15051036

APA Style

Xu, D., Li, B., Xi, G., Wang, S., Xu, L., & Ma, J. (2025). A Shooting Distance Adaptive Crop Yield Estimation Method Based on Multi-Modal Fusion. Agronomy, 15(5), 1036. https://doi.org/10.3390/agronomy15051036

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Shooting Distance Adaptive Crop Yield Estimation Method Based on Multi-Modal Fusion

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Acquisition

2.2. Dataset Construction

2.3. RGB Image-Based Relative Area Transformation Estimation Model

2.4. RGB-H Input Layer Fusion Estimation Model

2.5. RGB-H Output Layer Fusion Estimation Model

2.6. RGB-D Input Layer Fusion and Output Layer Fusion Estimation Model

2.7. Model Training and Evaluation

3. Results and Discussion

3.1. Results Based on RGB Image Data

3.2. Results Based on RGB-D Data

3.3. Comparisons Between RGB and RGB-Avgd Models

3.4. Validation of the Crop Fresh Weight Estimation Platform

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI