Study on Utilizing Mask R-CNN for Phenotypic Estimation of Lettuce’s Growth Status and Optimal Harvest Timing

Hou, Lixin; Zhu, Yuxia; Wei, Ning; Liu, Zeye; You, Jixuan; Zhou, Jing; Zhang, Jian

doi:10.3390/agronomy14061271

Open AccessArticle

Study on Utilizing Mask R-CNN for Phenotypic Estimation of Lettuce’s Growth Status and Optimal Harvest Timing

by

Lixin Hou

¹

,

Yuxia Zhu

¹,

Ning Wei

¹,

Zeye Liu

¹

,

Jixuan You

¹,

Jing Zhou

^1,*

and

Jian Zhang

^2,3,*

¹

College of Information and Technology, Jilin Agricultural University, Changchun 130118, China

²

Faculty of Agronomy, Jilin Agricultural University, Changchun 130118, China

³

Department of Biology, University of British Columbia, Okanagan, Kelowna, BC V5K 1K5, Canada

^*

Authors to whom correspondence should be addressed.

Agronomy 2024, 14(6), 1271; https://doi.org/10.3390/agronomy14061271

Submission received: 20 May 2024 / Revised: 6 June 2024 / Accepted: 11 June 2024 / Published: 12 June 2024

(This article belongs to the Topic Challenges, Development and Frontiers of Smart Agriculture and Forestry—2nd Volume)

Download

Browse Figures

Versions Notes

Abstract

:

Lettuce is an annual plant of the family Asteraceae. It is most often grown as a leaf vegetable, but sometimes for its stem and seeds, and its growth status and quality are evaluated based on its morphological phenotypic traits. However, traditional measurement methods are often labor-intensive and time-consuming due to manual measurements and may result in less accuracy. In this study, we proposed a new method utilizing RGB images and Mask R-Convolutional Neural Network (CNN) for estimating lettuce critical phenotypic traits. Leveraging publicly available datasets, we employed an improved Mask R-CNN model to perform a phenotypic analysis of lettuce images. This allowed us to estimate five phenotypic traits simultaneously, which include fresh weight, dry weight, plant height, canopy diameter, and leaf area. The enhanced Mask R-CNN model involved two key aspects: (1) replacing the backbone network from ResNet to RepVGG to enhance computational efficiency and performance; (2) adding phenotypic branches and constructing a multi-task regression model to achieve end-to-end estimation of lettuce phenotypic traits. Experimental results demonstrated that the present method achieved high accuracy and stable results in lettuce image segmentation, detection, and phenotypic estimation tasks, with APs for detection and segmentation being 0.8684 and 0.8803, respectively. Additionally, the R² values for the five phenotypic traits are 0.96, 0.9596, 0.9329, 0.9136, and 0.9592, with corresponding mean absolute percentage errors (MAPEs) of 0.1072, 0.1522, 0.0757, 0.0548, and 0.0899, respectively. This study presents a novel technical advancement based on digital knowledge for phenotypic analysis and evaluation of lettuce quality, which could lay the foundation for artificial intelligence expiation in fresh vegetable production.

Keywords:

lettuce; phenotype; deep learning; Mask R-CNN; digital technology

1. Introduction

Lettuce, Lactuca sativa L., is a popular vegetable crop widely grown around the world, with high nutritional value and important economic value. In today’s world, lettuce is ubiquitously found in supermarkets, markets, and restaurants. China, the United States, and Europe are the regions with the highest production value of lettuce [1]. The research objectives of all researchers in this domain revolve around producing lettuce rapidly, with high quality, and in large quantities. Plant phenotyping is an important branch of agriculture [2]. The morphological phenotypic traits of lettuce are crucial indicators for assessing the growth condition and quality of the lettuce, serving as the basis for its breeding, cultivation, and harvesting. Traditional methods for measuring these traits rely on labor-intensive manual sampling and measurement, which are time-consuming [3]. Furthermore, the analytical results are closely related to the operator’s experience, often leading to inconsistent and inaccurate results [4]. Additionally, such methods can compromise the integrity and freshness of the lettuce, impacting its subsequent usability. Therefore, a rapid, accurate, non-destructive, and high-throughput method for measuring lettuce phenotypic traits is of great significance and value for improving the efficiency and quality of lettuce production.

In recent years, with the development of computer vision and deep learning techniques, image-based methods for measuring plant phenotypic traits have received widespread attention and application. Image sensors collect image information of plants, such as RGB cameras [5,6,7,8,9], depth cameras [10,11,12,13], and optical spectrometers [14,15], which are then analyzed and processed with deep learning models to facilitate the automated, precise, and intelligent measurement of plant phenotypic traits. Li et al. [16] used UAV multispectral data as input to predict winter wheat yield, the convolutional neural network (CNN) model gave the best results, with R² of 0.752 and NMSE of 0.404 t·ha⁻¹. Giuffrida et al. [17] presented a Pheno-Deep Counter to predict leaf count with an accuracy of 88.5%, using RGB, FMP, and NIR images as inputs. Zhang et al. [7] employed a CNN to estimate the fresh weight, dry weight, and leaf area of lettuce. The estimated values showed good agreement with the actual measured values, with R² values of 0.8938, 0.8910, and 0.9156, and normalized root mean square error (NRMSE) values of 26.00%, 22.07%, and 19.94%, respectively. On this basis, Xu et al. [18] enhanced the fresh weight R² to 0.9725 and decreased the MAPE to 8.47% by integrating RGB images and depth images and introducing the MSPE function. Rasti et al. [19] utilized deep learning to estimate the growth stages of barley and wheat. The results demonstrated that transfer learning based on the VGG19 model achieved the highest accuracy for both crops. Guo et al. [5] used a deep learning method coupled with a novel directed search algorithm to obtain stem-related phenotypes for soybeans. The Pearson correlation coefficients (R) of plant height, pitch number, internodal length, main stem length, stem curvature, and branching angle were 0.9904, 0.9853, 0.9861, 0.9925, 0.9084, and 0.9391, respectively. The above studies demonstrate that deep learning is an effective technique in the field of plant phenotype research.

Deep learning has advanced significantly over the years, giving rise to various classical algorithms, such as Mask R-CNN [20], YOLO [21], and hybrid deep learning models [22]. Mask R-CNN is widely used in agriculture for tasks such as identifying and estimating the characteristics of crops [23,24,25,26,27], fish [28,29], and trees [30]. It is also utilized for detecting diseases and pests [31,32] due to its excellent performance, scalability, and multi-tasking capabilities. Zhang et al. [33] drew the chrysanthemum canopy contour and estimated the crown diameter using a mask obtained from Mask R-CNN. The results showed an R² of 0.9629 and a root mean square error (RMSE) of 2.2949 cm. Zheng et al. [34] used a Mask R-CNN model to segment RGB-NIR image and RGB image of strawberries and established a regression model of canopy leaf area and dry biomass for biomass prediction. Li et al. [35] used Mask R-CNN for leaf segmentation, classification, and counting of the watermelon plug seedlings. Gao et al. [36] used UAV images and Mask R-CNN to extract maize seedling information. The average accuracy of the seedling emergence rate is 98.87%. At the same time, the model also shows good migration characteristics. Hao et al. [37] combined the Mask R-CNN model with the canopy height model (CHM) to detect the height and crown of trees. The tree height estimate was highly correlated with the height from UAV images (R² = 0.97). It can be seen that phenotypic traits are mainly predicted based on the segmentation results of Mask R-CNN. There are few studies in which phenotypic traits are directly output from models.

This paper presents a new method for estimating phenotypic traits in lettuce using RGB images and a Mask R-CNN model. The method can estimate five phenotypic traits simultaneously and produce target detection and segmentation results. The main contributions are as follows: (1) the addition of a phenotypic branch to the Mask R-CNN model to enable phenotypic trait estimation; (2) the replacement of the backbone network with RepVGG [38], resulting in improved speed and reduced parameter count; (3) the achievement of end-to-end estimation of five lettuce phenotypic traits. The work indicates significant advancements in digital phenotypic analysis, paving the way for artificial intelligence applications in fresh vegetable production.

2. Materials and Methods

2.1. Datasets

The open dataset came from the third Autonomous Greenhouse Challenge [39] organized by Tencent and Wageningen University and Research. The lettuce was grown in the laboratory of Wageningen University in the Netherlands. The RealSense D415 depth sensor was used to capture natural light RGB and depth images, suspended 0.9 m above the crop. The images were saved in PNG format with an original pixel resolution of 1920 × 1080. In the dataset, there are 96 images of Lugano lettuce, 102 images of Salanova lettuce, 92 images of Aphylion lettuce, and 98 images of Satine lettuce, as illustrated in Figure 1.

Fresh weight, dry weight, plant height, canopy diameter, and leaf area were obtained via destructive measurement methods. The fresh weight of the lettuce was determined by weighing the lettuce from the point where the first leaf was attached. The dry weight was obtained after three days. From the initial leaf to the highest point of the plant, the height was measured. The canopy diameter was the principal diameter of the projection on a horizontal surface. The leaf area was computed by projecting the leaf surface area onto a plane without considering leaf bending. The units of these traits are “g/plant”, “g/plant”, “cm”, “cm”, and “cm²”, respectively.

The RGB images were cropped to 1024 × 1024 pixels from the center point of the plant and then scaled to 800 × 800 pixels. The VIA annotation tool was utilized to manually label the dataset. Each RGB image contained one lettuce target labeled with four species: Lugano, Salanova, Aphylion, and Satine. To improve the learning and generalization ability of the neural network model, data augmentation was used in this study. We mainly used horizontal flip, vertical flip, and 10% brightness as data enhancement methods. Data augmentation improves detection performance, and the size of the enhanced dataset is 1548. The K-fold cross-verification method was adopted, and the data were divided into five parts. For each course of training, one fold was selected as the test, the remaining four folds were used as the training set, and then 10% was taken from the training set for the verification set.

2.2. Method (Improvement of the Model)

Mask R-CNN is a classical two-stage instance segmentation method extended from Faster R-CNN [40] in deep learning, which mainly includes a feature extraction network (ResNet [41]), a multi-scale feature fusion network (FPN) [42], a region proposal network (RPN), a region of interest align (RoiAlign), a classification sub-network, and a segmentation sub-network. The improved Mask R-CNN adds a third branch for phenotypic trait prediction while replacing the backbone network (ResNet) with RepVGG. The overall model structure is shown in Figure 2.

2.2.1. Backbone Network

The feature acquisition part of the entire model is called the backbone network, and it is crucial to obtain enough features to obtain good results. In the traditional Mask R-CNN, a deep residual network is used to extract features. To balance model accuracy and speed, and to enable deployment in the field, RepVGG was selected as the new backbone network.

RepVGG is a restructured-parameter backbone network proposed in 2021. The model creation is inspired by ResNet, which is a multi-branch structure during training, using identity, 3 × 3, and 1 × 1 branches. During inference, the multi-branch structure is converted to a single 3 × 3 model. Its structure for both training and inference is displayed in Figure 3.

The RepVGG model uses the re-parameterization technique to enhance its representational capacity during training, while also reducing parameters and improving inference speed without sacrificing accuracy. Specifically, we equivalently transform a network structure with shortcuts into a structure without shortcuts using 3 × 3 convolutions. This transformation enables the model to use only 3 × 3 convolution kernels and ReLU activations during inference. There are two scenarios for branch compositions during training. First, there are three branches: the 3 × 3 branch, the 1 × 1 branch, and the identify branch. The formulation is as follows:

M^{(2)} = b n (M^{(1)} * W^{(3)}, μ^{(3)}, σ^{(3)}, γ^{(3)}, β^{(3)}) + b n (M^{(1)} * W^{(1)}, μ^{(1)}, σ^{(1)}, γ^{(1)}, β^{(1)}) + b n (M^{(1)}, μ^{(0)}, σ^{(0)}, γ^{(0)}, β^{(0)})

(1)

The kernel of the convolutional layer is represented by W in the formula. µ, σ, γ, and β stand for the accumulated mean, standard deviation, learned scaling factor, and bias of the BN layer after convolution, respectively. The numbers 3, 1, and 0 represent the 3 × 3 branch, 1 × 1 branch, and identity branch in the upper-right bracket. Second, there are two branches: the 3 × 3 branch and the 1 × 1 branch. This is reflected in the formulation by having only the first two terms. In the re-parameterization process, each BN and its previous convolutional layer are first converted to convolution with bias vectors by converting to

W_{i, :, :, :}^{’} = \frac{γ_{i}}{σ_{i}} W_{i, :, :, :}, b_{i}^{’} = - \frac{μ_{i} γ_{i}}{σ_{i}} + β_{i}

(2)

Then, the formula for BN can be written as

bn {(M * W, μ, σ, γ, β)}_{:, i, :, :} = {(M * W^{’})}_{:, i, :, :} + b_{i}^{’}

(3)

The identity branch is applied to this transformation because it can be served as a 1 × 1 convolution with an identity matrix as the kernel. Thus, we have three bias vectors, two 1 × 1 kernels, and one 3 × 3 kernel. Subsequently, the three bias vectors are added to obtain the final bias. The two 1 × 1 kernels are then transformed into 3 × 3 kernels by padding 0, and the three convolution kernels are added, as illustrated in Figure 4. It should be noted that the stride of the 3 × 3 and 1 × 1 layers have to match for equivalent conversion, and the padding has to be set so that the latter is one pixel smaller than the former. For instance, in the case of a 3 × 3 layer that pads the input by one pixel, which is commonly encountered, the 1 × 1 layer should have padding set to 0.

2.2.2. Phenotype Traits Head

Few phenotypic trait findings could be directly obtained from the model for prediction, and previous research ideas mostly relied on the segmentation of the target image as a basis for generating phenotypic traits. The original Mask R-CNN paper introduced a branch head to estimate human pose, which inspired the author to extend it further for phenotypic morphological trait prediction. This branch was responsible for extracting morphological characteristics of plants from images and outputted five phenotypic traits, including fresh weight, dry weight, plant height, canopy diameter, and leaf area.

The structural phenotypic traits branch ran parallel to the original two branches. First, we used the RoiAlign layer to crop and align the feature map, and then we performed feature extraction via eight convolutional layers. Finally, we used two fully connected layers to output the predicted values of the five phenotypic traits. In order to optimize the phenotypic traits branch, we used the Huber loss function.

The Huber loss function is a regression loss function with continuity and derivability. It can reduce the sensitivity of outliers with the advantages of square error (MSE) and absolute error (MAE). The following is the formula:

L (y, f (x)) = \{\begin{matrix} {\frac{1}{2} (y - f (x))}^{2}, if | y - f (x) | \leq δ \\ δ |y - f (x)| - \frac{1}{2} δ^{2}, if |y - f (x)| > δ \end{matrix}

(4)

The hyperparameter δ plays a selective role in Equation (4). Squared error is used when the forecast prediction is less than δ, and linear error is used when the forecast deviation is greater than δ (similar to MAE).

2.2.3. Experimental Environment and Training Strategies

This study used the Ubuntu 22.04.3 operating system with a computer configuration consisting of an NVIDIA GeForce RTX 4070 Ti GPU (12 GB), 32 GB RAM, and an Intel^® CoreTM i7-13700KF processor. The model was written in Python 3.8 and Pytorch 1.10.

The steps of model training are as follows: (i) train the entire model; (ii) train all branch sub-networks while freezing the other network layers; (iii) train the backbone network while keeping the branches frozen; (iv) finally, train the first two layers of the backbone network under fixing all branches.

2.2.4. Evaluation Index

COCO evaluation indexes are utilized in this study for the detection and segmentation sections. The metrics used in the COCO evaluation are primarily separated into two categories: average recall (AR) and average precision (AP). To determine the degree of overlap between the detection box (segmentation mask) and the actual label, COCO evaluation indicators use ten IOU thresholds ranging from 0.5 to 0.95 (in increments of 0.05). This helps to assess the detector’s positioning ability.

In the regression prediction of lettuce phenotypic traits, R² and MAPE indexes are used. R² is an evaluation index used in regression analysis, which reflects the degree of correlation between independent and dependent variables. Its value ranges from 0 to 1, with a higher value indicating a better fit for the regression model. R² is calculated by the following equation:

R^{2} = 1 - \frac{\sum_{i} {(y_{i} - y_{i}^{’})}^{2}}{\sum_{i} {(y_{i} - \bar{y})}^{2}}

(5)

The mean absolute percentage error (MAPE) is an evaluation metric used in regression analysis to indicate the proportional amount of divergence between the predicted value and the true value. The MAPE value can range from 0 to infinity. If the predicted value is exactly equal to the true value, the MAPE is 0%, which means the model is perfect. However, if the predicted value has no relation to the true value, the MAPE is 100%, indicating a poor model. The formula for MAPE is as follows:

M A P E = \frac{1}{n} \sum_{i = 1}^{n} |\frac{y_{i} - y_{i}^{’}}{y_{i}}| \times 100 %

(6)

3. Results

3.1. Results of the Proposed Method

The performance of the model was tested on the test set, and the results are displayed in Figure 5. The x-axis represents the estimated values, while the y-axis indicates the measured values. The predicted outcomes are shown as discrete points on the graph. The five subplots in the figure correspond to the growth traits of fresh weight (fsw), dry weight (dsw), plant height (h), canopy diameter (d), and leaf area (la). Each subplot shows the results of the five-fold cross-validation and provides the evaluation score and fitted equation for each fold. When the R² value falls between 0 and 1, a higher R² value is an indication of better estimation performance. On the other hand, a smaller MAPE value indicates better network performance.

Our proposed method has an average R² of 0.9600 for fsw predictions and an average MAPE of 0.1072. For dsw predictions, the average R² value is 0.9596 with an average MAPE of 0.1522. H predictions exhibit an average R² of 0.9329 and an average MAPE of 0.0757, while d predictions show an average R² of 0.9136 and an average MAPE of 0.0548. La predictions have an average R² of 0.9592, with a corresponding average MAPE of 0.0899. According to Zhang et al. [12], there is a strong correlation of more than 0.9 between fsw, dsw, and la in their data analysis. The R² results for fsw, dsw, and la exhibit similarities. Surprisingly, d has the lowest R² value among all traits, but its MAPE is also the lowest among all traits. The second least predictive is h, which could be attributed to the less abundant feature information extracted from top-view lettuce RGB images compared to other predicting traits. In Figure 5, the fitting results deteriorate during the later growth period, which is related to the uneven distribution of sample numbers. Before the data enhancement, there were 387 samples divided according to the fresh weight in grams (0–100 g, 100–200 g, 200–300 g, greater than 300 g) with sample numbers of 219, 83, 54, and 31, respectively. It is noticeable that fsw, dsw, and la show good prediction performance during the initial growth stages. However, the pattern for predicting d and h throughout the whole growth stage is not similar to the other three phenotypic traits during the early growth stages. Among the five phenotypic traits, the MAPE value of dry weight was the highest. Our analysis is due to the limitations of MAPE. The value of dry weight is relatively small, with 126 samples of 0–2 g. MAPE is sensitive to small errors and amplifies errors in samples with smaller true values.

Table 1 displays the outcomes of object detection and segmentation obtained through Mask R-CNN on the lettuce dataset. The average values of detection metrics AP, AP50, and AP75 for the five-fold cross-validation are 0.8684, 0.9964, and 0.9854, respectively. Additionally, the segmentation results show average values of 0.8803, 0.9964, and 0.9854, respectively. It has been noticed that the fourth fold generally performs better than other folds. This superiority may be attributed to dataset partitioning.

3.2. The Application Results of Different Species

A confusion matrix was used to analyze the model’s detection accuracy for different species through a five-fold cross-validation. The use of K-fold cross-validation can effectively utilize the limited data and avoid the limitations and particularities of fixed partition datasets. The results show that most misclassifications occur in Aphylion and Lugano in Figure 6. It is more common to misclassify Lugano as Aphylion than to misclassify Aphylion as Lugano. The second most common error is misclassification between Salanova and Satine, followed by misclassifying Satine as Aphylion. Upon examining the images, we can observe that the Lugano leaves are more coiled than the Aphylion leaves. However, there are several other similarities between them. Although Salanova and Satine have similar colors during their early growth stages, the central part of Satine plants tends to turn yellow in later stages. Additionally, there is a significant contrast in leaf morphology between the two varieties. Salanova has flat elongated leaves, while Satine has curly unevenly shaped leaves.

Table 2 presents the predicted results of our model for five phenotypic traits in four types of lettuce, revealing variations in the predicted results across different lettuce varieties. The data are the results of an average from five test sets. Out of all the varieties, Salanova shows the least effectiveness in predicting the values of fsw, dsw, and h, with the largest difference in MAPE in dsw (0.2278) and the lowest R² in h (0.8717). However, there is a slightly smaller difference in performance for fsw. As for d and la, Satine appears to have the lowest performance, with its R² for d being approximately 5% lower than the other varieties. However, the variances between the two indicators of the la trait among the four lettuce varieties are relatively small. Figure 7 shows the inference outputs for the four types of lettuce. The segmentation works well in the early stages for all types of lettuce. Different lettuce varieties exhibit different morphologies as they grow; however, the segmentation performance begins to degrade for all varieties. The mask’s shape is unable to match the lettuce’s shape due to the leaves curling.

4. Discussion

4.1. Comparison of Different Backbone

To thoroughly validate the selection of RepVGG as the backbone network for object detection and morphological phenotypic trait prediction, we performed ablation experiments. Other backbone networks evaluated include ResNet50 (used in the classic Mask R-CNN), MobileNetV3 (popular for its lightweight design), and EfficientNet (known for its balance between accuracy and speed). Table 3 and Table 4 show the average results of the five-fold cross-validation for each backbone network. The best result in each column is highlighted in bold. It is worth noting that RepVGG is the best-performing backbone network. In terms of detection and segmentation results, the AP metrics for all backbone networks are higher than 0.81, with segmentation metrics performing better than detection metrics.

EfficientNet shows slightly better AP results compared to RepVGG, while both models have similar AP50 results. However, RepVGG outperforms EfficientNet in terms of AP75. ResNet50 ranks third in terms of performance, with a detection AP of 0.8236 and a segmentation AP of 0.8787. Among the four backbone networks, MobileNetV3 performs the poorest, with detection and segmentation AP of 0.8236 and 0.8787, respectively. Regarding the prediction of phenotype trait regression metrics, RepVGG has the best overall performance as the backbone network. However, the difference between the best and worst metrics in each column is only around 0.01. This suggests that the choice of backbone network has limited influence on the results. The information extracted from the data for regression prediction mining may reach the extreme.

We assessed network performance and efficiency by considering model parameters, inference time, and FLOPs. FLOPs, which stands for floating-point operations, refer to the number of operations performed during the model’s forward pass. Among the models we examined, ResNet50 had the longest inference time at 0.0233 s, the largest model parameters at 585.8 MB, and the highest FLOPs. In contrast, MobileNet_V3 and EfficientNet had similar model parameters, at 331 MB and 344.4 MB respectively, and their inference times were also close, at 0.0155 s and 0.0169 s. These two models ranked first and second in FLOPs. RepVGG, on the other hand, ranked third in FLOPs, but it had the smallest model parameters at 127 MB and a slightly faster inference time compared to MobileNet. This suggests that RepVGG, as a backbone network, has fewer model parameters and faster inference time. In practical application, we can consider deploying the model to Raspberry PI to develop an automatic monitoring platform for lettuce phenotypes.

4.2. Comparison of Phenotypic Traits Branching with Different Convolutional Layers

To identify the most effective number of convolutional layers for the phenotypic trait branch, an ablation experiment was conducted using a convolutional neural network (CNN) with 6, 8, and 10 layers, respectively. The averaged outcomes of five-fold cross-validation for varying CNN layer counts are shown in Table 5 and Table 6. The best-performing outcomes in each column have been highlighted in bold.

From the tables, the overall performance is optimal when the number of CNN layers is eight. In both detection and segmentation tasks, the eight-layer model consistently performs better than other models. The difference in performance between six and eight layers is significant, while the difference between eight and ten layers is negligible. Regarding phenotypic trait prediction, the eight-layer model exhibits superior performance. The six-layer model has the poorest performance, while the eight-layer model slightly outperforms the ten-layer model. These results demonstrate that an adequate number of convolutional layers is essential for optimal performance, but excessive layers potentially lead to diminished results. As a result, an eight-layer CNN was ultimately chosen for the phenotypic prediction branch.

5. Conclusions

To assess the growth status and quality of lettuce, morphological and phenotypic characteristics were selected for evaluation. In this study, we replaced the backbone network and incorporated a phenotypic trait branch to estimate fresh weight, dry weight, plant height, canopy diameter, and leaf area in the Mask R-CNN framework. The results show that the average AP is 0.8684 for detection and 0.8803 for segmentation. Additionally, the R² results for phenotypic traits are all above 0.91, but with varying MAPE values. In particular, trait dsw has the highest MAPE value of 0.1522, while trait d has the lowest MAPE value of 0.0548.

This study has the following limitations: it does not consider the effect of leaf curling; to improve prediction accuracy, depth images can be fused with the model; and the data collection environment should take into account the challenges posed by changes in the shooting environment and shooting angle.

Author Contributions

Conceptualization, L.H. and J.Z. (Jing Zhou); methodology, L.H., Y.Z. and N.W.; software, Y.Z. and Z.L.; validation, N.W. and J.Y.; formal analysis, Z.L. and J.Y.; investigation, Y.Z., N.W. and J.Y.; data curation, N.W.; writing—original draft preparation, L.H. and Y.Z.; writing—review and editing, L.H., J.Z. (Jing Zhou) and J.Z. (Jian Zhang); visualization, N.W., Z.L. and J.Y.; supervision, L.H. and J.Z. (Jing Zhou); project administration, L.H.; funding acquisition, L.H., J.Z. (Jing Zhou) and J.Z. (Jian Zhang). All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Jilin Province Science and Technology Development Plan Project (No. 20230101343JC; No. 20230202042NC); Jilin Agricultural University high-level researcher grant (JLAUHLRG20102006).

Data Availability Statement

This study used the Third Autonomous Greenhouse Challenge: Online Challenge Lettuce Images dataset publicly available at 4TU.ResearchData [39].

Acknowledgments

We sincerely appreciate Jian Zhang for his invaluable support during the field experiments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Simko, I.; Hayes, R.J.; Mou, B.; McCreight, J.D. Lettuce and Spinach. In Yield Gains in Major U.S. Field Crops; Smith, S., Diers, B., Specht, J., Carver, B., Eds.; American Society of Agronomy, Inc.: Madison, WI, USA; Crop Science Society of America, Inc.: Madison, WI, USA; Soil Science Society of America, Inc.: Madison, WI, USA, 2014; pp. 53–85. [Google Scholar]
Xiong, J.; Yu, D.; Liu, S.; Shu, L.; Wang, X.; Liu, Z. A Review of Plant Phenotypic Image Recognition Technology Based on Deep Learning. Electronics 2021, 10, 81. [Google Scholar] [CrossRef]
Casadesús, J.; Villegas, D. Conventional digital cameras as a tool for assessing leaf area index and biomass for cereal breeding. J. Integr. Plant Biol. 2014, 56, 7–14. [Google Scholar] [CrossRef] [PubMed]
Zhang, L.; Verma, B.; Stockwell, D.; Chowdhury, S. Density Weighted Connectivity of Grass Pixels in image frames for biomass estimation. Expert Syst. Appl. 2018, 101, 213–227. [Google Scholar] [CrossRef]
Guo, Y.; Gao, Z.; Zhang, Z.; Li, Y.; Hu, Z.; Xin, D.; Chen, Q.; Zhu, R. Automatic and Accurate Acquisition of Stem-Related Phenotypes of Mature Soybean Based on Deep Learning and Directed Search Algorithms. Front. Plant Sci. 2022, 13, 906751. [Google Scholar] [CrossRef] [PubMed]
Du, W.; Liu, P. Instance Segmentation and Berry Counting of Table Grape before Thinning Based on AS-SwinT. Plant Phenomics 2023, 5, 0085. [Google Scholar] [CrossRef] [PubMed]
Zhang, L.; Xu, Z.; Xu, D.; Ma, J.; Chen, Y.; Fu, Z. Growth monitoring of greenhouse lettuce based on a convolutional neural network. Hortic. Res. 2020, 7, 124. [Google Scholar] [CrossRef] [PubMed]
Ye, Z.; Tan, X.; Dai, M.; Lin, Y.; Chen, X.; Nie, P.; Ruan, Y.; Kong, D. Estimation of rice seedling growth traits with an end-to-end multi-objective deep learning framework. Front. Plant Sci. 2023, 14, 1165552. [Google Scholar] [CrossRef]
Du, J.; Lu, X.; Fan, J.; Qin, Y.; Yang, X.; Guo, X. Image-Based High-Throughput Detection and Phenotype Evaluation Method for Multiple Lettuce Varieties. Front. Plant Sci. 2020, 11, 563386. [Google Scholar] [CrossRef]
Buxbaum, N.; Lieth, J.H.; Earles, M. Non-destructive Plant Biomass Monitoring with High Spatio-Temporal Resolution via Proximal RGB-D Imagery and End-to-End Deep Learning. Front. Plant Sci. 2022, 13, 758818. [Google Scholar] [CrossRef]
Quan, L.; Li, H.; Li, H.; Jiang, W.; Lou, Z.; Chen, L. Two-Stream Dense Feature Fusion Network Based on RGB-D Data for the Real-Time Prediction of Weed Aboveground Fresh Weight in a Field Environment. Remote Sens. 2021, 13, 2288. [Google Scholar] [CrossRef]
Zhang, Q.; Zhang, X.; Wu, Y.; Li, X. TMSCNet: A three-stage multi-branch self-correcting trait estimation network for RGB and depth images of lettuce. Front. Plant Sci. 2022, 13, 982562. [Google Scholar] [CrossRef] [PubMed]
Milella, A.; Marani, R.; Petitti, A.; Reina, G. In-field high throughput grapevine phenotyping with a consumer-grade depth camera. Comput. Electron. Agric. 2019, 156, 293–306. [Google Scholar] [CrossRef]
Moghimi, A.; Yang, C.; Anderson, J.A. Aerial hyperspectral imagery and deep neural networks for high-throughput yield phenotyping in wheat. Comput. Electron. Agric. 2020, 172, 105299. [Google Scholar] [CrossRef]
Ampatzidis, Y.; Partel, V. UAV-Based High Throughput Phenotyping in Citrus Utilizing Multispectral Imaging and Artificial Intelligence. Remote Sens. 2019, 11, 410. [Google Scholar] [CrossRef]
Li, Z.; Chen, Z.; Cheng, Q.; Fei, S.; Zhou, X. Deep Learning Models Outperform Generalized Machine Learning Models in Predicting Winter Wheat Yield Based on Multispectral Data from Drones. Drones 2023, 7, 505. [Google Scholar] [CrossRef]
Giuffrida, M.V.; Doerner, P.; Tsaftaris, S.A. Pheno-Deep Counter: A unified and versatile deep learning architecture for leaf counting. Plant J. 2018, 96, 880–890. [Google Scholar] [CrossRef] [PubMed]
Xu, D.; Chen, J.; Li, B.; Ma, J. Improving Lettuce Fresh Weight Estimation Accuracy through RGB-D Fusion. Agronomy 2023, 13, 2617. [Google Scholar] [CrossRef]
Rasti, S.; Bleakley, C.J.; Silvestre, G.C.M.; Holden, N.M.; Langton, D.; O’Hare, G.M.P. Crop growth stage estimation prior to canopy closure using deep learning algorithms. Neural Comput. Appl. 2021, 33, 1733–1743. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA; 2016; pp. 779–788. [Google Scholar]
Yazdinejad, A.; Dehghantanha, A.; Parizi, R.M.; Epiphaniou, G. An optimized fuzzy deep learning model for data classification based on NSGA-II. Neurocomputing 2023, 522, 116–128. [Google Scholar] [CrossRef]
Tang, C.; Chen, D.; Wang, X.; Ni, X.; Liu, Y.; Liu, Y.; Mao, X.; Wang, S. A fine recognition method of strawberry ripeness combining Mask R-CNN and region segmentation. Front. Plant Sci. 2023, 14, 1211830. [Google Scholar] [CrossRef]
Siricharoen, P.; Yomsatieankul, W.; Bunsri, T. Recognizing the sweet and sour taste of pineapple fruits using residual networks and green-relative color transformation attached with Mask R-CNN. Postharvest Biol. Technol. 2023, 196, 112174. [Google Scholar] [CrossRef]
Wang, D.; He, D. Fusion of Mask RCNN and attention mechanism for instance segmentation of apples under complex background. Comput. Electron. Agric. 2022, 196, 106864. [Google Scholar] [CrossRef]
Cong, P.; Li, S.; Zhou, J.; Lv, K.; Feng, H. Research on Instance Segmentation Algorithm of Greenhouse Sweet Pepper Detection Based on Improved Mask RCNN. Agronomy 2023, 13, 196. [Google Scholar] [CrossRef]
Wang, L.; Jia, K.; Fu, Y.; Xu, X.; Fan, L.; Wang, Q.; Zhu, W.; Niu, Q. Overlapped tobacco shred image segmentation and area computation using an improved Mask RCNN network and COT algorithm. Front. Plant Sci. 2023, 14, 1108560. [Google Scholar] [CrossRef] [PubMed]
Yu, G.; Luo, Y.; Deng, R. Automatic segmentation of golden pomfret based on fusion of multi-head self-attention and channel-attention mechanism. Comput. Electron. Agric. 2022, 202, 107369. [Google Scholar] [CrossRef]
Han, B.; Hu, Z.; Su, Z.; Bai, X.; Yin, S.; Luo, J.; Zhao, Y. Mask_LaC R-CNN for measuring morphological features of fish. Measurement 2022, 203, 111859. [Google Scholar] [CrossRef]
Zhang, C.; Zhou, J.; Wang, H.; Tan, T.; Cui, M.; Huang, Z.; Wang, P.; Zhang, L. Multi-species individual tree segmentation and identification based on improved mask R-CNN and UAV imagery in mixed forests. Remote Sens. 2022, 14, 874. [Google Scholar] [CrossRef]
Li, H.; Shi, H.; Du, A.; Mao, Y.; Fan, K.; Wang, Y.; Shen, Y.; Wang, S.; Xu, X.; Tian, L. Symptom recognition of disease and insect damage based on Mask R-CNN, wavelet transform, and F-RNet. Front. Plant Sci. 2022, 13, 922797. [Google Scholar] [CrossRef]
Wang, H.; Mou, Q.; Yue, Y.; Zhao, H. Research on Detection Technology of Various Fruit Disease Spots Based on Mask R-CNN. In Proceedings of the 2020 IEEE International Conference on Mechatronics and Automation (ICMA), Beijing, China, 13–16 October 2020; pp. 1083–1087. [Google Scholar]
Zhang, J.; Lu, J.; Zhang, Q.; Qi, Q.; Zheng, G.; Chen, F.; Chen, S.; Zhang, F.; Fang, W.; Guan, Z. Estimation of Garden Chrysanthemum Crown Diameter Using Unmanned Aerial Vehicle (UAV)-Based RGB Imagery. Agronomy 2024, 14, 337. [Google Scholar] [CrossRef]
Zheng, C.; Abd-Elrahman, A.; Whitaker, V.M.; Dalid, C. Deep learning for strawberry canopy delineation and biomass prediction from high-resolution images. Plant Phenomics 2022, 2022, 9850486. [Google Scholar] [CrossRef]
Li, L.; Bie, Z.; Zhang, Y.; Huang, Y.; Peng, C.; Han, B.; Xu, S. Nondestructive Detection of Key Phenotypes for the Canopy of the Watermelon Plug Seedlings Based on Deep Learning. Hortic. Plant J. 2023, in press. [Google Scholar] [CrossRef]
Gao, X.; Zan, X.; Yang, S.; Zhang, R.; Chen, S.; Zhang, X.; Liu, Z.; Ma, Y.; Zhao, Y.; Li, S. Maize seedling information extraction from UAV images based on semi-automatic sample generation and Mask R-CNN model. Eur. J. Agron. 2023, 147, 126845. [Google Scholar] [CrossRef]
Hao, Z.; Lin, L.; Post, C.J.; Mikhailova, E.A.; Li, M.; Chen, Y.; Yu, K.; Liu, J. Automated tree-crown and height detection in a young forest plantation using mask region-based convolutional neural network (Mask R-CNN). ISPRS J. Photogramm. Remote Sens. 2021, 178, 112–123. [Google Scholar] [CrossRef]
Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. Repvgg: Making vgg-style convnets great again. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13728–13737. [Google Scholar]
Hemming, S.; de Zwart, H.F.; Elings, A.; Bijlaard, M.; van Marrewijk, B.; Petropoulou, A. 3rd Autonomous Greenhouse Challenge:Online Challenge Lettuce Images. Available online: https://doi.org/10.4121/15023088.v1(accessed on 2 March 2022).
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]

Figure 1. In total, there were 96, 102, 92, and 98 image pairs for each variation.

Figure 2. Structure of the improved model.

Figure 3. Structure of ResNet and the training and inference structure of RepVGG.

Figure 4. Structural re-parameterization of a RepVGG block.

Figure 5. Fitting of five phenotypic traits for five-fold cross-validation.

Figure 6. Confusion matrix for five-fold cross-validation.

Figure 7. Model output of four varieties of lettuce: (a) Aphylion; (b) Lugano; (c) Salanova; (d) Satine.

Table 1. Detection and segmentation results of the five-fold cross-validation experiment.

Number of Experiments	Del			Seg
Number of Experiments	AP	AP50	AP75	AP	AP50	AP75
k1	0.8674	0.9947	0.9832	0.8797	0.9947	0.9865
k2	0.8608	0.9887	0.9862	0.8721	0.9887	0.9862
k3	0.8697	0.9997	0.9866	0.8818	0.9997	0.9948
k4	0.8743	0.9993	0.9899	0.8851	0.9993	0.9993
k5	0.8699	0.9998	0.9812	0.8832	0.9998	0.9998
average	0.8684	0.9964	0.9854	0.8803	0.9964	0.9933

Table 2. Results of phenotypic trait with different varieties of lettuce.

Variety	fsw		dsw		h		d		la
Variety	R²	MAPE	R²	MAPE	R²	MAPE	R²	MAPE	R²	MAPE
Lugano	0.9656	0.0903	0.9587	0.1202	0.9121	0.0836	0.9188	0.0531	0.9509	0.0873
Salanova	0.9466	0.1448	0.9406	0.2278	0.8717	0.0914	0.9151	0.0494	0.9545	0.1016
Aphylion	0.9586	0.0975	0.9661	0.1346	0.9547	0.0597	0.9112	0.0591	0.9651	0.0798
Satine	0.9647	0.0957	0.9670	0.1201	0.9355	0.0670	0.8600	0.0570	0.9494	0.0919