Next Article in Journal
Analysis of Breaking and Separating Characteristics of Potato–Soil Aggregates Based on the New Type of Swing Separation Sieve
Previous Article in Journal
Elemental Analysis of Five Medicinal Plants Species Growing in North Ossetia Using Neutron Activation Analysis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Study on Utilizing Mask R-CNN for Phenotypic Estimation of Lettuce’s Growth Status and Optimal Harvest Timing

by
Lixin Hou
1,
Yuxia Zhu
1,
Ning Wei
1,
Zeye Liu
1,
Jixuan You
1,
Jing Zhou
1,* and
Jian Zhang
2,3,*
1
College of Information and Technology, Jilin Agricultural University, Changchun 130118, China
2
Faculty of Agronomy, Jilin Agricultural University, Changchun 130118, China
3
Department of Biology, University of British Columbia, Okanagan, Kelowna, BC V5K 1K5, Canada
*
Authors to whom correspondence should be addressed.
Agronomy 2024, 14(6), 1271; https://doi.org/10.3390/agronomy14061271
Submission received: 20 May 2024 / Revised: 6 June 2024 / Accepted: 11 June 2024 / Published: 12 June 2024

Abstract

:
Lettuce is an annual plant of the family Asteraceae. It is most often grown as a leaf vegetable, but sometimes for its stem and seeds, and its growth status and quality are evaluated based on its morphological phenotypic traits. However, traditional measurement methods are often labor-intensive and time-consuming due to manual measurements and may result in less accuracy. In this study, we proposed a new method utilizing RGB images and Mask R-Convolutional Neural Network (CNN) for estimating lettuce critical phenotypic traits. Leveraging publicly available datasets, we employed an improved Mask R-CNN model to perform a phenotypic analysis of lettuce images. This allowed us to estimate five phenotypic traits simultaneously, which include fresh weight, dry weight, plant height, canopy diameter, and leaf area. The enhanced Mask R-CNN model involved two key aspects: (1) replacing the backbone network from ResNet to RepVGG to enhance computational efficiency and performance; (2) adding phenotypic branches and constructing a multi-task regression model to achieve end-to-end estimation of lettuce phenotypic traits. Experimental results demonstrated that the present method achieved high accuracy and stable results in lettuce image segmentation, detection, and phenotypic estimation tasks, with APs for detection and segmentation being 0.8684 and 0.8803, respectively. Additionally, the R2 values for the five phenotypic traits are 0.96, 0.9596, 0.9329, 0.9136, and 0.9592, with corresponding mean absolute percentage errors (MAPEs) of 0.1072, 0.1522, 0.0757, 0.0548, and 0.0899, respectively. This study presents a novel technical advancement based on digital knowledge for phenotypic analysis and evaluation of lettuce quality, which could lay the foundation for artificial intelligence expiation in fresh vegetable production.

1. Introduction

Lettuce, Lactuca sativa L., is a popular vegetable crop widely grown around the world, with high nutritional value and important economic value. In today’s world, lettuce is ubiquitously found in supermarkets, markets, and restaurants. China, the United States, and Europe are the regions with the highest production value of lettuce [1]. The research objectives of all researchers in this domain revolve around producing lettuce rapidly, with high quality, and in large quantities. Plant phenotyping is an important branch of agriculture [2]. The morphological phenotypic traits of lettuce are crucial indicators for assessing the growth condition and quality of the lettuce, serving as the basis for its breeding, cultivation, and harvesting. Traditional methods for measuring these traits rely on labor-intensive manual sampling and measurement, which are time-consuming [3]. Furthermore, the analytical results are closely related to the operator’s experience, often leading to inconsistent and inaccurate results [4]. Additionally, such methods can compromise the integrity and freshness of the lettuce, impacting its subsequent usability. Therefore, a rapid, accurate, non-destructive, and high-throughput method for measuring lettuce phenotypic traits is of great significance and value for improving the efficiency and quality of lettuce production.
In recent years, with the development of computer vision and deep learning techniques, image-based methods for measuring plant phenotypic traits have received widespread attention and application. Image sensors collect image information of plants, such as RGB cameras [5,6,7,8,9], depth cameras [10,11,12,13], and optical spectrometers [14,15], which are then analyzed and processed with deep learning models to facilitate the automated, precise, and intelligent measurement of plant phenotypic traits. Li et al. [16] used UAV multispectral data as input to predict winter wheat yield, the convolutional neural network (CNN) model gave the best results, with R2 of 0.752 and NMSE of 0.404 t·ha−1. Giuffrida et al. [17] presented a Pheno-Deep Counter to predict leaf count with an accuracy of 88.5%, using RGB, FMP, and NIR images as inputs. Zhang et al. [7] employed a CNN to estimate the fresh weight, dry weight, and leaf area of lettuce. The estimated values showed good agreement with the actual measured values, with R2 values of 0.8938, 0.8910, and 0.9156, and normalized root mean square error (NRMSE) values of 26.00%, 22.07%, and 19.94%, respectively. On this basis, Xu et al. [18] enhanced the fresh weight R2 to 0.9725 and decreased the MAPE to 8.47% by integrating RGB images and depth images and introducing the MSPE function. Rasti et al. [19] utilized deep learning to estimate the growth stages of barley and wheat. The results demonstrated that transfer learning based on the VGG19 model achieved the highest accuracy for both crops. Guo et al. [5] used a deep learning method coupled with a novel directed search algorithm to obtain stem-related phenotypes for soybeans. The Pearson correlation coefficients (R) of plant height, pitch number, internodal length, main stem length, stem curvature, and branching angle were 0.9904, 0.9853, 0.9861, 0.9925, 0.9084, and 0.9391, respectively. The above studies demonstrate that deep learning is an effective technique in the field of plant phenotype research.
Deep learning has advanced significantly over the years, giving rise to various classical algorithms, such as Mask R-CNN [20], YOLO [21], and hybrid deep learning models [22]. Mask R-CNN is widely used in agriculture for tasks such as identifying and estimating the characteristics of crops [23,24,25,26,27], fish [28,29], and trees [30]. It is also utilized for detecting diseases and pests [31,32] due to its excellent performance, scalability, and multi-tasking capabilities. Zhang et al. [33] drew the chrysanthemum canopy contour and estimated the crown diameter using a mask obtained from Mask R-CNN. The results showed an R2 of 0.9629 and a root mean square error (RMSE) of 2.2949 cm. Zheng et al. [34] used a Mask R-CNN model to segment RGB-NIR image and RGB image of strawberries and established a regression model of canopy leaf area and dry biomass for biomass prediction. Li et al. [35] used Mask R-CNN for leaf segmentation, classification, and counting of the watermelon plug seedlings. Gao et al. [36] used UAV images and Mask R-CNN to extract maize seedling information. The average accuracy of the seedling emergence rate is 98.87%. At the same time, the model also shows good migration characteristics. Hao et al. [37] combined the Mask R-CNN model with the canopy height model (CHM) to detect the height and crown of trees. The tree height estimate was highly correlated with the height from UAV images (R2 = 0.97). It can be seen that phenotypic traits are mainly predicted based on the segmentation results of Mask R-CNN. There are few studies in which phenotypic traits are directly output from models.
This paper presents a new method for estimating phenotypic traits in lettuce using RGB images and a Mask R-CNN model. The method can estimate five phenotypic traits simultaneously and produce target detection and segmentation results. The main contributions are as follows: (1) the addition of a phenotypic branch to the Mask R-CNN model to enable phenotypic trait estimation; (2) the replacement of the backbone network with RepVGG [38], resulting in improved speed and reduced parameter count; (3) the achievement of end-to-end estimation of five lettuce phenotypic traits. The work indicates significant advancements in digital phenotypic analysis, paving the way for artificial intelligence applications in fresh vegetable production.

2. Materials and Methods

2.1. Datasets

The open dataset came from the third Autonomous Greenhouse Challenge [39] organized by Tencent and Wageningen University and Research. The lettuce was grown in the laboratory of Wageningen University in the Netherlands. The RealSense D415 depth sensor was used to capture natural light RGB and depth images, suspended 0.9 m above the crop. The images were saved in PNG format with an original pixel resolution of 1920 × 1080. In the dataset, there are 96 images of Lugano lettuce, 102 images of Salanova lettuce, 92 images of Aphylion lettuce, and 98 images of Satine lettuce, as illustrated in Figure 1.
Fresh weight, dry weight, plant height, canopy diameter, and leaf area were obtained via destructive measurement methods. The fresh weight of the lettuce was determined by weighing the lettuce from the point where the first leaf was attached. The dry weight was obtained after three days. From the initial leaf to the highest point of the plant, the height was measured. The canopy diameter was the principal diameter of the projection on a horizontal surface. The leaf area was computed by projecting the leaf surface area onto a plane without considering leaf bending. The units of these traits are “g/plant”, “g/plant”, “cm”, “cm”, and “cm2”, respectively.
The RGB images were cropped to 1024 × 1024 pixels from the center point of the plant and then scaled to 800 × 800 pixels. The VIA annotation tool was utilized to manually label the dataset. Each RGB image contained one lettuce target labeled with four species: Lugano, Salanova, Aphylion, and Satine. To improve the learning and generalization ability of the neural network model, data augmentation was used in this study. We mainly used horizontal flip, vertical flip, and 10% brightness as data enhancement methods. Data augmentation improves detection performance, and the size of the enhanced dataset is 1548. The K-fold cross-verification method was adopted, and the data were divided into five parts. For each course of training, one fold was selected as the test, the remaining four folds were used as the training set, and then 10% was taken from the training set for the verification set.

2.2. Method (Improvement of the Model)

Mask R-CNN is a classical two-stage instance segmentation method extended from Faster R-CNN [40] in deep learning, which mainly includes a feature extraction network (ResNet [41]), a multi-scale feature fusion network (FPN) [42], a region proposal network (RPN), a region of interest align (RoiAlign), a classification sub-network, and a segmentation sub-network. The improved Mask R-CNN adds a third branch for phenotypic trait prediction while replacing the backbone network (ResNet) with RepVGG. The overall model structure is shown in Figure 2.

2.2.1. Backbone Network

The feature acquisition part of the entire model is called the backbone network, and it is crucial to obtain enough features to obtain good results. In the traditional Mask R-CNN, a deep residual network is used to extract features. To balance model accuracy and speed, and to enable deployment in the field, RepVGG was selected as the new backbone network.
RepVGG is a restructured-parameter backbone network proposed in 2021. The model creation is inspired by ResNet, which is a multi-branch structure during training, using identity, 3 × 3, and 1 × 1 branches. During inference, the multi-branch structure is converted to a single 3 × 3 model. Its structure for both training and inference is displayed in Figure 3.
The RepVGG model uses the re-parameterization technique to enhance its representational capacity during training, while also reducing parameters and improving inference speed without sacrificing accuracy. Specifically, we equivalently transform a network structure with shortcuts into a structure without shortcuts using 3 × 3 convolutions. This transformation enables the model to use only 3 × 3 convolution kernels and ReLU activations during inference. There are two scenarios for branch compositions during training. First, there are three branches: the 3 × 3 branch, the 1 × 1 branch, and the identify branch. The formulation is as follows:
M 2 = b n M 1 W 3 , μ 3 , σ 3 , γ 3 , β 3 + b n ( M 1 W 1 , μ 1 , σ 1 , γ 1 , β ( 1 ) ) + b n M 1 , μ 0 , σ 0 , γ 0 , β 0
The kernel of the convolutional layer is represented by W in the formula. µ, σ, γ, and β stand for the accumulated mean, standard deviation, learned scaling factor, and bias of the BN layer after convolution, respectively. The numbers 3, 1, and 0 represent the 3 × 3 branch, 1 × 1 branch, and identity branch in the upper-right bracket. Second, there are two branches: the 3 × 3 branch and the 1 × 1 branch. This is reflected in the formulation by having only the first two terms. In the re-parameterization process, each BN and its previous convolutional layer are first converted to convolution with bias vectors by converting to
W i , : , : , : = γ i σ i W i , : , : , : ,       b i = μ i γ i σ i + β i
Then, the formula for BN can be written as
bn M W , μ , σ , γ , β : , i , : , : = M W : , i , : , : + b i
The identity branch is applied to this transformation because it can be served as a 1 × 1 convolution with an identity matrix as the kernel. Thus, we have three bias vectors, two 1 × 1 kernels, and one 3 × 3 kernel. Subsequently, the three bias vectors are added to obtain the final bias. The two 1 × 1 kernels are then transformed into 3 × 3 kernels by padding 0, and the three convolution kernels are added, as illustrated in Figure 4. It should be noted that the stride of the 3 × 3 and 1 × 1 layers have to match for equivalent conversion, and the padding has to be set so that the latter is one pixel smaller than the former. For instance, in the case of a 3 × 3 layer that pads the input by one pixel, which is commonly encountered, the 1 × 1 layer should have padding set to 0.

2.2.2. Phenotype Traits Head

Few phenotypic trait findings could be directly obtained from the model for prediction, and previous research ideas mostly relied on the segmentation of the target image as a basis for generating phenotypic traits. The original Mask R-CNN paper introduced a branch head to estimate human pose, which inspired the author to extend it further for phenotypic morphological trait prediction. This branch was responsible for extracting morphological characteristics of plants from images and outputted five phenotypic traits, including fresh weight, dry weight, plant height, canopy diameter, and leaf area.
The structural phenotypic traits branch ran parallel to the original two branches. First, we used the RoiAlign layer to crop and align the feature map, and then we performed feature extraction via eight convolutional layers. Finally, we used two fully connected layers to output the predicted values of the five phenotypic traits. In order to optimize the phenotypic traits branch, we used the Huber loss function.
The Huber loss function is a regression loss function with continuity and derivability. It can reduce the sensitivity of outliers with the advantages of square error (MSE) and absolute error (MAE). The following is the formula:
L y , f x = 1 2 ( y f ( x ) ) 2 , if   | y f ( x ) | δ δ y f x 1 2 δ 2 , if   y f x > δ
The hyperparameter δ plays a selective role in Equation (4). Squared error is used when the forecast prediction is less than δ, and linear error is used when the forecast deviation is greater than δ (similar to MAE).

2.2.3. Experimental Environment and Training Strategies

This study used the Ubuntu 22.04.3 operating system with a computer configuration consisting of an NVIDIA GeForce RTX 4070 Ti GPU (12 GB), 32 GB RAM, and an Intel® CoreTM i7-13700KF processor. The model was written in Python 3.8 and Pytorch 1.10.
The steps of model training are as follows: (i) train the entire model; (ii) train all branch sub-networks while freezing the other network layers; (iii) train the backbone network while keeping the branches frozen; (iv) finally, train the first two layers of the backbone network under fixing all branches.

2.2.4. Evaluation Index

COCO evaluation indexes are utilized in this study for the detection and segmentation sections. The metrics used in the COCO evaluation are primarily separated into two categories: average recall (AR) and average precision (AP). To determine the degree of overlap between the detection box (segmentation mask) and the actual label, COCO evaluation indicators use ten IOU thresholds ranging from 0.5 to 0.95 (in increments of 0.05). This helps to assess the detector’s positioning ability.
In the regression prediction of lettuce phenotypic traits, R2 and MAPE indexes are used. R2 is an evaluation index used in regression analysis, which reflects the degree of correlation between independent and dependent variables. Its value ranges from 0 to 1, with a higher value indicating a better fit for the regression model. R2 is calculated by the following equation:
R 2 = 1 i y i y i 2 i y i y ¯ 2
The mean absolute percentage error (MAPE) is an evaluation metric used in regression analysis to indicate the proportional amount of divergence between the predicted value and the true value. The MAPE value can range from 0 to infinity. If the predicted value is exactly equal to the true value, the MAPE is 0%, which means the model is perfect. However, if the predicted value has no relation to the true value, the MAPE is 100%, indicating a poor model. The formula for MAPE is as follows:
M A P E = 1 n i = 1 n y i y i y i × 100 %

3. Results

3.1. Results of the Proposed Method

The performance of the model was tested on the test set, and the results are displayed in Figure 5. The x-axis represents the estimated values, while the y-axis indicates the measured values. The predicted outcomes are shown as discrete points on the graph. The five subplots in the figure correspond to the growth traits of fresh weight (fsw), dry weight (dsw), plant height (h), canopy diameter (d), and leaf area (la). Each subplot shows the results of the five-fold cross-validation and provides the evaluation score and fitted equation for each fold. When the R2 value falls between 0 and 1, a higher R2 value is an indication of better estimation performance. On the other hand, a smaller MAPE value indicates better network performance.
Our proposed method has an average R2 of 0.9600 for fsw predictions and an average MAPE of 0.1072. For dsw predictions, the average R2 value is 0.9596 with an average MAPE of 0.1522. H predictions exhibit an average R2 of 0.9329 and an average MAPE of 0.0757, while d predictions show an average R2 of 0.9136 and an average MAPE of 0.0548. La predictions have an average R2 of 0.9592, with a corresponding average MAPE of 0.0899. According to Zhang et al. [12], there is a strong correlation of more than 0.9 between fsw, dsw, and la in their data analysis. The R2 results for fsw, dsw, and la exhibit similarities. Surprisingly, d has the lowest R2 value among all traits, but its MAPE is also the lowest among all traits. The second least predictive is h, which could be attributed to the less abundant feature information extracted from top-view lettuce RGB images compared to other predicting traits. In Figure 5, the fitting results deteriorate during the later growth period, which is related to the uneven distribution of sample numbers. Before the data enhancement, there were 387 samples divided according to the fresh weight in grams (0–100 g, 100–200 g, 200–300 g, greater than 300 g) with sample numbers of 219, 83, 54, and 31, respectively. It is noticeable that fsw, dsw, and la show good prediction performance during the initial growth stages. However, the pattern for predicting d and h throughout the whole growth stage is not similar to the other three phenotypic traits during the early growth stages. Among the five phenotypic traits, the MAPE value of dry weight was the highest. Our analysis is due to the limitations of MAPE. The value of dry weight is relatively small, with 126 samples of 0–2 g. MAPE is sensitive to small errors and amplifies errors in samples with smaller true values.
Table 1 displays the outcomes of object detection and segmentation obtained through Mask R-CNN on the lettuce dataset. The average values of detection metrics AP, AP50, and AP75 for the five-fold cross-validation are 0.8684, 0.9964, and 0.9854, respectively. Additionally, the segmentation results show average values of 0.8803, 0.9964, and 0.9854, respectively. It has been noticed that the fourth fold generally performs better than other folds. This superiority may be attributed to dataset partitioning.

3.2. The Application Results of Different Species

A confusion matrix was used to analyze the model’s detection accuracy for different species through a five-fold cross-validation. The use of K-fold cross-validation can effectively utilize the limited data and avoid the limitations and particularities of fixed partition datasets. The results show that most misclassifications occur in Aphylion and Lugano in Figure 6. It is more common to misclassify Lugano as Aphylion than to misclassify Aphylion as Lugano. The second most common error is misclassification between Salanova and Satine, followed by misclassifying Satine as Aphylion. Upon examining the images, we can observe that the Lugano leaves are more coiled than the Aphylion leaves. However, there are several other similarities between them. Although Salanova and Satine have similar colors during their early growth stages, the central part of Satine plants tends to turn yellow in later stages. Additionally, there is a significant contrast in leaf morphology between the two varieties. Salanova has flat elongated leaves, while Satine has curly unevenly shaped leaves.
Table 2 presents the predicted results of our model for five phenotypic traits in four types of lettuce, revealing variations in the predicted results across different lettuce varieties. The data are the results of an average from five test sets. Out of all the varieties, Salanova shows the least effectiveness in predicting the values of fsw, dsw, and h, with the largest difference in MAPE in dsw (0.2278) and the lowest R2 in h (0.8717). However, there is a slightly smaller difference in performance for fsw. As for d and la, Satine appears to have the lowest performance, with its R2 for d being approximately 5% lower than the other varieties. However, the variances between the two indicators of the la trait among the four lettuce varieties are relatively small. Figure 7 shows the inference outputs for the four types of lettuce. The segmentation works well in the early stages for all types of lettuce. Different lettuce varieties exhibit different morphologies as they grow; however, the segmentation performance begins to degrade for all varieties. The mask’s shape is unable to match the lettuce’s shape due to the leaves curling.

4. Discussion

4.1. Comparison of Different Backbone

To thoroughly validate the selection of RepVGG as the backbone network for object detection and morphological phenotypic trait prediction, we performed ablation experiments. Other backbone networks evaluated include ResNet50 (used in the classic Mask R-CNN), MobileNetV3 (popular for its lightweight design), and EfficientNet (known for its balance between accuracy and speed). Table 3 and Table 4 show the average results of the five-fold cross-validation for each backbone network. The best result in each column is highlighted in bold. It is worth noting that RepVGG is the best-performing backbone network. In terms of detection and segmentation results, the AP metrics for all backbone networks are higher than 0.81, with segmentation metrics performing better than detection metrics.
EfficientNet shows slightly better AP results compared to RepVGG, while both models have similar AP50 results. However, RepVGG outperforms EfficientNet in terms of AP75. ResNet50 ranks third in terms of performance, with a detection AP of 0.8236 and a segmentation AP of 0.8787. Among the four backbone networks, MobileNetV3 performs the poorest, with detection and segmentation AP of 0.8236 and 0.8787, respectively. Regarding the prediction of phenotype trait regression metrics, RepVGG has the best overall performance as the backbone network. However, the difference between the best and worst metrics in each column is only around 0.01. This suggests that the choice of backbone network has limited influence on the results. The information extracted from the data for regression prediction mining may reach the extreme.
We assessed network performance and efficiency by considering model parameters, inference time, and FLOPs. FLOPs, which stands for floating-point operations, refer to the number of operations performed during the model’s forward pass. Among the models we examined, ResNet50 had the longest inference time at 0.0233 s, the largest model parameters at 585.8 MB, and the highest FLOPs. In contrast, MobileNet_V3 and EfficientNet had similar model parameters, at 331 MB and 344.4 MB respectively, and their inference times were also close, at 0.0155 s and 0.0169 s. These two models ranked first and second in FLOPs. RepVGG, on the other hand, ranked third in FLOPs, but it had the smallest model parameters at 127 MB and a slightly faster inference time compared to MobileNet. This suggests that RepVGG, as a backbone network, has fewer model parameters and faster inference time. In practical application, we can consider deploying the model to Raspberry PI to develop an automatic monitoring platform for lettuce phenotypes.

4.2. Comparison of Phenotypic Traits Branching with Different Convolutional Layers

To identify the most effective number of convolutional layers for the phenotypic trait branch, an ablation experiment was conducted using a convolutional neural network (CNN) with 6, 8, and 10 layers, respectively. The averaged outcomes of five-fold cross-validation for varying CNN layer counts are shown in Table 5 and Table 6. The best-performing outcomes in each column have been highlighted in bold.
From the tables, the overall performance is optimal when the number of CNN layers is eight. In both detection and segmentation tasks, the eight-layer model consistently performs better than other models. The difference in performance between six and eight layers is significant, while the difference between eight and ten layers is negligible. Regarding phenotypic trait prediction, the eight-layer model exhibits superior performance. The six-layer model has the poorest performance, while the eight-layer model slightly outperforms the ten-layer model. These results demonstrate that an adequate number of convolutional layers is essential for optimal performance, but excessive layers potentially lead to diminished results. As a result, an eight-layer CNN was ultimately chosen for the phenotypic prediction branch.

5. Conclusions

To assess the growth status and quality of lettuce, morphological and phenotypic characteristics were selected for evaluation. In this study, we replaced the backbone network and incorporated a phenotypic trait branch to estimate fresh weight, dry weight, plant height, canopy diameter, and leaf area in the Mask R-CNN framework. The results show that the average AP is 0.8684 for detection and 0.8803 for segmentation. Additionally, the R2 results for phenotypic traits are all above 0.91, but with varying MAPE values. In particular, trait dsw has the highest MAPE value of 0.1522, while trait d has the lowest MAPE value of 0.0548.
This study has the following limitations: it does not consider the effect of leaf curling; to improve prediction accuracy, depth images can be fused with the model; and the data collection environment should take into account the challenges posed by changes in the shooting environment and shooting angle.

Author Contributions

Conceptualization, L.H. and J.Z. (Jing Zhou); methodology, L.H., Y.Z. and N.W.; software, Y.Z. and Z.L.; validation, N.W. and J.Y.; formal analysis, Z.L. and J.Y.; investigation, Y.Z., N.W. and J.Y.; data curation, N.W.; writing—original draft preparation, L.H. and Y.Z.; writing—review and editing, L.H., J.Z. (Jing Zhou) and J.Z. (Jian Zhang); visualization, N.W., Z.L. and J.Y.; supervision, L.H. and J.Z. (Jing Zhou); project administration, L.H.; funding acquisition, L.H., J.Z. (Jing Zhou) and J.Z. (Jian Zhang). All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Jilin Province Science and Technology Development Plan Project (No. 20230101343JC; No. 20230202042NC); Jilin Agricultural University high-level researcher grant (JLAUHLRG20102006).

Data Availability Statement

This study used the Third Autonomous Greenhouse Challenge: Online Challenge Lettuce Images dataset publicly available at 4TU.ResearchData [39].

Acknowledgments

We sincerely appreciate Jian Zhang for his invaluable support during the field experiments.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Simko, I.; Hayes, R.J.; Mou, B.; McCreight, J.D. Lettuce and Spinach. In Yield Gains in Major U.S. Field Crops; Smith, S., Diers, B., Specht, J., Carver, B., Eds.; American Society of Agronomy, Inc.: Madison, WI, USA; Crop Science Society of America, Inc.: Madison, WI, USA; Soil Science Society of America, Inc.: Madison, WI, USA, 2014; pp. 53–85. [Google Scholar]
  2. Xiong, J.; Yu, D.; Liu, S.; Shu, L.; Wang, X.; Liu, Z. A Review of Plant Phenotypic Image Recognition Technology Based on Deep Learning. Electronics 2021, 10, 81. [Google Scholar] [CrossRef]
  3. Casadesús, J.; Villegas, D. Conventional digital cameras as a tool for assessing leaf area index and biomass for cereal breeding. J. Integr. Plant Biol. 2014, 56, 7–14. [Google Scholar] [CrossRef] [PubMed]
  4. Zhang, L.; Verma, B.; Stockwell, D.; Chowdhury, S. Density Weighted Connectivity of Grass Pixels in image frames for biomass estimation. Expert Syst. Appl. 2018, 101, 213–227. [Google Scholar] [CrossRef]
  5. Guo, Y.; Gao, Z.; Zhang, Z.; Li, Y.; Hu, Z.; Xin, D.; Chen, Q.; Zhu, R. Automatic and Accurate Acquisition of Stem-Related Phenotypes of Mature Soybean Based on Deep Learning and Directed Search Algorithms. Front. Plant Sci. 2022, 13, 906751. [Google Scholar] [CrossRef] [PubMed]
  6. Du, W.; Liu, P. Instance Segmentation and Berry Counting of Table Grape before Thinning Based on AS-SwinT. Plant Phenomics 2023, 5, 0085. [Google Scholar] [CrossRef] [PubMed]
  7. Zhang, L.; Xu, Z.; Xu, D.; Ma, J.; Chen, Y.; Fu, Z. Growth monitoring of greenhouse lettuce based on a convolutional neural network. Hortic. Res. 2020, 7, 124. [Google Scholar] [CrossRef] [PubMed]
  8. Ye, Z.; Tan, X.; Dai, M.; Lin, Y.; Chen, X.; Nie, P.; Ruan, Y.; Kong, D. Estimation of rice seedling growth traits with an end-to-end multi-objective deep learning framework. Front. Plant Sci. 2023, 14, 1165552. [Google Scholar] [CrossRef]
  9. Du, J.; Lu, X.; Fan, J.; Qin, Y.; Yang, X.; Guo, X. Image-Based High-Throughput Detection and Phenotype Evaluation Method for Multiple Lettuce Varieties. Front. Plant Sci. 2020, 11, 563386. [Google Scholar] [CrossRef]
  10. Buxbaum, N.; Lieth, J.H.; Earles, M. Non-destructive Plant Biomass Monitoring with High Spatio-Temporal Resolution via Proximal RGB-D Imagery and End-to-End Deep Learning. Front. Plant Sci. 2022, 13, 758818. [Google Scholar] [CrossRef]
  11. Quan, L.; Li, H.; Li, H.; Jiang, W.; Lou, Z.; Chen, L. Two-Stream Dense Feature Fusion Network Based on RGB-D Data for the Real-Time Prediction of Weed Aboveground Fresh Weight in a Field Environment. Remote Sens. 2021, 13, 2288. [Google Scholar] [CrossRef]
  12. Zhang, Q.; Zhang, X.; Wu, Y.; Li, X. TMSCNet: A three-stage multi-branch self-correcting trait estimation network for RGB and depth images of lettuce. Front. Plant Sci. 2022, 13, 982562. [Google Scholar] [CrossRef] [PubMed]
  13. Milella, A.; Marani, R.; Petitti, A.; Reina, G. In-field high throughput grapevine phenotyping with a consumer-grade depth camera. Comput. Electron. Agric. 2019, 156, 293–306. [Google Scholar] [CrossRef]
  14. Moghimi, A.; Yang, C.; Anderson, J.A. Aerial hyperspectral imagery and deep neural networks for high-throughput yield phenotyping in wheat. Comput. Electron. Agric. 2020, 172, 105299. [Google Scholar] [CrossRef]
  15. Ampatzidis, Y.; Partel, V. UAV-Based High Throughput Phenotyping in Citrus Utilizing Multispectral Imaging and Artificial Intelligence. Remote Sens. 2019, 11, 410. [Google Scholar] [CrossRef]
  16. Li, Z.; Chen, Z.; Cheng, Q.; Fei, S.; Zhou, X. Deep Learning Models Outperform Generalized Machine Learning Models in Predicting Winter Wheat Yield Based on Multispectral Data from Drones. Drones 2023, 7, 505. [Google Scholar] [CrossRef]
  17. Giuffrida, M.V.; Doerner, P.; Tsaftaris, S.A. Pheno-Deep Counter: A unified and versatile deep learning architecture for leaf counting. Plant J. 2018, 96, 880–890. [Google Scholar] [CrossRef] [PubMed]
  18. Xu, D.; Chen, J.; Li, B.; Ma, J. Improving Lettuce Fresh Weight Estimation Accuracy through RGB-D Fusion. Agronomy 2023, 13, 2617. [Google Scholar] [CrossRef]
  19. Rasti, S.; Bleakley, C.J.; Silvestre, G.C.M.; Holden, N.M.; Langton, D.; O’Hare, G.M.P. Crop growth stage estimation prior to canopy closure using deep learning algorithms. Neural Comput. Appl. 2021, 33, 1733–1743. [Google Scholar] [CrossRef]
  20. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
  21. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA; 2016; pp. 779–788. [Google Scholar]
  22. Yazdinejad, A.; Dehghantanha, A.; Parizi, R.M.; Epiphaniou, G. An optimized fuzzy deep learning model for data classification based on NSGA-II. Neurocomputing 2023, 522, 116–128. [Google Scholar] [CrossRef]
  23. Tang, C.; Chen, D.; Wang, X.; Ni, X.; Liu, Y.; Liu, Y.; Mao, X.; Wang, S. A fine recognition method of strawberry ripeness combining Mask R-CNN and region segmentation. Front. Plant Sci. 2023, 14, 1211830. [Google Scholar] [CrossRef]
  24. Siricharoen, P.; Yomsatieankul, W.; Bunsri, T. Recognizing the sweet and sour taste of pineapple fruits using residual networks and green-relative color transformation attached with Mask R-CNN. Postharvest Biol. Technol. 2023, 196, 112174. [Google Scholar] [CrossRef]
  25. Wang, D.; He, D. Fusion of Mask RCNN and attention mechanism for instance segmentation of apples under complex background. Comput. Electron. Agric. 2022, 196, 106864. [Google Scholar] [CrossRef]
  26. Cong, P.; Li, S.; Zhou, J.; Lv, K.; Feng, H. Research on Instance Segmentation Algorithm of Greenhouse Sweet Pepper Detection Based on Improved Mask RCNN. Agronomy 2023, 13, 196. [Google Scholar] [CrossRef]
  27. Wang, L.; Jia, K.; Fu, Y.; Xu, X.; Fan, L.; Wang, Q.; Zhu, W.; Niu, Q. Overlapped tobacco shred image segmentation and area computation using an improved Mask RCNN network and COT algorithm. Front. Plant Sci. 2023, 14, 1108560. [Google Scholar] [CrossRef] [PubMed]
  28. Yu, G.; Luo, Y.; Deng, R. Automatic segmentation of golden pomfret based on fusion of multi-head self-attention and channel-attention mechanism. Comput. Electron. Agric. 2022, 202, 107369. [Google Scholar] [CrossRef]
  29. Han, B.; Hu, Z.; Su, Z.; Bai, X.; Yin, S.; Luo, J.; Zhao, Y. Mask_LaC R-CNN for measuring morphological features of fish. Measurement 2022, 203, 111859. [Google Scholar] [CrossRef]
  30. Zhang, C.; Zhou, J.; Wang, H.; Tan, T.; Cui, M.; Huang, Z.; Wang, P.; Zhang, L. Multi-species individual tree segmentation and identification based on improved mask R-CNN and UAV imagery in mixed forests. Remote Sens. 2022, 14, 874. [Google Scholar] [CrossRef]
  31. Li, H.; Shi, H.; Du, A.; Mao, Y.; Fan, K.; Wang, Y.; Shen, Y.; Wang, S.; Xu, X.; Tian, L. Symptom recognition of disease and insect damage based on Mask R-CNN, wavelet transform, and F-RNet. Front. Plant Sci. 2022, 13, 922797. [Google Scholar] [CrossRef]
  32. Wang, H.; Mou, Q.; Yue, Y.; Zhao, H. Research on Detection Technology of Various Fruit Disease Spots Based on Mask R-CNN. In Proceedings of the 2020 IEEE International Conference on Mechatronics and Automation (ICMA), Beijing, China, 13–16 October 2020; pp. 1083–1087. [Google Scholar]
  33. Zhang, J.; Lu, J.; Zhang, Q.; Qi, Q.; Zheng, G.; Chen, F.; Chen, S.; Zhang, F.; Fang, W.; Guan, Z. Estimation of Garden Chrysanthemum Crown Diameter Using Unmanned Aerial Vehicle (UAV)-Based RGB Imagery. Agronomy 2024, 14, 337. [Google Scholar] [CrossRef]
  34. Zheng, C.; Abd-Elrahman, A.; Whitaker, V.M.; Dalid, C. Deep learning for strawberry canopy delineation and biomass prediction from high-resolution images. Plant Phenomics 2022, 2022, 9850486. [Google Scholar] [CrossRef]
  35. Li, L.; Bie, Z.; Zhang, Y.; Huang, Y.; Peng, C.; Han, B.; Xu, S. Nondestructive Detection of Key Phenotypes for the Canopy of the Watermelon Plug Seedlings Based on Deep Learning. Hortic. Plant J. 2023, in press. [Google Scholar] [CrossRef]
  36. Gao, X.; Zan, X.; Yang, S.; Zhang, R.; Chen, S.; Zhang, X.; Liu, Z.; Ma, Y.; Zhao, Y.; Li, S. Maize seedling information extraction from UAV images based on semi-automatic sample generation and Mask R-CNN model. Eur. J. Agron. 2023, 147, 126845. [Google Scholar] [CrossRef]
  37. Hao, Z.; Lin, L.; Post, C.J.; Mikhailova, E.A.; Li, M.; Chen, Y.; Yu, K.; Liu, J. Automated tree-crown and height detection in a young forest plantation using mask region-based convolutional neural network (Mask R-CNN). ISPRS J. Photogramm. Remote Sens. 2021, 178, 112–123. [Google Scholar] [CrossRef]
  38. Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. Repvgg: Making vgg-style convnets great again. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13728–13737. [Google Scholar]
  39. Hemming, S.; de Zwart, H.F.; Elings, A.; Bijlaard, M.; van Marrewijk, B.; Petropoulou, A. 3rd Autonomous Greenhouse Challenge:Online Challenge Lettuce Images. Available online: https://doi.org/10.4121/15023088.v1(accessed on 2 March 2022).
  40. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
  41. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  42. Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
Figure 1. In total, there were 96, 102, 92, and 98 image pairs for each variation.
Figure 1. In total, there were 96, 102, 92, and 98 image pairs for each variation.
Agronomy 14 01271 g001
Figure 2. Structure of the improved model.
Figure 2. Structure of the improved model.
Agronomy 14 01271 g002
Figure 3. Structure of ResNet and the training and inference structure of RepVGG.
Figure 3. Structure of ResNet and the training and inference structure of RepVGG.
Agronomy 14 01271 g003
Figure 4. Structural re-parameterization of a RepVGG block.
Figure 4. Structural re-parameterization of a RepVGG block.
Agronomy 14 01271 g004
Figure 5. Fitting of five phenotypic traits for five-fold cross-validation.
Figure 5. Fitting of five phenotypic traits for five-fold cross-validation.
Agronomy 14 01271 g005
Figure 6. Confusion matrix for five-fold cross-validation.
Figure 6. Confusion matrix for five-fold cross-validation.
Agronomy 14 01271 g006
Figure 7. Model output of four varieties of lettuce: (a) Aphylion; (b) Lugano; (c) Salanova; (d) Satine.
Figure 7. Model output of four varieties of lettuce: (a) Aphylion; (b) Lugano; (c) Salanova; (d) Satine.
Agronomy 14 01271 g007
Table 1. Detection and segmentation results of the five-fold cross-validation experiment.
Table 1. Detection and segmentation results of the five-fold cross-validation experiment.
Number of ExperimentsDelSeg
APAP50AP75APAP50AP75
k10.86740.99470.98320.87970.99470.9865
k20.86080.98870.98620.87210.98870.9862
k30.86970.99970.98660.88180.99970.9948
k40.87430.99930.98990.88510.99930.9993
k50.86990.99980.98120.88320.99980.9998
average0.86840.99640.98540.88030.99640.9933
Table 2. Results of phenotypic trait with different varieties of lettuce.
Table 2. Results of phenotypic trait with different varieties of lettuce.
Varietyfswdswhdla
R2MAPER2MAPER2MAPER2MAPER2MAPE
Lugano0.96560.09030.95870.12020.91210.08360.91880.05310.95090.0873
Salanova0.94660.14480.94060.22780.87170.09140.91510.04940.95450.1016
Aphylion0.95860.09750.96610.13460.95470.05970.91120.05910.96510.0798
Satine0.96470.09570.96700.12010.93550.06700.86000.05700.94940.0919
The bold numbers are the best results for each column.
Table 3. Results of detection and segmentation based on different backbone networks.
Table 3. Results of detection and segmentation based on different backbone networks.
BackboneDelSegModel Parameter (MB)Inference Time (s)FLOPs (G)
APAP50AP75APAP50AP75
RepVGG0.86840.99640.98540.88040.99640.99331270.015482.486
MoblieNet_V30.81650.98510.95870.84600.98520.97273310.015567.684
EfficientNet0.84890.99640.98360.88400.99640.9884344.40.016970.193
ResNet500.82360.99040.96720.87870.99040.9810585.80.0233122.350
The bold numbers are the best results for each column.
Table 4. Results of phenotyping traits based on different backbone networks.
Table 4. Results of phenotyping traits based on different backbone networks.
Backbonefswdswhdla
R2MAPER2MAPER2MAPER2MAPER2MAPE
RepVGG0.96000.10730.95960.15220.93290.07570.91360.05480.95920.0899
MoblieNet_V30.96000.11960.96280.15690.93370.07800.91000.05740.95230.0987
EfficientNet0.95870.10630.95960.15310.92520.08280.90410.05700.95490.0936
ResNet500.95040.11930.95600.16760.91150.08900.88670.06470.95290.0981
The bold numbers are the best results for each column.
Table 5. Results of detection and segmentation with different convolutional layers.
Table 5. Results of detection and segmentation with different convolutional layers.
Number of LayerDelSeg
APAP50AP75APAP50AP75
number = 60.85660.99510.98320.87340.99510.9886
number = 80.86840.99640.98540.88030.99640.9933
number = 100.86420.99410.98050.87810.99410.9887
The bold numbers are the best results for each column.
Table 6. Results of phenotyping traits with different convolutional layers.
Table 6. Results of phenotyping traits with different convolutional layers.
Number of Layerfswdswhdla
R2MAPER2MAPER2MAPER2MAPER2MAPE
number = 60.95580.10920.95890.15140.91680.08150.91190.05630.95490.0916
number = 80.96000.10730.95960.15220.93290.07570.91360.05480.95920.0899
number = 100.95700.10610.95990.14630.92110.08000.90200.05610.95900.0892
The bold numbers are the best results for each column.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hou, L.; Zhu, Y.; Wei, N.; Liu, Z.; You, J.; Zhou, J.; Zhang, J. Study on Utilizing Mask R-CNN for Phenotypic Estimation of Lettuce’s Growth Status and Optimal Harvest Timing. Agronomy 2024, 14, 1271. https://doi.org/10.3390/agronomy14061271

AMA Style

Hou L, Zhu Y, Wei N, Liu Z, You J, Zhou J, Zhang J. Study on Utilizing Mask R-CNN for Phenotypic Estimation of Lettuce’s Growth Status and Optimal Harvest Timing. Agronomy. 2024; 14(6):1271. https://doi.org/10.3390/agronomy14061271

Chicago/Turabian Style

Hou, Lixin, Yuxia Zhu, Ning Wei, Zeye Liu, Jixuan You, Jing Zhou, and Jian Zhang. 2024. "Study on Utilizing Mask R-CNN for Phenotypic Estimation of Lettuce’s Growth Status and Optimal Harvest Timing" Agronomy 14, no. 6: 1271. https://doi.org/10.3390/agronomy14061271

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop