1. Introduction
Lettuce,
Lactuca sativa L., is a popular vegetable crop widely grown around the world, with high nutritional value and important economic value. In today’s world, lettuce is ubiquitously found in supermarkets, markets, and restaurants. China, the United States, and Europe are the regions with the highest production value of lettuce [
1]. The research objectives of all researchers in this domain revolve around producing lettuce rapidly, with high quality, and in large quantities. Plant phenotyping is an important branch of agriculture [
2]. The morphological phenotypic traits of lettuce are crucial indicators for assessing the growth condition and quality of the lettuce, serving as the basis for its breeding, cultivation, and harvesting. Traditional methods for measuring these traits rely on labor-intensive manual sampling and measurement, which are time-consuming [
3]. Furthermore, the analytical results are closely related to the operator’s experience, often leading to inconsistent and inaccurate results [
4]. Additionally, such methods can compromise the integrity and freshness of the lettuce, impacting its subsequent usability. Therefore, a rapid, accurate, non-destructive, and high-throughput method for measuring lettuce phenotypic traits is of great significance and value for improving the efficiency and quality of lettuce production.
In recent years, with the development of computer vision and deep learning techniques, image-based methods for measuring plant phenotypic traits have received widespread attention and application. Image sensors collect image information of plants, such as RGB cameras [
5,
6,
7,
8,
9], depth cameras [
10,
11,
12,
13], and optical spectrometers [
14,
15], which are then analyzed and processed with deep learning models to facilitate the automated, precise, and intelligent measurement of plant phenotypic traits. Li et al. [
16] used UAV multispectral data as input to predict winter wheat yield, the convolutional neural network (CNN) model gave the best results, with R
2 of 0.752 and NMSE of 0.404 t·ha
−1. Giuffrida et al. [
17] presented a Pheno-Deep Counter to predict leaf count with an accuracy of 88.5%, using RGB, FMP, and NIR images as inputs. Zhang et al. [
7] employed a CNN to estimate the fresh weight, dry weight, and leaf area of lettuce. The estimated values showed good agreement with the actual measured values, with R
2 values of 0.8938, 0.8910, and 0.9156, and normalized root mean square error (NRMSE) values of 26.00%, 22.07%, and 19.94%, respectively. On this basis, Xu et al. [
18] enhanced the fresh weight R
2 to 0.9725 and decreased the MAPE to 8.47% by integrating RGB images and depth images and introducing the MSPE function. Rasti et al. [
19] utilized deep learning to estimate the growth stages of barley and wheat. The results demonstrated that transfer learning based on the VGG19 model achieved the highest accuracy for both crops. Guo et al. [
5] used a deep learning method coupled with a novel directed search algorithm to obtain stem-related phenotypes for soybeans. The Pearson correlation coefficients (R) of plant height, pitch number, internodal length, main stem length, stem curvature, and branching angle were 0.9904, 0.9853, 0.9861, 0.9925, 0.9084, and 0.9391, respectively. The above studies demonstrate that deep learning is an effective technique in the field of plant phenotype research.
Deep learning has advanced significantly over the years, giving rise to various classical algorithms, such as Mask R-CNN [
20], YOLO [
21], and hybrid deep learning models [
22]. Mask R-CNN is widely used in agriculture for tasks such as identifying and estimating the characteristics of crops [
23,
24,
25,
26,
27], fish [
28,
29], and trees [
30]. It is also utilized for detecting diseases and pests [
31,
32] due to its excellent performance, scalability, and multi-tasking capabilities. Zhang et al. [
33] drew the chrysanthemum canopy contour and estimated the crown diameter using a mask obtained from Mask R-CNN. The results showed an R
2 of 0.9629 and a root mean square error (RMSE) of 2.2949 cm. Zheng et al. [
34] used a Mask R-CNN model to segment RGB-NIR image and RGB image of strawberries and established a regression model of canopy leaf area and dry biomass for biomass prediction. Li et al. [
35] used Mask R-CNN for leaf segmentation, classification, and counting of the watermelon plug seedlings. Gao et al. [
36] used UAV images and Mask R-CNN to extract maize seedling information. The average accuracy of the seedling emergence rate is 98.87%. At the same time, the model also shows good migration characteristics. Hao et al. [
37] combined the Mask R-CNN model with the canopy height model (CHM) to detect the height and crown of trees. The tree height estimate was highly correlated with the height from UAV images (R
2 = 0.97). It can be seen that phenotypic traits are mainly predicted based on the segmentation results of Mask R-CNN. There are few studies in which phenotypic traits are directly output from models.
This paper presents a new method for estimating phenotypic traits in lettuce using RGB images and a Mask R-CNN model. The method can estimate five phenotypic traits simultaneously and produce target detection and segmentation results. The main contributions are as follows: (1) the addition of a phenotypic branch to the Mask R-CNN model to enable phenotypic trait estimation; (2) the replacement of the backbone network with RepVGG [
38], resulting in improved speed and reduced parameter count; (3) the achievement of end-to-end estimation of five lettuce phenotypic traits. The work indicates significant advancements in digital phenotypic analysis, paving the way for artificial intelligence applications in fresh vegetable production.
4. Discussion
4.1. Comparison of Different Backbone
To thoroughly validate the selection of RepVGG as the backbone network for object detection and morphological phenotypic trait prediction, we performed ablation experiments. Other backbone networks evaluated include ResNet50 (used in the classic Mask R-CNN), MobileNetV3 (popular for its lightweight design), and EfficientNet (known for its balance between accuracy and speed).
Table 3 and
Table 4 show the average results of the five-fold cross-validation for each backbone network. The best result in each column is highlighted in bold. It is worth noting that RepVGG is the best-performing backbone network. In terms of detection and segmentation results, the AP metrics for all backbone networks are higher than 0.81, with segmentation metrics performing better than detection metrics.
EfficientNet shows slightly better AP results compared to RepVGG, while both models have similar AP50 results. However, RepVGG outperforms EfficientNet in terms of AP75. ResNet50 ranks third in terms of performance, with a detection AP of 0.8236 and a segmentation AP of 0.8787. Among the four backbone networks, MobileNetV3 performs the poorest, with detection and segmentation AP of 0.8236 and 0.8787, respectively. Regarding the prediction of phenotype trait regression metrics, RepVGG has the best overall performance as the backbone network. However, the difference between the best and worst metrics in each column is only around 0.01. This suggests that the choice of backbone network has limited influence on the results. The information extracted from the data for regression prediction mining may reach the extreme.
We assessed network performance and efficiency by considering model parameters, inference time, and FLOPs. FLOPs, which stands for floating-point operations, refer to the number of operations performed during the model’s forward pass. Among the models we examined, ResNet50 had the longest inference time at 0.0233 s, the largest model parameters at 585.8 MB, and the highest FLOPs. In contrast, MobileNet_V3 and EfficientNet had similar model parameters, at 331 MB and 344.4 MB respectively, and their inference times were also close, at 0.0155 s and 0.0169 s. These two models ranked first and second in FLOPs. RepVGG, on the other hand, ranked third in FLOPs, but it had the smallest model parameters at 127 MB and a slightly faster inference time compared to MobileNet. This suggests that RepVGG, as a backbone network, has fewer model parameters and faster inference time. In practical application, we can consider deploying the model to Raspberry PI to develop an automatic monitoring platform for lettuce phenotypes.
4.2. Comparison of Phenotypic Traits Branching with Different Convolutional Layers
To identify the most effective number of convolutional layers for the phenotypic trait branch, an ablation experiment was conducted using a convolutional neural network (CNN) with 6, 8, and 10 layers, respectively. The averaged outcomes of five-fold cross-validation for varying CNN layer counts are shown in
Table 5 and
Table 6. The best-performing outcomes in each column have been highlighted in bold.
From the tables, the overall performance is optimal when the number of CNN layers is eight. In both detection and segmentation tasks, the eight-layer model consistently performs better than other models. The difference in performance between six and eight layers is significant, while the difference between eight and ten layers is negligible. Regarding phenotypic trait prediction, the eight-layer model exhibits superior performance. The six-layer model has the poorest performance, while the eight-layer model slightly outperforms the ten-layer model. These results demonstrate that an adequate number of convolutional layers is essential for optimal performance, but excessive layers potentially lead to diminished results. As a result, an eight-layer CNN was ultimately chosen for the phenotypic prediction branch.
5. Conclusions
To assess the growth status and quality of lettuce, morphological and phenotypic characteristics were selected for evaluation. In this study, we replaced the backbone network and incorporated a phenotypic trait branch to estimate fresh weight, dry weight, plant height, canopy diameter, and leaf area in the Mask R-CNN framework. The results show that the average AP is 0.8684 for detection and 0.8803 for segmentation. Additionally, the R2 results for phenotypic traits are all above 0.91, but with varying MAPE values. In particular, trait dsw has the highest MAPE value of 0.1522, while trait d has the lowest MAPE value of 0.0548.
This study has the following limitations: it does not consider the effect of leaf curling; to improve prediction accuracy, depth images can be fused with the model; and the data collection environment should take into account the challenges posed by changes in the shooting environment and shooting angle.