1. Introduction
As the global population continues to increase and climate change intensifies, sufficient and stable food supplies are facing enormous challenges [
1]. Greenhouse planting is a new type of agricultural production method that can isolate adverse external environmental conditions and achieve year-round crop production. However, it also cuts off rainfall, making artificial irrigation the only source of water supplement for greenhouse crops. The transpiration intensity of tomato is closely related to the amount of irrigation, and many studies have found that ETc decreases with decreasing irrigation [
2]. Therefore, reasonable irrigation regulation according to changes in the greenhouse environment and crop growth stage is an important means to ensure the growth of greenhouse crops [
3]. Among them, the accurate and quick prediction of the transpiration rate (T
r) of greenhouse crops is the key to making irrigation regulation possible. By establishing a T
r prediction model, we can better understand the water use and growth laws of greenhouse crops, provide a decision-making basis and technical support for scientific irrigation [
4], and then improve the water-use efficiency of greenhouse crops.
At present, the traditional crop transpiration calculation method uses the crop coefficient model proposed by FAO-56, which has been widely used in greenhouse crops such as tomato, eggplant, and lettuce [
5,
6,
7,
8]. However, in addition to meteorological factors such as air temperature, relative humidity, and solar radiation, the input parameters of the crop coefficient model also need to estimate important parameters that are difficult to obtain, such as canopy resistance and aerodynamic resistance, which limit the wide application; In addition, the crop coefficient (Kc), which is an important parameter in the calculation and publicity, often uses empirical parameters and is affected by different climatic environments and soil characteristics. In practical applications, there is a significant error. Some studies have shown that the Mean squared error (MSE) value of the Kc during the entire growth stage of tomatoes can reach a maximum of 11.9–71.4% [
9]. With the increasing scarcity of water resources and the development of precision irrigation technology, higher requirements have been put forward for real-time irrigation regulation. Therefore, the analysis and prediction of transpiration water consumption changes are more real-time, achieving higher frequency T
r analysis, which is more meaningful for on-demand irrigation regulation. However, traditional calculation methods have shown problems such as decreased accuracy and fitting degree when applied to the non-linear and instantaneous complex changes of T
r.
Machine learning (ML) is a class of methods that uses data and algorithms to achieve automated learning and reasoning. Their rise provides new possibilities for predicting crop T
r [
10,
11]. These algorithms can better capture hidden patterns and laws in data, adapt to different environments and crop conditions, and improve the accuracy and efficiency of predictions [
12]. Tunalı et al. [
13] used artificial neural networks (ANN) to estimate the actual crop evapotranspiration (ETc) of tomatoes in soilless cultivation systems, compared it with the traditional “two-step” method based on reference evapotranspiration (ET
o) and Kc, and found that the prediction accuracy of the ANN model for site-specific ETc prediction in soilless cultivation was 30% higher than that of traditional methods; Nam et al. [
14] used artificial neural networks to estimate T
r and found with ANNs that the annual estimated RMSE of T
r is 0.08–0.10 g·m
−2·min
−1, which is obviously better than the estimation accuracy of traditional estimation methods. However, there are still some challenges in the development of crop transpiration models based on machine learning algorithms. Most of the ANNs used in current research require massive amounts of data to be trained accurately to avoid over-fitting and under-fitting problems [
15]. However, in practical agricultural production applications, the available data sets are often limited, and it is necessary to explore algorithms and methods that are more suitable for small sample data sets.
In addition, tomatoes’ T
r is affected by the growth stage and environmental factors, showing nonlinear and complex change characteristics, and the transpiration data changes at different growth stages are quite different [
16]. In order to improve the prediction accuracy, this study introduced a clustering algorithm and a feature extraction algorithm to extract the data characteristics during the crop growth stage, divide different feature intervals, and construct corresponding prediction models for each interval [
17], and explored a predictive modeling method that considers the crop growth process.
In order to achieve the goal of this study, we took the following steps: (1) Use CARS technology to extract the characteristic variables of environmental variables and determine the best combination of input variables; (2) use the t-SNE algorithm to cluster the data and divide the data intervals with Tr characteristics; and (3) establish a tomato Tr prediction model based on the CatBoost algorithm and verify the feasibility of the model through tomato planting test data.
2. Materials and Methods
The experiment was carried out in a solar-powered greenhouse at the National Precision Agriculture Demonstration Base in Changping District, Beijing (116°27′26.557″ east longitude, 40°11′10.779″ north latitude, 50 m above sea level) from April to July 2022 (
Figure 1A). The length of the greenhouse is 60 m, the span is 8 m, and the total area is 480 m
2, of which the size of the test area was 22 m × 7 m. The tomato plants were planted in two rows, with 36 plants in each row, the spacing between the plants was 30 cm, and the planting density was 4.6 plants/m
2. On 28 April, the tomatoes were planted at the “six leaves and one heart” stage. The experiment began after the flowering stage on 20 May, the fruiting stage on 17 June, and the harvest on 28 July. On 17 June, they entered the fruiting stage, and on 28 July, the plants were pulled. They were irrigated by the drip irrigation technology under the film of coconut bran substrate cultivation. Irrigation was controlled by setting a radiation accumulation threshold. The radiation accumulation threshold set for each irrigation was 120 KJ·m
−2·h
−1, and when the threshold was reached, the irrigation controller started the water pump and solenoid valve to start irrigation.
2.1. Data Collection and Processing
2.1.1. Data Collection
The actual value of tomato T
r was collected by the on-line weighing system of substrate developed by the National Agricultural Intelligent Equipment Engineering and Technology Research Centre. Six tomato plant samples with consistent growth status were selected, and the numerical changes in the weight of the substrate where the tomato plant samples were located and the flow rate of the return liquid were monitored by the substrate on-line weighing system, and the tomato T
r was derived after calculation using Equation (1) [
18]:
In the formula: Tr represents the Tr (mm·h−1) of the plant; IREF represents the amount of water filled from T1 to T2 as measured by the electronic water meter (g); BWT1 represents the weight value of the substrate at T1 time (g); BWT2 represents the weight value of the substrate at T2 time (g); DLiquid represents the accumulated return liquid collected by the return flow meter (DLiquid) during T1 to T2 (g); and A is the crop leaf coverage area (m2).
The greenhouse sensing detection system includes: a greenhouse environment sensor (
Figure 1B) (National Agricultural Intelligent Equipment Engineering Technology Research Center, Beijing, China), which was used to collect parameters such as temperature (°C), air relative humidity (%), light intensity (umol·m
2·s
−1), and CO
2 concentration (ppm) in real time, and was set about 20 cm above the crop growth point; the TEROS12 sensor (METER GROUP, Pullman, USA) was used to measure substrate temperature (°C), and the probe was buried at a position about 5 cm horizontally and 10 cm vertically from the arrow dripper; the total radiation sensor (Wuhan Hanqin in System Science & Technology Co., Ltd., Wuhan, China), used to collect the accumulated light radiation data in the greenhouse, was set at a position 2 m above the ground in the greenhouse.
The weight and environmental sensing data were collected every 10 min and uploaded to the agricultural data platform (
http://envsys.nxagricloud.com/ (accessed on 15 January 2023)) through the 4G module. We used the last collected data per hour as the parameter value for this hour.
In addition, vapor pressure deficit (VPD, kPa) is also one of the main variables in the construction of the transpiration model [
19]. The VPD value is low in the dry summer conditions of greenhouse cultivation, and the difference between day and night is large, which can be used as an environmental parameter in this study. VPD was obtained by the following calculation formula:
Tm is the average air temperature (°C), RH is the average relative humidity (%), and e is a natural constant.
2.1.2. Data Processing
Table 1 is the parameter information required for the experiment. In the process of monitoring and transmitting data using IoT sensors, issues such as device stability and signal quality resulted in partial data loss, duplication, data imbalance, and inconsistent data types. The number of abnormal data accounts for 1.7% of the total dataset. In order to improve the data quality and ensure the training speed and prediction accuracy of the model, we performed the following processing on the data: (1) For partially missing data, the linear interpolation method was used to supplement; (2) for the case of missing block data, the data of this time stage was directly deleted; and (3) the data was normalized using the following formula so that all data were in the same dimension:
Among them, Xnom is the normalized value, X is the original data, and Xmin and Xmax are the minimum and maximum values of the original data, respectively.
The descriptive statistical analysis of environmental variables is carried out by using box plots, and the threshold and abnormal values of each environmental variable can be observed intuitively. It can be seen from
Figure 2 that the air temperature and air relative humidity do not show obvious abnormal values; the maximum thresholds of R
n and VPD are 209.47 KJ m
−2 h
−1 and 5.04 kPa, respectively, and the abnormal values are concentrated above the maximum value and the number is small, respectively, 209.14–297.27 KJ m
−2 h
−1 and 5.10–6.14 kPa; the abnormal values of light intensity are 769.60–1464 umol·m
−2·s
−1, and the data are generally concentrated between the upper quartile and the 90th percentile; and the abnormal values of CO
2 concentration are distributed between 558.50–774.55 ppm and 223.30–311.65 ppm.
2.2. Tr Prediction Model Construction
This study adopts the CatBoost model as the basic algorithm for the tomato T
r prediction model, and the optimization of input data is the key to model construction. The construction process of the CARS-CatBoost model is shown in
Figure 3. This model retains the strong ability of CARS to extract feature variables and the CatBoost model’s ability to produce good classification results without extensive data training, and utilizes the clustering advantage of the t-SNE algorithm for nonlinear variables to improve model accuracy. The structure of the CARS-CatBoost prediction model is shown in
Figure 4. The specific steps of building the CARS-CatBoost model can be summarized as follows:
By combining continuous Tomato Tr and meteorological data, several continuous time-series data can be converted into a two-dimensional matrix. The gridded data matrix contains tomato Tr and meteorological variables from left to right, and time series from far to near from top to bottom. Gridded time-series data can be represented as:
- 2.
Pre-process a variety of data collected and calculated by various sensors, use CARS algorithm to gradually retain and eliminate variables, and finally find the data subset with the smallest Root Mean Square Error of Cross Validation (RMSECV) as the optimal combination of variables. In this study, CARS algorithm was used to filter the environmental data in the training set, where the Monte Carlo sampling number was set to 100.
- 3.
Use t-SNE to map high-dimensional features into two-dimensional space to form clusters, and build a model based on the formed clusters.
- 4.
In the process of model building, the parameters of CatBoost need to be adjusted, so the processed data set was randomly divided into training set and test set. The training set is used for parameter adjustment, and the best model parameters are confirmed according to the model evaluation index. The test set is mainly to test the generalization performance of the model to ensure that the training parameters obtained from the training set have nothing to do with the test set, and the model is more robust.
- 5.
The CARS-CatBoost Tr prediction model was constructed, and it was combined with the single crop coefficient model to predict the Tr of the growth stage of tomato, and the Tr divided into different characteristic intervals according to t-SNE, and the effect of the model in terms of prediction accuracy was evaluated and discussed.
2.2.1. CARS Variable Selection
The input characteristic variables of the model directly determine the accuracy and computational efficiency of the prediction. In this study, individuals with large absolute values of regression coefficients in the PLS model were retained through adaptive reweighted sampling (ARS), and multiple subsets of variables were obtained. Finally, the optimal combination of variables related to T
r was screened out from multiple subsets of variables by cross-validation method. MATLAB 2019b was used as the operating platform for CARS variable selection, the optimal number of latent variables was selected through the Monte Carlo cross-validation method, and the optimal variable subset was selected according to the RMSECV value obtained by PLS cross-validation modeling (
Figure 4A). Variable were compressed, the model structure simplified, and the model performance improved [
20].
Assume that Y is expressed as an m × 1 sample target attribute matrix, X is an
sample spectral matrix, where m is the number of samples, n is the number of variables, and
is the combination coefficient; T is the linear combination of X and
, which is the sub-matrix of X;
is the regression coefficient vector of the PLS model built by Y and T, where
and
represent the n-dimensional regression coefficient vector and the sample prediction residual, respectively. Assuming Formulas (5) and (6) are established:
In Formula (6), the regression coefficient vector
, the ith variable contributes to Y, then the total contribution of all wavelengths to Y is represented by the absolute value
of the ith element. Use the weight
as the variable preference index to evaluate the importance of each variable, where
is the proportion of
to the total contribution. If the value of
is larger, the importance of the variable is more obvious, as shown in Formula (7):
The process of calculating every time is actually the process of evaluating the importance of variables. Keep the variables with larger values calculated each time, and then use ARS technology to recombine new variables from them. On this basis, use PLS modeling to calculate its RMSECV value. Among them, the number of sampling is set to N, repeated N times, until the end of sampling, we will obtain the optimal variable subset, that is, a series of variable subsets with the smallest RMSECV value.
2.2.2. t-SNE Visual Analysis
Since the transpiration of different growth stages of tomato varied greatly during the experiment, regional modeling was considered to improve the prediction accuracy of the model. Using t-SNE to reduce the output of the test data set to 2D or 3D space, the value of each cluster was used to color the data points in the t-SNE graph, and the distance was used to visually display the similarity and difference between different samples, and to distinguish the difference of T
r between different tomato stages (
Figure 4A).
The specific steps of the t-SNE algorithm are as follows: Given a set
containing N sample points, for any two samples i and j, the algorithm defines the distance between samples as the probability
, and the distance is expressed as Formula (8):
For the conditional probability
between sample points, it is defined as Formula (9):
is the standard deviation of the Gaussian distribution of the data. The sample set
after t-SNE dimension reduction is the mapping from high-dimensional space X to low-dimensional space Y, and the distance
between sample points in Y can be expressed as Formula (10):
The final optimization goal of the t-SNE algorithm is the KL divergence, expressed as Formula (11):
Generally, we consider
and
to be 0 values. Since minimizing the KL divergence is a non-convex optimization, we can use stochastic gradient descent to solve it. Then, the gradient of KL divergence is Formula (12):
2.2.3. Classification Gradient Boosting Model (CatBoost)
Considering the complexity of T
r changes and the small size of the driving data, a decision tree-based machine learning model, CatBoost, was established. Thanks to the powerful gradient boosting technology of CatBoost, it has the advantages of fast calculation and less overfitting than other algorithms, and can use less historical data to learn the relationship between crop T
r and other variables. By using the same split criterion on each node, the created tree is symmetrical and balanced. A new algorithm called Ordered boosting [
21] (shows in Algorithm 1). For the input dataset
, permutations are performed, and the average label value of sequences with homogeneous alignment will be calculated (
Figure 4B). Finally, the following formula will replace all categorical features:
Among them, the parameter
, which is the prior weight, can suppress low-frequency category noise. P is the prior value.
is the target, and
is the feature. In this paper, several main parameters of the CatBoost model are shown in
Table 2.
In order to overcome the problem of conditional deviation that may occur when the data structure and distribution of the training and test data sets are different, CatBoost proposed a new algorithm called Ordered boosting [
22]. For sample
, if a model that does not include it is used to estimate its gradient, the estimated result can be regarded as an unbiased estimate:
Algorithm 1: Ordered boosting |
Input: |
|
2.3. Model Training Environment and Evaluation Metrics
The training environment for this research experiment was CPU: AMD Ryzen 5 3600 @ 3.60 GHz, GPU: NVIDIA GeForce GTX 1660 SUPER and RAM: 16 GB. Model training uses the Anaconda platform as the basic platform for machine learning training.
The performance of the model during training and testing was evaluated by three statistical indicators: root mean square error (RMSE), mean absolute error (MAE), and coefficient of determination (R
2). The specific calculation formulas are shown in Formulas (14)–(16). In addition, the single crop coefficient model [
6,
23] was used to calculate crop T
r and compared with the prediction results of the mixed prediction model proposed in this study.
In the above formula, represents the predicted value, represents the actual value, and represents the average value.
4. Discussion
In this study, nine environmental variables (T
max; T
m; T
min, RH
max; RH
m; RH
min, VPD, R
n, and T
s) were used as input variables. The correlation analysis between the environmental variables and T
r showed that during the tomato florescence and fruiting stages, the main environmental variables affecting T
r were R
n, T, RH, and VPD, which is consistent with the conclusions of previous studies [
25,
26], and the R
n correlation was the most significant. This is because the temperature in the greenhouse gradually increased and the relative humidity gradually decreased with the increase of solar radiation, and the water vapor pressure difference between the leaf surface and the air increased, thereby accelerating the transpiration of plants in the greenhouse. The solar radiation disappears at night, and the air temperature and relative humidity decrease and increase, respectively, over time. At this time, the water vapor pressure difference between the leaves and the air decreases, which inhibits the transpiration of tomatoes [
27,
28,
29].
According to the analysis results, The CARS-CatBoost model is more accurate than the single crop coefficient model, which is mainly determined by two factors. Firstly, Kc in the calculation of single crop coefficient models is usually based on empirical values, which may deviate from actual values due to factors such as climate and environment. Ghuman et al. [
9] found that the MSE of Kc can reach a maximum of 11.9–71.4%. Reis et al. [
30] found that the estimation error in the early tomato fruiting stage can reach 38%, resulting in up to 20% water waste [
31]. Second, compared with the single crop coefficient model, the machine learning model can use the entire data set for training, minimize information loss, and still provide high prediction accuracy in the case of missing variables. Kim et al. [
32] proposed a CNN-CatBoost hybrid model solar radiation prediction method and concluded that the prediction accuracy and stability of this hybrid model is better than the single model of CNN and CatBoost; Niu et al. [
33] introduced a machine learning method based on wavelet packet denoising and CatBoost for weather forecasting. Using a feature selection and spatio-temporal feature addition to improve forecasting performance, the results show that the CatBoost model combined with wavelet packet denoising can achieve shorter convergence time and higher forecasting accuracy than forecasting models using deep learning or machine learning algorithms alone. In the studies of the above-mentioned scholars, they all considered nonlinear and complex environmental changes, which is similar to the research object of this study. Therefore, in order to predict the T
r of tomato, we adopted the CatBoost model and achieved satisfactory prediction results.
In addition, from the visual comparison of the measured and predicted values of tomato T
r changes over time in
Figure 10, it can be found that the maximum prediction errors of CARS-CatBoost and the single crop coefficient model both appeared at noon in the whole growth stage, which were 0.056 mm·h
−1 and 0.212 mm·h
−1, respectively. The coincidence degree between the change rule of the predicted value of the CARS-CatBoost model and the real value is significantly better than that of the single crop coefficient model; especially in the case of partition modeling, the predicted curves of the florescence stage, fruiting stage, and fruiting stage night are more consistent with the actual curve, and the difference is smaller. This shows that the prediction model can improve the estimation accuracy by dividing different time intervals and emphasizes the advantages of transpiration prediction based on T
r characteristic intervals.
This study utilized multiple advanced sensors developed by the National Agricultural Intelligent Equipment Engineering Technology Research Center to obtain a large amount of high-precision data, aiming to construct a prediction model that can approach the accurate level of tomato actual transpiration rate. Due to the significant influence of greenhouse environment and crop species on transpiration rate, optimizing the model through traditional crop models has shown limited effectiveness in improving accuracy. However, data-driven machine learning modeling methods can achieve high-precision modeling by continuously collecting and training greenhouse data, ultimately meeting the demand for precision irrigation in facility agriculture [
34].
5. Conclusions
In this study, we analyzed and extracted the main environmental variables affecting tomato transpiration and established a hybrid prediction model for tomato Tr based on the CatBoost algorithm (CARS-CatBoost model). By analyzing the results, we draw the following conclusions:
Through the correlation analysis of tomato Tr and environmental variables, it was found that temperature, VPD, and Rn were positively correlated with Tr, and relative humidity was negatively correlated with Tr, among which Rn had the highest correlation with Tr. For the prediction results of the whole growth stage, compared with the traditional single crop coefficient model, the RMSE and MAE of the CARS-CatBoost prediction model were lower by 72.1% and 72.0%, respectively, indicating that the prediction performance of the CARS-CatBoost model was better than that of the single crop coefficient model. Under the framework of the CARS-CatBoost model, the RMSE of the partition model established according to the three characteristic intervals of the florescence stage, the fruiting stage, and the night fruiting stage decreased by 13.1%, 18.5%, and 97.0%, respectively, compared with the whole growth stage, indicating that the CARS-CatBoost model can further improve and predict the effect of tomato partition modeling. This study provides useful guidance for exploring the precise irrigation system in different stages of the greenhouse tomato growth cycle.