1. Introduction
Agriculture has become increasingly important in recent years due to its various challenges in producing food from environmental, social, and economic standpoints. The United Nations has projected that the global population will grow to
billion by 2050 [
1], making it essential to produce enough food. However, the available agricultural land is limited, requiring significant efficiency gains to maximize production with scarce resources. Furthermore, special attention should be paid to climate change, which will require reducing greenhouse gas emissions as its atmospheric concentration is projected to double by 2030 and cause significant global temperature increases [
2,
3].
Climate-Smart Agriculture (CSA) is a system that utilizes the latest technological advancements to tackle the challenges faced in agriculture. The Food and Agriculture Organization of the United Nations (FAO) defines CSA as a tool that enhances national food security and development goals. It involves various new technologies that work together to achieve sustainable productivity, resilience, and reduction in/removal of greenhouse gas emissions [
4,
5].
The CSA concept encompasses the Smart Farming (SF) initiative, which promotes eco-friendly agricultural practices that rely on science and technology. SF is comparable to Industry 4.0’s smart factories, utilizing Information and Communication Technologies (ICT) like the Internet of Things (IoT), Global Positioning System (GPS), Cloud Computing (CC), Fog Computing (FC), and Big Data (BD) analysis to monitor and manage farms and farming activities [
6,
7,
8]. These technologies are driving the transformation of the agricultural sector towards a smarter and more sustainable one, contributing to the significant problems facing agriculture [
9].
Smart Farming includes the greenhouse concept essential for profitable and sustainable agricultural practices. Greenhouses control important factors affecting crop growth, such as solar radiation, temperature, humidity, light intensity, and carbon dioxide levels, helping to increase yields throughout the year and protect crops from harsh weather conditions and pests [
10,
11]. However, the unstable conditions within a greenhouse can negatively impact plant growth and ultimately reduce crop production. This issue can be mitigated using Artificial Intelligence (AI) tools that regulate the greenhouse variables [
12,
13,
14].
Machine Learning (ML) is the subtopic of AI that allows a system to learn from previous experiences and adjust accordingly. By analyzing large amounts of data, ML can make accurate predictions.This technology helps reduce pre-harvest crop loss, often caused by inadequate irrigation and climate conditions. By utilizing data collected from sensors and an automated watering system, farmers can optimize their crop yields and minimize losses [
15,
16].
Various agricultural tasks can benefit significantly from ML algorithms, which have produced state-of-the-art results. The most prominent models include Linear Regressions (LRs), Support Vector Machines (SVMs), K-Nearest Neighbor (k-NN), Neural Networks (NNs), Random Forest (RF) classifiers, and Decision Trees (DTs) [
17]. Additionally, emerging forecasting methods such as Recurrent Neural Networks (RNNs) and Extreme Gradient Boosting (XGBoost) help analyze time series data. The Long Short-Term Memory (LSTM)-RNN model and XGBoost are particularly effective at avoiding the issue of vanishing and exploding gradients during training [
18,
19].
Predicting microclimate conditions in greenhouses through climate forecasting has become crucially important. This is possible due to the advanced sensors and systems that enable exact measurements and evaluations of the microclimate within seconds. Although different techniques have been developed to model temperature behavior inside greenhouses, they often only use variable monitoring instead of forecasting, limiting the implementation of automatic control to corrective and non-preventative actions, which may only partially satisfy the needs of greenhouse growers. However, AI-based algorithms have been developed to act preventively by adjusting heating/cooling systems, ventilation, and carbonic fertilization supply through actuators installed to ensure optimal growth, maintenance, and pest control of crops in greenhouses [
20].
Integrating a preventive model into the automatic control system of greenhouses is vital to ensuring optimal crop growth. Extreme temperatures can negatively affect crop morphology and physiological processes, resulting in floral formation, leaf burn, poor fruit quality, excess transpiration, and a shortened crop lifespan. By effectively controlling the microclimate within the greenhouse, the risk of developing pathogens and damaging crops can be minimized [
21].
Previous research by García-Vázquez et al. [
22] focused on creating an accurate temperature prediction model for a greenhouse using linear and support vector regression techniques. However, this paper proposes using the same greenhouse data set to predict the internal temperature up to one hour in advance by applying supervised learning algorithms such as LSTM-RNN and XGBoost. These algorithms are used because LSTM excels at processing sequential data. They can remember long-term patterns and learn temporal dependencies through specialized memory units [
23]. On the other hand, XGBoost combines multiple weak models to form a more robust model, allowing the model to learn complex relationships between features and obtain more accurate results [
24]. Both techniques provide effective time series forecasting; therefore, they are suitable for enabling preventive control of the greenhouse.
The contributions of this paper can be summarized as follows: In order to prevent issues like crop diseases and poor fruit quality caused by uncontrolled microclimate changes inside greenhouses, meteorological data can be analyzed using supervised learning techniques in a controlled environment. By utilizing predictive models and intelligent control systems in agriculture, crop efficiency, and productivity can be significantly enhanced while minimizing the risks associated with sudden changes in greenhouse conditions. The comparison between different ML models lets the researchers/producers analyze which one fits their needs better, for example, adding more computational power if higher prediction accuracy is needed or less computational power with less prediction accuracy, taking into consideration that this LTSM-RNN will need a lot of more power to achieve the level of accuracy that this paper shows later in
Section 4.
A study on data segmentation based on the annual climate seasons shows that an algorithm per season would benefit temperature prediction due to the particularities of each one, and prediction errors tend to increase when projecting towards extended periods. Also, this technology has the potential to benefit the agricultural industry by improving crop quality and yield while also reducing the consumption of resources, such as energy.
The paper is organized as follows:
Section 2 shows the existing works related to SF in greenhouses.
Section 3 details the proposed workflow using the Team Data Science Process (TDSP) methodology and explains how forecasting in the greenhouse is implemented. The results are presented in
Section 4, while
Section 5 provides a discussion. The paper closes with
Section 6, which summarizes the findings and presents the conclusions.
2. Related Work
This section explores the current agricultural systems and discusses ways to improve them by leveraging data from different sources. The text presents the latest technological advancements in agriculture, focusing on machine-learning applications and the prediction of climatological variables within greenhouses. This information is valuable for individuals seeking to enhance agricultural practices and increase productivity.
Authors in [
25] implemented the LSTM-RNN to work with time series to forecast the temperature and relative humidity inside a greenhouse, considering microclimatic parameters as input components. Solar radiation, external temperature, external humidity, and wind direction were considered impact variables on the forecast. Based on this, the proposal focuses on observing the behavior of the LSTM-RNN through the variation in the number of hidden layers and analyzing the Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and Coefficient of Determination (R
2) to obtain the level of efficiency of the network in the forecasting of temperature and relative humidity.
In [
26], the authors presented an RNN-Back-Propagation (BP) model to predict the temperature and humidity of a solar greenhouse in northern China. Climate data, such as air and substrate temperature, humidity, lighting, and carbon dioxide (CO
2) concentration recorded over eight days, were used to build and validate climate prediction models. The results show that the model provides reasonably good predictions, with RSMEs of 0.751 for temperature and 0.781 for humidity, and both R
2 values being above 0.9, outperforming the models in this study.
The study in [
27] introduces a solar greenhouse evapotranspiration (ET) estimation model called PSO (Particle Swarm Optimization)-XGBoost. This model uses meteorological and soil moisture data from a greenhouse in China to grow two tomato crops. The PSO-XGBoost model accurately estimates evapotranspiration under different data configurations. The study found that the model’s accuracy decreases as input variables are reduced. Moreover, the variables solar radiation and vapor pressure difference were identified as the most critical variables.
The work presented in [
28] aims to manage and predict the air temperature in greenhouses using a wireless sensor network for data collection. The prediction model is based on RNN techniques. The prediction algorithm utilizes weather data to predict the missing values in case of missing sensor data. The experimental results show that the RNN-based prediction algorithm can efficiently forecast the air temperature in the greenhouse with an RMSE of 1.50 °C, a Mean Absolute Percentage Error (MAPE) of 4.91%, and an R
2 of 0.965.
The study in [
29] discusses using an RNN model as a deep learning architecture that can retain the recent memories of input patterns. This is achieved through the feedback mechanism of its hidden layer outputs to itself. The RNN model is commonly used for time-series-related work, particularly in predicting climate variables. It can analyze input and output data from other identification models to make accurate predictions.
Authors in [
30] evaluated Artificial Neural Networks (ANNs), Nonlinear Autoregressive Exogenous (NARX) models, and LSTM-RNN models to predict environmental changes in temperature, humidity, and CO
2 in greenhouses. The best model for all data sets was the LSTM-RNN, with continuous prediction accuracy even after 30 min, obtaining an R
2 of 0.96 for temperature, 0.80 for humidity, and 0.81 for CO
2.
The work in [
31] utilized five models to collect and analyze a data set from 27 greenhouses in South Korea. The variables analyzed included internal temperature, relative humidity, radiation, CO
2 concentration, and external temperature. The most efficient model was Bidirectional LSTM (BiLSTM), which resulted in an average R
2 of 0.78 and 0.81 for the pepper and tomato data sets, respectively. The model effectively adapted pre-trained deep learning models and improved their prediction ability in data-limited greenhouse microclimates.
A research study described in [
32] aims to improve the accuracy of predicting crop yields in greenhouses. Such precision is essential for making informed decisions about the planning and management of greenhouse agriculture. The researchers developed a new algorithm to achieve this goal by combining a Temporal Convolution Network (TCN) and an RNN. They evaluated the algorithm’s performance using data from several greenhouses cultivating tomatoes. The results demonstrate that this new technique outperforms traditional machine learning methods and other deep neural networks in accuracy. The study also highlights that historical performance information is crucial for accurately predicting future crop yields.
In [
33], researchers have proposed a Multivariate Long Short-Term Memory (MV-LSTM) neural network model for predicting wind speed. The model is based on the feature selection of the Pearson correlation coefficient and utilizes temperature, humidity, and air pressure data to predict the wind speed at two observation stations in Beijing for the next hour. The study compared the MV-LSTM model with the Auto-Regressive Integrated Moving Average (ARIMA) and LSTM methods. It demonstrated that the MV-LSTM model outperforms the other models regarding prediction accuracy.
In [
34], a study was conducted to predict wind energy in ultra-short periods. The researchers developed a hybrid Spatio-Temporal Correlation Model (STCM) using Convolutional Neural Networks (CNN) and LSTM. The model takes multiple meteorological factors as input and reconstructs the data into a matrix for CNN to extract spatial correlation and LSTM for temporal correlation. The model was tested on a wind farm in China and showed a significant improvement in accuracy compared to using CNN or LSTM alone. The error rate was reduced by around 30%. However, the study recognizes the need to investigate the impact of different meteorological factors and improve the model’s optimization algorithm.
Research in [
35] was conducted to estimate crop ET. Since ET is not easy to measure directly, a simulation model based on XGBoost Regression (XGBR) was developed to obtain it. The researchers used three years of data and eight meteorological factors (net solar radiation, mean temperature, minimum temperature, maximum temperature, relative humidity, minimum relative humidity, maximum relative humidity, and wind speed) to train the XGBR-ET model. They compared it with seven other standard regression models. The results showed that net solar radiation had the highest positive correlation with ET, while wind speed had the lowest correlation.
In [
36], the authors proposed a model that utilizes the Light Gradient Boosting Machine (LightGBM) algorithm to predict the internal temperature of a greenhouse. The model uses control and temporal data collected over five years. The study reports that the LGBM model has better tuning capacity and is significantly faster in training than other models like neural networks, BP, RNN, XGBoost, and Stochastic Gradient Boosting (SGB). These results indicate that the LGBM model has great potential for real-time prediction and control of the greenhouse environment.
The work in [
37] uses time series data to predict ET and humidity in tomato greenhouses. The researchers applied an LSTM model to predict ET and compared it with the Stanghellini model. During the training phase, the developed ET model had an RMSE of 0.00317 and 0.00356 during the testing phase, with percentage errors of 5.76% and 6.45%, respectively. In addition, a humidity prediction model performed better than traditional LSTM-RNN models.
In [
38], the researchers aimed to predict the peak energy consumption of an intelligent farm using various algorithms. They analyzed energy data collected from a unique pepper farm in South Korea between 2019 and 2021. The study compared several machine learning algorithms, including ANN, SVM, RF, XGBoost, k-NN, Gradient Boosting Machine (GBM), and ARIMA. The most successful model was based on RF, achieving an accuracy of 92%. Additionally, the research analyzed the variables’ importance, identifying that internal humidity, dew point, and external temperature are critical factors in predicting energy consumption in the innovative farm.
A study in [
39] presented a model to predict the temperature inside a greenhouse using time series analysis and the LightGBM. The model considered environmental factors such as humidity, air pressure inside and outside the greenhouse, external temperature, and time series data to make the temperature predictions. The study found that incorporating time series features improved the R
2 and reduced the Mean Square Error (MSE) and the MAE across several prediction models. Other models, such as RNN, SVR, and LR, were also compared, but LightGBM outperformed them all regarding model fit.
In [
40], a Dense Neural Network (DNN) framework was proposed to predict temperature and relative humidity measurements. According to the results, the framework shows a high correlation of 0.91 and 0.85 for temperature and humidity, respectively. The framework has significantly reduced prediction errors, with a 68.67% reduction for temperature and 46.21% for relative humidity compared to an approach without the DNN model.
Several investigations, such as [
41,
42,
43,
44], have also contributed relevant studies to agricultural areas using artificial intelligence focused on the geotechnical properties of soil. Researchers have used machine learning techniques to predict crucial soil properties like thermal conductivity and mechanical behavior. Methods like LR, Gaussian process regression, SVR, DT, RF, and adaptive boosting have been employed to predict soil thermal conductivity. Furthermore, researchers have utilized neural network techniques like LSTM to model the mechanical behavior of frozen soils. This comprehensive approach has provided a deeper understanding of soil properties and has become an essential tool in geotechnical engineering.
3. Materials and Methods
The methodology used to develop the temperature forecast is based on TDSP. This is an agile and iterative approach that helps to develop predictive analytics solutions and intelligent applications efficiently. By providing a structured lifecycle, TDSP guides the development of data science projects [
45].
TDSP is a leading approach in the technical field of machine learning and data science. This methodology is used in this study because it is unique in its structured and project-oriented approach that covers all critical stages of the project lifecycle, from data preparation to model implementation and deployment. Additionally, TDSP incorporates continuous iteration techniques that allow agile adaptation of the model based on feedback and changes in data. With an emphasis on interdisciplinary collaboration and effective communication, TDSP ensures an accurate understanding of real-world problems [
45].
TDSP follows a series of steps to develop a data science project. The initial steps, which concentrate on business knowledge and data acquisition, were outlined in [
22]. The focus was obtaining information from a curved-roof-type greenhouse between July 2020 and June 2021 for data acquisition. The data collected included various variables obtained from sensors of a weather system; these are listed in
Table 1.
The third step of the TDSP methodology focuses on modeling, which is based on feature engineering, training, and model evaluation.
3.1. Feature Engineering
Feature engineering plays a vital role in developing ML models and data analysis. Its main objective is to enhance the models’ performance and efficiency by selecting, transforming, and creating significant features from the initial data. This crucial process can significantly impact the accuracy and efficiency of the ML algorithms [
46].
For data pre-processing, the first step was to detect and delete the null values within the data sequence, where data of the climatic variables of interest were found and did not let the proposed algorithms apply due to not having a valid or consistent value. Secondly, the null data were replaced by using linear interpolation techniques, and the database was grouped into seasons of the year (3 months per interval) because it benefits in preventing underfitting during model training. Finally, they were divided into training and validation data of the LSTM-RNN model.
The database analysis determined that each season has distinct patterns and varying quantities of samples, as shown in
Figure 1. Research by García-Vázquez et al. [
22] has indicated that dividing the database into annual seasons is helpful to prevent underfitting. This is due to the varying temperature trends observed throughout the year. After considering this information, it was decided to proceed with prediction models that use the seasonal division of data.
Striking a balance between incorporating informative variables and avoiding unrelated ones is important. Incorporating informative variables can enhance the output, while unrelated ones can introduce unnecessary noise to the model. Therefore, the database variables were analyzed to identify significant correlations and determine which variables should be used in the models to ensure accurate predictions.
The correlational analysis can be seen in
Figure 2, where each variable is shown concerning each year’s season. This enabled us to identify the following key points:
Relative humidity decreases as the internal temperature increases;
Internal temperature increases as the solar radiation increases;
The dew point increases as the relative humidity increases;
The internal temperature is affected by the external temperature.
In order to create a prediction model for the internal temperature of a greenhouse, this research focuses on the following independent variables: To, Ho, Hi, Di, and Rs, whereas Ti is the dependent variable. The correlation maps show positive correlations between two or three variables; thus, for a sequence of three input elements and one output element, the many-to-one configuration can be used to perform an analysis. By analyzing these variable combinations, irrelevant or redundant features can be eliminated and reduce the risk of overfitting, leading to a better-performing model.
Equation (
1) provides the possible combinations for arranging three input elements based on independent variables:
where
n represents the independent variables, and
r represents the input variables considered for the model.
Table 2 displays the ten combinations used for testing the models in many-to-one analysis.
3.2. Model Training
This paper proposes using the LSTM-RNN and XGBoost algorithms to forecast the internal temperature of a greenhouse due to its adaptability in applications with time series. The subsequent points outline the composition of each algorithm and the method for choosing hyperparameters in these techniques.
3.2.1. LSTM-RNN
An ANN is a computing system inspired by biological neural processing research. The model consists of interconnected processing nodes, or neurons, that work together to solve complex problems. During training, the connections between nodes have their weights adjusted. A typical ANN, called a multiple-layer perceptron network, includes three layers: input, hidden, and output [
47]. This is illustrated in
Figure 3, and the formulas for these layers are expressed in Equations (
2) and (
3).
where
and
represent the activation functions for the hidden-to-output and input-to-hidden layers, respectively. The weight parameters are represented by matrices
and
, and biases for each layer are indicated by
and
. The hidden state vector at step
t is represented by the variable
.
RNNs differ from traditional neural networks in that they use feedback loops. This means that the output from each layer is sent back to the RNN to influence the current stage’s results. RNNs are effective in classification analysis and working with time series data, allowing for nonlinear trajectory prediction and dynamic system modeling. These networks are frequently selected for processing sequential data.
Figure 4 depicts an RNN network temporarily deployed to represent an element of the modeling string.
The hidden state vector for RNNs in Equation (
3) can be adjusted according to Equation (
4):
where
is the matrix of weight parameter for the prior hidden state.
RNNs have different aspects, with LSTM being the most commonly used for time series forecasting. LSTM includes memory cells with three types of gates: input, output, and forget gates. The input gate regulates the flow into the memory cell of input activations, while the output gate regulates the yield flow to the rest of the network of cell activations. The forget gate scales the cell’s inner state before adding it to the cell as input through the cell’s self-recurring association, thus forgetting or resetting the cell’s memory adaptively. The memory cell is the key to the LSTM framework, as it goes directly down the entire chain with the capacity to add or extract cell state information, tightly controlled by structures called gates. These gates are optional data inlet ways consisting of a neural net layer sigmoid and point-sensitive multiplication [
47], as shown in
Figure 5.
The first step in LSTM is determining which data will be removed from the cell state. This decision is made by the forget gate
, as shown in Equation (
5):
where
is the sigmoid function. The next step is determining what new information should be saved in the cell’s state by two actions. The first is the input gate (
) selecting which data to modify, as expressed in Equation (
6). The second is a hyperbolic tangent (tanh) layer that generates new values for the applicant values based on Equation (
7):
Then, Equation (
8) is used to update the previous state of the cell (
) to the new state (
):
The output gate (
) selects the components of the cell state that will be generated as output, as shown in Equation (
9). The cell state goes through a tanh layer and is multiplied by the output gate using Equation (
10):
where
W and
b are the matrix and bias of the weight parameter and ⊕ is the pointwise multiplication.
3.2.2. XGBoost
The XGBoost algorithm was created to find better ways to enhance decision trees. Initially, it was designed as a self-contained program that generated prediction models from input data. However, XGBoost’s integration with standard interface systems enabled it to evolve into a more robust package that utilizes computational resources to produce accurate predictions in less time [
48].
Boosting is an assembly method that combines different weak models into a robust model to minimize training errors. Usually, weak models tend to overfit, so having several additive models allows for a better model. In boosting, a random sample of data is selected, and a model is applied and trained sequentially, where each model tries to compensate for the weak points of the previous one [
49], as demonstrated in
Figure 6.
XGBoost is a variation of the gradient boosting algorithm that adds predictors to an array in sequence to correct errors from previous predictors. This method combines gradient descent and boosting. XGBoost uses this method with minor modifications that improve the regularization target function. Considering a database
D with
m features and
n number of examples
, XGBoost can be used to create an ensemble tree model that uses
K additive functions to predict the output
, as expressed in Equation (
11) [
50]:
where
represents the Classification and Regression Trees (CARTs). In this context,
q refers to the structure of a tree that maps an instance to a specific leaf index.
T represents the total number of leaves on each tree. Meanwhile,
pertains to a distinct tree structure with leaf weights
w.
K denotes the number of trees utilized in the model.
Equation (
11) can be solved by finding the best set of functions by minimizing the regularization function and cost, as expressed in Equation (
12):
In Equation (
12),
l stands for the cost function, which calculates the difference between the predicted value
and the target value
.
penalizes the model’s complexity, which helps avoid overfitting by smoothing out the final learned weights. The objective function will return to traditional gradient tree boosting if the regularization term is zero.
The tree ensemble model described in Equation (
11) contains functions treated as parameters that cannot be optimized using traditional methods in a Euclidean space. To continue training after adding a new function
f to the model, another function (tree) is added at the
iteration, as shown in Equation (
13):
Then, a second-order approximation is used to optimize the objective in a general way, where
and
. This will result in a simple objective function for step
t, as expressed in Equation (
14):
Defining
as an leaf instance of
j, Equation (
14) can be rewritten as Equation (
15):
For a fixed structure of
, the optimal value of the weight
of leaf
j is obtained with Equation (
16):
Enumerating all possible tree structures
q is often complicated. Therefore, a greedy algorithm begins with a single leaf and gradually adds branches to the tree. When there is a split, the sets IL and IR represent the instances of the left and right nodes, respectively. I = IL IR represents the instances of the entire tree. The cost function after a split can be expressed using Equation (
17):
When learning trees, a significant issue is determining the best split; thus, the XGBoost algorithm examines all possible splits for continuous features. To efficiently accomplish this, the algorithm first sorts the data based on the feature values and then analyzes the sorted data to gather gradient statistics for the score in Equation (
17).
It is common for input x to be sparse in various situations. This means that x may have missing values, null values, frequent null entries in the statistics, and artifacts from one-hot encoding. Therefore, the algorithm must be able to detect sparse data. Although the block structure helps optimize the computational complexity of the node split search, the new algorithm requires an indirect search of gradients and Hessians for each row since these values are accessed by features in order. XGBoost addresses these challenges by utilizing the processor’s cache memory. The algorithm caches the gradients and Hessians to compute the similarity scores and output values, improving model training computational time.
3.2.3. Hyperparameters
It is essential to adjust hyperparameters correctly when creating a mathematical prediction model to ensure accurate results. There are two ways to select hyperparameters: using default values provided by the software or manually configuring them. Additionally, data-dependent hyperparameter optimization strategies, such as grid or random search, can be employed. These strategies use second-degree optimization procedures to minimize the expected error of the model by searching for candidate configurations of hyperparameters.
On the other hand, Bayesian optimization is a more complex iterative strategy that can be used to identify the best hyperparameters due to its efficiency and effective uncertainty management. Using probability models, Bayesian optimization guides the search in a focused manner, reducing the number of evaluations necessary. Furthermore, its ability to handle uncertainty and balance exploitation and exploration ensures adaptability to various objective functions, even non-linear or expensive to evaluate. This systematic and versatile methodology maximizes model performance, making it an essential tool in hyperparameter optimization for complex machine learning problems [
51].
Various hyperparameters can be modified when utilizing the LSTM-RNN technique, as listed in
Table 3 [
52,
53]. Likewise,
Table 4 details the hyperparameters that must be set for the XGBoost algorithm [
54,
55].
3.3. Model Evaluation
Evaluating a model is essential in developing any ML algorithm; therefore, this paper proposes using four metrics to measure and analyze the model’s performance on the LSTM-RNN and XGBoost algorithms to determine its suitability for predicting internal temperature in the greenhouse.
The proposed metrics are as follows: R
2 (Equation (
18)), RMSE (Equation (
19)), MAE (Equation (
20)), and MAPE (Equation (
21)).
where
represents the actual value observed,
is the value predicted by the model, and
n is the total number of observations in the data set.
R
2 measures how much the independent variables affect the dependent variable in terms of variance proportion. It uses the Residual Sum of Squares (RSS) and Total Sum of Squares (TSS). The RMSE is derived from the Mean Square Error (MSE) to standardize their units of measurement. The MSE measures variance by how well a model fits the training data. The RMSE is helpful because it gives more importance to specific data points, significantly impacting overall error if a prediction is incorrect. MAE evaluates the distance between the regressor and real points. The MAE does not heavily penalize outliers due to its norm that smooths out all errors, which provides a generic and bounded performance measure for the model. MAPE is used when variations impact the estimate more than the absolute values. However, this metric is biased toward low forecasts and is therefore unsuitable for evaluating tasks where errors of significant magnitude are expected [
56,
57].
3.4. Development
As part of the TDSP methodology, the development stage involves constructing and training the machine learning model. This stage includes the following:
Analyze models: The LSTM-RNN and XGBoost supervised learning techniques were used to predict the internal temperature of the greenhouse. Both models were designed to forecast temperatures hourly to maintain a ±2 °C hysteresis.
Data split: The initial data set was divided into four seasons of the year. Moreover, the data were split into 80% for training and 20% for testing.
Model construction: An analysis was conducted to determine the optimal combination of input variables and obtain the best response in predicting internal temperature. This was based on three input and one output variable (many-to-one). In addition, a study was conducted to identify the hyperparameters that can enhance the models’ performance.
Model experimentation: An experimental setup has been designed to determine the number of experiments required and the prediction time window. For this purpose, eight experiments were performed, corresponding to the four seasons of the year, using two proposed algorithms: LSTM-RNN and XGBoost. The prediction window chosen for these experiments is one hour, which allows for capturing internal temperature changes and making decisions based on the predictions in the greenhouse.
Model validation: The R2, RMSE, MAE, and MAPE metrics were used to validate the performance of the predictions. In addition, the prediction graphs were displayed to analyze and determine whether there is an overfit or underfit.
5. Discussion
In this study, two supervised learning models, namely XGBoost and LSTM-RNN, were utilized to predict the internal temperature of a greenhouse up to an hour in advance. The objective was maintaining an acceptable temperature range with a hysteresis of ±2 °C. The MAE metric calculated the distance between the predicted and actual values. The results indicated that the XGBoost algorithm did not meet the acceptability criterion, as the range in all stations varied between 2.3 °C and 2.4 °C, surpassing the established ±2 °C hysteresis. On the other hand, LSTM-RNN showed an MAE metric ranging from 0.4 °C to 1.2 °C, indicating that the acceptability criterion was met and the fall season yielded the best results.
In
Section 2, a series of studies focused on predicting greenhouse variables. Each investigation had specific objectives, not all centered around making predictions in advance; some studies make predictions to detect patterns and find the best adjustments. This comparison is shown in
Table 11.
The study by García-Vázquez et al. uses the same database as the study being discussed. However, it takes a different approach to predicting the internal temperature of a greenhouse using regression methods such as the polynomial SVR algorithm. The results obtained show an R2 of 0.9998 and an MAE of 0.0422. On the other hand, Codeluppi et al. used ANN to predict air temperature with an R2 of 0.96 and MAPE of 0.49. Meanwhile, Hongkangal et al. combined RNN and BP to predict internal temperature, achieving a model with an R2 of 0.95 and an MAE of 0.42.
The studies mentioned use a forward prediction approach instead of relying on future or anticipated information in the modeling process. This approach involves predicting the future value of a variable of interest at the next time step based solely on historical data available up to the time of the prediction. The absence of advanced information in the prediction process ensures that the obtained results represent the model’s ability to make accurate and reliable predictions in real time without prior knowledge of future events or data. Therefore, the high R2 values and low errors, such as MAE and MAPE, in these studies indicate the effectiveness of the models in making accurate predictions without relying on advanced information. However, depending on the system type, these predictions may not be as valuable as preventive automatic control models for greenhouses.
The work by Wu et al. used the CNN-LSTM algorithm to predict wind energy, a starting point to analyze how the prediction behaves in variables other than internal temperature. They used different steps to make predictions ranging from 5 to 60 min, with an average MAE of 2.573 across all steps. The research conducted by Jung et al. focuses on predicting internal humidity using LSTM-RNN. Their metrics showed an R2 value greater than 0.8, making it the most consistent study with our research. However, it should be noted that their predictions were made using a different variable type, with a time frame of 30 min in advance. On the other hand, Singh et al. used ANN to predict the air temperature inside a greenhouse. They obtained an R2 of 0.98 and an MAE of 0.558 in their study, which was conducted with a 24-h in advance prediction. Compared to these studies, our LSTM-RNN approach has demonstrated outstanding performance in predicting the internal greenhouse temperature with a lead time of one hour. The evaluation metrics, R2 = 0.9994, MAE = 0.1449, MAPE = 0.0041, and RMSE = 0.2698, indicate a good fit of the model to the greenhouse data. These results suggest that our model has accurate and reliable predictive ability, validating the claim that the approach exhibits excellent fit in predicting the temperature in the greenhouse with a lead time of one hour.
6. Conclusions
The focus of this research was to predict the internal temperature of a greenhouse up to an hour in advance. Supervised learning algorithms, such as LSTM-RNN and XGBoost, were utilized to generate models capable of establishing preventive control of greenhouse conditions. The database used in this research included internal variables such as temperature, humidity, and dew point, as well as external variables such as temperature, humidity, and solar radiation. This paper generated a methodology for constructing models to which machine learning was applied. Based on the LSTM-RNN and XGBoost algorithms, the best combination of input variables (internal humidity, internal dew point, and solar radiation—Hi-Di-Rs) was found concerning the output variable (internal temperature—Ti). Bayesian optimization was used in each of the algorithms to analyze the best hyperparameters and apply them to the models. This model’s construction led to eight experiments focused on the two algorithms and each year’s season. The results were evaluated using the R2, RMSE, MAE, and MAPE metrics, which showed that LSTM-RNN presented better performance than XGBoost in all seasons. LSTM-RNN had the best result in the summer season with an R2 value of 0.9994, while XGBoost had the lowest result in the summer season with an R2 value of 0.8605.
Based on the prediction results of the LSTM-RNN and XGBoost algorithms, it is possible to develop a system that can accurately anticipate the internal conditions of a greenhouse up to an hour in advance. These models enable the control system to make proactive decisions and provide instructions to the various actuators in the greenhouse. However, several challenges impact the models’ accuracy and the required computational resources when implementing predictive models based on these algorithms. Two fundamental limitations can be mentioned in particular. Firstly, as the forecasting capacity of the model increases, there is a corresponding increase in computational cost. Secondly, prediction errors such as RMSE and MAE tend to increase when projecting towards extended periods. Therefore, it is essential to balance the model’s predictive capacity and the available computational resources to ensure optimal performance.