Comparison of Different Machine Learning Models for Modelling the Higher Heating Value of Biomass

Brandić, Ivan; Pezo, Lato; Bilandžija, Nikola; Peter, Anamarija; Šurić, Jona; Voća, Neven

doi:10.3390/math11092098

Open AccessArticle

Comparison of Different Machine Learning Models for Modelling the Higher Heating Value of Biomass

by

Ivan Brandić

^1,*

,

Lato Pezo

²

,

Nikola Bilandžija

¹,

Anamarija Peter

¹

,

Jona Šurić

¹

and

Neven Voća

¹

Faculty of Agriculture, University of Zagreb, Svetošimunska Cesta 25, 10000 Zagreb, Croatia

²

Institute of General and Physical Chemistry, University of Belgrade, Studentski trg 12/V, 11000 Belgrade, Serbia

^*

Author to whom correspondence should be addressed.

Mathematics 2023, 11(9), 2098; https://doi.org/10.3390/math11092098

Submission received: 3 April 2023 / Revised: 25 April 2023 / Accepted: 26 April 2023 / Published: 28 April 2023

Download

Browse Figures

Versions Notes

Abstract

:

The aim of this study was to investigate the potential of using structural analysis parameters for estimating the higher heating value (HHV) of biomass by obtaining information on the composition of cellulose, lignin, and hemicellulose. To achieve this goal, several nonlinear mathematical models were developed, including polynomials, support vector machines (SVMs), random forest regression (RFR) and artificial neural networks (ANN) for predicting HHV. The performed statistical analysis “goodness of fit” showed that the ANN model has the best performance in terms of coefficient of determination (R² = 0.90) and the lowest level of model error for the parameters X² (0.25), RMSE (0.50), and MPE (2.22). Thus, the ANN model was identified as the most appropriate model for determining the HHV of different biomasses based on the specified input parameters. In conclusion, the results of this study demonstrate the potential of using structural analysis parameters as input for HHV modeling, which is a promising approach for the field of biomass energy production. The development of the model ANN and the comparative analysis of the different models provide important insights for future research in this field.

Keywords:

structural analysis; support vector machine; artificial neural network; random forest regression; high order polynomials

MSC:

49M37

1. Introduction

With the increasing use of renewable energy sources to meet the growing demand for energy, biomass will play a central role in the coming years. This is particularly important given the rising cost of conventional fuels and the need to mitigate the exacerbation of climate change. [1]. Biomass, which refers to biodegradable residues from agricultural production, various types of organic waste, residues of biological origin from agriculture and forestry, and biological residues from plant and animal production [2], is one of the most common renewable sources for energy production [3]. Research has shown that the use of energy crops for energy production can replace existing conventional fuels and slow down negative climate change [4]. Lignocellulosic biomass is widely recognized as an effective and efficient renewable resource for energy production [5]. In recent times, the use of renewable energy resources has become a critical component of energy security. A crucial factor in determining fuel quality and conducting tests is the higher heating value (HHV), but its determination requires a long time and the use of special laboratory equipment. For this reason, various mathematical models are developed to predict the HHV value depending on the input parameters of different analyses [6]. In addition to mathematical models, numerous machine learning techniques (ML) facilitate the creation of more efficient forecasting models. Recently, artificial neural networks (ANNs) have been increasingly used in the field of modeling and represent a suitable tool for the investigation and evaluation of biomass energy parameters [7]. The basic features of an ANN are its structure, “learning” algorithm, and activation function used to transfer the computed values from one neuron to another. By selecting the input parameters, the models in the form of ANNs can fit and generalize the data according to the desired output value. The effectiveness of ANN is determined by comparing experimental and computed data. There are several models of ANN, but in application, especially for regression, the most efficient models are multilayer perceptrons (MLP-ANN) [8,9,10]. RFR models, in addition to anticipating, allow the size of the interval to be determined by estimation (instead of using a separate dataset required for calibration). They also provide reliability and a much higher level of effectiveness for prediction than existing linear methods [11]. RFR is used as a useful tool for predicting desired output values. When using the RFR model, it is important to determine intervals that contain values with a certain prediction probability. Similar to other approaches to forecasting, these models are typically used for finding “points” that are not accompanied by actual value deviation data [12]. SVM as a model for regression analysis, uses hyperplane classifiers by mapping the input data into a multidimensional space and comparing it to the output data [13]. Garcia Nieto et al. [14] conducted a study to predict the higher heating value of biomass and compared the performance of different models. The cubic-SVM model showed the strongest correlation between predicted and actual values, with the highest R² value of 0.94 and the lowest RMSE and MAE values of 0.39 and 0.32, respectively. Its MBE value was close to zero (0.0012), indicating minimal bias in the predictions. In contrast, the random forest model showed the weakest performance with the lowest R² value (0.59) and the highest RMSE (1.06) and MAE values (0.86), indicating lower accuracy and precision in predicting the higher heating value of biomass. In his study, Afolabi et al. [15] (2022) compared different ML models, including decision tree (DT), random forest (RF), and artificial neural networks (ANN) using criteria of statistical measures. The table shows the mean absolute error (MAE), mean square error (MSE), and root mean square error (RMSE) for each model, allowing an evaluation of their performance. The DT model has an MAE of 1.48, an MSE of 4.36, and an RMSE of 2.09. The RF model has an MAE of 1.01, an MSE of 1.87, and an RMSE of 1.37, indicating better performance compared with the DT model. The ANN model has an MAE of 1.21, an MSE of 2.43, and an RMSE of 1.56, indicating better performance compared to the DT model, but slightly worse than the RF model. Overall, the random forest model shows the best performance among the three models based on the lowest values of MAE, MSE, and RMSE. Liu et al., 2022 [16], used a RFR to predict the HHV of torrefied biomass. The model RF was trained with 10-fold cross-validation to fit the hyperparameters. The model achieved high prediction accuracy, with an R² value of 0.91 for the test dataset. The results of this study show that the random forest model (RF) has the ability to estimate the higher heating value (HHV) of torrefied biomass with a high degree of precision. Dubey and Guruviah (2022) [17] constructed a support vector machine (SVM) model for predicting and optimizing the prediction of HHV of agricultural biomass based on proximate analysis data. The main objective of this paper is to develop nonlinear machine learning models in the form of higher-degree polynomials, SVM, RFR, and ANN, and to investigate the possibility of HHV modeling in terms of the input parameters of biomass structural analysis, which include the variables cellulose, hemicellulose, and lignin. For a better understanding of the overall concept of the work, Figure 1 shows the flowchart of the research conducted with the aim of determining the most appropriate machine learning model in terms of predictive ability, but also in terms of the error rate of each model. The present study is characterized by the comparative evaluation of different machine learning approaches, which include polynomial functions, support vector machines (SVM), random forest regressors (RFR), and artificial neural networks (ANN), with the aim of predicting the higher heating value (HHV) of biomass through structural analysis parameters.

By means of Yoon’s global sensitivity method (based on the ANN model), the influence of the input variables on the output values is investigated. In order to compare the above models in terms of the different modeling errors, literature and calculated data are compared with the model. The coefficient of determination is taken as the main parameter of the model comparison, through which the most appropriate model for estimating the HHV biomass is determined.

2. Materials and Methods

2.1. Data Collection

Data from different types of biomasses were used for the study, including input variables, percentages of cellulose, lignin, hemicellulose, and HHV on a dry basis. The data used for the analysis were taken from published papers [1,18,19] and are listed in Table S1 (overall, 235 samples). The large differences between the minimum and maximum values also allow for the construction of a more universal model for the prediction of HHV from the above variables.

2.2. Data Processing

The Python (Python 3.10) [20] libraries Pandas, Seaborn, Matplotlib.pyplot, and NumPy [21,22,23,24] were imported to create a pair plot for the dataset used. These libraries were used extensively for data analysis and plotting [25]. The corr() function, which calculates the pairwise correlation between all columns, was used to calculate the correlations between the features of the data frame. To display only the upper triangle of the heatmap, a triangular mask was created using the NumPy module. The heatmap was created with the Seaborn heatmap() function, the correlation matrix as input, and the mask used to annotate the correlation values. To demonstrate the distribution of each feature and the pairwise correlations, a pair plot was created using Seaborn, and the two plots were combined into a single figure using the Matplotlib library. To evaluate the effectiveness and performance of the support vector machine (SVM), random forest regression (RFR), polynomial, and artificial neural network (ANN) models in calculating the higher heating value (HHV) based on input data from the structural analysis components, several statistical parameters were calculated. These included the reduced chi-square (Χ²) (1), root mean square error (RMSE) (2), coefficient of determination (R²) (3), mean systematic error (MBE) (4), mean percentage error (MPE) (5), total squared error (SSE) (6), and average absolute relative deviation (AARD) (7). The RMSE values indicate the efficiency of the model by comparing calculated values with experimentally measured values. The MBE values are used to determine the standard deviation between the predicted and measured values [26]. These statistical parameters were calculated using equations [27,28,29]. In addition, Yoon’s method of global sensitivity (8) was used to evaluate the direct influence of the input parameters on the output variables, which corresponds to the weighting coefficients (w) within the ANN model [30].

x^{2} = \frac{\sum_{i = 1}^{N} (x_{p r e d i c t e d, i} - x_{e x p e r i m e n t a l, i})^{2}}{N - n}

(1)

R M S E = {[\frac{1}{N} \cdot \sum_{i = 1}^{N} {(x_{p r e d i c t e d, i} - x_{e x p e r i m e n t a l, i})}^{2}]}^{1 / 2}

(2)

R^{2} = 1 - \frac{\sum_{i = 1}^{n} [x_{i}^{p r e d i c t e d} - x_{i}^{e x p e r i m e n t a l}]^{2}}{\sum_{i = 1}^{n} [x_{i}^{p r e d i c t e d} - x^{m}]^{2}}, x_{m} = \frac{\sum_{i = 1}^{n} x_{i}^{e x p e r i m e n t a l}}{n}

(3)

M B E = \frac{1}{N} \cdot \sum_{i = 1}^{N} (x_{p r e d i c t e d, i} - x_{e x p e r i m e n t a l, i})

(4)

M P E = \frac{100}{N} \cdot \sum_{i = 1}^{N} (\frac{| x_{p r e d i c t e d, i} - x_{e x p e r i m e n t a l, i} |}{x_{e x p e r i m e n t a l, i}})

(5)

S S E = \sum_{i = 1}^{N} {(x_{p r e d i c t e d, i} - x_{e x p e r i m e n t a l, i})}^{2}

(6)

A A R D = \frac{100}{n} \sum_{i = 1}^{n} \frac{| x_{i}^{p r e d i c t e d} - x_{i}^{e x p e r i m e n t a l} |}{x_{i}^{e x p e r i m e n t a l}}

(7)

R I_{i j} (%) = \frac{\sum_{k = 0}^{n} (w_{i k} \cdot w_{k j})}{\sum_{i = 0}^{m} | \sum_{k = 0}^{n} (w_{i k} \cdot w_{k j}) |} \cdot 100 %

(8)

where N is population size, n is sample size, x^predicted indicates predicted value, x^experimental is experimental value. In this study, the C++ programming language [31] was used to implement ML models (ANN, SVM, and RFR). C++ was chosen because of its high performance and efficient memory management, which are essential for processing large and complex datasets [32]. The low-level control of the language also enabled the optimization of algorithms and the implementation of advanced techniques in the ML models. Before the models were created, all the data were divided into a part for training and a part for testing the model in a 70:30 ratio.

2.3. SVM Modelling

SVM models are based on theories of averaging and are algorithms that can be used for supervised learning for regression. SVM as regression models make predictions by splitting the data into a part for learning and testing the model and are suitable for predicting the HHV of biomass. In models for nonlinear applications, the input low-dimensional space vectors must first be transformed with a nonlinear function (9) (Φ) [14]:

f (x) = w^{T} Φ (x) + b

(9)

where w and b represent weight vector and intercept of the model.

The SVM model was created to predict HHV biomass based on the input parameters of the structural analysis. The model was created as regression type 1 with a training constant of 10. The epsilon measure of the model is set to 0.1, while the radial basis function (gamma value) is set to 1.00. The total number of model iterations is 10,000.

2.4. Polynomial Regression Model

The created polynomial model relies on relations between variables based on structural analysis of biomass and HHV values of output data. To adjust for possible causes of variation, a statistical experimental design was used to examine the effects of three variables (factors) on an outcome variable while controlling for a grouping variable (block). The polynomial model of this experimental design was as follows:

H H V = β_{0} + β_{1} \cdot C e l + β_{2} \cdot C e l^{2} + β_{3} \cdot L i g + β_{4} \cdot L i g^{2} + β_{5} \cdot H e m + β_{6} \cdot H e m^{2} + β_{7} \cdot C e l \cdot L i g + β_{8} \cdot C e l \cdot H e m + β_{9} \cdot L i g \cdot H e m

where HHV is the response variable, Cel, Lig, and Hem are the three factors, β_/1 is the intercept, β_/1–β_/1 are the main effects of the factors, β_/1–β_/1 are the two-way interactions between the factors.

2.5. RFR Modeling

RFR are learning algorithms that combine multiple random decision trees and make anticipation based on the average value; they include methods of classification and regression, and use a certain number of random trees [33,34,35]. During RFR model evaluation, the number of trees was set to values of 100, 200, 300, 400, 500, and 10,000.

2.6. ANN Modelling

MLP is one of the types of ANN and is widely used in computing. The main advantage of this type of network is that it can “learn” to make connections between input and output data, which is very useful in predicting nonlinear problems in areas where large amounts of data have to be processed [36,37,38]. Number of artificial neurons in the hidden layer may vary based on the error and trial methods. The learning process of a neural network involves processing input data, which is then converted to the desired output data [39]. The two basic types of network learning processes are supervised and unsupervised. In the supervised learning process, the model is provided with ANN output data, based on which it compares the values obtained [40]. The developed ANN model was trained 100,000 times with a random number of neurons in the hidden layer (5–20). Different activation functions and random values for weighting coefficients and bias were used. The Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm was used to solve the nonlinear optimization during the modeling process of ANN [41]. The neural network model, written in matrix notation, contains biases and weight coefficients for the hidden and output layers, represented by matrices and vectors W_/1, B_/1, W_/1, and B_/1, respectively. Y is the output value, while f_/1 and f_/1 denote the transfer functions for the hidden and output layers, respectively. X represents the matrix of the input layer [42]:

Y = f_{1} (W_{2} \cdot f_{2} (W_{1} \cdot X + B_{1}) + B_{2})

(10)

3. Results

3.1. Data Distribution

According to the literature data obtained, the cellulose concentration ranged between 10.66 and 56.62%, while lignin percent scoped between 2.39 and 22%; hemicellulose spanned across the range of 5.97–37%, whereas the obtained HHV value was between 12.54 and 19.25 MJ kg⁻¹. The extreme values of the collected data varied greatly due to the different types of biomasses, which are characterized by different chemical compositions.

Figure 2 shows the distribution of the individual variables and the correlation between the observed variables and the HHV biomass.

The correlation coefficient between the observed variables is positive and statistically significant at a coefficient of p ≤ 0.01. The correlation coefficient (r) of HHV and the variable Hem is 0.74, while it is 0.88 for lignin and 0.89 for cellulose. When looking at the distribution, certain behavioral patterns of the observed values become visible. In the dataset used to build the model, the largest part consisted of data on the energy crop of Miscanthus (192 samples), whose (for example) calorific value varies from 15.53 to 19.52 MJ kg⁻¹, while the other biomass samples have a lower average calorific value (12.54–17.07 MJ kg⁻¹), which can explain the uneven distribution of the data.

3.2. Polynomial Regression Model

With regard to the proposed higher-degree polynomial model, the intercept values, the main effect on the factors, and the interaction of the effects on several factors were calculated and presented (Table 1).

3.3. RFR Model

During the development of the RFR model to predict HHV values, a large number of decision trees were constructed (1940). The data for RFR were split into a random sample of 30% and a subsample of 50%.

3.4. ANN Model

The proposed ANN model consists of an input layer, a hidden layer, and an output layer with architecture 3-4-1 (number of artificial neurons in the input, hidden, and output layers). The weights and biases (Table 2) were determined by randomly searching for values that would make the model sufficiently accurate in modeling the output values.

The model presented showed the greatest predictive ability in the architecture with four neurons in the values.

3.5. Model Performance

The results (Table 3) show that the MLP neural network model outperforms the other three models in terms of RMSE, AARD, and R². The low RMSE value indicates that the ANN model predicts the output variables with high accuracy. The high AARD value indicates that the ANN model has a low relative error and is therefore acceptable for practical applications. Several measures, such as R² and MBE, indicate that the RFR model performs quite well. The low MBE value indicates that the RFR model predicts the output variables with low bias. Nevertheless, the RMSE and AARD values show that the model has more error than the MLP model. For most measurements, the SVM model performed worse than the other models. The SVM model has more relative errors than the other models, indicating that it is less suitable for real-world applications. The low R² value indicates that the input variables explain a smaller proportion of the variation in the output variables. The skewness value indicates that the error distribution of the SVM model’ is nearly symmetric.

The ANN was trained using the Broyden–Fletcher–Goldfarb–Shanno (BFGS) optimization technique with an exponential hidden layer activation function and an identity output activation function. With training and testing errors of 0.15 and 0.07, respectively, the ANN achieved a training performance of 0.88 and a testing performance of 0.95. The training performance of the RFR model was 0.89, and the test performance was 0.92. SVM and polynomial regression models achieved training and test values of 0.85 and 0.89, respectively. Overall, it was found that the MLP neural network performed better than the other models with the specified architecture and training conditions.

The scatter plot is one of the most commonly used types of visualization that shows the behavior of data on the x-y axis [43,44]. Figure 3 shows the scatter plot of the overlap of the predicted data with the experimentally determined values for the developed models in the form of polynomials, SVM, RFR, and ANN.

Figure 4 shows the ability of the models to predict HHV biomass as a function of the number of observations. As can be seen, the models generally agree well with the observed data, with the highest agreement observed between observations 50 and 150, i.e., when the data are clustered and there is little variability. The model ANN had the lowest estimation error with respect to the number of observed samples of the models listed.

3.6. Global Sensitivity Analysis of the Developed ANN Model

The global sensitivity analysis is performed according to the Yoon method, which calculates the direct influence of the input parameters on the output values. Figure 5 shows the relative importance (%) of each variable in the structural analysis of biomass for the output value of HHV. The range in which the relevance factor is shown is between −1 and 1. The influence of HHV is affected by increasing the value of lignin and hemicellulose and decreasing the input value of cellulose.

Figure 5 shows the influence of Cel (−56.78%), Lig (40.27%), and Hem (2.94%). Considering the presented influencing factors and their relative importance, it can be concluded that they affect the total output value of HHV by increasing the input values of lignin and hemicellulose and decreasing cellulose.

The global sensitivity analysis showed that increasing the value of lignin and hemicellulose and decreasing the value of cellulose has an impact on the effects of HHV. Considering the proportional relevance of the presented influencing factors, it can be concluded that they affect the overall output value of HHV by increasing the input values of lignin and hemicellulose and decreasing the input value of cellulose. Overall, the MLP neural network model performed best in predicting the HHV of different types of biomasses. Global sensitivity analysis revealed that the most important parameters affecting HHV were lignin and hemicellulose. Future research could focus on improving model accuracy by adding more diverse datasets and conducting controlled experiments to reduce the effects of external influences.

3.7. Goodness of Fit

To determine the ability of the model to predict HHV biomass with respect to the input parameters of the structural analysis, it is necessary to calculate statistical parameters to assess the model’s ability to predict, as well as to compare the individual models to make it clear which nonlinear model is the most accurate in forecasting.

Table 4 shows the calculated statistical test “goodness of fit” in relation to the polynomials, SVM, RFR, and ANN models based on the calculation of the HHV value in relation to the input parameters of the structural analysis of biomass.

The ANN model had the best performance based on several metrics, such as X², RMSE, AARD, and R². The low RMSE value indicates that the ANN model has high accuracy in predicting the output variables. The high AARD value indicates that the ANN model has a low relative error, making it suitable for practical applications. The R² value indicates that a high proportion of the variance in the output variable can be explained by the input variables. The RFR model had relatively good performance based on some metrics, such as R² and MBE. The low MBE value indicates that the RFR model has a low bias in predicting the output variables. However, the RMSE and AARD values indicate that the model has higher errors compared to the model ANN. The model, in the form of SVM, had lower performance compared to the other models based on most metrics. The high AARD value indicates that the SVM model has higher relative errors compared to the other models, making it less suitable for practical applications. The low R² value indicates that a smaller proportion of the variance in the output variable can be explained by the input variables. The skewness value indicates that the error distribution in the SVM model is nearly symmetric, while the low Χ² value and RMSE indicate good agreement between the predicted and actual values. The MBE of 0.00 shows that the predictions of the models do not tend to overestimate or underestimate the actual values. The developed SVM model showed the best results, with 39 support and weight vectors.

4. Discussion

Analyzing the composition of biomass and its calorific value is crucial for understanding and optimizing its use as a renewable energy source. Dai et al. 2021 [45] state that there is a need to develop a model for predicting the energy properties of biomass based on various analyses that enable the use of biomass resources in energy applications. Considering the specifics of the connection between input parameters and output values of HHV biomass, the ML model proved to be more accurate in application than the existing linear models. Xing et al. (2019) [46] conducted a study in which they examined the possibility of applying the ML model to the estimation of HHV biomass based on ultimate and proximate analysis. Unlike empirical models, which showed a lower predictive ability (R² < 0.70), ML models (ANN, SVM, and RF) showed better performance in HHV biomass modeling (R² > 0.90). Afolabi et al. (2022) [15] conducted a study in which they used different ML models to estimate the HHV of different biomass classes. They used MAE, MSE, and RMSE as statistical measures of model error. Among others, RF and ANN models were created, which had satisfactory performance in terms of modeling error for the statistically calculated parameters MAE, MSE, and RMSE (1.01, 1.87, 1.37, 1.21, 2.43, and 1.56). Chen et al. (2022) [47] provided research results on the evaluation of HHV biochar. Gradient-boosting regression (GBR), RF, SVM algorithms, and linear regression methods were developed through modeling. For the development, 52 samples were collected, and 97 were taken from published literature sources so that the models could be optimized. Based on 52 experimental data points, the machine learning (ML) methods showed better predictive capabilities (training R² ≥ 0.96) for the higher heating value (HHV) of biochar compared to multiple linear regression (MLR) (training R² < 0.94). The gradient boosting regression (GBR) algorithm successfully predicted the HHV of biochar (test dataset) using finite and proximal analysis, with R² = 0.98, MAE = 0.83, and RMSE = 1.08 when trained with the experimental dataset. The random forest (RF) and support vector machine (SVM) models performed similarly well in predicting HHV, with R² = 0.97, MAE = 0.93, RMSE = 1.22, and R² = 0.97, MAE = 0.93, and RMSE = 1.23, respectively. Hosseinpour et al. (2018) [48] developed a new network of fuzzy partial least squares combined with principal component analysis (PCA-INFPLS) to estimate HHV biomass based on the input parameters of solid carbon, volatile matter, and ash content. The developed model PCA-INFPLS shows high performance in predicting the HHV, with modeling errors R² > 0.96, MSE < 0.51, and MAPE < 2.5%, so it can be concluded that the proposed model is suitable for modeling. Ghugare et al. (2014) [49] developed a genetic program (GP) and a multilayer perceptron (MLP) model to evaluate the fuel properties of solid biomass. The GP and MLP models showed good predictive performance in terms of accuracy and generalization. They achieve high correlation coefficients (>0.95) and low MAPE (<4.5%) when comparing experimental and model-predicted higher heating values (HHV).

In this research, four models were developed and compared: polynomial regression, support vector machines (SVM), random forest regression (RFR), and artificial neural networks (ANN). The ANN model, specifically a multilayer neural perceptron (MLP), showed superior performance in terms of RMSE (0.50), AARD (100.87), and R² (0.90) compared to the other models, indicating high accuracy and low relative error. The RFR model also performed well, but with higher errors than the ANN model. The SVM model showed lower performance with higher relative errors and lower R² values (0.86), indicating that it is less suitable for practical applications. A global sensitivity analysis showed that the most important parameters affecting HHV are lignin and hemicellulose, while cellulose has a negative influence. These results have implications for optimizing biomass composition to achieve higher heating values. The model ANN, which uses the Broyden–Fletcher–Goldfarb–Shanno (BFGS) optimization method and an exponential hidden layer activation function, showed the best performance in predicting the HHV for different types of biomasses.

Future research should focus on improving the accuracy of the models by including more diverse datasets and conducting controlled experiments to minimize the influence of external factors. In addition, exploring other modeling techniques and refining current models, especially the ANN, can help develop more accurate predictive tools for biomass heating value. Such improvements can facilitate better decision-making for the efficient use of biomass as a renewable energy source, help address energy challenges, and mitigate climate change.

5. Conclusions

Recently, more attention has been paid to the development of various models for predicting the energy parameters of biomass fuels. The factors cellulose, hemicellulose, and lignin influence the HHV.
Using Yoon’s method of global sensitivity, the increase in HHV biomass was found to be influenced by the increase in the parameters lignin and hemicellulose and the decrease in cellulose content.
Four developed nonlinear models showed high performance in estimating HHV biomass: ANN (R² = 0.90), RFR (R² = 0.89), SVM (R² = 0.86), and polynomial (R² = 0.87).
Using the statistical test “goodness of fit”, the ANN model showed the smallest errors in estimating HHV and was determined based on the calculated parameters Χ², RMSE, MBE, MPE, SSE, and AARD.
Among the developed models, ANN showed the best ability to summarize, generalize data, and predict.
To reduce the error rate in the development of the ML model for estimating energy values of biomass, the expansion of the database, the categorization of the data, and the development of new algorithms are required for future research.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/math11092098/s1, Table S1: supporting information: data and codes.

Author Contributions

Conceptualization, I.B. and L.P.; methodology, I.B.; software, L.P.; validation, N.V., N.B. and J.Š.; formal analysis, A.P.; investigation, I.B.; resources, A.P.; data curation, L.P.; writing—original draft preparation, I.B.; writing—review and editing, N.B.; supervision, N.V.; project administration, N.V. All authors have read and agreed to the published version of the manuscript.

Funding

The publication was supported by the Open Access Publication Fund of the Croatian Academic and Research Libraries Consortium (CARLC).

Data Availability Statement

Not applicable.

Acknowledgments

The publication was supported by the Croatian Science Foundation, under project No. IP-2018-01-7472 “Sludge management via energy crops production” within the project “Young Researchers’ Career Development Project—Training of Doctoral Students”, and cofinanced by the European Union, under the OP “Efficient Human Resources 2014–2020” from the ESF funds.

Conflicts of Interest

The authors declare no conflict of interest.

References

Callejón-Ferre, A.; Carreño-Sánchez, J.; Suárez-Medina, F.; Pérez-Alonso, J.; Marti, B.V. Prediction models for higher heating value based on the structural analysis of the biomass of plant remains from the greenhouses of Almería (Spain). Fuel 2014, 116, 377–387. [Google Scholar] [CrossRef]
Demirbas, A. Higher heating values of lignin types from wood and non-wood lignocellulosic biomasses. Energy Sources Part A Recovery Util. Environ. Eff. 2017, 39, 592–598. [Google Scholar] [CrossRef]
Scarlat, N.; Dallemand, J.F.; Taylor, N.; Banja, M.; Sanchez, L.J.; Avraamides, M. Brief on Biomass for Energy in the European Union | EU Science Hub; EC Publication: Luxembourg, 2019; pp. 1–8. [Google Scholar]
Koçar, G.; Civaş, N. An overview of biofuels from energy crops: Current status and future prospects. Renew. Sustain. Energy Rev. 2013, 28, 900–916. [Google Scholar] [CrossRef]
Olatunji, O.; Akinlabi, S.; Oluseyi, A.; Peter, M.; Madushele, N. Experimental investigation of thermal properties of Lignocellulosic biomass: A review. IOP Conf. Ser. Mater. Sci. Eng. 2018, 413, 012054. [Google Scholar] [CrossRef]
Boumanchar, I.; Charafeddine, K.; Chhiti, Y.; Alaoui, F.E.M.; Sahibed-Dine, A.; Bentiss, F.; Jama, C.; Bensitel, M. Biomass higher heating value prediction from ultimate analysis using multiple regression and genetic programming. Biomass-Convers. Biorefin. 2019, 9, 499–509. [Google Scholar] [CrossRef]
Obafemi, O.; Stephen, A.; Ajayi, O.; Nkosinathi, M. A survey of Artificial Neural Network-based Prediction Models for Thermal Properties of Biomass. Procedia Manuf. 2019, 33, 184–191. [Google Scholar] [CrossRef]
Grossi, E.; Buscema, P.M. Introduction to artificial neural networks. Eur. J. Gastroenterol. Hepatol. 2007, 19, 1046–1054. [Google Scholar] [CrossRef]
Kartal, F.; Özveren, U. A deep learning approach for prediction of syngas lower heating value from CFB gasifier in Aspen plus^®. Energy 2020, 209, 118457. [Google Scholar] [CrossRef]
Pattanayak, S.; Loha, C.; Hauchhum, L.; Sailo, L. Application of MLP-ANN models for estimating the higher heating value of bamboo biomass. Biomass-Convers. Biorefin. 2020, 11, 2499–2508. [Google Scholar] [CrossRef]
Johansson, U.; Boström, H.; Löfström, T.; Linusson, H. Regression conformal prediction with random forests. Mach. Learn. 2014, 97, 155–176. [Google Scholar] [CrossRef] [Green Version]
Zhang, H.; Zimmerman, J.; Nettleton, D.; Nordman, D.J. Random Forest Prediction Intervals. Am. Stat. 2019, 74, 392–406. [Google Scholar] [CrossRef]
Sun, J.; Zhang, J.; Gu, Y.; Huang, Y.; Sun, Y.; Ma, G. Prediction of permeability and unconfined compressive strength of pervious concrete using evolved support vector regression. Constr. Build. Mater. 2019, 207, 440–449. [Google Scholar] [CrossRef]
Nieto, P.J.G.; García-Gonzalo, E.; Paredes-Sánchez, J.P.; Sánchez, A.B.; Fernández, M.M. Predictive modelling of the higher heating value in biomass torrefaction for the energy treatment process using machine-learning techniques. Neural Comput. Appl. 2018, 31, 8823–8836. [Google Scholar] [CrossRef]
Afolabi, I.C.; Epelle, E.I.; Gunes, B.; Güleç, F.; Okolie, J.A. Data-Driven Machine Learning Approach for Predicting the Higher Heating Value of Different Biomass Classes. Clean Technol. 2022, 4, 1227–1241. [Google Scholar] [CrossRef]
Liu, X.; Yang, H.; Yang, J.; Liu, F. Application of Random Forest Model Integrated with Feature Reduction for Biomass Torrefaction. Sustainability 2022, 14, 16055. [Google Scholar] [CrossRef]
Dubey, R.; Guruviah, V. Machine learning approach for categorical biomass higher heating value prediction based on proximate analysis. Energy Sources Part A Recovery Util. Environ. Eff. 2022, 44, 3381–3394. [Google Scholar] [CrossRef]
Mansor, A.M.; Lim, J.S.; Ani, F.N.; Hashim, H.; Ho, W.S. Characteristics of cellulose, hemicellulose and lignin of MD2 pineapple biomass. Chem. Eng. Trans. 2019, 72, 79–84. [Google Scholar] [CrossRef]
Voća, N.; Leto, J.; Karažija, T.; Bilandžija, N.; Peter, A.; Kutnjak, H.; Šurić, J.; Poljak, M. Energy Properties and Biomass Yield of Miscanthus x Giganteus Fertilized by Municipal Sewage Sludge. Molecules 2021, 26, 4371. [Google Scholar] [CrossRef]
Van Rossum, G.; Drake, F.L. Python 3 Reference Manual; CreateSpace: Scotts Valley, CA, USA, 2009. [Google Scholar]
McKinney, W. Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference, Austin, TX, USA, 28 June–3 July 2010; Volume 445, pp. 51–56. [Google Scholar]
Seaborn. Available online: https://seaborn.pydata.org/ (accessed on 20 January 2023).
Hunter, J.D. Matplotlib: A 2D graphics environment. Comput. Sci. Eng. 2007, 9, 90–95. [Google Scholar] [CrossRef]
Harris, C.R.; Millman, K.J.; van der Walt, S.J.; Gommers, R.; Virtanen, P.; Cournapeau, D.; Wieser, E.; Taylor, J.; Berg, S.; Smith, N.J.; et al. Array programming with NumPy. Nature 2020, 585, 357–362. [Google Scholar] [CrossRef]
El Hachimi, C.; Belaqziz, S.; Khabba, S.; Chehbouni, A. Data Science Toolkit: An all-in-one python library to help researchers and practitioners in implementing data science-related algorithms with less effort. Softw. Impacts 2022, 12, 100240. [Google Scholar] [CrossRef]
Demirbas, A. Relationships between Heating Value and Lignin, Moisture, Ash and Extractive Contents of Biomass Fuels. Energy Explor. Exploit. 2002, 20, 105–111. [Google Scholar] [CrossRef]
Khalil, S.A.; Shaffie, A. A comparative study of total, direct and diffuse solar irradiance by using different models on horizontal and inclined surfaces for Cairo, Egypt. Renew. Sustain. Energy Rev. 2013, 27, 853–863. [Google Scholar] [CrossRef]
Arsenović, M.; Pezo, L.; Stanković, S.; Radojević, Z. Factor space differentiation of brick clays according to mineral content: Prediction of final brick product quality. Appl. Clay Sci. 2015, 115, 108–114. [Google Scholar] [CrossRef]
Dashti, A.; Bahrololoomi, A.; Amirkhani, F.; Mohammadi, A.H. Estimation of CO₂ adsorption in high capacity metal−organic frameworks: Applications to greenhouse gas control. J. CO2 Util. 2020, 41, 101256. [Google Scholar] [CrossRef]
Aćimović, M.; Pezo, L.; Tešević, V.; Čabarkapa, I.; Todosijević, M. QSRR Model for predicting retention indices of Satureja kitaibelii Wierzb. ex Heuff. essential oil composition. Ind. Crops Prod. 2020, 154, 112752. [Google Scholar] [CrossRef]
Stroustrup, B. The C++ Programming Language; Addison-Wesley: Reading, MA, USA, 1986. [Google Scholar]
Rassokhin, D. The C++ programming language in cheminformatics and computational chemistry. J. Cheminform. 2020, 12, 10. [Google Scholar] [CrossRef] [Green Version]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
Scornet, E.; Biau, G.; Vert, J.-P. Consistency of random forests. Ann. Stat. 2015, 43, 1716–1741. [Google Scholar] [CrossRef]
Biau, G.; Scornet, E. A random forest guided tour. Test 2016, 25, 197–227. [Google Scholar] [CrossRef] [Green Version]
Aitkin, M.; Foxall, R. Statistical modelling of artificial neural networks using the multi-layer perceptron. Stat. Comput. 2003, 13, 227–239. [Google Scholar] [CrossRef]
Talebi, N.; Nasrabadi, A.M.; Mohammad-Rezazadeh, I. Estimation of effective connectivity using multi-layer perceptron artificial neural network. Cogn. Neurodyn. 2017, 12, 21–42. [Google Scholar] [CrossRef] [PubMed]
Cao, W.; Wang, X.; Ming, Z.; Gao, J. A review on neural networks with random weights. Neurocomputing 2018, 275, 278–287. [Google Scholar] [CrossRef]
Brandić, I.; Pezo, L.; Bilandžija, N.; Peter, A.; Šurić, J.; Voća, N. Artificial Neural Network as a Tool for Estimation of the Higher Heating Value of Miscanthus Based on Ultimate Analysis. Mathematics 2022, 10, 3732. [Google Scholar] [CrossRef]
Ozveren, U. An artificial intelligence approach to predict gross heating value of lignocellulosic fuels. J. Energy Inst. 2017, 90, 397–407. [Google Scholar] [CrossRef]
Rajković, D.; Jeromela, A.M.; Pezo, L.; Lončar, B.; Zanetti, F.; Monti, A.; Špika, A.K. Yield and Quality Prediction of Winter Rapeseed—Artificial Neural Network and Random Forest Models. Agronomy 2021, 12, 58. [Google Scholar] [CrossRef]
Pezo, L.L.; Ćurčić, B.L.; Filipović, V.S.; Nićetin, M.R.; Koprivica, G.B.; Mišljenović, N.M.; Lević, L.B. Artificial neural network model of pork meat cubes osmotic dehydratation. Hem. Ind. 2013, 67, 465–475. [Google Scholar] [CrossRef]
Geladi, P.; Manley, M. Scatter plotting in multivariate data analysis. J. Chemom. 2003, 17, 503–511. [Google Scholar] [CrossRef]
Keim, D.A.; Hao, M.C.; Dayal, U.; Janetzko, H.; Bak, P. Generalized Scatter Plots. Inf. Vis. 2009, 9, 301–311. [Google Scholar] [CrossRef] [Green Version]
Dai, Z.; Chen, Z.; Selmi, A.; Jermsittiparsert, K.; Denić, N.M.; Nеšić, Z. Machine learning prediction of higher heating value of biomass. Biomass-Convers. Biorefin. 2021, 13, 3659–3667. [Google Scholar] [CrossRef]
Xing, J.; Luo, K.; Wang, H.; Gao, Z.; Fan, J. A comprehensive study on estimating higher heating value of biomass from proximate and ultimate analysis with machine learning approaches. Energy 2019, 188, 116077. [Google Scholar] [CrossRef]
Chen, J.; Ding, L.; Wang, P.; Zhang, W.; Li, J.; Mohamed, B.A.; Chen, J.; Leng, S.; Liu, T.; Leng, L.; et al. The Estimation of the Higher Heating Value of Biochar by Data-Driven Modeling. J. Renew. Mater. 2022, 10, 1555–1574. [Google Scholar] [CrossRef]
Ghugare, S.B.; Tiwary, S.; Elangovan, V.; Tambe, S.S. Prediction of Higher Heating Value of Solid Biomass Fuels Using Artificial Intelligence Formalisms. Bioenergy Res. 2013, 7, 681–692. [Google Scholar] [CrossRef]
Hosseinpour, S.; Aghbashlo, M.; Tabatabaei, M. Biomass higher heating value (HHV) modeling on the basis of proximate analysis using iterative network-based fuzzy partial least squares coupled with principle component analysis (PCA-INFPLS). Fuel 2018, 222, 1–10. [Google Scholar] [CrossRef]

Figure 1. Flowchart of the conducted research.

Figure 2. Pair plot with correlation coefficient of the observed values from the structural analysis of the biomass (statistical significance: * p ≤ 0.01).

Figure 3. Comparison of experimentally obtained values with nonlinear models ((a) ANN, (b) RFR, (c) SVM, and (d) polynomial) predicted values for training and testing data.

Figure 4. Comparison of nonlinear models ((a) ANN, (b) RFR, (c) SVM, and (d) polynomial) in the estimation of HHV biomass regarding the observation number.

Figure 5. Relative importance (%) of structural analysis variables on HHV.

Table 1. Estimated effects of factors on HHV output.

Factor	Effect	ε
β₀	17.38	0.18
β₁	0.71	0.54
β₂	0.48	0.76
β₃	5.07	0.76
β₄	−1.61	1.95
β₅	0.16	0.79
β₆	0.44	1.67
β₇	−2.57	2.29
β₈	1.51	1.67
β₉	−2.28	2.29

β₀—intercept value; β₁–β₉—main effects; ε—standard error.

Table 2. Weights and biases of input and output layers of the developed ANN model.

Artificial Neuron Number	Input Layer				Output Layer
	Weight Coefficient			Bias	Weight Coefficient	Bias
	Cel	Lig	Hem	Bias	HHV	Bias
1	8.70	−5.06	−1.02	−3.82	−0.47	1.46
2	−3.00	−1.80	−1.91	1.55	−0.24
3	3.35	−2.33	0.37	−0.53	1.72
4	2.38	−1.87	0.45	−0.02	−1.74

Cel—cellulose; Lig—lignin; Hem—hemicellulose; HHV—higher heating value.

Table 3. Performance of developed ML models.

Model	Net. Name	Training Perf.	Test Perf.	Training Error	Test Error	Training Algorithm	Error Function	Hidden Activation	Output Activation
ANN	MLP 3-4-1	0.88	0.95	0.15	0.07	BFGS 82	SOS	Exponential	Identity
RFR	-	0.89	0.92	-	-	-	-	-	-
SVM		0.85	0.89
Polynomial		0.85	0.92

ANN—artificial neural network; RFR—random forest regression; SVM—support vector machine.

Table 4. Statistical test “Goodness of fit”.

Model	Χ²	RMSE	MBE	MPE	SSE	AARD	R²	Skew	Kurt	SD	Var
ANN	0.25	0.50	0.03	2.22	57.98	100.87	0.90	−0.90	5.20	0.50	0.25
RFR	0.29	0.54	0.01	2.45	68.26	113.90	0.89	−0.50	2.47	0.54	0.29
SVM	0.35	0.59	0.03	2.74	80.97	158.04	0.86	−0.03	1.72	0.59	0.35
Polynominal	0.32	0.56	0.00	2.62	74.89	230.35	0.87	−0.23	2.37	0.57	0.32

Χ²-chi-squared test; RMSE—root mean square error; MBE—mean bias error; MPE—mean percentage error; SSE—sum squared error; AARD—average absolute relative deviation; R²—coefficient of determination; Skew—skewness; Kurt—kurtosis; SD—standard deviation; Var—variance.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Brandić, I.; Pezo, L.; Bilandžija, N.; Peter, A.; Šurić, J.; Voća, N. Comparison of Different Machine Learning Models for Modelling the Higher Heating Value of Biomass. Mathematics 2023, 11, 2098. https://doi.org/10.3390/math11092098

AMA Style

Brandić I, Pezo L, Bilandžija N, Peter A, Šurić J, Voća N. Comparison of Different Machine Learning Models for Modelling the Higher Heating Value of Biomass. Mathematics. 2023; 11(9):2098. https://doi.org/10.3390/math11092098

Chicago/Turabian Style

Brandić, Ivan, Lato Pezo, Nikola Bilandžija, Anamarija Peter, Jona Šurić, and Neven Voća. 2023. "Comparison of Different Machine Learning Models for Modelling the Higher Heating Value of Biomass" Mathematics 11, no. 9: 2098. https://doi.org/10.3390/math11092098

APA Style

Brandić, I., Pezo, L., Bilandžija, N., Peter, A., Šurić, J., & Voća, N. (2023). Comparison of Different Machine Learning Models for Modelling the Higher Heating Value of Biomass. Mathematics, 11(9), 2098. https://doi.org/10.3390/math11092098

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Comparison of Different Machine Learning Models for Modelling the Higher Heating Value of Biomass

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Collection

2.2. Data Processing

2.3. SVM Modelling

2.4. Polynomial Regression Model

2.5. RFR Modeling

2.6. ANN Modelling

3. Results

3.1. Data Distribution

3.2. Polynomial Regression Model

3.3. RFR Model

3.4. ANN Model

3.5. Model Performance

3.6. Global Sensitivity Analysis of the Developed ANN Model

3.7. Goodness of Fit

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI