Next Article in Journal
Edge-Based Minimal k-Core Subgraph Search
Previous Article in Journal
The Homology of Warped Product Submanifolds of Spheres and Their Applications
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Machine Learning Alternatives to Response Surface Models

1
CNRS, AMSE, Aix Marseille Université, 13001 Marseille, France
2
CNRS, Aix Marseille Université, I2M UMR 7373, 13009 Marseille, France
*
Author to whom correspondence should be addressed.
Mathematics 2023, 11(15), 3406; https://doi.org/10.3390/math11153406
Submission received: 29 June 2023 / Revised: 28 July 2023 / Accepted: 2 August 2023 / Published: 4 August 2023
(This article belongs to the Section Probability and Statistics)

Abstract

:
In the Design of Experiments, we seek to relate response variables to explanatory factors. Response Surface methodology (RSM) approximates the relation between output variables and a polynomial transform of the explanatory variables using a linear model. Some researchers have tried to adjust other types of models, mainly nonlinear and nonparametric. We present a large panel of Machine Learning approaches that may be good alternatives to the classical RSM approximation. The state of the art of such approaches is given, including classification and regression trees, ensemble methods, support vector machines, neural networks and also direct multi-output approaches. We survey the subject and illustrate the use of ten such approaches using simulations and a real use case. In our simulations, the underlying model is linear in the explanatory factors for one response and nonlinear for the others. We focus on the advantages and disadvantages of the different approaches and show how their hyperparameters may be tuned. Our simulations show that even when the underlying relation between the response and the explanatory variables is linear, the RSM approach is outperformed by the direct neural network multivariate model, for any sample size (<50) and much more for very small samples (15 or 20). When the underlying relation is nonlinear, the RSM approach is outperformed by most of the machine learning approaches for small samples (n ≤ 30).

1. Introduction

Design of Experiments (DoE) is used in any manufacturing process where, on the one hand, input parameters called factors affect a process or a formula and can be controlled, and on the other hand, output parameters of this process, which therefore represent the results, are called responses. For example, input parameters can be a temperature or a pH, and output parameters can be a yield or an impurity rate. Experiments must be performed to determine a relationship between factors and responses. DoE allows us to select these experiments in such a way that the position of the experimental points in the space to be studied is optimal. In particular, spatial modeling can be used to better understand the process and to predict the variation of responses throughout the variation range of the factors. In these experiments, responses and factors may be quantitative, qualitative, or a mix of both. The number of experiments is often very small, because each may need several days or even weeks, and the cost can be considerable. Among the important tasks that are used in the DoE approach, response surface methodology (RSM) estimates the values of all responses for any combination of factors’ values. This is achieved by means of a regression of the responses on the factors. The most common approach considers each response separately and then uses polynomial regression, including or excluding interactions between factors. Polynomial regression suffers from several limitations, as follows: (i) the model is linear in the coefficients; (ii) to get a good approximation of the surfaces, higher degrees of polynomials may be necessary together with interactions, and this considerably increases the number of variables to include in the model and therefore the number of experiments to be performed; (iii) the model may only be estimated if the number of experiments is larger than the number of factors included in the model; and (iv) polynomial models are inappropriate in the case of nonlinear systems [1].
Several papers have suggested machine learning (ML) approaches as an alternative to polynomial regression. The approaches that are considered in the literature include Support Vector Machines (SVM) [1,2,3,4,5,6], Neural Networks (NN) [1,2,4,6,7,8,9], Random Forests (rf) [1,3,4,10], Boosting and its extension [1,3], Extra Tree regression (ET) [3,4] and Classification And Regression Trees (CART) [1]. In these papers, the authors compared some of these approaches only on real datasets, using specific settings with respect to the nature and dimensions of the data. As well as focusing on a specific dataset, the comparisons considered very few approaches at the same time and were based on very different metrics. In this work, we will mainly consider the contribution of ML with respect to the RSM; these approaches have been shown to be very efficient in various fields, such as ecology [11] and epidemiology [12].
In this paper, we give a state of the art of the ML approaches that have often been used as alternatives to polynomial regression in RSM, including direct multi-output approaches. We will briefly describe these approaches and compare them using simulation models in various situations with different sample sizes and over a real DoE case study. The simulations were actually limited to numerical variables; including categorical ones as input or output did not have any significant effect on our main results or conclusions. The advantage and the contribution of our work lies, on one hand, in the choices of the simulations; some responses are in favor of the RSM approach and others are not. On the other hand, to our knowledge, direct multivariate output approaches have never been used in this context.
This paper is organized as follows. Section 2 gives a summary of the polynomial RSM-based approach and a brief description of the most well-known ML approaches. Section 3 gives the state of the art for the applications of these approaches in various domains where DoE is practiced. In Section 4, we compare all of the ML approaches to polynomial regression on simulated multiple output datasets, varying sample sizes and optimizing the models’ hyperparameters. A real dataset is used for similar experiments. The last section includes our discussions and conclusion.

2. Response Surface Methodology

Among the many objectives of experimental design, Response Surface Methodology (RSM) [13] is used to determine the value of one or more outputs at any point in the experimental domain of interest [14] without carrying out an infinite number of experiments. This is achieved using a regression model of the form:
Y = f ( X 1 , X 2 , X K ) + ϵ
where f is the unknown regression function. To estimate f, we need to make some hypothesis about its shape. Polynomial models are most commonly used to address this issue. Using high-degree polynomials, f may be approximated correctly. This is a linear model in the coefficients. With K factors and degree d, the number of coefficients of the model is ( K + d ) ! K ! d ! . According to the type of study, polynomial models of degree 1 or 2 are used because the number of observations is, in general, very small (often less than 20).
Let U k be the original K controlled factors, which are also called the natural variables, whose space is called the experimental domain. These variables are generally on different scales, and they may bias the modeling results. Thus, they are often normalized and linearly transformed into codified variables X 1 , X 2 , , X K , which are dimensionless quantities and with the same range. The most commonly used transformation is X k = ( U k U k 0 ) Δ U k , where U k 0 = ( m a x U k + m i n U k ) 2 is called the central value and Δ U k = ( m a x U k m i n U k ) 2 is the step. The transformed variables ( X k ) k = 1 , , K lie in the interval [ 1 , 1 ] , and the n × K matrix X = ( X k ) k = 1 , , K is called the experimental design, where n is the sample size. A polynomial model of degree 2 may be written as:
Y = β 0 + k = 1 K β k X k + k = 1 K β k k X k 2 + 1 j < k K β j , k X j X k
where Y is the response and β are the unknown coefficients. Once the design of the experiments is carried out and the data are available, the coefficients of the model are estimated by ordinary least squares, and their estimate is given by:
b = ( X X ) 1 X Y
where X = ( 1 , X 1 , X 2 , , X K , X 1 2 , X 2 2 , , X K 2 , X 1 X 2 , X 1 X 3 , , X K 1 X K ) is the model matrix, X is the transpose of X, Y is the response, X X is the information matrix, and ( X X ) 1 is the dispersion matrix.
Test points may be used to validate the model, or an analysis of the variance may be performed. After validation, the model is used to predict the variation of responses throughout the experimental domain represented by a response surface (Figure 1). These predictions can then be used to construct a Design Space, which is a region in which the response specifications will be achieved with a fixed probability [15]. Nevertheless, due to the small sample sizes available in experimental designs, polynomial models of degree 2 may give very poor estimation for surface responses.
Alternative models have been proposed for several years. ML models present more flexible methods of estimating a response surface function: they are nonparametric and nonlinear models, and may even be efficient when the number of experiments is small compared to the number of factors.

3. Machine Learning Approaches

In this section, we will give a brief description of the most commonly used approaches in ML, as follows: k-nearest neighbors (knn), CART, Ensemble models (Bagging, Boosting, and rf), SVM, and NN.

3.1. knn

This is one of the simplest ML approaches that may be used in regression and in classification [17]. It is also quite different from all of the other approaches, in that the model estimation and prediction are embedded. Once the number of neighbors k is fixed, for any new observation x, the method seeks its k nearest neighbors within the learning set. The prediction for x is the average of the neighbors’ outputs in regression, or the majority vote in classification. The decision rule may be refined using, for instance, weighted averages or weighted majority votes, where weights may be inversely proportional to the distances of the neighbors from x. Note that the distance used to identify neighbors is a hyperparameter for this approach, together with k.

3.2. CART

Classification and Regression Trees [18] are based on recursive partitioning of the input space using rectangles. These partitions are represented by a sequence of binary splitting rules of the form X m < s , where X m is one of the input variables and s is a threshold over it. The split at each stage of the partitioning is optimized by an exhaustive search looking for the couple variable and threshold that minimizes the heterogeneity of the obtained subsamples. Heterogeneity is measured with respect to the output y. When the output is continuous, the deviance is used (the variance multiplied by the sample size). For discrete output, entropy or Gini criteria are used.
CART is is the most popular nonlinear and nonparametric method when one wants to understand how the model relates the outputs to the explanatory variables. It is one of the few methods which is graphically representable.

3.3. Ensemble Methods

The idea of ensemble methods, in their simplest approach, is to build K bootstrap samples of the dataset at hand, adjust a model chosen from within a class of functions (e.g., decision trees) for each bootstrap sample (denoted f k ), and then use the K models for predictions. In regression, the prediction by the ensemble is a weighted average of the K predictions given from each f k . In classification, it is a weighted majority vote over the K predictions given by the models f k .
Several variants of ensemble methods exist, differing either in the way in which the bootstrap samples are generated, in the estimation of each f k , in the way in which the individual models f k are combined, or in the algorithm used to estimate the whole process.
Random Forests (rf; [19] are among the most well-known and used approaches. This approach combines trees built over bootstrap samples by simple averages in regression and majority vote in classification. The trees that are built in rf may be very big, because they are not optimized, and the choice of the splits at each node in the trees is made on a small subset of variables chosen at random. Because bootstrap samples are independent, random forests may be parallelized.
Boosting [20] is quite different because the trees built over bootstrap samples are obtained sequentially; weights w i are associated to each observation in the original dataset, and at each step of the algorithm, a tree f k is built and tested over the original dataset. Observations whose prediction at step k is wrong will have their weight increased. The modified weights are used at the next step k + 1 for the next bootstrap sample. Each tree that is built in the ensemble has a weight W k related to its performance, and the final output is a weighted majority vote in classification and a weighted average in regression. The trees’ estimates and their weights may be obtained using a gradient approach, as is applied in the gradient boosting method (gbm) [21].
Extreme gradient boosting (xgboost [22]) is another gradient boosting algorithm, which employs some additional tricks to estimate the parameters (i.e., trees and their weights). The optimization is carried out globally at the level of each split in the different trees and, as in rf, splits are optimized over a randomly sampled subset of covariates. The loss function optimized in the algorithm uses regularization over the weights and, optionally, a shrinkage of each tree added to the ensemble.

3.4. Support Vector Machines

Support vector machines [23] seek a linear separation of observations, like linear regression, but minimize a loss function based on the margins. In binary classification, the optimal regression is the hyper-plane, which perfectly separates the two classes and stays the farthest distance possible from its nearest points from the two classes. This hyper-plane may be easily found if the classes are linearly separable. If they are not, then the dataset is projected using nonlinear transformations in a much higher space where linear separation may be guaranteed. The linear transformations that are used in this case are expressed using kernel functions, among which the most commonly used are polynomial and radial kernels. These kernels, together with the optimization algorithm used to find the optimal hyper-plane, depend in general on several hyperparameters that must be fixed or tuned by the user. We will denote svmPoly (resp. svmRadial) as the support vector regression using polynomial (resp. radial) kernel.
Support Vector Machines may show high performance when the data are linearly separable, and are very efficient for large datasets. They also have a very solid mathematical justification.

3.5. Neural Networks

NNs, and specifically multi-layer perceptron (mlp) [17], are designed to handle multiple outputs. An mlp is composed of successive layers, each having a fixed number of neurons. Each neuron of a layer receives information (its inputs) from all of the neurons of the previous layers and outputs a nonlinear transformation ϕ of a linear combination of its inputs. In Figure 2, the left-hand panel shows a simple neuron having d inputs x 1 , x 2 , , x d . This neuron will output the value:
ϕ ( j = 0 d w j x j )
where the w j are the weights w 0 = b , x 0 = 1 , and ϕ is a nonlinear function. The right-hand panel in Figure 2 shows an mlp with one hidden layer containing three neurons. Various types of layers may also be used to design a NN. We shall mainly use the fully connected layers (also called dense layers).
Once the structure of the network is specified (number of layers and number of neurons per layer), all of the weights of the NN that are unknown are estimated. To achieve this, a loss function should be specified (mean squared error for regression), together with an optimization algorithm (typically gradient-based algorithms). In some situations, specific activation layers are added to the networks, mainly for the last layer (softmax for multi-class learning, sigmoid for multi-label learning).
Neural networks are among the most difficult to tune because of the large number of hyperparameters that may be involved in the structure of the network and in the choice of the optimization algorithms. For small datasets, they may be unstable due to random initializations of the weights within the training.

3.6. Multidimensional Output Approaches

In many situations, the output variable to be predicted may be multidimensional in dimension q; multi-target regression, multi-label learning, distribution learning, semantic segmentation (for images), or time series. Various approaches may be used to tackle this situation. Direct multi-output regression is the simplest approach, which consists of using one model per output component. In this case, one assumes that the output components are independent. Chained regression is another approach that accounts for dependence among output variables. The output variables should first be ranked, and one model is learnt for each output, as follows: for output variable j, the original explanatory variables X are used as input, together with the predictions of all of the j 1 outputs already modelled. In this approach, the result depends strongly on the order used for the responses.
Some supervised learning algorithms also have the ability to jointly consider the q output variables within the same model, which is the case for the example of NN, and also regression trees [24].
In classical regression trees, the output variable y is used to compute the deviance, that is, the splitting criterion of a node t:
D e v i a n c e ( t ) = i t ( y i y ¯ t ) 2
where y ¯ t is the empirical mean of y in node t (which corresponds to a subset of the original sample). When y is multidimensional, y R q , the deviance is easily generalized [24]:
D e v i a n c e ( t ) = i t y i y ¯ t 2 2
where . 2 stands for the Euclidean norm in R q . In this case, the values assigned to each node and each leaf of the tree are the q dimensional vectors y ¯ t . The rest of the algorithm is similar to the one-dimensional output. We will denote this approach as C A R T m v .
The main advantage of direct multiple-output approaches is that they may account for dependencies between the different outputs. C A R T m v and its variants have been widely used in various applications.

3.7. Hyperparameters and Their Tuning

The performance of any ML algorithm highly depends on the choice of its hyperparameters. Each approach uses several hyperparameters, from one to almost ten. In most cases, these parameters have to be optimized, which is often achieved using cross-validation. Tuning may require excessive computation times, depending on the number of hyperparameters to be adjusted. Details about the tuning process and the hyperparameters used will be given in the experiments section.

3.8. Metrics Used to Compare the Models

Analysis of variance is generally used to evaluate the performance of polynomial models in RSM. Depending on the problem under study, it can be observed that the variation in outputs may not be due to the variation in inputs. Different criteria may be used to assess and compare the accuracy of the prediction obtained using any regression model. Let n be the size of the sample used to evaluate the models, y i be the observed value of the response for observation i, y ¯ i be the average of responses in the sample, and y ^ i be the predicted value of the response for observation i. Among the most used metrics, we can find:
  • Root Mean Square Error, R M S E = i = 1 n ( y i y ^ i ) 2 n
  • Mean Absolute Error, M A E = 1 n i = 1 n | y i y ^ i |
  • Mean Absolute Percentage Error, M A P E = 1 n i = 1 n | y i y ^ i y i |
  • Determination Coefficient, R 2 = 1 i = 1 n ( y i y ^ i ) 2 i = 1 n ( y i y ¯ ) 2
  • Akaike Information Criterion, A I C = 2 k 2 l n ( L ) , where k is the number of parameters to be estimated in the model and L is the maximum likelihood function of the model
Other, less well-known or used metrics also appear in the literature:
  • The explained variance by the model is the proportion of the variance due to the factors, E V = 1 v a r ( y y ^ ) v a r ( y )
  • The Nash and Sutcliffe Efficiency is equivalent to the R 2 but uses absolute differences rather than quadratic, N S E = 1 i = 1 n | y i y ^ i | i = 1 n | y i y ¯ | , < N S E 1
  • The agreement index d is a standardized measure of the degree of model prediction error, d = 1 i = 1 n | y i y ^ i | i = 1 n | y i y ¯ | + | y ^ i y ¯ | , 0 < d 1
  • The average absolute deviations from a central point (this metric is defined and used in [8], but the name given by the authors does not seem appropriate for us as it is not in line with their definition), A A D ( % ) = 100 × 1 n i = 1 n y ^ i y i y i
The objective is to minimize R M S E , M A E , M A P E , A I C , and A A D , and to maximize R 2 , E V , N S E , and d. The results encountered in the literature show that, in general, when different metrics are used for comparisons, they are very often concordant with respect to the objectives.
When the output y is multidimensional, any of these criteria may be used by computing its average over all the dimensions.

4. Comparisons between RSM and ML Approaches

A comparison of these response surface models with ML has been made, mainly in the field of mechanics and materials development.
ML models were compared to the traditional RSM polynomial approximation on a mechanical engineering case study [1]. The authors studied 63 continuous factors and 8 continuous responses combined to produce a one-dimensional output, with 56,512 observations. Polynomial models of degree 1 and 2, Least Absolute Shrinkage Selection Operator [25], Generalized Linear Model [26] and nonlinear ML models (Random Forest, Gradient Boosting Decision Tree [17], Multiple Layer Perceptrons and Support Vector Regression) were tested. Three evaluation criteria were used for the comparisons, as follows: Explained Variance (EV), Mean Absolute Percentage Error (MAPE) and Root Mean Square Error (RMSE). The results obtained show that all ML models outperform the polynomial models of the RSM approach. The authors also note that for larger training set sizes, nonlinear ML models were more accurate. A simulation of the data by the MC method is proposed to improve the estimation of the response surface function. Comparisons made in this work are based on a real dataset with large samples; the true underlying relation between the response and the factors is unknown. Meanwhile, eight original responses were combined to get a one-dimensional output.
In [2], ML models were used to predict coating thickness in a nonlinear electrostatic spray deposition process and compared with the conventional RSM model. Three continuous factors and one continuous response were considered with 30 observations. A polynomial model of degree 2 (RSM), NN (Back-Propagation algorithm [27]) and Support Vector Machine models were used. MAPE was calculated to evaluate the performance of these models. The results suggest that the SVM model gives the best prediction accuracy.
The objective of [3] was to predict the viscosity of nano-polymers used in an enhanced oil recovery method, using the response surface methodology and supervised machine learning approaches. Five continuous factors and one continuous response were available, with 57 observations. The authors applied a polynomial model of degree 2 and analysis of variance, which did not show inadequacy for this model. However, the covariance matrix analysis showed that there is no linear relationship between the factors. Therefore, the authors also used nonlinear ML models, namely ridge regression [28], lasso regression, Support Vector Machine, Decision Tree, random forest, Extra tree regression [29], Gradient Boosting Regression and eXtreme Gradient Boosting [22]. Akaike Information Criterion (AIC), Mean Absolute Error (MAE), RMSE and coefficient of determination were used as evaluation and comparison criteria. The results show that the ensemble models give more accurate results than the RSM. Furthermore, the eXtreme Gradient Boosting model was the best model for response prediction.
In another study, where the objective was to predict the microhardness of a synthesized electroless Ni-P-TiO 2 coated aluminium composite, five continuous factors and one continuous response were considered with 36 observations [4]. The authors tested a polynomial model of degree 2 and four ML models (NN, SVM, rf and Extra Trees). Mean Square Error (MSE), MAE and determination coefficient were used as evaluation criteria to compare these models. The results show that the ET model presents the lowest MSE and MAE values. This model also had the greatest R 2 .
Hybrid regression and ML models were tested to predict the ultimate condition of fiber-reinforced polymer-confined concrete [5]. The authors used an open database with eight continuous factors, two continuous responses and 765 observations. The authors compared some existing physical empirical models developed by other authors with the RSM and SVR models, together with a hybrid model combining SVR and RSM models. The following metrics were used to evaluate these models: RMSE, MAE, Nash and Sutcliffe Efficiency (NSE) and agreement index d. The authors show that the hybrid model presents a lower RMSE, lower MAE, higher NSE and higher d than other models.
Similar comparisons with ML approaches may also be found in other fields, such as pharmaceutical development. Ref. [6] worked on the effect of the core/shell technique on improving powder compactability. In this paper, two continuous factors, two binary factors and two continuous responses were measured in 28 experiments. The authors used the RSM approach, adjusting one model for each combination of the two binary factors levels, thus, four models for each response. For the ML approach, the authors compared an SVR model and four types of NN models: Backpropagation Neural Network (BPNN), Genetic Algorithm Based BPNN (GA-BPNN) [30], Mind Evolutionary Algorithm Based BPNN (MEA-BPNN) [31]), and Extreme Learning Machine (ELM) [32]. The evaluation criteria were the variance coefficient of the RMSE and RMSE for all models, and AIC for NN models only. The results show that NN models provide better prediction accuracy than other models.
The objective of the authors of [10] was to apply RSM and ML combined with data simulation to estimate metal recovery from freshwater sediments. In this work, three continuous factors were studied, six continuous responses were measured and 18 experiments were carried out. A polynomial model of degree 2 was estimated and used for data simulation. ML models, namely Lazy KStar [33] and rf algorithms, were also tested. A comparison of the models was made by means of RMSE and coefficient of determination. The results show that RSM models overestimate the observed responses by 19% compared to ML models.
Ref. [8] compared RSM and ML approaches (NN) to predict the efficient extraction of artemisinin (a precursor molecule of the most powerful antimalarial drugs on the market). In this work, three continuous factors were studied, one continuous response was measured and a central composite design was constructed with 20 experiments. For the RSM approach, ANOVA was calculated and shows that the variability of the response cannot be adequately predicted by the RSM model depending on the factors studied. This result can be explained by a complex relationship between variables. Multiple Layer Perceptrons (feed forward NN) were also tested. A comparison between these two models was performed by means of Absolute Average Deviation (AAD). The authors concluded that the NN model has better prediction accuracy.
Ref. [9] analysed NN models as an alternative modeling technique for datasets showing nonlinear relationships, using data from a tablet compression study. In this work, six continuous factors were studied, two continuous responses were measured, and 102 experiments were performed. The authors calculated a polynomial model of degree 2 with only important terms (p-value < 0.05) for the RSM approach and calculated a “generalized feed forward multiple layer perceptron network” for the ML approach. A comparison of the determination coefficients of these two models suggests that the NN model can more accurately predict the variation of the response.
A summary of these papers comparing RSM and ML approaches is given in Table 1.

5. Experiments

In this section, we compare all of the ML approaches described in Section 3 to the polynomial model over a simulated dataset and a real dataset. We describe the datasets that we used, and how the different models are tuned and compared. All of the experiments were conducted using R software together with the RStudio Team [34] and the package “caret” [35].

5.1. Simulated Model

The simulated model that we use is based on two independent continuous factors, X 1 and X 2 , following a uniform distribution over [ 1 , 1 ] . The three responses—denoted as Y 1 , Y 2 , and Y 3 —are computed as follows:
Y 1 = 20 X 1 X 2 + 18 X 1 2 + X 2 2 + 5 X 1 X 2 + ϵ Y 2 = X 1 3 × exp ( sin ( X 2 ) ) + ϵ Y 3 = | X 1 0.5 | + | X 2 0.5 | + ϵ
where ϵ is a random noise with zero mean and standard deviation σ . The responses were chosen deliberately to present various types of function, as follows: polynomial of degree 2 ( Y 1 ), nonlinear ( Y 2 ), and nonlinear with discontinuities ( Y 3 ). A graphical representation of these responses is shown in Figure 3.
Using this simulated model, we generated a training sample of size n and a test sample of the same size. Different values of n were tested, as follows: 15, 20, 30 and 50 observations. Small samples were used for conformity with true DoE applications. For each simulated dataset of size n, we computed the RMSE metric. Simulations were repeated K = 50 times, and RMSE values were averaged over the different runs. We also varied the variance of the random noise, testing two values, σ = 1 and σ = 2 .

5.1.1. Tuning the Models

We compared the RSM approach with all of the ML approaches described in Section 3: knn, CART, rf, gbm, xgboost, svmPoly, svmRadial, NN, CARTmv and NNmv. The models were trained using the model matrix as input, thus, the true factors together with their squares and interactions.
All ML models depend on several hyperparameters. To optimize the values of the hyperparameters, we applied the grid search approach. Thus, a grid was defined for each model’s hyperparameter, and three-fold cross-validation with five repeats was used for the optimization. The optimal values of the hyperparameters were then used to learn the model over the training set and evaluate it on the test sample. Table 2 lists the hyperparameters for each method, together with a short description and the grid used for its optimization.
The average of the optimal values of the hyperparameters and their standard deviation for each approach and each response are given in Appendix A.1.

5.1.2. Results

Table 3, Table 4, Table 5 and Table 6 show the average and standard deviation of the RMSE scores computed over the test samples for each response and averaged over K = 50 repetitions (columns, S 1 , S 2 and S 3 ), and the average RMSE over the responses (column S a v e ). We also reported the rank of each model with respect to its RMSE for each response ( R 1 , R 2 and R 3 ) and for the average over responses ( R a v e ). A model has rank 1 for a response if it has the smallest RMSE, and thus, is the best. Each table corresponds to a different sample size n. Boxplots of the RMSE over the K = 50 runs for all the models and the outputs are shown in Appendix A.2.
For n = 50 (Table 3), the best models for the three responses were multivariate NN (for Y 1 ), svm with polynomial kernel (for Y 2 ) and NN (for Y 3 ). The linear model had positions (2, 5, 5).
For n = 30 (Table 4), the best models for the three responses were, respectively: multivariate NN (for Y 1 ) and NN (for Y 2 and Y 3 ). With respect to the average of all response errors, svm with polynomial kernel is the best model in this case. Linear regression is at position 2 for Y 1 , but at position 8 for the other two responses. These first results seem to be in agreement with the state of the art: ML approaches may provide better prediction estimates for the responses.
We obtained similar results for smaller sample sizes, n = 20 (Table 5), and n = 15 (Table 6): multivariate NN (for Y 1 ), knn (for Y 2 ) and NN (for Y 3 ) were the best models. Linear regression gets the worst results as the sample size decreases, particularly for Y 2 and Y 3 , which are nonlinear responses.
Globally, the RSM approach seems to be inefficient when compared to ML approaches, mainly with small sample sizes and for responses that are originally nonlinear in the factors. Even for large samples ( n = 50 , compared to what is available in DoE), RSM is outperformed in our experiments for linear response. Finally, multivariate NNs show excellent performance in various situations, mainly for the linear response ( Y 1 ) where they outperform the RSM for any sample size.

5.2. Use Case

We used the pharmaceutical application that is described in [16]. This article is based on the development of the high-performance liquid chromatography analytical method to quantify verapamil hydrochloride (which is a chemical molecule involved in headaches) and its impurities in tablets. In this use case, three continuous factors (buffer pH, ammonium hydroxide concentration in mobile phase and injection volume of test solutions) and five continuous responses (capacity factor for the first eluted peak denoted Y 1 , resolution between two impurities denoted Y 2 , signal-to-noise for an impurity denoted Y 3 , resolution between two other impurities denoted Y 4 , and retention time difference between oxidative impurity peak and the closest impurity denoted Y 5 ) were available. The authors used the RSM approach to determine optimal chromatographic conditions.
For the real case study, we conducted similar experiments as for the simulated dataset, replacing random generation with random splitting. Thus, for each of the 50 repetitions, the original dataset was randomly split using proportions 0.7 and 0.3 for learning and testing, respectively. The results that we obtained are given in Table 7. The best models with respect to the RMSE were xgb for Y 1 , gbm for Y 2 , multivariate NN for Y 3 , NN for Y 4 and knn for Y 5 . For the overall error, multivariate NN had the smallest error. For this real dataset, and for all responses, ML models gave better results than the RSM approach.

6. Conclusions

In this work, using extensive simulations and a real use case, we have demonstrated that ML approaches present a very interesting alternative to response surface modeling in the DoE. We have tested a large panel of ML approaches, together with direct multi-output regression models (decision trees and NNs), which had never previously been used in this context. Clearly, various ML approaches outperformed RSM in all our experiments. RSM is very simple to use compared to ML approaches, due to the fact that several hyperparameters must be tuned correctly in ML algorithms. This requires optimization procedures over grids. The caret package is a good wrapper for this task and for most of the approaches that we have used, except for multi-output regression, where we implemented the grid search algorithm directly. Another constraint of caret is that not all the hyperparameters involved in each approach may be tuned, and thus it may be necessary to use other packages for this issue.
In our implementation, we used the factors together with their squares and interactions in the ML approaches. Very few papers are clear about this choice. We also tested the models including only the factors, and in some cases a loss in performance appeared for some ML algorithms, but the main conclusion was observed: several ML approaches outperformed classical RSM.
Many ML and deep learning approaches were not explored in this work (like [36,37,38,39]) and may have some advantages and drawbacks compared to those we have used. Due to the very large choice of such approaches, it is impossible to be exhaustive, so we have selected the ones most used and cited in the literature. In addition, further higher-polynomial-order factors could be included as input in the different models. Together with feature selection approaches, this may give better performance for various ML approaches.

Author Contributions

Conceptualization, B.G. and D.M.; methodology, B.G. and D.M.; software, D.M.; validation, B.G. and D.M.; formal analysis, B.G. and D.M.; investigation, B.G. and D.M.; resources, B.G. and D.M.; writing—original draft preparation, B.G. and D.M.; writing—review and editing, B.G. and D.M.; visualization, B.G. and D.M.; supervision, B.G.; project administration, D.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

On request by email.

Acknowledgments

The authors gratefully acknowledge Claude Deniau and Sophie Declomesnil for their constructive comments on the manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Supplementary Material

Appendix A.1. Mean and Standard Deviation of Hyperparameter Values

In each of the K = 50 runs, the optimized values of the parameters were computed. The following tables give the mean and standard deviation of the hyperparameters for each response.
Table A1. Average and standard deviation (SD) of the hyperparameters for all responses with σ = 1 and n = 50 .
Table A1. Average and standard deviation (SD) of the hyperparameters for all responses with σ = 1 and n = 50 .
HyperparametersAverage Y1SD Y1Average Y2SD Y2Average Y3SD Y3
k4.061.2689.342.218.42.304
minsplit32.4741.31.1991.20.99
minbucket1.81.8525.41.6415.61.37
cp0.0030.0020.0230.0110.0180.012
minsize5.463.3825.543.575.443.144
mtry3.20.72.30.8392.641.208
n.trees9358.0385005428.284
interaction.depth1.960.881.080.341.080.396
shrinkage0.100.100.10
n.minobsinnode100100100
nrounds7047.38500517.071
max_depth1.20.4951.361.0641.120.521
eta0.300.3060.0240.3040.02
gamme000000
colsample_bytree0.7120.10.6840.10.6640.094
min_child_weight101010
subsample0.7430.1780.8280.1870.8330.197
degree1.320.6531.760.871.420.758
scale2.3123.6610.1410.2920.4121.977
C p o l y 1.661.3581.5351.4450.480.756
sigma0.5370.420.5250.5150.4970.503
C r a d 5.087.3590.350.1960.461.1
layer14.961.3555.041.5253.641.549
layer25.121.2234.641.3063.641.601
layer34.921.5235.041.3553.721.715
minsplit2.82.4242.82.4242.82.424
minbucket2.42.2682.42.2682.42.268
cp0.0060.0060.0060.0060.0060.006
layer141.51241.51241.512
layer24.041.4844.041.4844.041.484
layer33.921.7123.921.7123.921.712
Table A2. Average and standard deviation (SD) of the hyperparameters for all responses with σ = 1 and n = 30 .
Table A2. Average and standard deviation (SD) of the hyperparameters for all responses with σ = 1 and n = 30 .
HyperparametersAverage Y1SD Y1Average Y2SD Y2Average Y3SD Y3
k3.081.0478.282.2418.522.297
minsplit1.91.941.10.7071.31.199
minbucket1.31.1995.71.1995.71.199
cp0.0060.0060.0130.0110.0130.013
minsize5.63.1695.643.2695.043.597
mtry3.781.0162.220.7372.340.917
n.trees11362.935726.745517.071
interaction.depth1.580.4991.280.4541.260.443
shrinkage0.100.100.10
n.minobsinnode100100100
nrounds10673.296851.27517.071
max_depth2.061.4491.861.4141.741.44
eta0.310.030.3180.0390.3220.042
gamme000000
colsample_bytree0.7040.1010.6560.0910.6840.1
min_child_weight101010
subsample0.690.1890.7450.1970.7350.198
degree1.40.6391.560.7871.660.895
scale1.8243.3590.1390.2930.2521.421
C p o l y 1.5551.2261.1351.2920.9951.243
sigma0.3970.270.5180.6350.4940.339
C r a d 6.166.9730.682.240.3450.214
layer14.961.2284.641.5883.681.684
layer25.520.9534.961.2284.241.791
layer35.321.1864.041.7844.121.637
minsplit2.72.3932.72.3932.72.393
minbucket1.51.5151.51.5151.51.515
cp0.0090.0080.0090.0080.0090.008
layer14.121.6374.121.6374.121.637
layer23.881.6863.881.6863.881.686
layer34.281.5124.281.5124.281.512
Table A3. Average and standard deviation (SD) of the hyperparameters for all responses with σ = 1 and n = 20 .
Table A3. Average and standard deviation (SD) of the hyperparameters for all responses with σ = 1 and n = 20 .
HyperparametersAverage Y1SD Y1Average Y2SD Y2Average Y3SD Y3
k2.880.8498.32.4858.542.786
minsplit1.81.8521.71.7531.71.753
minbucket104.72.2155.21.852
cp0.0050.0050.0120.0120.0060.009
minsize5.583.3085.243.3844.483.209
mtry4.081.0072.621.1592.340.848
n.trees7255.476517.0715940.013
interaction.depth101010
shrinkage0.100.100.10
n.minobsinnode100100100
nrounds11074.2316238.5455934.538
max_depth2.721.3562.261.6011.681.332
eta0.3240.0430.320.040.320.04
gamme000000
colsample_bytree0.7160.10.6560.0910.680.099
min_child_weight101010
subsample0.670.1690.750.1920.7250.192
degree1.320.6211.580.7581.260.6
scale2.333.6520.4881.9750.2231.411
C p o l y 1.4151.1731.6751.5810.6750.892
sigma0.3080.2220.3230.2190.6951.371
C r a d 79.3180.4550.5620.4850.64
layer14.761.3334.081.6143.681.731
layer25.361.1744.441.684.041.641
layer35.60.8083.521.6444.081.85
minsplit2.12.0922.12.0922.12.092
minbucket101010
cp0.010.010.010.010.010.01
layer13.921.5633.921.5633.921.563
layer24.21.4714.21.4714.21.471
layer33.881.483.881.483.881.48
Table A4. Average and standard deviation (SD) of the hyperparameters for all responses with σ = 1 and n = 15 .
Table A4. Average and standard deviation (SD) of the hyperparameters for all responses with σ = 1 and n = 15 .
HyperparametersAverage Y1SD Y1Average Y2SD Y2Average Y3SD Y3
k2.740.8537.162.6917.842.985
minsplit1.20.991.41.371.71.753
minbucket1.10.7074.82.1574.52.315
cp0.0080.0090.0050.0070.0050.008
minsize5.723.0846.323.3656.483.346
mtry4.520.8632.661.0992.51.055
n.trees10790.3568170.6315529.014
interaction.depth101010
shrinkage0.100.100.10
n.minobsinnode100100100
nrounds11576.4326242.336852.255
max_depth2.61.5392.521.5812.761.611
eta0.3140.0350.3320.0470.3360.048
gamme000000
colsample_bytree0.7280.0970.6520.0890.6640.094
min_child_weight101010
subsample0.720.1850.680.1940.6850.188
degree1.160.4681.70.8141.320.683
scale2.4923.8140.8722.7260.2571.42
C p o l y 1.7151.4891.1751.3110.741.087
sigma0.2620.1840.3460.230.4890.55
C r a d 3.53.0661.4353.1881.513.361
layer14.521.4463.921.6144.041.784
layer25.321.1864.721.6044.521.607
layer35.440.9933.481.7983.721.807
minsplit2.32.2152.32.2152.32.215
minbucket1.20.991.20.991.20.99
cp0.0070.0070.0070.0070.0070.007
layer14.041.6414.041.6414.041.641
layer24.121.5864.121.5864.121.586
layer34.641.3674.641.3674.641.367
Table A5. Average and standard deviation (SD) of the hyperparameters for all responses of DoE case study.
Table A5. Average and standard deviation (SD) of the hyperparameters for all responses of DoE case study.
HyperparametersAverage Y1SD Y1Average Y2SD Y2Average Y3SD Y3Average Y4SD Y4Average Y5SD Y5
k 6.893 2.833 7.398 2.736 6.867 2.983 6.928 2.832 7.354 2.723
minsplit 1.595 1.629 1.301 1.197 1.482 1.485 1.602 1.638 1.305 1.204
minbucket 4.274 2.392 3.771 2.500 4.133 2.433 4.253 2.398 3.744 2.503
cp 0.006 0.008 0.007 0.009 0.008 0.010 0.006 0.008 0.007 0.009
minsize 5.905 3.442 5.157 3.042 5.964 3.355 5.952 3.435 5.159 3.061
mtry 5.262 2.929 5.831 3.177 4.470 2.973 5.217 2.918 5.841 3.195
n.trees 83.330 72.950 69.280 56.220 70.480 58.450 83.730 73.300 69.510 56.520
interaction.depth 1.000 0.000 1.000 0.000 1.000 0.000 1.000 0.000 1.000 0.000
shrinkage 0.100 0.000 0.100 0.000 0.100 0.000 0.100 0.000 0.100 0.000
n.minobsinnode 10.000 0.000 10.000 0.000 10.000 0.000 10.000 0.000 10.000 0.000
nrounds 138.100 80.520 147.000 85.310 150.000 79.630 136.700 80.050 148.200 85.150
max_depth 2.786 1.440 2.928 1.536 3.000 1.514 2.771 1.443 2.939 1.542
eta 0.356 0.050 0.351 0.050 0.354 0.050 0.355 0.050 0.351 0.050
gamma 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
colsample_bytree 0.721 0.098 0.752 0.086 0.721 0.098 0.723 0.098 0.754 0.085
min_child_weight 1.000 0.000 1.000 0.000 1.000 0.000 1.000 0.000 1.000 0.000
subsample 0.625 0.178 0.619 0.155 0.604 0.143 0.626 0.178 0.619 0.156
degree 1.560 0.750 1.675 0.783 1.422 0.665 1.566 0.752 1.683 0.784
scale 1.235 3.072 1.226 3.093 1.695 3.612 1.237 3.090 1.241 3.109
C p o l y 0.970 1.020 1.280 1.452 0.895 1.190 0.970 1.026 1.247 1.429
sigma 0.057 0.007 0.057 0.007 0.056 0.006 0.057 0.007 0.057 0.007
C r a d 3.330 4.385 3.602 5.383 3.873 7.118 3.322 4.411 3.634 5.408
layer1 4.024 1.657 3.639 1.566 4.313 1.696 4.000 1.653 3.659 1.565
layer2 3.905 1.580 3.976 1.638 3.880 1.541 3.928 1.576 3.951 1.632
layer3 3.786 1.584 3.373 1.651 3.518 1.611 3.807 1.581 3.366 1.659
minsplit 1.833 1.875 1.723 1.769 1.843 1.884 1.843 1.884 1.732 1.778
minbucket 4.929 2.064 4.976 2.030 4.795 2.151 4.916 2.073 4.963 2.039
cp.1 0.009 0.011 0.009 0.011 0.008 0.011 0.009 0.011 0.009 0.011
layer1 4.238 1.625 4.072 1.636 4.289 1.597 4.241 1.635 4.073 1.646
layer2 4.381 1.567 4.169 1.629 4.289 1.627 4.386 1.576 4.171 1.639
layer3 3.571 1.615 3.735 1.646 3.735 1.616 3.590 1.616 3.756 1.645

Appendix A.2. Boxplots of Errors for All Approaches and Responses

Figure A1. Response Y1: R = 50, n = 50, sd = 1.
Figure A1. Response Y1: R = 50, n = 50, sd = 1.
Mathematics 11 03406 g0a1
Figure A2. Response Y2: R = 50, n = 50, sd = 1.
Figure A2. Response Y2: R = 50, n = 50, sd = 1.
Mathematics 11 03406 g0a2
Figure A3. Response Y3: R = 50, n = 50, sd = 1.
Figure A3. Response Y3: R = 50, n = 50, sd = 1.
Mathematics 11 03406 g0a3
Figure A4. Response Y1: R = 50, n = 30, sd = 1.
Figure A4. Response Y1: R = 50, n = 30, sd = 1.
Mathematics 11 03406 g0a4
Figure A5. Response Y2: R = 50, n = 30, sd = 1.
Figure A5. Response Y2: R = 50, n = 30, sd = 1.
Mathematics 11 03406 g0a5
Figure A6. Response Y3: R = 50, n = 30, sd = 1.
Figure A6. Response Y3: R = 50, n = 30, sd = 1.
Mathematics 11 03406 g0a6
Figure A7. Response Y1: R = 50, n = 20, sd = 1.
Figure A7. Response Y1: R = 50, n = 20, sd = 1.
Mathematics 11 03406 g0a7
Figure A8. Response Y1: R = 50, n = 20, sd = 1.
Figure A8. Response Y1: R = 50, n = 20, sd = 1.
Mathematics 11 03406 g0a8
Figure A9. Response Y3: R = 50, n = 20, sd = 1.
Figure A9. Response Y3: R = 50, n = 20, sd = 1.
Mathematics 11 03406 g0a9
Figure A10. Response Y1: R = 50, n = 15, sd = 1.
Figure A10. Response Y1: R = 50, n = 15, sd = 1.
Mathematics 11 03406 g0a10
Figure A11. Response Y2: R = 50, n = 15, sd = 1.
Figure A11. Response Y2: R = 50, n = 15, sd = 1.
Mathematics 11 03406 g0a11
Figure A12. Response Y3: R = 50, n = 15, sd = 1.
Figure A12. Response Y3: R = 50, n = 15, sd = 1.
Mathematics 11 03406 g0a12
Figure A13. Response Y1: DOE case study—R = 50.
Figure A13. Response Y1: DOE case study—R = 50.
Mathematics 11 03406 g0a13
Figure A14. Response Y2: DOE case study—R = 50.
Figure A14. Response Y2: DOE case study—R = 50.
Mathematics 11 03406 g0a14
Figure A15. Response Y3: DOE case study—R = 50.
Figure A15. Response Y3: DOE case study—R = 50.
Mathematics 11 03406 g0a15
Figure A16. Response Y4: DOE case study—R = 50.
Figure A16. Response Y4: DOE case study—R = 50.
Mathematics 11 03406 g0a16
Figure A17. Response Y5: DOE case study—R = 50.
Figure A17. Response Y5: DOE case study—R = 50.
Mathematics 11 03406 g0a17

References

  1. Zhang, Y.; Wu, Y. Introducing Machine Learning Models to Response Surface Methodologies. In Response Surface Methodology in Engineering Science; IntechOpen: London, UK, 2021. [Google Scholar]
  2. Paturi, U.M.R.; Reddy, N.S.; Cheruku, S.; Narala, S.K.R.; Cho, K.K.; Reddy, M.M. Estimation of coating thickness in electrostatic spray deposition by machine learning and response surface methodology. Surf. Coat. Technol. 2021, 422, 127559. [Google Scholar] [CrossRef]
  3. Lashari, N.; Ganat, T.; Otchere, D.; Kalam, S.; Ali, I. Navigating viscosity of GO-SiO2/HPAM composite using response surface methodology and supervised machine learning models. J. Pet. Sci. Eng. 2021, 205, 108800. [Google Scholar] [CrossRef]
  4. Shozib, I.A.; Ahmad, A.; Rahaman, M.A.; Alam, M.; Beheshti, M.; Taufiqurrahman, I. Modelling and optimization of microhardness of electroless Ni-P-TiO2 composite coating based on machine learning approaches and RSM. J. Mater. Res. Technol. 2021, 12, 1010–1025. [Google Scholar] [CrossRef]
  5. Keshtegar, B.; Gholampour, A.; Thai, D.K.; Taylan, O.; Trung, N.T. Hybrid regression and machine learning model for predicting ultimate condition of FRP-confined concrete. Compos. Struct. 2021, 262, 113644. [Google Scholar] [CrossRef]
  6. Lou, H.; Chung, J.I.; Kiang, Y.H.; Xiao, L.Y.; Hageman, M.J. The application of machine learning algorithms in understanding the effect of core/shell technique on improving powder compactability. Int. J. Pharm. 2019, 555, 368–379. [Google Scholar] [CrossRef] [Green Version]
  7. Haque, S.; Khan, S.; Wahid, M.; Dar, S.A.; Soni, N.; Mandal, R.K.; Singh, V.; Tiwari, D.; Lohani, M.; Areeshi, M.Y.; et al. Artificial Intelligence vs. Statistical Modeling and Optimization of Continuous Bead Milling Process for Bacterial Cell Lysis. Front. Microbiol. 2016, 7, 1852. [Google Scholar] [CrossRef]
  8. Pilkington, J.L.; Preston, C.; Gomes, R.L. Comparison of response surface methodology (RSM) and artificial neural networks (ANN) towards efficient extraction of artemisinin from Artemisia annua. Ind. Crops Prod. 2014, 58, 15–24. [Google Scholar] [CrossRef]
  9. Bourquin, J.; Schmidli, H.; van Hoogevest, P.; Leuenberger, H. Advantages of Artificial Neural Networks (ANNs) as alternative modelling technique for data sets showing non-linear relationships using data from a galenical study on a solid dosage form. Eur. J. Pharm. Sci. 1998, 7, 5–16. [Google Scholar] [CrossRef] [PubMed]
  10. Souza Lima, E.; Lima, V.; Almeida, C.; Justi, K. Application of response surface methodology and machine learning combined with data simulation to metal determination of freshwater sediment. Water Air Soil Pollut. 2017, 228, 370. [Google Scholar] [CrossRef]
  11. Bi, Q.; Goodman, K.E.; Kaminsky, J.; Lessler, J. What is Machine Learning? A Primer for the Epidemiologist. Am. J. Epidemiol. 2019, 188, 2222–2239. [Google Scholar] [CrossRef]
  12. Crisci, C.; Terra, R.; Pacheco, J.; Ghattas, B.; Bidegain, M.; Goyenola, G.; Lagomarsino, J.; Méndez, G.; Mazzeo, N. Multi-model approach to predict phytoplankton biomass and composition dynamics in a eutrophic shallow lake governed by extreme meteorological events. Ecol. Model. 2017, 360, 80–93. [Google Scholar] [CrossRef]
  13. Myers, R.H.; Montgomery, D.C.; Anderson-Cook, C.M. Response Surface Methodology: Process and Product in Optimization Using Designed Experiments; John Wiley and Sons: New York, NY, USA, 1995. [Google Scholar]
  14. Sarabia, L.; Ortiz, M. 1.12—Response Surface Methodology. In Comprehensive Chemometrics; Brown, S.D., Tauler, R., Walczak, B., Eds.; Elsevier: Oxford, UK, 2009; pp. 345–390. [Google Scholar] [CrossRef]
  15. Manzon, D.; Claeys-Bruno, M.; Declomesnil, S.; Carité, C.; Sergent, M. Quality by Design: Comparison of Design Space construction methods in the case of Design of Experiments. Chemom. Intell. Lab. Syst. 2020, 200, 104002. [Google Scholar] [CrossRef]
  16. dos Moreira, C.S.; Lourenço, F.R. Development and optimization of a stability-indicating chromatographic method for verapamil hydrochloride and its impurities in tablets using an analytical quality by design (AQbD) approach. Microchem. J. 2020, 154, 104610. [Google Scholar] [CrossRef]
  17. Hastie, T.; Tibshirani, R.; Friedman, J.; Franklin, J. The elements of statistical learning: Data mining, inference, and prediction. Math. Intell. 2004, 27, 83–85. [Google Scholar] [CrossRef]
  18. Breiman, L.; Friedman, J.; Stone, C.J.; Olshen, R.A. Classification and Regression Trees; Chapman and Hall/CRC: Boca Raton, FL, USA, 1984. [Google Scholar]
  19. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
  20. Freund, Y.; Schapire, R.E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef] [Green Version]
  21. Natekin, A.; Knoll, A. Gradient boosting machines, a tutorial. Front. Neurorobot. 2013, 7, 21. [Google Scholar] [CrossRef] [Green Version]
  22. Chen, T.; He, T. xgboost: eXtreme Gradient Boosting. 2021. Available online: https://cran.r-project.org/web/packages/xgboost/vignettes/xgboost.pdf (accessed on 27 July 2023).
  23. Cristianini, N.; Shawe-Taylor, J. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods; Cambridge University Press: Cambridge, UK, 2000. [Google Scholar] [CrossRef]
  24. Nerini, D.; Durbec, J.; Mante, C.; Garcia, F.; Ghattas, B. Forecasting physicochemical variables by a classification tree method: Application to the Berre Lagoon (South France). Acta Biotheor. 2001, 48, 181–196. [Google Scholar] [CrossRef]
  25. Tibshirani, R. Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. Ser. B (Methodol.) 1996, 58, 267–288. [Google Scholar] [CrossRef]
  26. Nelder, J.A.; Wedderburn, R.W.M. Generalized linear models. J. R. Stat. Soc. Ser. (Gen.) 1972, 135, 370. [Google Scholar] [CrossRef]
  27. Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
  28. Marquardt, D.W.; Snee, R.D. Ridge regression in practice. Am. Stat. 1975, 29, 3–20. [Google Scholar] [CrossRef]
  29. Geurts, P.; Ernst, D.; Wehenkel, L. Extremely randomized trees. Mach. Learn. 2006, 63, 3–42. [Google Scholar] [CrossRef] [Green Version]
  30. Schaffer, J.; Whitley, D.; Eshelman, L. Combinations of genetic algorithms and neural networks: A survey of the state of the art. In Proceedings of the COGANN-92: International Workshop on Combinations of Genetic Algorithms and Neural Networks, Baltimore, MD, USA, 6 June 1992; pp. 1–37. [Google Scholar] [CrossRef]
  31. Jie, J.; Zeng, J.; Han, C. An extended mind evolutionary computation model for optimizations. Appl. Math. Comput. 2007, 185, 1038–1049. [Google Scholar] [CrossRef]
  32. Huang, G.B.; Zhu, Q.Y.; Siew, C.K. Extreme learning machine: Theory and applications. Neurocomputing 2006, 70, 489–501. [Google Scholar] [CrossRef]
  33. Aljazzar, H.; Leue, S. K*: A heuristic search algorithm for finding the k shortest paths. Artif. Intell. 2011, 175, 2129–2154. [Google Scholar] [CrossRef] [Green Version]
  34. RStudio Team. RStudio: Integrated Development Environment for R; RStudio, PBC: Boston, MA, USA, 2020. [Google Scholar]
  35. Kuhn, M. Building predictive models in R using the caret package. J. Stat. Softw. 2008, 28, 1–26. [Google Scholar] [CrossRef] [Green Version]
  36. Dai, Y.; Yang, C.; Liu, Y.; Yao, Y. Latent-Enhanced Variational Adversarial Active Learning Assisted Soft Sensor. IEEE Sens. J. 2023, 23, 15762–15772. [Google Scholar] [CrossRef]
  37. Zhu, J.; Jia, M.; Zhang, Y.; Deng, H.; Liu, Y. Transductive transfer broad learning for cross-domain information exploration and multigrade soft sensor application. Chemom. Intell. Lab. Syst. 2023, 235, 104778. [Google Scholar] [CrossRef]
  38. Jia, M.; Xu, D.; Yang, T.; Liu, Y.; Yao, Y. Graph convolutional network soft sensor for process quality prediction. J. Process Control 2023, 123, 12–25. [Google Scholar] [CrossRef]
  39. Liu, K.; Zheng, M.; Liu, Y.; Yang, J.; Yao, Y. Deep Autoencoder Thermography for Defect Detection of Carbon Fiber Composites. IEEE Trans. Ind. Inform. 2023, 19, 6429–6438. [Google Scholar] [CrossRef]
Figure 1. Example of surface responses [16].
Figure 1. Example of surface responses [16].
Mathematics 11 03406 g001
Figure 2. Left-hand panel: a neuron with d inputs. Right-hand panel: a NN with a one-dimensional input and output, and one hidden layer containing three neurons.
Figure 2. Left-hand panel: a neuron with d inputs. Right-hand panel: a NN with a one-dimensional input and output, and one hidden layer containing three neurons.
Mathematics 11 03406 g002
Figure 3. Response surfaces of the simulated model.
Figure 3. Response surfaces of the simulated model.
Mathematics 11 03406 g003
Table 1. Summary of RSM vs. ML studies—PM: Polynomial Model, LASSO: Least Absolute Shrinkage Selection Operator, GLM: Generalized Linear Model, rf: random forest, GBDT: Gradient Boosting Decision Tree, MLP: Multiple Layer Perceptrons, SVR: Support Vector Regression, BPNN: Back-Propagation Neural Network, SVM: Support Vector Machine, CART: Classification and Regression Tree, ET: Extra Tree regression, GBR: Gradient Boosting Regression, xgboost: eXtreme Gradient Boosting, NN: Neural Network, GA-BPNN: Genetic Algorithm-Back Propagation Neural Network, MEA-BPNN: Mind Evolutionary Algorithm Based BPNN, ELM: Extreme Learning Machine, EV: Explained Variance, MAPE: Mean Absolute Percentage Error, RMSE: Root Mean Square Error, AIC: Akaike Information Criterion, MAE: Mean Absolute Error, R 2 : coefficient of determination, MSE: Mean Square Error, NSE: Nash and Sutcliffe Efficiency, d: agreement index, CV RMSE: Coefficient of Variance of the RMSE, AAD: Absolute Average Deviation.
Table 1. Summary of RSM vs. ML studies—PM: Polynomial Model, LASSO: Least Absolute Shrinkage Selection Operator, GLM: Generalized Linear Model, rf: random forest, GBDT: Gradient Boosting Decision Tree, MLP: Multiple Layer Perceptrons, SVR: Support Vector Regression, BPNN: Back-Propagation Neural Network, SVM: Support Vector Machine, CART: Classification and Regression Tree, ET: Extra Tree regression, GBR: Gradient Boosting Regression, xgboost: eXtreme Gradient Boosting, NN: Neural Network, GA-BPNN: Genetic Algorithm-Back Propagation Neural Network, MEA-BPNN: Mind Evolutionary Algorithm Based BPNN, ELM: Extreme Learning Machine, EV: Explained Variance, MAPE: Mean Absolute Percentage Error, RMSE: Root Mean Square Error, AIC: Akaike Information Criterion, MAE: Mean Absolute Error, R 2 : coefficient of determination, MSE: Mean Square Error, NSE: Nash and Sutcliffe Efficiency, d: agreement index, CV RMSE: Coefficient of Variance of the RMSE, AAD: Absolute Average Deviation.
References# Factors# ResponsesnModelsCriteria
[1]63156,512PM (1)
PM (2)
Linear ML model-
LASSO, GLM
rf, GBDT
MLP, SVR
EV
MAPE
RMSE
[2]3130PM (2)
BPNN, SVM
MAPE
[3]5157PM (2)
Ridge regression
LASSO, SVM
CART, rf
ET, GBR
xgboost
AIC
MAE
RMSE
R 2
[4]5136PM (2)
NN, SVM
rf, ET
MSE
MAE
R 2
[5]82765PM (2), SVR
RSM-SVR 1
RMSE
MAE
NSE
d
[6]2 (+2) 2 228PM (2)
BPNN
GA-BPNN,
MEA-BPNN
ELM
CV RMSE
RMSE
AIC (for NN)
[10]3618PM (2)
Lazy KStar
rf
RMSE, R 2
[8]3120PM (2)
MLP
AAD
[9]62102PM (2)
MLP
R 2
1 hybrid model. 2 binary factors.
Table 2. Hyperparameter values tested for each method. knn: k-nearest neighbors, rf: random forest, svmPoly: support vector machine with polynomial kernel, svmRadial: support vector machine with radial basis function kernel, CART: Classification And Regression Tree, gbm: gradient boosting machine, xgboost: eXtreme Gradient Boosting, NN: Neural Network, CARTmv: Classification and Regression Tree with multivariate responses, NNmv: Neural Network with multivariate responses.
Table 2. Hyperparameter values tested for each method. knn: k-nearest neighbors, rf: random forest, svmPoly: support vector machine with polynomial kernel, svmRadial: support vector machine with radial basis function kernel, CART: Classification And Regression Tree, gbm: gradient boosting machine, xgboost: eXtreme Gradient Boosting, NN: Neural Network, CARTmv: Classification and Regression Tree with multivariate responses, NNmv: Neural Network with multivariate responses.
MethodHyperparameter GridDescription
knnk: {1, …, 10}number of neighbors
CARTminsplit: {1, 5, 10}
cp: {0.002, 0.005, 0.01, 0.015, 0.02, 0.03}
minbucket: {1, 5, 10}
minimum number of observations to attempt a split
complexity parameter
minimum number of observations in a leaf
rfminsize: {2, 5, 10}
mtry: {1, …, 5}
minimum size of a node
number of variables to test for each split
gbmn.trees: {50, 100, 150, 200, 250}
interaction.depth: {1, 2, 3, 4, 5}
shrinkage: 0.1
n.minobsinnode: 10
total number of trees to fit
maximum depth of trees
learning rate
minimum number of observations in leaves
xgboostnrounds: {50, 100, 150, 200, 250}
max_depth: {1, 2, 3, 4, 5}
eta: {0.3,0.4}
gamma: 0
colsample_bytree: {0.6, 0.8}
min_child_weight: 1
subsample: [0.5, 0.625, 0.750, 0.875, 1}
maximum number of boosting iterations
maximum depth of a tree
control the learning rate
minimum loss reduction
subsample ratio of columns when constructing each tree
minimum sum of instance weight needed in a child
subsample ratio of the training instance
svmPolydegree: {0.25, 0.5, 1, 2, 4}
scale: {0.001, 0.01, 0.1, 1, 10}
C p o l y : {1, 2, 3}
polynome degree
polynomial scaling factor
control parameter
svmRadialsigma: constant value
C r a d : {0.25, 0.5, 1, 2, 4, 8, 16, 32}
bandwidth of kernel function
control parameter
NNlayer1: {2, 4, 6, 8, 10}
layer2: {2, 4, 6, 8, 10}
layer3: {2, 4, 6, 8, 10}
number of hidden neurons in each layer
CARTmvminsplit: {1,5,10}
cp: {0.002, 0.005, 0.01, 0.015, 0.02, 0.03}
minbucket: {1, 5, 10}
minimum number of observations to attempt a split
complexity parameter
minimum number of observations in a leaf
NNmvlayer1: {2, 4, 6, 8, 10}
layer2: {2, 4, 6, 8, 10}
layer3: {2, 4, 6, 8, 10}
number of hidden neurons in each layer
Table 3. RMSE and rank for each method and each response (n = 50). Gray cells highlight the best model.
Table 3. RMSE and rank for each method and each response (n = 50). Gray cells highlight the best model.
Method S 1 S 2 S 3 S ave R 1 R 2 R 3 R ave
lm 1.067 ± 0.118 1.070 ± 0.136 1.116 ± 0.116 1.072 2 5 5 2
knn 1.629 ± 0.190 1.063 ± 0.112 1.093 ± 0.108 1.262 7 3 7 7
CART 1.715 ± 0.246 1.155 ± 0.154 1.192 ± 0.130 1.354 9 1011 9
rf 1.338 ± 0.177 1.092 ± 0.138 1.123 ± 0.126 1.184 5 7 8 5
gbm 1.296 ± 0.161 1.067 ± 0.133 1.085 ± 0.132 1.149 4 4 6 4
xgb 1.390 ± 0.198 1.121 ± 0.147 1.150 ± 0.141 1.220 6 8 9 6
svmpoly 1.085 ± 0.120 1.054 ± 0.127 1.056 ± 0.129 1.065 3 1 2 1
svmrad 1.706 ± 0.626 1.072 ± 0.121 1.066 ± 0.119 1.281 8 6 3 8
NN 6.102 ± 1.870 1.061 ± 0.127 1.047 ± 0.109 2.737 10 2 1 10
CARTmv 8.331 ± 0.676 1.324 ± 0.147 1.151 ± 0.154 3.602 11111011
NNmv 1.063 ± 0.117 1.122 ± 0.130 1.070 ± 0.112 1.085 1 9 4 3
Table 4. RMSE and rank for each method and each response (n = 30). Gray cells highlight the best model.
Table 4. RMSE and rank for each method and each response (n = 30). Gray cells highlight the best model.
Method S 1 S 2 S 3 S ave R 1 R 2 R 3 R ave
lm 1.177 ± 0.184 1.138 ± 0.184 1.123 ± 0.189 1.146 2 8 8 3
knn 2.036 ± 0.527 1.088 ± 0.129 1.067 ± 0.149 1.397 8 2 4 7
CART 2.003 ± 0.451 1.167 ± 0.192 1.140 ± 0.166 1.437 7 9 9 9
rf 1.587 ± 0.385 1.121 ± 0.135 1.117 ± 0.152 1.275 5 7 7 5
gbm 1.550 ± 0.327 1.116 ± 0.140 1.100 ± 0.159 1.255 4 5 6 4
xgb 1.601 ± 0.380 1.216 ± 0.155 1.194 ± 0.193 1.337 6 1011 6
svmpoly 1.199 ± 0.207 1.100 ± 0.189 1.055 ± 0.135 1.118 3 4 2 1
svmrad 2.057 ± 0.798 1.093 ± 0.152 1.059 ± 0.146 1.403 9 3 3 8
NN 6.248 ± 1.930 1.084 ± 0.144 1.038 ± 0.142 2.790 10 1 1 10
CARTmv 8.104 ± 0.895 1.335 ± 0.222 1.184 ± 0.203 3.541 11111011
NNmv 1.164 ± 0.172 1.117 ± 0.165 1.081 ± 0.152 1.121 1 6 5 2
Table 5. RMSE and rank for each method and each response (n = 20). Gray cells highlight the best model.
Table 5. RMSE and rank for each method and each response (n = 20). Gray cells highlight the best model.
Method S 1 S 2 S 3 S ave R 1 R 2 R 3 R ave
lm 1.205 ± 0.260 1.203 ± 0.233 1.196 ± 0.212 1.201 2 9 10 3
knn 2.309 ± 0.549 1.110 ± 0.195 1.064 ± 0.174 1.494 9 1 4 9
CART 2.091 ± 0.589 1.179 ± 0.231 1.139 ± 0.210 1.470 7 8 8 8
rf 1.761 ± 0.485 1.147 ± 0.180 1.091 ± 0.177 1.333 5 6 6 4
gbm 1.915 ± 0.444 1.141 ± 0.196 1.099 ± 0.189 1.385 6 5 7 6
xgb 1.706 ± 0.388 1.258 ± 0.197 1.174 ± 0.183 1.379 4 10 9 5
svmpoly 1.236 ± 0.236 1.127 ± 0.192 1.048 ± 0.164 1.137 3 3 3 2
svmrad 2.207 ± 0.979 1.123 ± 0.184 1.045 ± 0.149 1.458 8 2 2 7
NN 5.397 ± 2.040 1.140 ± 0.191 1.014 ± 0.153 2.517 10 4 1 10
CARTmv 7.959 ± 1.100 1.404 ± 0.254 1.270 ± 0.252 3.545 11111111
NNmv 1.172 ± 0.241 1.154 ± 0.196 1.064 ± 0.177 1.130 1 7 5 1
Table 6. RMSE and rank for each method and each response (n = 15). Gray cells highlight the best model.
Table 6. RMSE and rank for each method and each response (n = 15). Gray cells highlight the best model.
Method S 1 S 2 S 3 S ave R 1 R 2 R 3 R ave
lm 1.436 ± 0.491 1.371 ± 0.512 1.293 ± 0.314 1.367 3 1011 3
knn 2.670 ± 0.818 1.094 ± 0.192 1.063 ± 0.199 1.611 8 1 3 8
CART 2.424 ± 0.693 1.220 ± 0.302 1.151 ± 0.193 1.598 6 8 8 6
rf 2.149 ± 0.730 1.133 ± 0.218 1.079 ± 0.170 1.454 5 4 5 4
gbm 2.697 ± 0.741 1.187 ± 0.287 1.100 ± 0.161 1.661 9 6 7 9
xgb 1.910 ± 0.530 1.289 ± 0.290 1.221 ± 0.209 1.473 4 9 9 5
svmpoly 1.401 ± 0.425 1.194 ± 0.426 1.045 ± 0.199 1.213 2 7 2 2
svmrad 2.576 ± 1.080 1.162 ± 0.274 1.087 ± 0.229 1.608 7 5 6 7
NN 5.624 ± 2.260 1.102 ± 0.174 1.007 ± 0.182 2.577 10 2 1 10
CARTmv 8.267 ± 1.370 1.437 ± 0.282 1.247 ± 0.292 3.650 11111011
NNmv 1.332 ± 0.455 1.109 ± 0.185 1.077 ± 0.193 1.173 1 3 4 1
Table 7. RMSE and rank for each method and each response (DOE case study). Gray cells highlight the best model.
Table 7. RMSE and rank for each method and each response (DOE case study). Gray cells highlight the best model.
Method S 1 S 2 S 3 S 4 S 5 S ave R 1 R 2 R 3 R 4 R 5 R ave
lm 0.274 ± 0.115 7.69 ± 3.060 13.35 ± 5.080 5.59 ± 2.740 0.843 ± 0.338 5.55 8 1111101111
knn 0.268 ± 0.081 3.15 ± 0.832 8.33 ± 1.540 4.80 ± 2.000 0.310 ± 0.078 3.37 6 8 2 6 1 4
CART 0.292 ± 0.094 3.10 ± 0.903 9.85 ± 2.130 6.30 ± 1.940 0.378 ± 0.109 3.98 9 6 1011 9 10
rf 0.258 ± 0.086 2.54 ± 0.767 9.38 ± 1.880 4.99 ± 1.640 0.350 ± 0.075 3.50 5 3 7 7 7 8
gbm 0.256 ± 0.078 2.17 ± 0.778 8.90 ± 2.500 5.49 ± 1.460 0.376 ± 0.099 3.44 4 1 4 8 8 5
xgb 0.198 ± 0.085 2.23 ± 0.903 9.20 ± 2.140 4.28 ± 1.730 0.343 ± 0.095 3.25 1 2 6 3 5 2
svmpoly 0.254 ± 0.087 3.36 ± 1.510 9.20 ± 1.720 4.33 ± 1.770 0.347 ± 0.129 3.50 3 10 5 4 6 7
svmrad 0.242 ± 0.077 2.80 ± 0.985 8.89 ± 1.550 4.40 ± 1.550 0.329 ± 0.093 3.33 2 4 3 5 3 3
NN 0.273 ± 0.072 3.18 ± 1.090 9.77 ± 2.430 3.70 ± 1.770 0.316 ± 0.060 3.45 7 9 9 1 2 6
CARTmv 0.311 ± 0.077 2.89 ± 1.090 9.54 ± 1.600 5.52 ± 1.310 0.339 ± 0.088 3.72 10 5 8 9 4 9
NNmv 0.491 ± 0.616 3.13 ± 1.280 6.98 ± 1.940 3.98 ± 1.660 0.614 ± 0.467 3.04 11 7 1 2 10 1
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ghattas, B.; Manzon, D. Machine Learning Alternatives to Response Surface Models. Mathematics 2023, 11, 3406. https://doi.org/10.3390/math11153406

AMA Style

Ghattas B, Manzon D. Machine Learning Alternatives to Response Surface Models. Mathematics. 2023; 11(15):3406. https://doi.org/10.3390/math11153406

Chicago/Turabian Style

Ghattas, Badih, and Diane Manzon. 2023. "Machine Learning Alternatives to Response Surface Models" Mathematics 11, no. 15: 3406. https://doi.org/10.3390/math11153406

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop