1. Introduction
Concrete is the most widely utilized material in global civil construction projects, serving as a fundamental component in countless structures and infrastructure systems. A total of 4.1 billion tons of the primary raw material for concrete, cement, was produced in 2022 [
1], with the projected demand for concrete reaching 18 billion tons by the year 2050 [
2]. This fact is linked to the characteristics that make concrete an attractive structural material. The ability to adapt to different types of shapes, its long functional life, its low maintenance demands, and its resistance to adverse weather conditions are some of the characteristics that make concrete a widely used material [
3,
4]. The high rates of urbanization, economic development, and population growth can explain the increase in the use of concrete materials over the years. These factors mean that the demand for the main component of concrete, cement, has increased in recent years [
5,
6].
The mechanical and physical properties of concrete are essential information during the structural design phase, since each type of project has certain requirements, such as compressive strength, durability, and specific weight [
7]. There are numerous types of concrete, each with a particular application and with characteristics that adapt to project requirements. Mechanical properties are linked to the components of the concrete mix and the quantities of each element therein. Concrete mixtures contain the following essential components: aggregates, fine or coarse particles, water, and cement. Additions and additives can also be included in different mixtures. Mineral materials with cementing or pozzolanic properties are typically added, and chemical products can be used as additives to reduce the water–cement ratio [
8].
The physical properties of concrete are usually defined through laboratory tests. These tests are standardized and enable the determination of properties such as compressive strength, modulus of elasticity, and tensile strength, among others [
9]. One of the most important mechanical properties to be defined at the time of structural design is the compressive strength that a given concrete must possess, aiming to support the load of the structure and other loads to which it will be subjected [
10,
11]. However, a possible way to reduce the need for laboratory tests is to establish a relationship between the amount of each component in the mixture and the properties of concrete mechanics. This approach allows us to define the mixing parameters from a value established in the project for a given mechanical property, such as compressive strength. However, defining this relationship is not a trivial task, since there is a high degree of nonlinearity [
12]. However, this difficulty has been overcome with the use of machine learning methods [
13,
14,
15,
16,
17].
Machine learning methods have been used to solve problems in different research areas, creating models capable of modeling highly nonlinear problems in science and engineering [
18,
19,
20]. In the case of predicting the compressive strength of concrete, the input parameters include information relating to the components of the mixture and other extra information, such as the curing time of the specimen; the output parameter is the compressive strength [
21,
22].
In the literature, one can find works that use machine learning methods to estimate the mechanical components of concrete [
23]. Abd and Abd [
24] carried out a study using nonstandard regression methods, a multivariate linear model, and a support vector machine (SVM) to predict concrete’s resistance to compression. The dataset used in this work came from 150 concrete mixtures containing lightweight foam or cellular concrete. Using a function of least squares loss, the multivariate nonlinear regression model obtained a correlation coefficient of 0.958, representing a good correlation between the actual and predicted values. The SVM models using the kernel, radial, linear, polynomial, and sigmoid basis functions yielded correlation coefficients of 0.986, 0.951, 0.976, and 0.851, respectively.
Ahmadi-Nedushan [
25] analyzed the forecasting ability of k-nearest neighbors (KNN) algorithm with regard to the compressive strength of concrete via four variations of the KNN algorithm, including generalized regularization neural networks, stepwise regression, and a modular neural network. The dataset used was composed of 104 samples of high-performance concrete mixes; each data sample was made up of seven input data points related to the elements of the mixture and the compressive strength of the concrete after 28 days. The KNN-derived algorithms implemented by the author used a set of weights to weigh the significance of each attribute. This algorithm presented the best results among the others evaluated, reaching a value of 0.984 for the coefficient of correlation and 1.174 for the root mean squared error (MSE).
The prediction of compressive strength for high-performance concretes was also addressed by Al-Shamiri et al. [
26]. In this paper, a comparison between a neural network trained with the backpropagation algorithm and an extreme learning machine (ELM) was performed. The dataset used was composed of 324 samples and consisted of five parameters of the mixture and the compressive strength at 28 days. The tests carried out with both approaches obtained satisfactory results, with correlation coefficients in the order of 0.990. An algorithm based on decision trees was used by Behnood et al. [
27] to predict the compressive strength of normal and high-performance concrete. The correlation coefficient was 0.900, indicating that strategies using trees are appropriate for this type of problem.
Gilan et al. [
28] used the particle swarm optimization (PSO) algorithm in conjunction with an SVM to predict the compressive strength of concrete mixtures containing metakaolin. The results obtained by the authors indicate that the use of the PSO-SVM hybrid model guarantees a greater ability to predict the compressive strength of concrete mixtures. In the work of Qi et al. [
29], PSO was used to optimize the architecture parameters of an ANN. The results indicate that the use of PSO in conjunction with the ANN guaranteed good forecast quality, since the values predicted by the model were close to the experimental values.
The study that was conducted by Ly et al. [
30] optimized artificial neural networks (ANNs) for faster and more accurate prediction of key properties in fly ash composites (FC). Using particle swarm optimization (PSO), the study fine-tuned the structure and parameters of the ANN. The results showed excellent prediction accuracy, with a strong correlation between the predicted and actual values. Additionally, this study identified the most influential factors affecting the properties under study, providing valuable insights for FC design and optimization.
Huang et al. [
31] optimized the hyperparameters of a random forest (RF) model using the firefly algorithm (FA) to achieve significant performance gains. The resulting hybrid FA-RF model demonstrated high accuracy in predicting the concrete compressive strength, as evidenced by the strong R-squared and low root mean squared error (RMSE) values and the close alignment between the predicted and actual values.
This study, developed by Zhu et al. [
32], presented two hybrid support vector regression models, AOSVR and ALSVR, optimized by advanced algorithms. These models accurately predicted concrete compressive strength from ingredient data, with AOSVR achieving even higher precision. Implementing these models reduces testing costs and improves concrete characterization analysis.
Understanding and forecasting scenarios of interest to experts, academics, and decision-makers has become extremely difficult due to the growing complexity of contemporary issues and the exponential growth of data from diverse processes. Models that can depict nonlinear relationships between inputs and outputs and are resilient to data noise and uncertainties are necessary for complex machine learning and data science challenges.
The use of stacked models, which have the ability to improve the accuracy displayed by individual models, can help address these issues [
33]. Additionally, these models function in accordance with their topology, enabling the aggregation of models with various capacities. Because of this, it is possible to create multiple stacking models by using different algorithms in the first layer of the stacked architecture. This process enables various algorithms to identify patterns in the training data, combining the models in the first layer to produce accurate results [
34].
Because it uses ensemble learning to combine the strengths of several predictive models, stacking is a useful technique for predicting concrete compressive strength that may provide better accuracy and generalization than individual models. Although current techniques like neural networks, decision trees, and regression models have demonstrated promising outcomes, stacking potentially surpasses them by lowering model bias and variance by combining various algorithms.
This paper aims to combine consolidated strategies with an approach that has still been little explored, denominated stacking, to predict the compressive strength of concrete. The stacking approach is a layered learning strategy in which, in the first layer, a set of machine learning algorithms are individually trained on the available data. In the second layer, a metamodel is responsible for making the final prediction, taking as input data the predictions made by the first layer models [
35]. It is worth noting that stacking can present a greater number of layers, but, in this paper, only two layers are used, as shown in the schematic model in
Figure 1.
The objectives of this paper are as follows:
To investigate the efficiency of the use of stacking as a technique for predicting the compressive strength of concrete mixtures;
To examine whether the results obtained with stacking are at least similar to those obtained with the individual use of computational intelligence techniques;
To investigate the best strategy for utilizing PSO as a tool for optimizing the parameters of machine learning models.
The PSO algorithm was chosen due to its simple usage, requiring fewer parameters to adjust compared to other algorithms. In addition, PSO strikes a good balance between exploring the search space for possible solutions and refining the most promising ones, which helps avoid getting stuck in suboptimal results. It is also effective at searching the entire solution space, similar to genetic algorithms, but with simpler and faster update mechanisms.
Furthermore, while other algorithms such as genetic algorithms (GA) [
36] or pattern search (PS) [
37] were considered, they have limitations. GA often requires more fine-tuning of parameters, such as mutation and crossover rates, and can be slower due to its reliance on random processes. PS, on the other hand, is more suited for local optimization and does not explore the solution space globally as effectively as PSO or GA. PSO has been used in some other works in the literature for problems related to concrete, such as optimizing concrete composition [
38] and predicting mechanical properties [
39].
The remainder of this paper is divided as follows. In
Section 2, the datasets used to test the efficiency of the proposed approaches are presented. This section also details the machine learning methods employed, as well as the joint learning and optimization strategies. Furthermore, metrics and statistical tests were used to evaluate the results. In
Section 3, the results are introduced, and their analysis is presented. Finally, in
Section 4, the conclusions are presented.
2. Materials and Methods
2.1. Experimental Data
In this paper, four datasets are used to analyze the prediction capacity of the proposed model. Each dataset consists of experimental data on different types of concrete mixtures, as described below.
The first dataset was obtained from [
40] and contains information extracted from 104 high-quality concrete performance samples. The cylindrical specimens (
mm) were removed from the molds after 24 h, cured for 28 days, and subsequently tested to verify the compressive strength.
Table 1 shows the parameters of the mixtures, their respective maximum and minimum values, and the values for resistance to compression.
Figure 2 shows the correlation matrix; the closer the value is to one, the greater the degree of correlation is, and the closer the value is to negative one, the greater the inverse correlation degree is. It can be observed that AE and SP have a high correlation with CS and AE is also highly correlated with SP. Additionally, AE, SP, and CS are inversely correlated with w/c.
The second dataset (D2) was collected from [
41], which used data to investigate the influence of fly ash and silica fumes on the compression resistance of high-performance concrete. A total of 24 different mixtures with varying quantities of the components are presented in
Table 2. Compression tests were performed for the following six different curing periods: 3, 7, 28, 56, 90, and 180 days. In total, 144 tests were carried out, thus composing the dataset used in this study. The correlation matrix for the dataset D2 parameters can be found in
Figure 3. It can be seen that the correlation values are lower than for D1.
The D3 dataset was provided by [
42], who investigated artificial neural networks to predict the compressive strength of self-compacting concrete containing fly ash. A compatible dataset was used for the materials present in the mixtures. Dataset D3 includes 80 samples, and the components of the mixtures are presented in
Table 3.
Figure 4 displays the correlation matrix for dataset D3; we can observe that the correlation values are low for all components of the mixtures and also for their relationships with compressive strength.
Finally, dataset D4 was extracted from [
26]. The dataset has 324 samples, each comprising six parameters, namely, the mixture and the compressive strength at 28 days. The dataset’s maximum and minimum values are presented in
Table 4.
Figure 5 presents the correlation matrix for dataset D4. C is the component of the mixtures that correlates most highly with CS, followed by SP.
2.2. Regression Methods
The following five regression models were used in this paper: an artificial neural network (ANN), decision trees (DTs), an extreme learning machine (ELM), K-nearest neighbors (KNNs), and support vector machines (SVMs). These machine learning methods were chosen due to their widespread use in modeling engineering problems and their availability in machine learning libraries in different programming languages, allowing research reproducibility.
2.2.1. Artificial Neural Networks (ANNs)
Artificial neural networks are machine learning algorithms inspired by the neural system of the animal brain that is capable of learning, generalizing, and organizing data [
43]. The smallest unit of an ANN is an artificial neuron, as shown in
Figure 6, whose interconnection represents the synaptic communication process of biological neurons.
In addition, the processing unit of a neuron can be represented by an activation function, which takes input information and generates output information sent to other neurons. Three activation functions were employed, as presented in
Table 5.
A multilayer perceptron (MLP) is a neural network that has at least one internal layer; since it can have multiple internal layers, an MLP is widely used and recommended for solving nonlinear problems. In this work, the neural network has three internal layers, each composed of up to one hundred neurons, and the topology used is feedforward and simple unidirectional.
The learning process of an ANN consists of adjusting the weights () and minimizing the network prediction error. Several algorithms in the literature are capable of carrying out network training. In this case, L-BFGS has the ability to solve problems with a large number of variables controlling the amount of memory used.
2.2.2. Decision Trees (DTs)
Decision trees (DTs) are machine learning algorithms capable of generating expert systems to solve classification and regression problems. A DT is built from a set of tests performed on the input data. The internal nodes of the tree are the representations of the tests; in the case of regression problems, these tests are quantitative and are carried out with the output value defined for the attribute set, compared to the value that outputs a division. Tests are typically performed by checking whether the value of a feature is greater or less than the division value. The leaf nodes of the tree store a return value relating to a given entry. The return value of a leaf node is defined by the mean of the output values for all test datasets that reached that knot.
The classification and regression trees (CART) algorithm was used to induce binary trees using the input variables and a threshold to achieve the greatest information gain at each node, thus reducing knot impurity [
44]. In regression problems, the impurity of a node is defined by a function whose value should be minimized. In the case of CART, the mean squared error is defined as follows:
The CART algorithm determines which divisions will be created and the topology of the tree [
45]. Knowing that the input data (
), with
, and
, will generate
K partitions in regions,
, and that the output is given by a constant
in each region, one can write the output as follows:
As the minimization criterion adopted is the MSE, one can verify that the optimal
is the average of the
belonging to the region
, as follows:
Then, to obtain the division variable
h and the division point
s, the minimization problem is given by Equation (
4), as follows:
2.2.3. Extreme Learning Machine (ELM)
The extreme learning machine (ELM) is a neural network with only one internal layer that differs from the others due to its weight adjustment strategy [
46]. In this strategy, the weights of the inner layer are randomly assigned, while the output layer weights are obtained analytically. The output of an ELM model is defined as follows:
where
N is the number of neurons in the hidden layer,
w is the weight of the output layer, and
H(
x) is equal to the output of the activation function of the neurons in the hidden layer. The activation functions applied are described in
Table 6.
In the ELM training process, the hidden layer parameters (weights (
a) and bias (
b)) are defined randomly, ensuring a significant reduction in the training time for this type of network. The weights of the output layer (
w) are defined by minimizing the error, as shown in Equation (
6) below:
which is found by solving the problem
which can be written in matrix form, as shown in Equation (
8) below:
The weights that minimize the error are found by solving
, where
is the generalized inverse Moore–Penrose matrix.
2.2.4. K-Nearest Neighbors (KNN)
The K-nearest neighbors (KNN) algorithm is an instance-based method, which performs predictions by comparing sets of attributes that have similar outputs [
47]. KNN uses the principle that a dataset that is close in the attribute space will also be close in the answer space. Starting from this point, the algorithm seeks to find
K neighbors of a set of attributes and, through the value of the target variable, performs the prediction.
To find the K-nearest neighbors, different strategies are used, among which the simplest is the Euclidean distance. In this work, algorithms based on decision trees, such as KD-Trees and BallTrees, are used to optimize the process of searching for neighbors. The set of nearest neighbors for the prediction of a feature set can be given by the average of the response values of the
K nearest neighbors, which is defined as follows:
where
is the set of attributes of the
K-nearest neighbors and
is the response value of each attribute set.
To weight the response values of neighbors, one can add to Equation (
10) a weight (
that aims to value the closest neighbors, as follows:
The weight values can be calculated as the inverse of the distance between the compared instances. Here, the weights
are uniform for all the
k-neighbors.
2.2.5. Support Vector Machine (SVM)
Support vector machines (SVMs) were developed by Vapnik in 1995 [
48]. The technique is based on the theory of static learning, which aims to reduce the error of generalization [
49]. SVMs have been used in many different fields and activities, such as image recognition, text categorization, and bioinformatics, and have yielded results comparable to or sometimes superior to those of techniques such as ANN. This fact can be justified by the SVM’s ability to deal with large datasets. The support vector regression (SVR) algorithm was used in this study. This version of the SVM is capable of working with continuous values for output.
The SVR algorithm is based on the principle of a linear machine that maps the input values while minimizing the generalization error. This machine is defined as follows:
where
is the prediction based on the input
x,
b represents the bias, and (·) is the inner product.
In the SVR optimization problem, two slack variables (
and
) are included, and the problem formulation is as follows:
where the parameter
C is used to regulate the tolerated clearance (
). The choice of
C also influences the complexity of obtaining the model.
The regression problem can be solved in its dual form, where
w can be replaced by
w =
. The SVR output is given by Equation (
13), as follows:
To perform nonlinear regression, a function
) is defined, where
is a nonlinear transformation. This function is referred to as the kernel. In this way, the output of the nonlinear SVR is given by Equation (
14), as follows:
The kernel function used is the radial basis function (RBF) given by
. The parameter
is a coefficient to be defined for the kernel through an optimization process.
2.3. Stacking
Techniques that combine the prediction capacity of models generated by learning algorithms to achieve better results than traditional models individually are known as ensembles [
50,
51]. Different ensemble techniques can be divided into categories according to their objectives in improving prediction results, strategies for combining individual results, and types of individual algorithms used [
52,
53]. The last category can be subdivided into heterogeneous and homogeneous combinations. Techniques such as bagging [
44] and boosting [
54] are examples of combination strategies that use homogeneous algorithms. In this work, the ensemble strategy adopted is stacking, which uses heterogeneous algorithms.
Stacking was defined by Wolpert [
55] as follows: when layers are proposed (
Figure 7), the algorithms belonging to the first level (level-0) are trained with a set of samples and generate predictions that are used as a training set for the algorithm in the second level (level-1), also denominated the metamodel, which generates the final predictions.
The objective of stacking is to reduce generalization errors through the use of model cascading. It is based on the premise that each model is less capable of making a better prediction than the set of these models.
An advantage, or at least a difference, between stacking and other ensemble methods, is the way in which the predictions of level-0 models are combined [
56]. While techniques such as bagging and boosting use simpler ways to carry out combinations, such as using the average of individual forecasts as the final forecast, stacking uses more robust models to make the final prediction, for example, linear regression or even learning algorithms.
Here, linear regression is used as a metamodel. This justifies the use of this algorithm because it is a technique where it is possible to evaluate with greater clarity the participation of level-0 model predictions in the final prediction.
2.4. Particle Swarm Optimization (PSO)
Particle swarm optimization (PSO) is inspired by the behavior of sets of animals, e.g., birds [
57]. The algorithm is stochastic and is based on a very simple concept: each individual in the set moves through the search space at a speed that is adjusted through their experiences and the experiences of the group. In this scenario, one can define a set of individuals as the population and define that each individual represents a solution to the minimization problem. The value of an individual is assessed by the objective function to determine whether the position of the individual is good or closer to optimal.
The objective function is defined according to the problem to be solved. In problems where the parameters of a regression model have to be optimized, an error function can be used that reflects the results of the model. To delimit the search space, the maximum and minimum values are defined according to the problem to be solved.
Individuals in the population are endowed with a memory that records the best position of the individual, denominated pBest, and a collective memory that records the best position already reached by the set of individuals, denominated gBest.
These characteristics allow individuals to evaluate their next position within the search space, aiming to reach points that represent better solutions to the problem. The learning process of the algorithm is defined by the memory capacity of individuals, which allows one to change the velocity (v) and direction (x).
In Equation (
15), the constants
and
represent the speed rate in the direction of the best individual position and the best global position, respectively. The variables
and
are introduced in Equation (
15) and generate randomness in the algorithm’s learning process. The variables are generated randomly and range from 0 to 1. This randomness guarantees a more complete exploration of the search space.
PSO can be used for discrete or numeric search spaces, as in the case of optimizing the parameters of an MLP, where the number of hidden layers and the activation function can be optimized. In the case of discrete numeric parameters, the PSO response is converted to an integer value; however, in the case of discrete discrete parameters, a strategy must be used. One of these strategies is to assign an integer value to each of the parameter options; thus, the same procedure is used for discrete numeric parameters to define the option.
The algorithm may have as a stopping criterion the maximum number of interactions or a tolerance value for variation in the value of the objective function. The maximum number of interactions was used as a stopping criterion in this study.
2.5. Cross-Validation
The cross-validation method is a technique used to evaluate the performance of estimators using all the data in the dataset. This method is employed to minimize prediction errors caused by overfitting and is also indicated for the validation of datasets that have a reduced number of sample data points.
The cross-validation technique used is
k-fold, which consists of dividing the dataset into
k sets of equal sizes, adjusting the model to
sets, and validating the remaining set. This process is carried out
k times, and the model is validated for each part of the dataset [
58].
The parameter k must be adjusted appropriately to avoid negatively impacting the final result. The value of k is normally chosen to be between 5 and 10. The choice of this value depends directly on the size of the dataset used, since selecting a large value for k can generate a training set that does not meet all the characteristics of the dataset.
2.6. Proposed Approaches
Two approaches have been proposed to predict the compressive strength of concrete. The topology used by stacking in both approaches has two levels. The first stacking level comprises SVM, MLP, ELM, KNN, and DT. The second level is the metamodel, a linear regression algorithm.
Figure 8 presents a graphical model of the topology used.
The main difference between these approaches is that PSO optimizes the first-level model parameters and adjusts the linear regression meta-parameters. As shown in
Figure 8, in the first approach (Stacking 1 (ST-1)), the first layer models are optimized individually via PSO, and the predictions from the optimized models serve as input data for the metamodel, whose parameters are adjusted using the least squares method. In the second approach (ST-2), the first layer models are optimized along with the parameters of the metamodels, with the objective function of optimization being the mean squared error of the final prediction.
2.6.1. ST-1
In ST-1, the optimization algorithm, PSO, is executed for each method in the first layer. The optimization of method parameters in this approach aims to minimize the mean squared error of each model individually. The parameters used for PSO are presented in
Table 7. The population size varied according to the methods used; in some cases, but the other parameters remained the same.
Each method present in the first layer of stacking has a set of parameters whose adjustment directly influences the prediction quality.
Table 8 lists the parameters that are optimized via PSO for each method. The values presented in
Table 8 were obtained from the literature and previous experiences.
During the process of obtaining the optimized parameters, the models in the first layer are trained using the cross-validation strategy. This strategy is justified mainly due to the reduced number of samples in the datasets studied. The metamodel is a linear regression method adjusted by the least squares approach. The model is adjusted on the basis of the predictions performed by the first-level models obtained through the optimization process.
2.6.2. ST-2
In ST-2, optimizing the model parameters of the first layer and metamodel parameters of the linear regression differs from that in ST-1, as they are carried out together. In this approach, the optimization aims to minimize the mean square error of the final prediction made by the metamodel. The PSO algorithm is applied to stacking to optimize the parameters of the first-level models and adjust the metamodel. The parameters of the PSO algorithm for this approach are presented in
Table 9.
The output of the metamodel, and consequently of stacking, is given in Equation (
17), as follows:
where
N is the number of first-level models,
are the predictions of the first-level models,
are the metamodel parameters of the linear regression, and
b is the bias.
In the optimization process, the parameters
can assume values between 0 and 1.
b can assume values between −10 and 10. The parameters of the first layer methods can assume values according to
Table 8. The PSO algorithm in this strategy has a search space of twenty-two dimensions, which justifies the increase in the population size and the maximum number of interactions.
To enable the analysis of the influence of the prediction of each first-level model on the final result of stacking, a restriction was used in the PSO (Equation (
18)) so that the sum of the parameters
was 1 with a tolerance (
t) of 0.05.
2.7. Assessment Metrics
The following three metrics, which were used in this study, are commonly applied in the literature: the coefficient of determination (), the root mean squared error (RMSE), and the mean absolute percentage error (MAPE).
Setting as the estimated output, y as the sample label, as the average of the sample labels, and N as the number of samples, the metrics are defined as stated below.
The coefficient of determination (
) is defined in Equation (
19), where (
) varies from 0 to 1, with a closer value to 1 indicating better generalization quality. The RMSE is obtained using Equation (
20), and the MAPE is defined by Equation (
21), as follows:
In the case of error metrics (RMSE and MAPE), the closer the obtained value is to zero, the closer the predicted value is to the true value. In addition, the MAPE is given as a percentage, and a value closer to 0% indicates that the method obtained better results, allowing us to make good predictions.
2.8. Statistical Tests
Statistical tests were used with the main objective of identifying the significant similarity between the first-layer models and the stacking results.
To determine the existence of similarity between three or more groups, parametric or nonparametric methods were used, depending on whether the samples were normally distributed. The Shapiro–Wilk test was used to determine whether the normality hypothesis is true for a given sample [
59]. The
p-values indicate whether the sample can be considered to have a normal distribution. A
p-value less than 0.05 indicated that the normality hypothesis was rejected.
The Lilliefors test was also used to indicate the normality of a given sample. This test is adapted from the Kolmogorov–Smirnov test and has the same statistics as in reference [
60], which is the maximum difference between the empirical distribution function and the theoretical cumulative distribution function. The null hypothesis for this test was that the sample follows a normal distribution, which was confirmed if the
p-value in the test result is greater than 0.05.
The use of parametric or nonparametric tests were defined based on the results of the Shapiro–Wilk and Lilliefors tests. Parametric tests are used for cases where the samples originate from distributions that present normality, allowing inference about parameters that characterize the origin distribution of the sample. The use of nonparametric tests occurs when the origin distributions of samples are not determined; in this case, an inference was made about the center of the distribution.
The nonparametric test used throughout this work was the Kruskal–Wallis test. The test is used to determine whether there is a significant difference between the medians of the distributions of two or more groups of an independent variable, continuous or ordinary. This test is an alternative to the parametric ANOVA test [
61]. The ANOVA test assumes normality and homoscedasticity and an equal distribution of variance for the samples. In the Kruskal–Wallis test, as with other tests, a statistic was calculated and compared to the point of cutoff, defined by the significance level, which is normally 0.05. The hypotheses tested were as follows: H0, where the population medians are equal; H1, where the population medians are different.
The use of tests such as the Kruskal–Wallis test and ANOVA allowed us to indicate whether there was a statistically significant difference between the groups of samples tested. The groups of samples that presented significant differences between each other were defined using post hoc tests.
The Dunn test is a nonparametric post hoc test used to compare pairs of sample groups and identify whether there is a significant difference between the pairs [
62]. This test was used after the Kruskal–Wallis test, which indicated a significant difference between the compared groups. The test yields a
p-value for each pair of compared samples; a
p-value less than 0.05 indicates that the samples are not significantly similar.
3. Results and Discussion
The results are presented using the average and standard deviation of the metrics for the thirty-five independent runs. The results of the first c applied to the four databases are presented, followed by those of ST-2.
The computational experiments were conducted on a computer with the following specifications: Intel(R) Core(TM) i7-9700F (eight cores of 3 GHz and cache memory of 6 MB), 32 GB RAM, and the operating system Linux Ubuntu 22. Additionally, the codes were implemented in Python, based on the pandas [
63], NumPy [
64], matplotlib [
65], seaborn [
66], scikit-learn [
67], and scipy [
68] libraries.
3.1. Results for ST-1 Stacking
The results for the simulations using ST-1 are presented in
Table 10,
Table 11,
Table 12 and
Table 13. The average and standard deviation (values shown in parentheses) values are presented. The best results, on average, among the first layer methods and stacking methods are highlighted in the tables.
The average values of the assessment metrics show that stacking does not yield the best results for all the datasets; however, it always presents itself as at least the second-best result.
To determine whether there was a significant difference between the results of the metric evaluation of stacking and those of other machine learning methods, statistical tests were used. To determine whether parametric or nonparametric tests were used, the Shapiro–Wilk and Lilliefors tests were performed to indicate whether the results of the evaluation metrics presented a normal distribution. In
Table 14 and
Table 15,
p-values for normality tests are presented.
The results presented in
Table 16 indicate that the average of the metrics of the first-layer and stacking models present a significant difference, since the
p-values for all tests were less than 0.05. Once it was verified that there was a significant difference between the methods, Dunn’s post hoc test was used to verify where the difference was located. The main interest of this work was to verify whether the results obtained by stacking presented a significant difference from those of the first-layer models. For simplicity, comparisons between the first-layer models were suppressed.
Comparing the results presented in
Table 17 and
Table 10 to those in
Table 13, one can observe that the stacking yield results are statistically similar to those of the first-layer models. To assist in the analysis of variance,
Figure 9,
Figure 10,
Figure 11 and
Figure 12 present boxplots of the metric results for each dataset.
Through the graphs, it is possible to confirm what was observed in the results: Stacking has low variance, which guarantees the prediction reliability of the method. The SVM for datasets D1, D3, and D4 yields the most similar variance. MLP is the one that presents the closest variance to the stacking in dataset D2.
Table 18 presents the models’ best parameters for datasets D1, D2, D3, and D4, for reproducibility. It can be observed that, for dataset D1, the results were better and, in most cases, the model was more simple. D3 was where the model had the most difficulty acquiring learning.
Table 19 presents the results using the parameters exhibits in
Table 18.
To evaluate the participation of each first-layer model in the final stacking prediction, linear regression was used as a metamodel, as explained previously.
Table 20 presents the averages of the regression coefficients (
) associated with each first-layer method and the intercept term (
) for the executions performed for each dataset.
According to the results presented in
Table 20, there is a direct relationship between the quality of the individual prediction of the first layer method and its participation in the prediction made by the metamodel. Taking the results for dataset D1 as an example, it can be observed that SVM has the greatest participation (58.2%) in the final prediction, which is justified since it was the model with the best individual result for this dataset. The performances of the MLP (23.5%) and ELM (10.7%) algorithms are proportional to the performances of the models, as are the performances of the KNN (3.2%) and DT (4.1%) algorithms.
The KNN and DT methods present low participation in the stacking final prediction in the four datasets, not even impacting the final prediction by 10% on average. Despite this overview, in dataset D2, the DT has more representation in the final prediction than the SVM, and in dataset D3, the KNN has a representation close to that obtained by the MLP.
3.2. Results for ST-2 Stacking
The results of the simulations using ST-2 are presented in
Table 21,
Table 22,
Table 23 and
Table 24. The data in the tables follow the same model used for ST-1: average and standard deviation for the 35 executions.
Evaluating only the average values of the metrics presented in
Table 21,
Table 22,
Table 23 and
Table 24, it can be seen that stacking does not yield better results in all the cases. However, stacking generates at least the second best result compared to first-layer models.
For the ST-1 results, statistical tests were used to determine whether there was a significant difference between the stacking evaluation and the other machine learning methods. The Shapiro–Wilk and Lilliefors tests were performed to determine whether the tests would be parametric or nonparametric. The results of these tests indicate whether the results of the evaluation metrics exhibit a normal distribution. In
Table 25 and
Table 26,
p-values for normality tests are presented.
The results indicate that none of the metrics showed a normal distribution in any of the samples. From these results, the need to apply nonparametric tests. As in ST-1, the Kruskal–Wallis test was applied to determine whether there was a significant difference between the first-layer models and the stacking model. In
Table 27, the
p-values for the tests are presented.
The results in
Table 27 indicate that there is a significant difference between the first-layer models and stacking. Given that there is a difference between the methods, it is necessary to identify which methods yield differences. To analyze the results for ST-1, Dunn’s post hoc test was used.
Table 28 presents comparisons between the first-layer models and stacking.
By observing the results presented in
Table 28 and comparing them with the results presented in
Table 21,
Table 22,
Table 23 and
Table 24, one can find that stacking is significantly similar to the first-layer models that present the best results. Datasets D2 and D4 showed similarities between Stacking and more of the first-layer models, which may be due to the high variance values presented by the metrics in this approach. To evaluate the variance of the evaluation metric results for the models in the first layer and for stacking, boxplots were generated; the results are presented in
Figure 13,
Figure 14,
Figure 15 and
Figure 16.
In the executions carried out for the second approach with the four datasets, when observing the variance of the results of the evaluation metrics, it can be seen that the first layer models MLP and ELM have higher variance values. Observing
Figure 13,
Figure 14,
Figure 15 and
Figure 16, it is clear that the MLP and ELM are present in their boxplot outlier points, which shows that these models presented with low precision. The variances of the metrics for the stacking results are low for the executions with datasets D1, D3, and D4 and higher for dataset D2, taking as a reference the lowest variance among the models. The SVM, KNN, and DT models presented with lower variance values.
To analyze the influence of each method on the final stacking prediction,
Table 29 presents the average regression coefficients (
) associated with each first-layer method and the intercept term (
) for the executions performed for each dataset.
The metamodel coefficients in this approach were obtained through optimization using the PSO algorithm. By comparing the results presented in
Table 29 and in
Table 21,
Table 22,
Table 23 and
Table 24, one can verify that there is no direct relationship between the participation of the models of the first layer in the stacking prediction and between the results of the evaluation metrics of each model. Taking the results for Database 1 as an example, when examining
Table 21, one can see that the SVM model yields the best individual results. However, its participation in the stacking prediction is smaller than that of ELM, which, in turn, is only the third-best individual model when comparing evaluation metrics.
3.3. Comparison Between ST-1 and ST-2
By analyzing the results of the evaluation of first-layer models and stacking using the data presented in
Table 10,
Table 11,
Table 12,
Table 13 and
Table 21,
Table 22,
Table 23,
Table 24, one can observe that, in general, the results presented by ST-1 are better on average. There are only two exceptions to the previous statement. The KNN model in ST-2, for simulations with dataset D1, yields better MAPE values than the ST-1 model, similar to what occurs for simulations with dataset D3, where the DT model yields better MAPE and RMSE values. According to
Table 10,
Table 11,
Table 12,
Table 13 and
Table 21,
Table 22,
Table 23,
Table 24, both approaches yield satisfactory results in the use of stacking, since the results of the evaluation metrics for the stacking predictions are the best, compared to those of the individual models, or are statistically equal to those of the first-layer model, which yields the best results for the evaluation metrics.
A comparison between approaches can be carried out by conducting a capacity analysis of the contribution of first-layer models to the final stacking prediction. ST-1 allowed us to visualize that the individual results of the first layer were directly related to their participation in the final prediction of the stacking, since, the better the model’s result was, the greater its participation was, which allows us to infer that the adjustment of the metamodel in ST-1 was satisfactory. In ST-2, there is an absence of a direct relationship, or of any other identifiable relationship, between the individual results of the first-layer models and the final result of stacking. Through this observation, it is not possible to define the fit of the metamodel as unsatisfactory, but it can be seen as an indicator of the greater complexity of adjusting the metamodel in ST-2.
3.4. Limitations of the Proposed Approach and Possibilities for Future Works
These results and the analysis presented in this paper consistently show that the stacking and PSO approach contribute to developing an automated and precise computer method. However, some limitations of the study can be cited, as follows: Factors such as the datasets which were used for training and the type of models used in this study restrict the generalization of the findings. More specifically, the stacking technique depends on the quality and diversity of the concrete mixture used; however, in the current study, only three types of concrete mixtures were evaluated. Stacking is defined as a technique that integrates various models, which can in turn add more depth in a manner of predicting. This can complicate the understanding of one of the essential components of interpretation, which is how each model contributes to the stacking strategy, and that makes it less clear than other forms of machine learning approaches. The diversity of models that could be employed in the stacking ensemble model was also not fully covered in this study, while stacking indeed offers the ability to incorporate a variety of machine learning models. Furthermore, the selection of a base model is one of the most important factors for achieving the maximum efficiency of the results; future work could use a wider range of model classes.
This work can be further extended with more advanced machine learning models, such as deep learning. These models are better suited to capture complex and potentially non-linear relationships and can also accommodate high-dimensional and time-varying data more effectively, which may enhance the predictive accuracy of compressive strength for concrete mixtures. Data augmentation techniques, such as synthetic data generation, may also be helpful in overcoming the challenges of smaller datasets and improving model robustness. Transfer learning, where a model is trained to predict strengths of one type of concrete and then applied to predict strengths of others, may allow stacking techniques to generalize to concrete mixtures outside of those provided in training. This would allow for much more scalable and generalizable models.
4. Conclusions
This work investigated the ability of the stacking learning method to predict the compressive strength of concrete specimens with different characteristics. Stacking was used in conjunction with the PSO optimization algorithm, to optimize the parameters of the machine learning methods present in the first layer of the stacking model, and the performance of the linear regression model used as a metamodel was improved in the second layer. Five different machine learning methods were used in the first layer: MLP, SVM, ELM, KNN, and DT. The models were trained and validated using the K-fold cross-validation method. The training, validation, and calculation of the evaluation metrics were carried out over the course of 35 independent runs to enable a statistical analysis of the results. Two approaches were proposed in this work to optimize the parameters of the first-layer methods and adjust the metamodel parameters. The results achieved in ST-1 indicated a good performance for stacking in predicting compressive strength.
Stacking delivered good results on average, with a MAPE of approximately 11% in the worst case; this result was obtained for dataset D3. However, for dataset D1, the result was 2%.
For the stacking models, the R2 values were approximately 0.85 in the worst case, which was, again, dataset D3; for the other datasets, the values were above 0.95.
The statistical tests carried out, using the results of the evaluation metrics, indicated that stacking gives results as good as the best first-layer method.
For ST-2, the results differ; however, from a qualitative point of view, the two approaches present similar results.
Among the results for ST-2, stacking presented a worst-case average MAPE of approximately 15%, in this case, for dataset D3.
The best result was 2.4%, which was obtained for dataset D1.
The result for R2 was 0.79 in the worst case, namely for dataset D3; for the other bases, the results exceeded 0.9.
For this approach, the stacking results were also statistically similar to the results obtained by the best first-layer model.
By analyzing the variance of the metric results, one can verify that, similarly to ST-1, stacking yields better accuracy than first-layer models.
The first-layer models that presented the best results were the SVM and the MLP.
These results were statistically similar to those for stacking in the first and ST-2 subgroups.
Therefore, it can be concluded that the computational framework created by stacking and PSO was efficient and promising.
Further investigations should employ feature selection techniques, including boruta feature selection (BFS) and recursive feature elimination (RFE). Additionally, metaheuristic algorithms, such as gray wolf optimization (GWO), artificial bee colony (ABC), and natural exponential differential evolution (DE), should be implemented to facilitate comparative analysis and the identification of optimal methodologies, in order to guide the search for increasingly improved predictions.