2.1. Multi Expression Programming
The widespread assignment in most research-based studies is to present a computational model to explicate and forecast specific phenomena or actions. Numerous computational techniques, such as Evolutionary Programming (EP), Multi-Expression Programming (MEP), Genetic Algorithm (GA), and Gene Expression Programming (GEP), were established in this regard to assist these activities [
72,
73]. The prime focus of AI modeling is to develop feasible and accurate mathematical illustrations to predict the outputs based on pre-specified input parameters. However, the GP-based Darwinian principle idea of natural selection was proposed by [
74], which is an evolution of the genetic algorithm (GA). The paramount variance involved in these methods is GA utilization based on fixed binary length strings substituted in GP with nonlinear parse trees. An utmost variant of linear proportionality has already been suggested over the past few years by various EA’s. Independents can also be illustrated as variable-length units as suggested by [
75,
76] for the MEP case. The simulation of MEP output can be demonstrated as an instruction based on linear strings, where strings are a coalition of mathematical functions and variables. The MEP schematic process is demonstrated below in
Figure 1. Moreover, MEP evolution operation begins with the production of chromosomes population randomly. Thus generation begins by utilizing a binary tournament and thereby selecting two parents initially. Reorganization then happens with a cross-over probability, followed by the production of binary offspring and genetic reshuffling of designated parents, and mutation of offspring; replacing begins based on the population of the worst-performing individuals with the optimized one’s. The whole operation being periodic remains continued unless it conflux towards convergence [
77].
Most of the presented works done in the last few years put intense consideration upon computational techniques, particularly in Neural Networks and GEP techniques for modeling different problems related to civil engineering. Despite that, MEP has a certain dominance on similar Intelligence techniques. Usually, to analyze the characteristic behavior of concrete, an extensive database is required to forecast the output. Gene Programming utilizes genetic tree crossing based operators, which results in the production of an immense population of processed components, or derivation tree which successively causes an increase in the production time of the model and thus requires extra storage [
74]. Furthermore, GP operates as a genotype and phenotype due to the nonlinear structure that makes it strenuous for the algorithmic process to predict reliable mathematical operators for the intended expression [
51].
Contrary to this, MEP can identify among the genome and phenome categories due to the involvement of linear variants [
78]. In GP the rate of success increases up to a threshold value with the number of genes in chromosomes. Nevertheless, over-fitting is a significant issue and tends to exist in the forecasted properties beyond the limit. That over-fitting limits the model feasibility in the construction sector [
26,
79,
80]. Conversely, MEP is dominant when final expression complications are unspecified, which is a normal practice involved in material science problems. A slight alteration in parameters can alter the results considerably [
75]. The capability of MEP makes it possible to code multiple solutions in a single chromosome. In addition, the linear pattern of chromosomes assists the algorithm in exploring wide and vast space in forecasting the target. The dominance MEP has over other computational algorithms makes it able to establish rigorous and robust models for the construction industry. Few studies in the past employ MEP to develop the systematic categorization of soil based on the Atterberg limit (to distinguish the consistency states either plastic limit or liquid limit), gravel occurrence, soil color, the volume of fine-grained particles, and sand percentage [
77]. Thus, nonpiecewise models are proposed to aid in determining the degree of soil consolidation [
81]. Nevertheless, normal and high strength concrete tangent (E) formulation by [
78], models for concrete columns confined with aramid fiber-reinforced polymer (AFRP) [
71], soil deformation modulus evaluation [
82], formulation of suction caisson uplift capacity [
83,
84], and CS of Portland cement-based on 28-day strength [
85] were among other studies.
However, in the presented study strength model for CFRP confined concrete has been developed by utilizing the MEP approach. The modeling is linked with comprehensive analytical and descriptive studies to assure the validity and efficacy of the created model. The development of credible models will encourage the building industry to use CFRP confined concrete since it eliminates the sophisticated and laborious experimental procedures required to test such an unusual material for construction. The developed approach will help to strengthen infrastructure through retrofitting and rehabilitation, and would also promote viable construction and resource conservation by preventing infrastructure deterioration. Furthermore, the suggested modeling strategy will allow future accurate simulations of similar complicated engineering phenomena.
2.2. Experimental Database
For modeling purposes, a thorough database of mechanical and geometrical parameters of CFRP confined concrete was compiled from the publications. The database created in this study is based on earlier experiments. The compiled database provided an extensive dataset of 828 specimens and all critical parameters related to the strength enhancement of concrete enclosed with FRPs. Universal and robust model development was ensured by incorporating all the variable datasets collection. Cube samples were used in some of the investigations to evaluate mechanical characteristics. The cube compression strength was converted to cylinder compression strength by the UNESCO converter coefficients [
86] to ensure data conformance and consistency. To determine the probable parameters impacting the characteristics of CFRP confined concrete, thorough literature research and statistical data analysis were performed.
Table 2 shows the range and statistical information of the parameters incorporated and employed in the model’s construction. The proposed parameters being included comprised of five inputs and one target component as follows:
where
f′
cc is the confined compressive strength of CFRP.
d;
h;
nt;
f′
co; and
EFRP are the respective section diameter, the corresponding height of specimen, the CFRP layers thickness, unconfined concrete strength, and finally, the elastic modulus of fibers, appear to be potentially effective parameters in predicting the ultimate load values and thus be utilized as the input parameters to establish the model. Moreover,
εco and
εcc are the corresponding strain values of unconfined concrete and CFRP confined concrete of respective specimens.
The distribution of input parameters has a significant influence on the generalization capacity of the generated model. In
Figure 2, data is represented using frequency histograms to depict the distribution of individual variables. As shown in
Figure 2, the distributions of the input variables are not consistent, and the frequency rate of the input parameters is relatively high. It is important to remember that if input parameters have a high-frequency rate, we will be able to get a better model. The statistics and ranges of the individual variables used in the model are summarized in
Table 2 to make the data more comprehensible. The table depicts the data’s center (mean and median), dispersion (standard deviation and variance), extremes (maxima and minima), and distribution shape (skewness and kurtosis), making data interpretation relatively straightforward. The results reveal that the suggested machine learning models apply to a wide range of input data, boosting their utility.
The multicollinearity problem, which emerges due to the interdependence of input parameters, is a prevalent challenge in the execution of machine learning techniques [
49]. It has the potential to raise the strength and endurance of correlations between variables, thus lowering the effectiveness of the produced model. It is recommended that the coefficient of correlation (
R) between two input parameters remains less than 0.8 [
87] to overcome the issue of multicollinearity.
R is evaluated for all potential input variable combinations, as given in
Table 3. The table shows that
R, whether negatively or positively, is smaller than the stipulated limit (0.8), indicating that there would be no possibility of multicollinearity amongst variables during modeling.
2.3. Modeling Parameters
As previously stated, to build a robust and comprehensive model, several fitting parameters for MEP must be specified prior to modeling. These fitting parameters are adopted based on past suggestions using the hit-trial and error procedure [
88]. In addition, the population size is specified so that the number of programs incorporated in the population is particularized priorly. Converging a model having a huge size of population would be difficult, sophisticated, and time-consuming. However, once the model’s size is expanded above a certain point, the issue of overfitting may develop. The process was actuated to take into account the number of the subset as 100.
Table 4 shows the parameters chosen for the model produced in this study. For the sake of convenience, the function set contains the following mathematical functions of adding, subtracting, multiplying, and divisions, as well as certain trigonometric functions to ensure that the final expressions are robust and accurate. The algorithm’s accuracy level is determined by the number of generations achieved by the algorithm prior to the termination. The best model can be accomplished through the run with as many generations as possible, and consequently, that will result in a model with the fewest anomalies.
Similarly, mutation and crossover rates represent the likelihood of progeny undergoing these genetic procedures. The percentage of cross-over ranges between 50% to 95%. The data was subjected through multiple combinations of modeling configurations, and the optimum combination was adopted based on overall model subjective evaluation, as shown in
Table 4. Being an advancement in the modeling procedure, one of the common issues often encountered in AI modeling is data overfitting. A model behaves efficiently with the source data, conversely, it performs poorly with unknown data. Therefore, it has been highly endorsed that the trained model be tested on an unknown or test dataset to avoid conflicts arising from these problems [
89,
90]. However, the entire database, on the other hand, has been arbitrarily separated into training sets, validation sets, and testing sets.
These training sets and validation sets were treated appropriately during modeling. The verified model is next used to test on the third dataset. However, it is a test set that was not involved in the model’s construction. It must be assured that the data distribution is uniform across all data sets. To maintain consistency in the presented work, 70%, 15%, and 15% of the data are manipulated based on the train, validate, and test sets, respectively. The final models exhibit better performances over all datasets. For this purpose, a commercially accessible computational tool MEPX v1.0 was acquired to employ the MEP algorithm.
The commencement of the algorithm is initiated by producing an initial population of viable alternatives. The mechanism is recursive, thus converges to approach the conclusion with every new generation. In each generation, fitness is at first well appraised inside the solution population. However, in machine learning algorithms, a big concern is model overfitting due to the data training in excess. This overfitting eventually causes an increase in testing error, but training error decreases continuously [
91]. Therefore, to cumber the effects caused by the overfitting of the model, the term objective function (
OF) is introduced in machine learning. This
OF term is known as a fitness function.
Moreover, from the literature review [
49,
92] it is proposed that the best model selection should be made based on minimized objective function (
OF). In the current study, to demonstrate the overall efficacy of the model
OF is also being assessed for each trained model, as it can consider the effects of
R,
RMSE, and the quantity of input data. Hence, the model developed by MEP persistently transforms unless there is no transition recorded in the pre-established fitness function, i.e.,
RMSE or coefficient of determination. Furthermore, the process is repeated until its convergence to achieve an accurate and robust model for these three datasets (training, validation, and testing) by eventually expanding the amount and size of the subpopulation. Finally, the model selection will be made based on the minimum value of
OF. However, superior performance of some models was indicated for the training set compared to the testing set, which indicates that the model is overfitted and must be countered accordingly. It is to be considered here that the accuracy of the developed model is impacted by the evolution period for the number of generations developed. With the inclusion of each new variable in the programmer in these algorithms, the model is constantly evolving. Therefore, in this research, the generated model was terminated either upon 10,000 generations or when the change in fitness function remained acceptable, i.e., less than 0.1 percent.
Furthermore, an optimal model must satisfy multiple performance indicators, as explained in the following discussion. These performance indicators assess the efficacy of the proposed model by evaluating statistical error and model indicators. These measures include Coefficient of Determination
R, Relative squared error (
RSE), Relative root mean square error (
RRMSE),
RMSE, Mean absolute error (
MAE), and Fitness function, performance index (
ρ). The Equations (1)–(7) represent the relationships for statistical indicators as discussed.
Here . and , denote the ith actual and estimated values respectively, and and denote the mean ith experimental and average estimated values, respectively, and n denotes the total number of observations utilized for modeling. The subscripts T and TE, respectively, reflect the train and test sets.
Furthermore, several criteria must be observed when evaluating the validity of constructed models. Therefore, as a result, it must meet at least the standards outlined in the literature as follows [
93,
94,
95,
96,
97,
98,
99].
To exist a correlation between the observed and expected values |R| needs to be between 0.2 < |R| < 0.8.
If |R| evaluated to be < 0.2, that depicts a weak correlation among the actual and predicted values.
|R| has to be larger than 0.8 to maintain a strong correlation between expected and actual values.
Furthermore, a model with a strong
R and limited predictive errors is considered reliable. In general, the |
R| value is an important parameter to consider when evaluating a model. Researchers have suggested that
R be used to assess linear relations between inputs and outputs’ results [
22,
83]. However, it does not evaluate the overall efficiency of the model due to its impassive behavior towards division or multiplication of output with a constant value. The average magnitude of the errors is calculated using the
RMSE and
MAE measures. Each has its own set of implications and restrictions. For instance, in
RMSE, the average value error is squared before the estimate, resulting in a preference for greater deviations.
In contrast, a large
RMSE value indicates that such outputs having significant errors are far higher than anticipated and must be minimized. In comparison to
RMSE, MAE allocates low weightage to larger errors, leading towards less value. Other researchers, such as Despotovic et al. (2016) [
100], have recommended that the
RRMSE value for excellent modeling should be between 0 and 0.10; however, if such calculations were within 0.11 and 0.20, the model is considered good. Other indices, such as
ρ and
OF, lie between 0 and infinity. However, for the reputation of a good model,
ρ and
OF must be less than 0.2 [
92]. The parameter
OF has significant importance as it considers the effect of three main statistical parameters involved in training and testing datasets, i.e.,
RRMSE,
R, and relative percentage.
Furthermore, a lower value of
OF indicates that a proposed model efficiency is preferably sufficient. The computed OF is preferably close to the criteria stated for a good model in the presented study. As explained earlier, numerous trials were carried out until the model converged to yield the lowest
OF value. Furthermore, the developed model is externally validated through standards suggested by other scholars, which are presented in
Table 5.