**3. Methodology**

### *3.1. Artificial Neural Networks*

ANNs are nonlinear mathematical models that aim to simulate the human brain processing schemes [56–59]. In a first approximation, an ANN is the result of the weighted and biased connections of logistic regression units, i.e., the artificial neurons. In a multilayer feedforward network, such neurons are organized into hidden layers between an input layer, consisting of source nodes (the components of the input vector), and an output layer composed of one or more computational neurons that compute the response of the network. Each neuron is characterized by an activation function that limits the range of its response to a continuous value between 0 and 1.

The input, i.e., the feature vector or the hidden layer output, is weighted, meaning that a certain weight is associated to each input signal (the components of the input feature vector or the hidden neurons' individual responses). These network parameters are learned through a supervised training process so that the network itself (at this point interpretable as a nonlinear parametric function) can perform a specific task. In regression problems, the network's task is to predict the experimental target associated with a given feature vector, i.e., to represent the implicit relationship between the input and the ground-truth output.

In formal terms, given a data set of input–output vector pairs:

$$D = \left\{ \left( \mathbf{x}^{\left(d\right)}, \mathbf{y}^{\left(d\right)} \right) \right\}^{|D|}\_{d=1} \tag{1}$$

$$\text{where } \mathbf{x}^{(d)} = \left[ \mathbf{x}^{(d)} \_{0}, \dots, \mathbf{x}^{(d)} \_{F-1} \right] \in \mathbb{R}^F \text{ and } \mathbf{y}^{(d)} = \left[ \mathbf{y}^{(d)} \_{0}, \dots, \mathbf{y}^{(d)} \_{T} \right] \in \mathbb{R}^T \tag{2}$$

and the input layer of an ANN is referred as being the input feature vector *<sup>x</sup>*(*d*). The first hidden layer's output is the vector:

$$h^{(1)} = \begin{bmatrix} h\_0^{(1)}, \dots, h\_{N-1}^{(1)} \end{bmatrix} \tag{3}$$

where each item *h* (1) *j* is obtained as:

with

type, maximum  nominal grain size, aggregate type, and

$$h\_j^{(1)} = \phi\left(\sum\_{i=0}^{F-1} w\_{ij}^{(1)} \cdot x\_i + b\_j^{(1)}\right). \tag{4}$$

In this expression, *w*(1) *ij* are the weights of the connections between the neurons *x*(*d*) *i* and the neurons *h* (1) *j* , *b* (1) *j* are the biases and *φ* is the activation function. The subsequent layer, *<sup>h</sup>*(2), is computed in a similar way, by considering the items of the vector *h*(1) as input layer, with corresponding weights *w*(2) *ij* and biases *b* (2) *j* :

$$h\_j^{(2)} = \phi \left( \sum\_{i=0}^{N-1} w\_{ij}^{(2)} \cdot h\_i^{(1)} + b\_j^{(2)} \right). \tag{5}$$

production

 site (values: 1 for M1,

This process is repeated to compute the activations of each layer *h*(*l*) , *l* ∈ {1, 2, . . . , *L*} and the output ˆ *y*(*d*) = *φ*(*out*) *h*(*L*) . All the learnable parameters are considered to form the set: *W*= *w*(*l*) *b*(*l*) *l*∈{1,2,*L*}.(6)

,

$$\cdots = \left\{ w\_{ij}^{\,-}, \nu\_j^{\,-} \mid \colon \left\{ \begin{array}{c} \square \curvearrowleft \left\{ \begin{array}{c} \square \curvearrowleft \left\{ \begin{array}{c} \square \curvearrowleft \left( \begin{array}{c} \square \mathrel{-} \end{array} \right) \end{array} \right\} \right\} \right\} \right\} \right\} \right} \right) \right) \tag{9}$$
  $\text{In the proposed ANN: the number of source nodes is equal to the number of input features (F} \ \spadesuit \text{), namely, the bitumen content (BC), the filter-to-bitumen ratio (Ftb), along with a categorical variable that distinguish that distinguishes the bitruminos mixture in terms of bitumen.}$ 

 . . . , 2 for M2, 3 for M3 and so on). These features are fundamental parameters of the pavement engineering, related to the composition of the bituminous mixtures.

Each hidden layer is equipped with N neurons and passed to *φ*(·) being either an ELU, TanH or ReLU function (Figure 5). The output layer consists of *T* = 4 neurons (corresponding to MS, MF, ITSM, and AV) and the identity activation function is considered *φ*(*out*)(.). In previous studies, Marshall parameters, as well as ITSM, rather than volumetric properties, have been always determined separately, by means of different neural models [29,45–48,55], while in the current paper they can be predicted simultaneously by a single multi-output ANN model.

**Figure 5.** Activation functions: exponential linear (ELU), hyperbolic tangent (TanH), rectified linear (ReLU).

### *3.2. ANN Optimization*

The network parameters *W* are identified by means of a supervised training process, which is divided into two successive steps, i.e., a forward and a backward phase. In the latter, a backpropagation algorithm [58] is exploited to update the ANN's weight and biases:

$$\mathcal{W}^{(\varepsilon)} = \mathcal{W}^{(\varepsilon - 1)} - a \nabla E \left[ \mathcal{W}^{(\varepsilon - 1)} \right], \epsilon \in \{0, \dots, E - 1\} \tag{7}$$

where *E* is the number of training iterations, *α* is the learning rate and *W*(*<sup>e</sup>*−<sup>1</sup>) are the parameters values at iteration *e* − 1. A detailed and comprehensive description of the Equation (7) has already been widely discussed, e.g., by Baldo et al. [29]. At the end of the training process (i.e., after the E iterations have been completed), the parameters *W* of the e-th iteration that produced the minimum of the loss function *L* are selected. Such loss function is usually defined as the 2-norm of the difference between the network output ˆ *y*(*d*) and the ground-truth vector *<sup>y</sup>*(*d*), thus being called Mean Squared Error (MSE):

$$L\left(\boldsymbol{y}^{\widehat{\langle d}},\boldsymbol{y}^{(d)}\right) = \|\,\boldsymbol{y}^{\widehat{\langle d}} - \boldsymbol{y}^{(d)}\,\,\|\_{2}^{2} \tag{8}$$

The training process just outlined is known as Stochastic Gradient Descent (SGD). Such approach is commonly used in the well-known [60,61] and widely used MATLAB® ANN Toolbox [22,23,29,45,46,50].

With the aim of increasing the convergence speed of the learning algorithm, several improvements have been proposed in recent years. The first remarkable one, called RAdam optimizer [62], faces the issue of implementing warmup heuristics to avoid falling in bad local minima. This problem can occur using adaptive optimizers such as Adam [63]. According to the authors, the variance of the adaptive learning rate in the initial weight updates can be intractable and direct the solution search towards local minima with poor performance. Adding a rectifier operation to the adaptive learning rate was shown to reduce the variance and lead to better solutions. The second remarkable improvement, Lookahead [64], is a suitable method in reducing the need for extensive hyperparameters search, and the variance of the parameter updates. The Lookahead algorithm maintains two copies of the parameters, respectively, the "fast weights" and "slow weights". The first is initially updated for k times using a standard SGD-based optimizer. Then, the second set of weights is updated in the direction (i.e., the gradient) of the last computed

fast weights. Intuitively, as the name states, the first kind of update looks ahead to acquire information of the loss function's surface. Once obtained, the gradients used for the actual weight update result in a more accurate direction towards the loss function's minimum. The combination of the strategies is known as Ranger optimizer. In this work, such optimization algorithm was used to calculate the optimal weights of the implemented ANN, in contrast with the common procedures of neural modeling used in previous literature studies [22,23,29,31,45,46,50].

### *3.3. ANN Regularization*

To perform proper neural modeling, the overfitting phenomenon must be avoided: a model becomes overfitted when it starts to excessively adapt to the training data and stops smoothly regressing the selected validation data. In the current study setup, the weight decay technique [65] (also known as L2 Regularization) has been implemented to overcome such problem. It consists in adding an additional term to the optimization objective, which is calculated as the 2-norm of the parameters of the network. In this way, the global optimization objective becomes:

$$L\_{\
oplus} \left( \mathbf{y}^{\hat{\langle d}}, \mathbf{y}^{(d)}, \mathcal{W}^{(\varepsilon - 1)} \right) = L\left( \mathbf{y}^{\hat{\langle d}}, \mathbf{y}^{(d)} \right) + \beta \parallel \mathcal{W}^{(\varepsilon - 1)} \parallel^2 \tag{9}$$

where *β* is a hyperparameter to control the magnitude of the term.

### *3.4. k-Fold Cross Validation*

A fixed training–validation split of data is usually performed [60,61]: this technique, also known as a "hold-out" method, can result in biased model performance as a consequence of the different descriptive statistics characterizing training and validation data sets [49].

Conversely, the k-fold Cross Validation (k-fold CV) represents an effective alternative to evaluate the actual model generalization capabilities. This resampling method suggests splitting the data set of interest into k folds, equally sized. For example, in the current study setup, a 5-fold CV was implemented. It means that 5 alternative partitions of the data set were obtained: in turn, 4 folds were used to train the neural model and the remaining one to validate it. This involved running 5 experiments, at the end of which the obtained validation scores were averaged to evaluate the general performance capabilities of the proposed model [49,51].

### *3.5. Bayesian Hyperparameters Optimization*

Modeling by machine learning (ML) approaches involves the accurate setting of several hyperparameters: these parameters are used to define the topology and size of a neural network (e.g., the number of hidden layers and neurons), and to control the learning process (e.g., the learning rate). Standard procedures involve a grid or random (i.e., based on a sampling method) search for the best combination of hyperparameters within variation intervals accurately defined by the experimenter on the basis of his/her experience [23,29,31,45,46,48,50]. However, to obviate the significant time demands and computational resources of the abovementioned methods, a semi-automatic strategy has been implemented in the current study, namely, the Bayesian Optimization (BO). This method, based on Bayesian statistics, aims to find the maximum (for maximization problems) or the minimum (for minimization problems) of a function: *f*(*x*), *x* = *xp*, *p* ∈ {0, . . . , *<sup>P</sup>*}, *P* ∈ *N*, where *xp* is a parameter in a bounded set *Xp* ⊂ *R*. This mathematical problem is solved by optimization algorithms that define a probabilistic model over *f*(*x*), to decide at each iteration which is the most likely point in *X* to maximize (or minimize) *f*(·). In this context, Snoek et al. [66] were the first to use the BO for the search of ML model hyperparameters: since the trend of the objective function is unknown, the authors treated *f*(·) as a random function and placed a Gaussian process (GP) prior [67] over it, to capture its behavior. During the optimization process, the prior is updated based on ML experiments

results, produced by different hyperparameters combinations, to form the posterior distribution over the function *f*(·). The latter, in turn, is exploited by an acquisition function to determine the next evaluation point.

In practice (Figure 6), a set of *O* observations of the form *x*(*o*), *<sup>y</sup>*(*o*)*Oo*=1, with *y*(*o*) = *N f x*(*o*) , *ν* and where *ν* is the variance of noise in the observation of *f*(·), is exploited to determine a multivariate Gaussian distribution over *R<sup>O</sup>* through a mean *m* : *X* → *R* and covariance *K* : *X* × *X* → *R* functions. The next set of hyperparameters *xnext* ∈ *X* that should be evaluated during the optimization process is determined by solving the equation *xnext* = *argmaxx<sup>a</sup>*(*x*). The function *a* : *X* → *R*<sup>+</sup> is called acquisition function and generally depends on both the previously observed samples *x*(*o*), *<sup>y</sup>*(*o*)*Oo*=<sup>1</sup> and the GP parameters *θ*. It is formulated as follows: *a x*; *x*(*o*), *<sup>y</sup>*(*o*), *θ* . In the GP prior setting, *<sup>a</sup>*(·) relies on the predictive mean *μ x*; *x*(*o*), *<sup>y</sup>*(*o*), *θ* and variance *σ*<sup>2</sup> *x*; *x*(*o*), *<sup>y</sup>*(*o*), *θ* functions of the GP model. There are several existing definitions for *<sup>a</sup>*(·) [68,69], but the GP Upper Confidence Bound (UCB) [70] has been proven to efficiently reduce the number of function evaluations needed to determine the global optimum of several black-box multimodal functions. UCB implements the acquisition function as *aUCB x*; *x*(*o*), *<sup>y</sup>*(*o*), *θ* = *μ x*; *x*(*o*), *<sup>y</sup>*(*o*), *θ* − *κσ x*; *x*(*o*), *<sup>y</sup>*(*o*), *θ* , where *κ* is a hyperparameter to control the balance between exploitation (i.e., favoring parts that the model predicts as promising) and exploration.

**Figure 6.** Flow chart of the optimization process.

In the actual problem, namely the prediction of mechanical and volumetric properties of bituminous mixtures for road pavement, *f*(·) is defined as *f* : *XL* × *X N* × *Xact* × *Xα*<sup>×</sup> *<sup>X</sup>β* × *XE* → [−1, 1] and has to be maximized. Therefore, given the *P* = 6 hyperparameters *L*, *N*, *act*, *α*, *β*, *E*, *f*(·) constructs an ANN with L layers, N neurons per layer and performs a 5-fold CV experiment in which the ANN is trained for E iterations with *α* and *β* as the learning rate and weight decay parameter, respectively. *f*(·) returns a scalar value that expresses the average Pearson coefficient (R) obtained by the ANN on the 5 validation folds. The BO algorithm performs 500 iterations: at each iteration, *f*(·) runs an experiment based on the hyperparameter combination sampled by the UCB algorithm on the posterior distribution; the result is used to update such probabilistic model and improve sampling of next points. The step-by-step procedure is described in Figure 7: the modeling procedure begins with the mixtures design and the execution of laboratory tests for the definition of features and targets variables; these latter are assumed to be representative of the physical problem treated; the input-target fitting is performed by a neural network, whose structure and algorithmic functioning is not known a priori; the search for the topology and the values of the training process parameters that minimize the prediction error of the ANN is handled by the Bayesian optimizer, by comparing network outputs and experimental targets; once the optimal hyperparameters combination has been identified, the model can be put into service for the designed application, by training it on the entire data set. In the current study setup, the hyperparameters to be optimized by the Bayesian approach can rangeinthefollowingintegerorlogarithmicintervals:

**Figure 7.** Step-by-step procedure followed in this study: it starts with the mix design (left side) and testing processes (bottom side); these tests define the set of target variables (lower right side); the input-target fitting is performed by a neural network, whose structure and algorithmic functioning are searched by the Bayesian optimizer (upper side) comparing network outputs and experimental targets (right side).


These ranges were defined in the optimizer implementation [67] and have been left unchanged.

### *3.6. Implementation Details*

Before being inputted to the ANN, each feature contained in the feature vectors *x*(*d*) was standardized, i.e., the respective mean was subtracted and division by the respective standard deviation was applied. The statistics were calculated on *D*. The same procedure was performed for the target feature vectors *<sup>y</sup>*(*d*), where each target variable was subtracted by its mean and divided by its standard deviation computed on *D*. Also, the BO observations *x*(*o*) were transformed in a similar fashion. In this case, each hyperparameter was standardized by the mean and standard deviation of the respective range.

The source code was written with the Python language. ANNs were implemented using the PyTorch machine learning framework, while the hyperparameters BO procedure was realized with the Bayesian optimization package [71]. The code was run on a machine provided with an Intel(R) Xeon(R) W-2125 4GHz CPU and 32GB of RAM running Ubuntu 18.04. Each experiment lasted circa 24 h.

### **4. Results and Discussion**

Figure 8 shows the k-fold CV score of neural models that were designed in the 500 iterations of the BO algorithm. The Pearson's R coefficient averaged over the 5 validation folds tended to become uniform in value among the different combinations tested by the BO optimizer at least during the first 350 iterations (Figure 8) but showed more marked variations afterwards. The algorithm has detected a region of the search space where the validation error was not significantly changing and, having assessed an over-exploration of a specific zone, it has decided to focus its search on an unexplored area that might have highperforming solutions (however, identifying worse combinations). This result highlights the regulation between exploitation and exploration performed by the UCB algorithm.

**Figure 8.** Average R-score on the 5 validation folds for the 500 algorithm iterations.

Despite the large size of the search space, the optimal model (i.e., showing the minimum validation loss MSECV = 0.249 and then the maximum value of the Pearson coefficient RCV = 0.868) was found at iteration 54 (Figure 8). The hyperparameters discovered by the BO algorithm defined an ANN with *L* = 5 layers, *N* = 37 neurons, and hyperbolic tangent activation function (*tanh*), that was trained for *E* = 3552 iterations, with a learning rate of *α* = 0.01 and weight decay *β* = 1·10−6. Table 5 shows the validation MSE (second column) and the R-score of the optimal model for each of the 5 folds (last column). In addition, the final average results for each mechanical characteristic and volumetric property are

reported in the last row of Table 5. Figure 9 shows the relation between network output and experimental target for each fold.


**Table 5.** Number and codes of Marshall specimens.

**Figure 9.** Regression analysis on the 5 folds.

Although there is a high variability in the data set, which can be explained by considering the different properties of the mixtures analysed, the optimal BO model returns very successful results (Figure 9). The 5-folds averaged Pearson coefficient for each mechanical and physical parameter is always greater than 0.819 (last row of Table 5). Fluctuations in values of the MSE and R parameters (second and seventh column in Table 5) can be attributed to the different distribution of training and test data that characterize each fold. This result shows how a fixed split of the data set can lead to an incorrect assessment of the prediction error, which can be worse (R4 = 0.829) or better (R3 = 0.906) than the most likely situation represented by the k-fold CV (RCV = 0.868). Random and grid searches based on the prediction error by a fixed split training test may find solutions that are not optimal [49], due to fluctuations resulting from considering one partition rather than another. Figure 10 shows the comparison between experimental targets and predicted outputs, for the four parameters analyzed, as regards fold 4. Values calculated by the ANN model characterized by the highest prediction error (MSEmean = 0.346, Table 5) are very close in value to the experimental data, whatever variable is considered. This result is relevant from an engineering point of view, because it proves that ANNs can be an accurate method to model (even simultaneously) the mechanical response and physical properties of bituminous mixtures, also very different in terms of composition.

**Figure 10.** Experimental vs. predicted data, related to fold 4.

Although the presented modeling procedure is conceptually and computationally more complex than a simple grid search, it overcomes some of the problems (such as biased performance evaluations and sub-optimal hyperparameters selection) inherent in traditional methods (i.e., grid and random search) or in the most frequently used toolboxes (such as the MATLAB® ANN Toolbox), as the above reported results suggest.
