*2.3. Surrogate Models*

In this section, we give an overview of the surrogate models used and their general assumptions; to highlight the differences as well as the advantages and disadvantages between them, we present a detailed mathematical background in Appendix A. We have selected models from different learning paradigms:


#### **3. Results**

For evaluation, we split the data into a training and test dataset to fit and test our surrogate models; see Table 6 for the dataset sizes and Table 2 for more details regarding the data split.

As a next step, we need to define hyperparameters for each model and each use case. We performed hyperparameter optimization using only training data; no test data were used. In our PINN approaches, the adaptation of hyperparameters was based on the work of [21,22]. Our MLPs were designed to be similar to our PINNs to allow for fair comparisons. We varied hyperparameters in our neural network approaches (MLP and PINN) following best practices and guidelines, where we optimized the number of hidden layers, number of neurons per hidden layer, activation function, validation split, earlystopping patience and the size of the batch per training epoch. Regarding the rest of our models, we applied a grid-search with a five fold cross-validation, utilizing the training data to obtain the best hyperparameters. The hyperparameters for each use case are in Appendix B and Tables A1–A6.

Our evaluation is based on R2-scores with respect to the FEM results and inference time. For models that contain inherent randomness, such as MLPs, GBDTR and PINNs, a five-fold cross-validation was conducted. For these models, we report the mean values and standard deviation of the R2-score. For the sake of brevity, we report only the average R2-scores across all 13 targets in this section; see Tables 7–9. The R2-scores for individual targets are provided in Appendix C. The inference times are based on the mean value of three measurements. Inferences were run on a machine with 16 GB RAM, 8 CPUs and Intel(R) i7-8565 2.0GHz processor. To compare the inference time of our surrogate models with the computation time required to run FEM simulations, we have included the latter also in Tables 7–9.


**Table 7.** Plate: averaged results, bold values indicate the best performing surrogate models. Values in parentheses are the corresponding standard deviations of the average R2-scores due to repeated experiments of stochastic process models. For further information concerning simulations, see Table 2.

For graphical results, we chose simulations that cover the error situation quite well in order to make statements about the performance of each model. In addition to the absolute errors (Figures 5a–f–10a–f), the corresponding FEM simulations of the basis are shown in Figures 5g–10g.

**Figure 5.** Elongation of a perforated plate, Simulation 1 (extrapolation): absolute errors of different surrogate models (**a**–**f**) and ground truth Abaqus FEM simulation (**g**) of *σxy*.

GBDTR, KNNR, GPR and SVR algorithms were implemented with the scikit-learn library version 0.24.0 in Python. The SVR and GBDTR algorithms were constructed with MultiOutputRegressor scikit-learn API to fit one regressor per target. Regarding our DL algorithms, the utilized MLPs were implemented with the keras API version 2.4.3 and our PINNs were implemented with the sciann API version 0.5.5.0 in Python 3.8.5. We used the PDEs from [21,22], but instead of the inversion part, we trained our PINNs additionally with plastic strain data, same as for the rest of the surrogate models.

(**g**) FEM result

**Figure 6.** Elongation of a perforated plate, Simulation 4 (interpolation): absolute errors of different surrogate models (**a**–**f**) and ground truth Abaqus FEM simulation (**g**) of *σzz*.

In the elongation of a perforated plate use case, our approach is based on a total of nine FEM simulations. We used five simulations for training and four simulations to evaluate the fitted models; see Table 2. We report the average of R2-scores across all outputs in Table 7 with the corresponding inference times.

Regarding extrapolation, the absolute errors of each surrogate model with respect to *σxy* of Simulation 1 are shown in Figure 5. We plot the absolute errors of each surrogate model of *σzz* of Simulation 4 in Figure 6 as an example of interpolation. In addition, we show in both figures the ground truth of the corresponding output variable obtained from the FEM simulation. For both interpolation and extrapolation, the errors are large near the shear band. As far as extrapolation is concerned, in addition to the errors near the shear band, most models have significant errors near the maximum negative xy shear stresses; see blue areas in Figure 5g. GBDTR performs well overall, though the error increases in

various locations; while PINNs have a similar average performance, they perform better outside the shear band regarding absolute errors. MLP overall shows the best results followed by KNNR.

In the bending beam use case, similar to the perforated plate use case, we trained our models on five simulations and tested them using the remaining four, see Table 2. We present the average R2-scores across all outputs and inference times in Table 8 for the test simulations 1, 4, 6 and 9.

**Table 8.** Beam: averaged results, bold values indicate the best performing surrogate models. Values in parentheses are the corresponding standard deviations of the average R2-scores due to repeated experiments of stochastic process models. For further information concerning simulations, see Table 2.


We provide a graphical representation of the absolute error of the surrogate models regarding *ε<sup>t</sup> yy* in Figure 7a–f with the FEM simulation result in (g) as one instance of interpolation. Absolute errors of the surrogate models regarding *ε p xx* and extrapolation are shown in Figure 8. Overall higher errors can be observed near the encastred boundary condition of the beam for some models for that output. While the PINN shows a competitive average R2-score regarding interpolation, on this single target, its performance shows significant weaknesses.

(**g**) FEM result

**Figure 7.** Bending of a beam, Simulation 6 (interpolation): absolute errors of different surrogate models (**a**–**f**) and ground truth Abaqus FEM simulation (**g**) of *ε<sup>t</sup> yy*.

**Figure 8.** Bending of a beam, Simulation 9 (extrapolation): absolute errors of different surrogate models (**a**–**f**) and ground truth Abaqus FEM simulation (**g**) of *ε p xx*.

The compression of a block with four perforations use case presents a more complex setting because we generalize by two generalization variables (yield stress and block width). Therefore, we utilize more training data for this use case; see Table 2. We report the average results of R2-scores with corresponding standard deviations, if applicable, in Table 9.


**Table 9.** Block: averaged results, bold values indicate the best performing surrogate models. Values in parentheses are the corresponding standard deviations of the average R2-scores, due to repeated experiments of stochastic process models. For further information concerning simulations, see Table 2.

> As an instance for interpolation, the absolute errors regarding *ε p xx* can be seen in Figure 9a–f with Abaqus FEM simulation result (g). Respectively, an instance for extrapolation is shown in Figure 10 with absolute errors (a–f) and FEM ground truth (g). Some models show higher prediction errors near shear bands (high *ε p xx* regions) regarding the interpolation task. However, SVR and GPR cannot extract meaningful information from the training data, especially in the space free of plastic deformation. This is indicated by the low average R2-scores, compared to the other models. Considering absolute errors of *σxy* and extrapolation the MLP, which is otherwise performing well, shows weaknesses and is in general outperformed by the KNNR.

**Figure 9.** *Cont.*

**Figure 9.** Compression of a block, Simulation 7 (interpolation): absolute errors of different surrogate models (**a**–**f**) and ground truth Abaqus FEM simulation (**g**) of *ε p xx*.

**Figure 10.** Compression of a block, Simulation 13 (extrapolation): absolute errors of different surrogate models (**a**–**f**) and ground truth Abaqus FEM simulation (**g**) of *σxy*.

#### **4. Discussion**

All classes of surrogate models that we considered in this work share several key characteristics: (1) they are mesh-free and thus, can deliver results with infinite resolution; (2) the computation time required to obtain the target values at predefined positions is orders of magnitude lower than for FEM simulations; (3) since for each simulation setup, where the geometry changes, a different mesh is created during FEM simulations, our results indicate that all classes of surrogate models generalize (interpolate) reasonably well across training data positions; (4) furthermore, all surrogate model classes generalize at least to some extent across use case parameters, such as changes in geometry or material parameters. Finally, all surrogate model classes must be used with care, as they do not extrapolate well to data positions and/or use case parameters unseen during training. Our findings show this in the extrapolation result of the beam use case, Simulation 9: due to the greater yield stress, almost no plastic deformation occurs; thus, the surrogate models are not able to learn such material behavior. Similar findings can be seen from the extrapolation results of the block use case, Simulation 1, 2, 12 and 13: approaches utilizing PINNs and SVRs are not able to predict acceptable strain components, leading to overall worse averaged R2-scores. In general, it can be stated that the surrogate models used show similar behavior with respect to inter- and extrapolation, but differ with respect to individual components, i.e., some models are better at predicting individual components (e.g., strains) for unknown generalization variables (e.g., yield strength) than others. Another example would be the symmetric nature of the use case, making it redundant to evaluate, e.g., stresses at negative x-positions, the proposed surrogate models will certainly respond with such stress values, which consequently, cannot be considered meaningful. Similarly, while the surrogate models may well be evaluated at physically meaningless use case parameters, e.g., negative radii, the thus obtained results must be considered meaningless as well. Therefore, all surrogate models must be treated with this in mind, which is a fundamental difference to FEM simulations that do not offer such modes of failure. With these considerations in mind, we now turn to discuss specific characteristics of each surrogate model class.

Our KNNR approach, which can be considered simple compared to the other algorithms, gave competitive results; moreover, this approach showed the best results regarding extrapolation (i.e., Simulations 1, 2, 12 and 13) in the block use case.

Algorithms we constructed with MultiOutputRegressor (SVR and GBDTR) could give better results if the hyperparameters are tuned to each target separately. However, we did not do this for fairness reasons since our other algorithms are also fitted to the overall use case and not to each target individually. We intend to monitor this in the future.

In our setting, the GPR algorithm did not deliver good results. Tuning the kernel function could deliver better results; however, we do not believe that it would be practical to modify for each new simulation use case. Thus, we not intend to head in this direction. However, we plan to investigate whether other Bayesian methods (e.g., Bayesian neural network [30] or neural processes [31]) could be beneficial.

Our MLPs approaches delivered the overall best results in our comparison, especially regarding interpolation (i.e., in the plate and beam use cases Simulations 4 and 6 and in the block use case, Simulation 7). They achieved high accuracies (R2-score > 0.992), while reducing the inference time by a factor of over 100 in comparison to FEM simulations. As mentioned before, designing the architecture is not a straightforward process; however, if the network is deep enough and suitable optimization methods are available (e.g., Adam optimizer) the network can be also efficiently trained utilizing early stopping.

As already reported in literature [32–35], we experienced in our setting that PINNs are not straightforward to design and train. Due to several plateaus in the loss function, early stopping did not prove to be effective. Therefore, we set a fixed number of training epochs. One reason for our observation could be the existence of a non-convex Pareto frontier [36]. In the multi-objective optimization problem, the optimizer might attempt to adjust the model parameters while situated between the different losses, leading it to favor one loss at the expense of the other [37]. Possible approaches to overcome this problem are adaptive optimizers [38], adaptive loss [39], and adaptive activation functions [40]. Moreover, PINNs are objects of current research and will gain more and more attention in the future. Besides other fundamental methods, we additionally plan to aim in that direction for improved surrogate modeling.

#### **5. Conclusions**

In this work, we deliver a comprehensive evaluation of generalizable and mesh-free ML and DL surrogate models based on FEM simulation and show that surrogate modeling leads to fast predictions with infinite resolution for practical use. In the context of our evaluation, we show which ML and DL models are target oriented at which level of complexity with respect to prediction accuracy and inference time, which can serve as a basis for the practical implementation of surrogate models (in, for example, production for real-time prediction, cyber–physical systems, and process design).

In future work, we plan to conduct more complex experiments, e.g., generalizing across more input variables regarding geometry (e.g., consideration of all component dimensions) and material parameters (e.g., non-perfect nonlinear material behavior, timedependent material properties, grain growth, and phase transformation). We will moreover explore extended surrogate models with more complex output variables (e.g., grain size, grain structure, and phase transformation).

**Author Contributions:** Conceptualization, J.G.H. and P.O.; methodology, J.G.H. and B.C.G.; software, J.G.H.; validation, J.G.H., B.C.G. and P.O.; formal analysis, J.G.H.; investigation, J.G.H.; resources, J.G.H., B.C.G., P.O. and R.K.; data curation, J.G.H.; writing—original draft preparation, J.G.H.; writing—review and editing, J.G.H., B.C.G., P.O. and R.K.; visualization, J.G.H.; supervision, R.K.; project administration, J.G.H. and P.O.; funding acquisition, J.G.H., P.O. and R.K. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was funded by Österreichische Forschungsförderungsgesellschaft (FFG) Grant No. 881039.

**Institutional Review Board Statement:** Not applicable.

**Informed Consent Statement:** Not applicable.

**Data Availability Statement:** Not applicable.

**Acknowledgments:** The project BrAIN—Brownfield Artificial Intelligence Network for Forging of High Quality Aerospace Components (FFG Grant No. 881039) is funded in the framework of the program 'TAKE OFF', which is a research and technology program of the Austrian Federal Ministry of Transport, Innovation and Technology. The Know-Center is funded within the Austrian COMET Program—Competence Centers for Excellent Technologies—under the auspices of the Austrian Federal Ministry of Transport, Innovation and Technology, the Austrian Federal Ministry of Economy, Family and Youth and by the State of Styria. COMET is managed by the Austrian Research Promotion Agency FFG. The authors would also like to thank the developers of the sciann API for making their work available and for responding promptly to our questions.

**Conflicts of Interest:** The authors declare no conflict of interest.

## **Appendix A. Surrogate Models**

We follow the notation introduced in Section 2.2 with data instance *Xi* = (*xi*, *yi*) containing input vector *xi* and output vector *yi*, the number of training instances is *n* and the number of test instances is *m*. Notations regarding individual models are introduced when needed.

#### *Appendix A.1. GBDTR*

Boosting methods are powerful techniques in which the final "strong" regressor model is based on an iteratively formed ensemble of "weak" base regressor models [41]. The main idea behind boosting is to sequentially add new models to the ensemble, iteratively refining the output. In GBDTR models, boosting is applied to arbitrary differentiable loss functions. In general, GBDTR models are additive models, where the samples are modified so that the labels are set to the negative gradient, while the distribution is held constant [42].

The additive method of GBDTR is the following:

$$\mathfrak{F}\_{i} = F\_{\mathbb{G}}(\mathbf{x}\_{i}) = \sum\_{\mathbb{g}=1}^{\mathbb{G}} h\_{\mathbb{g}}(\mathbf{x}\_{i}) \tag{A1}$$

where *y*ˆ*<sup>i</sup>* is the prediction for a given input *xi*, and *hg* are the fitted base tree regressors. The constant *G* is the number of base tree regressors. The GBDTR algorithm is greedy, where a newly added tree regressor *hg* is fitted to minimize the loss *Lg* of the resulting ensemble *Fg* = *Fg*−<sup>1</sup> + *hg*, i.e.,

$$h\_{\mathcal{K}} = \arg\min\_{h} L\_{\mathcal{K}} = \arg\min\_{h} \sum\_{i=1}^{n} l(y\_i, F\_{\mathcal{K}-1}(\mathbf{x}\_i) + h(\mathbf{x}\_i)) \tag{A2}$$

Here, *l*(*yi*, *F*(*xi*)) is defined by the loss parameters, and *h*(*xi*) is the candidate base regressor. With the utilization of a first-order Taylor approximation:

$$l(z) \approx l(a) + (z - a)\frac{\partial l(a)}{\partial a} \tag{A3}$$

where *z* corresponds to *Fg*−1(*xi*) + *hg*(*xi*) and *a* corresponds to *Fg*−1(*xi*), we can approximate the value of *l* with the following:

$$l(y\_{i\prime}F\_{\mathfrak{F}-1}(\mathbf{x}\_{i}) + h\_{\mathfrak{F}}(\mathbf{x}\_{i})) \approx l(y\_{i\prime}F\_{\mathfrak{F}-1}(\mathbf{x}\_{i})) + h\_{\mathfrak{F}}(\mathbf{x}\_{i}) \left[ \frac{\partial l(y\_{i\prime}F(\mathbf{x}\_{i}))}{\partial F(\mathbf{x}\_{i})} \right]\_{F = F\_{\mathfrak{F}-1}} \tag{A4}$$

We denote the derivative of the loss with *gi* and remove constant terms:

$$\mathfrak{gl}\_m \approx \arg\min\_h \sum\_{i=1}^n h(\boldsymbol{x}\_i) \boldsymbol{g}\_i \tag{A5}$$

*hm* is minimized if *h*(*xi*) is fitted to predict a value proportional to the negative gradient.

#### *Appendix A.2. KNNR*

The KNNR algorithm is a relatively simple method mathematically, compared to other algorithms presented here. Here, the model stores all available use cases from the training dataset *D* and predicts the numerical target *y*ˆ*<sup>j</sup>* of a test query instance *xj* with *n* < *j* ≤ (*n* + *m*) based on a similarity measure (e.g., distance functions). The algorithm computes the distance-weighted average of the numerical targets of the *K* nearest neighbors of *xj* in *D* [43].

Specifically, we introduce a distance metric *d* that measures the distance between all training instances *xi* with *i* ≤ *n* and a test instance *xj*. Next, the training instances are sorted w.r.t. their respective distance in ascending order to the test instance, i.e., there is a permutation *π<sup>j</sup>* of the training indices *i* such that *d*(*xπj*(1), *xj*) ≤ *d*(*xπj*(2), *xj*) ≤ ··· ≤ *d*(*xπj*(*n*), *xj*). Then, the estimate *y*ˆ*j*(*xj*) is given as the following:

$$
\hat{y}\_{\rangle}(\mathbf{x}\_{\rangle}) = \frac{1}{K} \sum\_{i=1}^{K} y\_{\pi\_{\hat{\jmath}}(i)} \tag{A6}
$$

where *K* must be specified as a hyperparameter.

#### *Appendix A.3. GPR*

Gaussian process regression modeling is a non-parametric Bayesian approach [44]. In general, a Gaussian process is a generalization of the Gaussian distribution. The Gaussian distribution describes random variables or random vectors, while a Gaussian process describes a function *f*(*x*) [45].

In general, a Gaussian process is completely specified by its mean function *μ*(*x*) and covariance function *K*(*x*, *x* ) (also called kernel).

If the function *f*(*x*) under consideration is modeled by a Gaussian process, i.e., if *f*(*x*) ∼ GP(*μ*(*x*), *K*(*x*, *x* )), then we have the following

$$\mathbb{E}[f(x)] = \mu(x) \tag{A7}$$

$$\mathbb{E}[(f(\mathbf{x}) - \mu(\mathbf{x}))(f(\mathbf{x'}) - \mu(\mathbf{x'}))] = \mathcal{K}(\mathbf{x}, \mathbf{x'}) \tag{A8}$$

for all *x* and *x* . Thus, we can define the Gaussian process as the following:

$$f(\mathbf{x}) \sim \mathcal{N}(\mu(\mathbf{x}), \mathbb{K}(\mathbf{x}, \mathbf{x})) \tag{A9}$$

We use the notation that matrix *D* = (*XD*,*YD*) contains the training data with input data matrix *XD* = (*x*1, ... , *xn*) and output data matrix *YD* = (*y*1, ... , *yn*), and test data matrix *T* = (*XT*,*YT*) contains the test data with *XT* = (*xn*+1, ... , *xn*+*m*) as input and *YT* = (*yn*+1, ... , *yn*+*m*) as output. We can define that they are jointly Gaussian and zero mean with consideration of the prior distribution:

$$
\begin{bmatrix}
\boldsymbol{Y}\_{D} \\
\boldsymbol{Y}\_{T}
\end{bmatrix} \sim \mathcal{N}(\boldsymbol{0}, \begin{bmatrix}
\boldsymbol{K}(\boldsymbol{X}\_{D}, \boldsymbol{X}\_{D}) \\
\boldsymbol{K}(\boldsymbol{X}\_{T}, \boldsymbol{X}\_{D})
\end{bmatrix},
\begin{bmatrix}
\boldsymbol{K}(\boldsymbol{X}\_{D}, \boldsymbol{X}\_{T}) \\
\boldsymbol{K}(\boldsymbol{X}\_{T}, \boldsymbol{X}\_{T})
\end{bmatrix}) \tag{A10}
$$

The Gaussian process makes a prediction *YT* for *XT* in a probabilistic way, where, as stated before, the posterior distribution can be fully described by the mean and the covariance.

$$\begin{aligned} \mathcal{Y}\_{\mathrm{T}}(\mathbf{X}\_{\mathrm{T}}, \mathbf{X}\_{\mathrm{D}}, \mathbf{Y}\_{\mathrm{D}} \sim \mathcal{N}(\mathrm{K}(\mathbf{X}\_{\mathrm{T}}, \mathbf{X}\_{\mathrm{D}}) \mathrm{K}(\mathbf{X}\_{\mathrm{D}}, \mathbf{X}\_{\mathrm{D}})^{-1} \mathbf{Y}\_{\mathrm{D}}, \\ \mathrm{K}(\mathbf{X}\_{\mathrm{T}}, \mathbf{X}\_{\mathrm{T}}) - \mathrm{K}(\mathbf{X}\_{\mathrm{T}}, \mathbf{X}\_{\mathrm{D}}) \mathrm{K}(\mathbf{X}\_{\mathrm{D}}, \mathbf{X}\_{\mathrm{D}})^{-1} \mathrm{K}(\mathbf{X}\_{\mathrm{D}}, \mathbf{X}\_{\mathrm{T}}) \mathrm{K} \end{aligned} \tag{A11}$$

*Appendix A.4. SVR*

The SVR approach is a generalization of the SVM classification problem by introducing an -sensitive region around the approximated function, also called an -tube. The optimization task in SVR contains two steps: first, finding a convex -insensitive loss function that need to be minimized, and second, finding the smallest -tube that contains the most training instances.

The convex optimization has a unique solution and is solved using numerical optimization algorithms. One of the main advantages of SVR is that the computational complexity does not depend on the dimensionality of the input space [46]. To deal with otherwise intractable constraints of the optimization problem, we introduce slack variables *ξ<sup>i</sup>* and *ξ*<sup>∗</sup> *<sup>i</sup>* [47]. The positive constant *C* determines the trade-off between the flatness of the function and the magnitude up to which deviations greater than are allowed. The primal quadratic optimization problem of SVR is defined as the following:

$$\underset{\omega, b}{\text{minimize }} \frac{1}{2} ||\omega||^2 + \mathbb{C} \sum\_{i=1}^{n} (\mathbb{S}\_i + \mathbb{S}\_i^\*) \tag{A12}$$

$$\begin{array}{rcl} \text{subject to the following} : \begin{cases} y\_i - \omega^T \mathbf{x}\_i - b & \le & \varepsilon + \xi\_i \\ \omega^T \mathbf{x}\_i + b - y\_i & \le & \varepsilon + \xi\_i^\* \\ \mathfrak{z}\_{i\prime} \mathfrak{z}\_{\emptyset}^{\ast \*} & \ge & 0 \end{cases} \end{array} \tag{A13}$$

Here, *ω* is the weight and *b* the bias to be adjusted. The constrained quadratic optimization problem can be solved by minimizing the Lagrangian with non-negative Lagrange multipliers *λi*, *λ*<sup>∗</sup> *<sup>i</sup>* , *αi*, *α*<sup>∗</sup> *<sup>i</sup>* , *i* ∈ {1, . . . , *n*}:

$$\begin{split} \mathcal{L}(\boldsymbol{\omega}, \boldsymbol{\upxi}^{\*}, \boldsymbol{\upxi}, \boldsymbol{\uplambda}, \boldsymbol{\uplambda}^{\*}, \boldsymbol{a}, \boldsymbol{a}^{\*}) &= \frac{1}{2} ||\boldsymbol{\omega}||^{2} + \mathbb{C} \sum\_{i=1}^{n} \mathbb{I}\_{i} + \mathbb{J}\_{i}^{\*} + \sum\_{i=1}^{n} \boldsymbol{a}\_{i}^{\*} (\boldsymbol{y}\_{i} - \boldsymbol{\upomega}^{T} \boldsymbol{x}\_{i} - \boldsymbol{\upvarepsilon} - \boldsymbol{\upxi}\_{i}^{\*}) \\ &+ \sum\_{i=1}^{n} \boldsymbol{a}\_{i} (-\boldsymbol{y}\_{i} + \boldsymbol{\upomega}^{T} \boldsymbol{x}\_{i} - \boldsymbol{\upvarepsilon} - \boldsymbol{\upxi}\_{i}^{\*}) - \sum\_{i=1}^{n} \boldsymbol{\lambda}\_{i} \boldsymbol{\upxi}\_{i} + \boldsymbol{\uplambda}\_{i}^{\*} \boldsymbol{\upmu}\_{i}^{\*} \end{split} \tag{A14}$$

The minimum of L can be found by taking the partial derivatives with respect to the variables and making them equal to zero (*Karush-Kuhn-Tucker* (KKT) conditions). With the final KKT condition, we can state the following:

$$\begin{aligned} a\_i(-y\_i + \omega^T x\_i - \varepsilon - \underline{\varepsilon}\_i) &= 0 \\ a\_i^\*(-y\_i + \omega^T x\_i - \varepsilon - \underline{\varepsilon}\_i^\*) &= 0 \\ \lambda\_i \underline{\varepsilon}\_i &= 0 \\ \lambda\_i^\* \underline{\varepsilon}\_i^\* &= 0 \end{aligned} \tag{A15}$$

The Lagrange multipliers that are zero correspond to the inside of the *ε*-tube, while the support vectors have non-zero Lagrange multipliers. The function estimate depends only on the support vectors, hence this representation is sparse. More specifically, we can derive the following function approximation to predict *y*ˆ*j*(*xj*):

$$\hat{y}\_{\hat{\jmath}}(\mathbf{x}\_{\hat{\jmath}}) = \sum\_{i=1}^{n\_{SV}} (\mathbf{a}\_i^\* - \mathbf{a}\_i)\mathbf{x}\_i^T \mathbf{x}\_{\hat{\jmath}} \tag{A16}$$

with *αi*, *α*<sup>∗</sup> *<sup>i</sup>* ∈ [0, *C*] and the number of support vectors *nSV*. For nonlinear SVR we replace *ωTxi* in (12)–(15) by *ωTφ*(*xi*) and the inner product in (16) by the kernel *K*(*xi*, *xj*).
