Multi-Output Bayesian Support Vector Regression Considering Dependent Outputs

Wang, Yanlin; Cheng, Zhijun; Wang, Zichen

doi:10.3390/math12182923

Open AccessFeature PaperArticle

Multi-Output Bayesian Support Vector Regression Considering Dependent Outputs

by

Yanlin Wang

,

Zhijun Cheng

^* and

Zichen Wang

College of Systems Engineering, National University of Defense Technology, Changsha 410073, China

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(18), 2923; https://doi.org/10.3390/math12182923

Submission received: 24 August 2024 / Revised: 17 September 2024 / Accepted: 18 September 2024 / Published: 20 September 2024

(This article belongs to the Section E1: Mathematics and Computer Science)

Download

Browse Figures

Versions Notes

Abstract

:

Multi-output regression aims to utilize the correlation between outputs to achieve information transfer between dependent outputs, thus improving the accuracy of predictive models. Although the Bayesian support vector machine (BSVR) can provide both the mean and the predicted variance distribution of the data to be labeled, which has a large potential application value, its standard form is unable to handle multiple outputs at the same time. To solve this problem, this paper proposes a multi-output Bayesian support vector machine model (MBSVR), which uses a covariance matrix to describe the relationship between outputs and outputs and outputs and inputs simultaneously by introducing a semiparametric latent factor model (SLFM) in BSVR, realizing knowledge transfer between outputs and improving the accuracy of the model. MBSVR integrates and optimizes the parameters in BSVR and those in SLFM through Bayesian derivation to effectively deal with the multi-output problem on the basis of inheriting the advantages of BSVR. The effectiveness of the method is verified using two function cases and four high-dimensional real-world data with multi-output.

Keywords:

multiple dependent outputs; support vector regression; Bayesian inference; semiparametric latent factor model; multi-output Bayesian support vector regression (MBSVR)

MSC:

62J02; 68Txx

1. Introduction

Regression models, also known as response models, can accurately predict the output of other features by establishing a mapping relationship between data features and outputs [1]. The commonly used regression models are single-output models, i.e., models with one or more inputs but only one output. However, in real engineering problems, there are often multiple outputs. For example, in novel battery material discovery, simultaneous and comprehensive prediction of the multidimensional properties of battery electrode materials is needed to help accelerate material discovery and design [2]. In environmental forecasting, there is a need to simultaneously predict particulate matter concentrations at different air quality monitoring stations, which often have potentially nonlinear spatial correlations. Reliable and accurate predictions help in crisis response and can reduce health risks [3]. Ultra-high-performance fiber-reinforced concrete (UHPFRC) is used in a variety of civil engineering applications, and its structural behavior is closer to that of steel. To investigate the effect of component dosage on its strain and energy absorption capacity under peak tension and to optimize the material dosage, both outputs need to be predicted simultaneously [4]. Multiple outputs can be processed separately, but this method ignores the potential correlation between the outputs and results in information loss. Therefore, the correlation between the outputs can be used to build a multi-output model, which is also known as a multi-response or multi-task model.

Support vector machine (SVR) was first proposed by Vapnik based on the principle of structural risk minimization [5]. SVR uses quadratic programming to obtain predictions for a single output. Compared to other models, SVR has superior performance due to its structural risk minimization principle, which allows it to avoid overfitting and achieve better output approximation [6,7]. The Bayesian support vector machine (BSVR) introduces Gaussian process assumptions and Bayesian inference on the basis of SVR to obtain the predicted values and their distributions. The BSVR model not only obtains an estimate of the unknown sample points but also has the advantages of adaptivity and prediction error distributions of Bayesian methods [8]. Meanwhile, SVR has shown superior performance in dealing with nonlinear problems and avoiding overfitting with good generalization ability [9]. Therefore, Bayesian support vector machines have received a lot of attention in the past time, such as [9,10,11] and references therein.

Multiple output regression aims to establish a mapping from multivariate inputs to multiple outputs [12]. Despite the potential utility of BSVR, its standard form cannot handle multiple outputs. The simplest way to deal with the multiple-output problem is to model multiple outputs individually. For each output indicator, a model can be built independently. This treatment is simpler but does not take into account the correlation between outputs and is suitable for scenarios where no correlation exists between outputs [13]. Another method is chain modeling [14]. This method predicts an output, followed by predicting the next output using the predicted output as input, and so on. However, the use of chain modeling requires determining the order of the outputs and the dependencies between them.

Considering that multiple outputs are correlated and being modeled individually can lead to information loss, more and more multi-output modeling approaches have been proposed. Multi-output modeling takes advantage of the correlation between outputs so that a single output can utilize information from other outputs to obtain more accurate predictions [13]. Methods have been developed to extend the support vector machine model so that it can handle multiple outputs simultaneously. Pérez-Cruz et al. [14] transformed the pipeline in a pipeline-based model

ε - S V R

into a hyper-square pipeline by equalizing the output values of the data points located outside the pipeline. This hyperspherical insensitive zone is designed to be more effective than modeling it individually. Zhang et al. [15] proposed an extended LSSVR (ELS-SVR), which extends the original feature space using vector virtualization to represent the multi-output case as an equivalent single-output case in the extended feature space and solved using a least-squares support vector machine. Inspired by multi-task learning, Xu et al. [16] changed the weight vector of a least-squares support vector machine from one to two. One carries generic information and the other carries specific information, thus characterizing the correlation between the two outputs, which is referred to as multi-output LS-SVR (MLS-SVR). Literature [17] gives an overview of the correlation methods and also analyzes the disadvantages of the above methods: the hyperspherical ϵ-tube of M-SVR does not exhibit an advantage over a hypercubic one, ELSSVR cannot handle the negative correlations, MLSSVR does not handle well the case of only partial correlations, and the above methods do not consistently outperform single-output support vector machines.

The above methods distribute modifications to

ε - S V R

and LSSVR so that they can solve the multi-output problem. Among them, the support vectors in

ε - S V R

are sparse and only some of the samples are involved in the model construction. LSSVR transforms convex quadratic optimization problems into linear systems of equations problems, in which all the samples are involved in the model construction. These methods better utilize the correlation between the outputs and improve the model accuracy to some extent. However, these methods cannot obtain a prediction distribution similar to BSVR, which can quantify uncertainty and has good application prospects. In addition, BSVR is based on Bayesian theory, which can systematically and effectively infer the optimal hyperparameters [8]. In terms of describing the correlation of multiple outputs, the method based on

ε - S V R

does not have an accurate structure to describe the correlation of outputs. ELS-SVR only describes the correlation through a parameter greater than 0, so it cannot describe the negative correlation. MLS-SVR describes the shared information through the disassembly of the weight vector. However, as the weight describes the correlation of multiple outputs in a unified manner, it cannot describe the partial correlation of the outputs.

Therefore, this paper introduces the multi-output Gaussian process assumption based on the Bayesian support vector machine model (BSVR) while considering the variability of multiple outputs in terms of SVR trade-off parameters. A Bayesian framework is used to systematically and comprehensively optimize the original BSVR hyperparameters and the hyperparameters of the Gaussian process, which in turn provides the predicted values and probability distributions of multiple outputs. The difference between the multi-output Bayesian support vector machine (MBSVR) and single-output Bayesian support vector machine (BSVR) mainly lies in the kernel function. MBSVR uses the semiparametric latent factor model in the new kernel function, which describes the relationship between the inputs and outputs, and the outputs and outputs at the same time through the linear combinations of implicit functions so that information between them can be transferred to improve the accuracy of the model. The main contributions of our work are as listed:

The method inherits the advantages of support vector machines in nonlinear, high-dimensional problems by introducing Bayesian derivation in support vector machines.
Compared to other SVR-based multi-output regression methods. Based on Bayesian theory, the predicted mean and its probability distribution (uncertainty) can be obtained, and the hyperparameter optimization can be performed systematically and effectively.
Compared with BSVR, the method combines the SLFM structure with BSVR for comprehensive optimization of parameters through information transfer between outputs and uses the shared information to improve model accuracy.
The use of a trade-off parameter makes the method sensitive to outliers and allows for more robust performance on real datasets than multi-output Gaussian process.

The rest of the paper is structured as follows: in Section 2, the Bayesian support vector machine model and the semiparametric latent factor model are introduced. In Section 3, the new multi-output Bayesian support vector machine model is introduced. In Section 4, the model evaluation is carried out using function arithmetic and real datasets, and Section 5 concludes.

2. Related Description and Basic Theories

2.1. Single-Output Bayesian Support Vector Machine

Single-output Bayesian support vector machine (BSVR) introduces Gaussian process assumptions based on SVR to optimize the hyperparameters by Bayesian derivation [8]. BSVR is widely used because it can obtain the mean and the prediction variance of the prediction through rigorous derivation to measure the uncertainty of prediction while maintaining the advantages of SVR. In BSVR, the mapping relationship between outputs and factors can be expressed as [8]:

y_{j} = g (x_{j}) + δ_{j}

(1)

where

δ_{j}

is an independently and identically distributed random error.

g (x_{j})

is the support vector machine regression model. It is a zero-mean Gaussian stochastic process whose covariance between the two outputs of different x can be expressed as:

k (\tilde{g} (x), \tilde{g} (x') = k (x, x') = \prod_{j = 1}^{n} \exp (- θ_{j} {(x_{j} - x_{j}^{'})}^{2})

(2)

where

θ = (θ_{1}, θ_{2} \dots θ_{n})

is the hyperparameter to be adjusted and the covariance between the two outputs is equal to the value of the corresponding kernel function. Under the assumptions of the Gaussian process, the a priori of the outputs

G = {\tilde{g} (x_{1}), \tilde{g} (x_{2}), \dots, \tilde{g} (x_{N})}^{T}

can be described as:

p (G | γ) = \frac{1}{Z_{G}} \exp (- \frac{1}{2} {(G - b)}^{T} K^{- 1} (G - b))

(3)

Z_{G} = {(2 π)}^{N / 2} | K |^{1 / 2}

(4)

where

K \in ℝ^{N \times N}

is the input covariance matrix in the sample where training was performed,

b = [b, b, \dots, b] \in ℝ^{N}

.

γ

denotes all hyperparameters. In this problem, the hyperparameters include the hyperparameters

θ = (θ_{1}, θ_{2} \dots θ_{n})

in the kernel function and the trade-off parameters

C

. Since the noise is assumed to be an independent and identically distributed random variable, the likelihood function of the sample output for the training set of samples can be expressed as [8]:

p (Y | G, γ) = \prod_{j = 1}^{N} p (y_{j} - g (x_{j}) | G, γ) = \prod_{j = 1}^{N} p (δ_{j})

(5)

where

p (δ_{j})

is the probability distribution of

δ_{j}

with the expression:

p (δ) = \frac{1}{Z_{δ}} \exp (- C l (δ))

(6)

Z_{δ} = \int \exp (- C l (δ)) d δ = \sqrt{2 π / C}

(7)

where

l (δ)

is the loss function of the model and

C

is a trade-off constant. According to Bayesian theory, the posterior distribution is obtained by synthesizing the information of the existing samples with the prior distribution. In BSVR, the likelihood function characterizes the training set sample information, while

p (G | γ)

is the prior distribution of the samples. Since the prior distribution satisfies the Gaussian process assumption, it can be represented by Equation (3). In summary, the posterior distribution of satisfies [18].

p (G | Y, γ) = \frac{p (Y | G, γ) p (G | γ)}{p (Y | γ)}

(8)

p (Y | γ)

is a normalization constant. Bringing (3) and (5) into (8) yields:

p (G | Y, γ) = \frac{1}{Z} \exp (- C \sum_{i = 1}^{N} l (y_{i} - \tilde{g} (x)) - \frac{1}{2} {(G - b)}^{T} K^{- 1} (G - b))

(9)

Z = \int \exp (- S (G)) d G

(10)

S (G) = \sum_{j = 1}^{N} C l (y_{j} - \tilde{g} (x_{j})) + \frac{1}{2} {(G - b)}^{T} K^{- 1} (G - b)

(11)

Thus, according to the principle of the great likelihood method of solution, maximizing the posterior distribution in (11) can be equated to:

\min_{G} \sum_{i = 1}^{N} C l (y_{i} - \tilde{g} (x_{i})) + \frac{1}{2} {(G - b)}^{T} K^{- 1} (G - b)

(12)

where

C

is the equilibrium parameter. The loss function of a support vector machine can be represented in a variety of ways, and one is the squared loss function:

l (δ) = \frac{1}{2} δ^{2}

(13)

Minimizing the squared loss function is essentially equivalent to great likelihood estimation under the assumption that the error follows a Gaussian distribution. Bringing the loss function expression into (12) yields the new objective function as:

\min_{G} \sum_{j = 1}^{N} \frac{C}{2} e_{j}^{2} + \frac{1}{2} {(G - b)}^{T} K^{- 1} (G - b)

(14)

where

y_{j} = g (x_{j}) + e_{j}

. Solving yields an estimate of

G

as:

\hat{G} = K {(K + I / C)}^{- 1} Y = K β + b

(15)

where

I

is the unit matrix,

β = [β_{1}, β_{2}, \dots β_{N}] = {(K + I / C)}^{- 1} Y

.

For the output

g (x)

to be predicted, its joint distribution with the training set satisfies:

(\begin{array}{l} g (x) \\ G \end{array}) \sim N (\{\begin{array}{l} 0 \\ 0 \end{array}\}, \{\begin{matrix} k (x, x) & k (x, X) \\ k (X, x) & k (X, X) \end{matrix}\})

(16)

k (X, x) = k {(x, X)}^{T} = {[k (x_{1}, x), \dots, k (x_{N}, x)]}^{T}

(17)

(17) denotes the variance between

G

and

g (x)

,

k (X, X) = K

. The prior of

g (x)

still obeys Gaussian distribution:

\begin{matrix} p (g (x) | Y) = \int p (\tilde{g} (x) | Y) p (G | Y) d G \\ = N (μ (x), Σ^{2} (x)) \end{matrix}

(18)

\begin{array}{l} μ (x) & = k (x, X) k {(X, X)}^{- 1} \hat{G} \\ = k (x, X) {(K + I / C)}^{- 1} Y \\ = \sum_{j = 1}^{N} β_{j} k (x, x_{j}) \end{array}

(19)

Σ^{2} (x) = k (x, x) - k (x, X) k {(X, X)}^{- 1} k (X, x)

(20)

For the Bayesian support vector machine with a squared loss function, the parameters to be optimized include

θ

and

C

, and the optimal values of these hyperparameters are determined by the maximum posteriori probability. The specific formula derivation and calculation methods are described in [10].

2.2. Semiparametric Latent Factor Model (SLFM)

The most used multi-output covariance structure in multi-output Gaussian processes is the linear model of coregionalization (LMC) [19]. The semiparametric latent factor model (SLFM) is a special form of LMC. The model assumes that there are

Q

shared potential Gaussian processes, and generally, the number of potential Gaussian processes is smaller than the number of outputs. SLFM represents the output as a linear combination of

Q

Gaussian processes. Taking two outputs as an example, two Gaussian implicit functions are chosen,

N_{f} = Q = 2

. Assuming that

u_{1} (x)

and

u_{2} (x)

are obtained by sampling from the two Gaussian processes respectively, where

u_{1} (x) \sim G P (0, k_{1} (x, x'))

,

u_{2} (x) \sim G P (0, k_{2} (x, x'))

,

g_{1} (x)

and

g_{2} (x)

are obtained by linearly transforming

u_{1} (x)

and

u_{2} (x)

:

g_{1} (x) = a_{1, 1} u_{1} (x) + a_{2, 2} u_{2} (x)

(21)

g_{2} (x) = a_{2, 1} u_{1} (x) + a_{2, 2} u_{2} (x)

(22)

where

u_{1} (x)

and

u_{2} (x)

obey a Gaussian process distribution with different covariance functions and are independent of each other, i.e., in the case of

q \neq q'

,

u_{q} (x) ⊥ u_{q'} (x')

,

cov (u_{q} (x), u_{q'} (x')) = 0

. Since a new covariance function can be obtained by linearly combining several covariance functions.

g_{1} (x)

and

g_{2} (x)

can be written as:

g (x) = a_{1} u_{1} (x) + a_{2} u_{2} (x)

(23)

where

g (x) = {[\begin{matrix} g_{1} (x) & g_{2} (x) \end{matrix}]}^{T}

,

a_{1} = {[\begin{matrix} a_{1, 1} & a_{2, 1} \end{matrix}]}^{T}

,

a_{2} = {[\begin{matrix} a_{1, 2} & a_{2, 2} \end{matrix}]}^{T}

, the variances of

g (x)

and

g (x')

can be calculated:

\begin{array}{l} k_{M} (x, x') & = cov (g (x), g (x^{'})) \\ = a_{1} {(a_{1})}^{T} cov (u_{1} (x), u_{1} (x')) + a_{2} {(a_{2})}^{T} cov (u_{2} (x), u_{2} (x')) \\ = a_{1} {(a_{1})}^{T} k_{1} (x, x') + a_{2} {(a_{2})}^{T} k_{2} (x, x') \end{array}

(24)

Defining

B_{1} = a_{1} {(a_{1})}^{T}

,

B_{2} = a_{2} {(a_{2})}^{T}

, we can obtain

k_{M} (x, x') = B_{1} k_{1} (x, x') + B_{2} k_{2} (x, x')

(25)

[\begin{matrix} g_{1} \\ g_{2} \end{matrix}] = [\begin{matrix} g_{1} (x_{1}) \\ ⋮ \\ g_{1} (x_{N}) \\ g_{2} (x_{1}) \\ ⋮ \\ g_{2} (x_{N}) \end{matrix}] \sim N ([\begin{matrix} 0 \\ 0 \end{matrix}], B_{1} \otimes K_{1} + B_{2} \otimes K_{2})

(26)

Furthermore, consider a plurality of outputs

{g_{i} (x)}_{i = 1}^{N_{f}}

in a more generalized form:

g_{i} (x) = \sum_{q = 1}^{Q} a_{i, q} u_{q} (x)

(27)

(27) can be described using a matrix as:

g (x) = A u (x)

(28)

g (x) = {g_{1} (x), g_{2} (x), \dots g_{N_{f}} (x)}^{T}

(29)

u (x) = [u_{1} (x), u_{2} (x), \dots, u_{Q} (x)]

(30)

A = [\begin{matrix} a_{1, 1} & \dots & a_{1, Q} \\ ⋮ & ⋱ & ⋮ \\ a_{N_{f}, 1} & \dots & a_{N_{f}, Q} \end{matrix}] \in ℝ^{N_{f} \times Q}

(31)

Similarly, since multiple Gaussian processes are independent for multiple outputs

g (x) = {[g_{1} (x) \dots g_{N_{f}} (x)]}^{T}

, the covariance of multiple outputs can be expressed as:

k_{M} (x, x') = \sum_{q = 1}^{Q} A_{q} A_{q}^{T} k_{q} (x, x')

(32)

[\begin{matrix} g_{1} \\ ⋮ \\ g_{N_{f}} \end{matrix}] = [\begin{matrix} g_{1} (x_{1}) \\ ⋮ \\ g_{1} (x_{N}) \\ g_{2} (x_{1}) \\ ⋮ \\ g_{2} (x_{N}) \\ ⋮ \\ g_{N_{f}} (x_{1}) \\ ⋮ \\ g_{N_{f}} (x_{N}) \end{matrix}] \sim N ([\begin{matrix} 0 \\ ⋮ \\ 0 \end{matrix}], \sum_{q = 1}^{Q} A_{q} A_{q}^{T} \otimes K_{q})

(33)

where in

A_{q} \in ℝ^{N_{f} \times N_{f}}

, the elements corresponding to the

i

output is

A_{i i'}^{q} = a_{i, q} a_{i', q}

. To further characterize the multi-output Gaussian process, it is necessary to determine the number of implicit functions. It has been found that the larger

Q

is, the more flexible the model is and the more variability can be described. Some scholars have determined

Q

to be two or the number of outputs. The increase of

Q

will also bring about a further increase in computational cost. In order to balance the flexibility and accuracy of the model and the computational overhead, this paper will take the value of

Q

as

N_{f}

. Then, the correlation can be further expressed as:

k_{M} (x, x') = R d i a g [k_{1} (x, x'), \dots, k_{Q} (x, x')] R^{T}

(34)

A_{q} = r_{q} {r_{q}}^{T}

(35)

R = [r_{1}, r_{2}, \dots, r_{Q}]

(36)

A = [\begin{matrix} a_{1, 1} & \dots & a_{1, Q} \\ ⋮ & ⋱ & ⋮ \\ a_{N_{f}, 1} & \dots & a_{N_{f}, Q} \end{matrix}] = R R^{T}

(37)

and

Q = N_{f}

,

k_{q} (x, x') = \prod_{d = 1}^{D} \exp (- θ_{d} {(x_{d} - x_{d}^{'})}^{2})

. The kernel function of a Gaussian process has a set of parameters

θ \in ℝ^{D}

to be optimized,

D

is the sample dimension, then for each of the

Q

implicit functions, there is a set of

θ

, which is used to measure the importance of the inputs as equivalent to the specified outputs. Then the parameters of the kernel function to be optimized include

θ_{M} = [θ_{1}, θ_{2}, \dots, θ_{N_{f}}] \in ℝ^{N_{f} \times N}

.

\sum_{0} = A A^{T} \in ℝ^{N_{f} \times N_{f}}

is used to describe the covariance between multiple outputs.

A

is the upper triangular matrix and also a set of unknown hyperparameters to be optimized.

3. Multi-Output Bayesian Support Vector Regression Model

The structure of the MBSVR model is shown in Figure 1, where the left side is the SLFM structure and

g (x)

combines linear combination of

Q

implicit functions, according to which the variance between

g (x)

can be quantitatively described. The right side represents the trade-off parameters in the support vector machine; for each output, there is a corresponding trade-off parameter, which is used to trade off the complexity and error of the model. The

i t h

output can be expressed as (27), where

a_{i, q}

is the parameter to be optimized,

u_{q} (x)

is Gaussian process implicit function. Based on the expression of

g_{i} (x)

, the model can be expressed as:

Y_{i} = g (x_{i}) + δ_{i}

(38)

where

x_{j}

is the

j t h

sample and

δ

is an independently and identically distributed random error. C is a number greater than 0, which determines the degree of tolerance for error in the model. When C is large, the model will not allow for errors, the complexity is high, and it may be overfitted with poor generalization ability. When C is small, the model does not focus on the presence of errors, the model is simpler, and it is easy to be underfitted.

MSVR combines the SLFM structure with the support vector machine model through the Bayesian assumptions. Through the Gaussian process assumption and Bayesian derivation, the correlation between the outputs is effectively delineated, and finally, the predicted mean and probability distribution of multiple outputs are obtained.

3.1. Bayesian Assumptions for MBSVR

Assume that a multi-output modeling problem consists of

N_{f}

outputs and

N

samples. Define a vector

Y

, which characterizes the outputs of the sample points and contains

N_{f} \times N

elements in the vector. Multi-output Bayesian support vectors aim to approximate the

N_{f}

outputs

{g_{i} (x)}_{1 \leq i \leq N_{f}}

simultaneously. A more accurate model is built by considering the correlation between the outputs. In a multi-output Bayesian support vector machine, for a certain

x_{j} \in ℝ^{d}

, the relationship between the outputs and the factors can be expressed as (38). where

δ_{j} \in ℝ^{N_{f}}

is an independently and identically distributed random error whose distribution form is usually unknown.

Y_{j}

is the values of multiple outputs.

g (x)

is a support vector machine, which is a multi-output Gaussian process. Since there are multiple outputs, the multiple outputs of all samples

g (x) = {g_{1} (x), g_{2} (x), \dots g_{N_{f}} (x)}^{T}

are satisfied:

g (x) \sim G P (0, \sum_{0})

(39)

According to the SLFM principle,

\sum_{0} = A A^{T}

is the parameter to be optimized.

N_{f}

Gaussian process outputs can be expressed as:

{\{\begin{matrix} {\tilde{g}}_{1} (x_{1}), {\tilde{g}}_{1} (x_{2}), \dots, {\tilde{g}}_{1} (x_{N}) \\ {\tilde{g}}_{2} (x_{1}), {\tilde{g}}_{2} (x_{2}), \dots, {\tilde{g}}_{2} (x_{N}) \\ ⋮ \\ {\tilde{g}}_{N_{f}} (x_{1}), {\tilde{g}}_{N_{f}} (x_{2}), \dots, {\tilde{g}}_{N_{f}} (x_{N}) \end{matrix}\}}^{T}

(40)

In order to make the model satisfy the Gaussian process assumptions and to facilitate the solution, the Gaussian process output is stored using a stack as:

G = {\{{\tilde{g}}_{1} (x_{1}), {\tilde{g}}_{2} (x_{1}), \dots, {\tilde{g}}_{N_{f}} (x_{1}) \dots, {\tilde{g}}_{1} (x_{N}), {\tilde{g}}_{2} (x_{N}), \dots, {\tilde{g}}_{N_{f}} (x_{N})\}}^{T} \in ℝ^{N_{f} \times N}

. To facilitate the derivation of the formula, it is denoted as

G = {g_{1}, g_{2}, \dots, g_{N_{f} \times N}}^{T}

. Then, the likelihood function of

G

can be expressed as:

p (G | γ) = \frac{1}{{(2 π)}^{N_{f} \times N / 2} {(| K_{M} |)}^{1 / 2}} \exp [- \frac{1}{2} {(G - b)}^{T} {K_{M}}^{- 1} (G - b)]

(41)

where

b = [b, b, \dots, b] \in ℝ^{N_{f} N}

denotes the mean vector of the

N_{f} \times N

elements and

K_{M} \in ℝ^{N_{f} N \times N_{f} N}

denotes the covariance matrix of

Y

.

γ

denotes the parameter vector to be optimized. The definition of the covariance matrix

K_{M}

is the key element that distinguishes the multi-output Gaussian process from the multi-output correlation, and the description of the multi-output correlation is also included in the covariance matrix.

Since the noise is assumed to be an independent and identically distributed random variable, the likelihood function of the sample output for a given training set of samples can be expressed as:

p (Y | G, γ) = \prod_{r = 1}^{r = N_{f} N} p (y_{r} - g_{r} | G, γ) = \prod_{r = 1}^{r = N_{f} N} p (δ_{r}) = \prod_{i = 1}^{i = N_{f}} \prod_{j = 1}^{j = N} p (δ_{i j})

(42)

where

p (δ)

is the probability distribution of

δ

and the expression is:

p (δ_{i}) = \frac{1}{Z_{δ_{i}}} \exp (- C_{i} l (δ_{i}))

(43)

Z_{δ_{i}} = \int \exp (- C_{i} l (δ_{i})) d δ = \sqrt{2 π / C_{i}}

(44)

where

l (δ_{i}), i = 1, 2, \dots, N_{f}

is the loss function of the support vector machine.

C = {[C_{1}, C_{2}, \dots, C_{N_{f}}]}^{T}

is the trade-off constant. For each output, there is a corresponding trade-off constant. According to Bayesian theory, the posterior distribution of

G

satisfies:

p (G | Y, γ) = \frac{p (Y | G, γ) p (G | γ)}{p (Y | γ)}

(45)

p (Y | γ)

is a normalization constant, and further,

p (G | Y, γ)

can be expressed as (See Appendix A for more details):

p (G | Y, γ) = \frac{1}{Z} \exp (- C_{M} l {(y - g)}^{T} - \frac{1}{2} {(G - b)}^{T} {K_{M}}^{- 1} (G - b))

(46)

where

C = {[C_{1}, C_{2}, \dots, C_{N_{f}}]}^{T}

,

C_{M} = 1 D \otimes C

,

1 D = [1, 1, \dots, 1] \in ℝ^{N_{f}}

,

\otimes

is Kronecker products.

Z = \int \exp (- S (G)) d G

(47)

S (G) = C_{M} l {(y - g)}^{T} + \frac{1}{2} {(G - b)}^{T} {K_{M}}^{- 1} (G - b)

(48)

Therefore, maximizing the posterior distribution according to the principle of the great likelihood method of solution can be equated to:

\min_{G} C_{M} l {(y - g)}^{T} + \frac{1}{2} {(G - b)}^{T} {K_{M}}^{- 1} (G - b)

(49)

Similar to the original support vector machine, the first term of the objective function denotes the empirical risk, and the second term, which denotes the smoothness of the function,

C_{M}

is an expansion of the trade-off parameters.

3.2. Model Construction for MBSVR

As with single-output Bayesian support vector machines, MBSVR still uses the squared loss function:

l (δ) = \frac{1}{2} δ^{2}

(50)

The squared loss function actually obeys a Gaussian probability density function [20]. Bringing in the loss function expression yields the new objective function as

\begin{array}{l} \min_{G} \frac{1}{2} C_{M} e^{2} + \frac{1}{2} {(G - b)}^{T} {K_{M}}^{- 1} (G - b) \\ s . t . y_{r} = g_{r} + e_{r} \end{array}

(51)

where

e = {e_{1}, e_{2}, \dots, e_{N_{f} N}}

.

e_{r} = y_{r} - g_{r}, r = 1, 2, \dots, N_{f} N

.

C_{M} = 1 D \otimes C

. The estimate of

G

is (See Appendix B for more details):

\hat{G} = K_{M} {(K_{M} + C_{h})}^{- 1} Y + b = K_{M} β + b

(52)

where

C_{h}

satisfies

C_{h} ⊙ d i a g (C_{M}) = I

,

d i a g (C_{M})

is in diagonal form of

C_{M}

,

I \in ℝ^{N_{f} N \times N_{f} N}

is a unit matrix,

β = [β_{1}, β_{2}, \dots β_{N_{f} \times N}] = {(K_{M} + C_{h})}^{- 1} Y

.

⊙

is the Hadamard product, denoting the element-by-element multiplication of the matrix. For the output to be predicted

g (x)

, its joint distribution with the training set is satisfied:

(\begin{array}{l} g (x) \\ G \end{array}) \sim N (\{\begin{array}{l} b_{0} \\ b \end{array}\}, \{\begin{matrix} k_{M} (x, x) & k_{M} (x, X) \\ k_{M} (X, x) & k_{M} (X, X) \end{matrix}\})

(53)

k_{M} (X, x) = k_{M} {(x, X)}^{T} = {[k_{M} (x_{1}, x), \dots, k_{M} (x_{N}, x)]}^{T}

(54)

k_{M} (X, X) = [\begin{matrix} k_{11} (X, X) & \dots & k_{1 N_{f}} (X, X) \\ ⋮ & ⋱ & ⋮ \\ k_{N_{f} 1} (X, X) & \dots & k_{N_{f} N_{f}} (X, X) \end{matrix}]

(55)

where

k_{M} (X, X) = K_{M}

,

b_{0} = [b, b, \dots, b] \in ℝ^{N_{f}}

.The prior of

g (x)

still obeys a multi-output Gaussian distribution:

p (g (x) | G) = N (μ (x), Σ^{2} (x))

(56)

μ (x) = b_{0} + k_{M} (x, X) {(k_{M} (X, X) + C_{h})}^{- 1} (G - b)

(57)

Σ^{2} (x) = k_{M} (x, x) - k_{M} (x, X) (k_{M} (X, X) + C_{h}))^{- 1} k_{M} (X, x)

(58)

where

b_{0} = [b, b, \dots, b] \in ℝ^{N_{f}}

,

k_{M} (x, X) \in ℝ^{N_{f} \times N_{f} N}

,the expression is:

k_{M} (x, X) = [\begin{matrix} k_{11} (x, X) & \dots & k_{1 N_{f}} (x, X) \\ ⋮ & ⋱ & ⋮ \\ k_{N_{f} 1} (x, X) & \dots & k_{N_{f} N_{f}} (x, X) \end{matrix}]

(59)

k_{i i'} (x, X) = [k_{i i'} (x, x_{1}), k_{i i'} (x, x_{2}), \dots, k_{i i'} (x, x_{N})], i = 1, 2, \dots N_{f}

(60)

k_{M} (x, x) = [\begin{matrix} k_{11} (x, x) & \dots & k_{1 N_{f}} (x, x) \\ ⋮ & ⋱ & ⋮ \\ k_{N_{f} 1} (x, x) & \dots & k_{N_{f} N_{f}} (x, x) \end{matrix}] \in ℝ^{N_{f} \times N_{f}}

(61)

The variance of the

i t h

diagonal element of

Σ^{2} (x)

corresponds to the variance of

i t h

output of

x

.

3.3. Optimized Solution of Parameters

In MBSVR, the parameters to be optimized include the kernel function parameter

θ_{M}

; the trade-off parameter

C

; and the matrix

A

, which describes the correlations between outputs. For computational convenience, the specific implementation is decomposed by Cholesky

A

into

R R^{T}

. The optimal values of these hyperparameters are determined by the maximum a posteriori probability:

p (γ) = \frac{p (Y | γ) p (γ)}{p (Y)}

(62)

where

γ = {θ_{M}, C, R}

,

p (γ)

is the prior distribution of the hyperparameters, and

p (Y)

is a regularization constant that in general specifies a uniform probability distribution for the hyperparameters. Therefore, its prior distribution

p (γ)

is a constant. Therefore, it is only necessary to maximize

p (Y | γ)

to achieve the purpose of great likelihood estimation of the parameters:

p (Y | γ) = \int p (Y | G, γ) p (G | γ) d G = \frac{1}{Z_{G} \prod_{i = 1}^{N_{f}} Z_{δ_{i}}^{N}} \int \exp (- S (G)) d G

(63)

where

Z_{G} = {(2 π)}^{N_{f} \times N / 2} {(| K_{M} |)}^{1 / 2}

,

Z_{δ_{i}} = \int \exp (- C_{i} l (δ_{i})) d δ_{i}

.

S (G)

can be expressed as

S (G) = S (\hat{G}) + \frac{1}{2} {(G - \hat{G})}^{T} ({K_{M}}^{- 1} + d i a g (C_{M}) ⊙ I) (G - \hat{G})

(64)

Bringing (64) into the probability distribution in (63) yields the following equation:

- \ln (p (Y | γ)) = \frac{1}{2} C_{M} e^{2} + \frac{1}{2} β^{T} K_{M} β + \frac{1}{2} \ln | I + d i a g (C_{M}) ⊙ K_{M} | + N \sum_{i = 1}^{N_{f}} \ln Z_{δ_{i}}

(65)

The hyperparameters are obtained by solving according to the minimized likelihood function. The nonlinear programming problem is solved using the “fmincon” function in MATLAB2022b. Given the initial solution, iterative optimization is performed to obtain the optimal hyperparameters. In general, the method can find the global optimal solution of the objective function, and the initial value of the parameters on the optimal solution of the parameters has less influence.

4. Numerical Experiments

4.1. Performance Metrics and Experiment Settings

The single-response Bayesian support vector machine [10] and a multi-output Gaussian process model [21] are used for comparison with the multi-output Bayesian support vector machine model (MBSVR). The three methods are denoted by (independent regression) IND, (multi-output Gaussian process) MGP, and MBSVR, respectively. The lhsdesign function in MATLAB is used to generate the training set and test set, and for the model parameters, the kernel function parameter in MBSVR is initialized to 1, the trade-off parameter C is initialized to 1000, and the range of optimality search is

[0.01, 100]

and

[1, 10^{6}]

. The kernel function parameter of the multi-output Gaussian process is initialized to 1. In the multi-output Bayesian support vector machine,

A_{q}

is initialized to a unit matrix. For the same hyperparameters in all three models, the same initial values and optimization ranges are assigned to allow for a fairer comparison.

To measure the generalization effect of the model, the error criterion normalized root mean square error (NRMSE) is used, and its expression is:

N R M S E_{i} = \frac{\sqrt{\frac{1}{T n} \sum_{t = 1}^{T n} (y_{i t} - {\tilde{y}}_{i t})}}{\max (y_{i}) - \min (y_{i})}

(66)

where

T n

is the size of the test set,

i

denotes

i t h

output,

y_{i t}

is the true value of the

t

test set for the

i

first output, and

{\tilde{y}}_{i t}

is the corresponding predicted value. The smaller the value of the error metric, the more accurate the model is. The performance of the model is affected not only by the training set size but also by the specific sample points. Therefore, in order to ensure the diversity of the experiments and fully reflect the model performance, repeat the experiments 100 times; 100 training sets are generated with the same number of points and statistically characterize the results. In this paper, we use the Pearson correlation coefficient for correlation analysis of the output before proceeding with the model construction, which is the most used [13].

4.2. Datasets Description

In order to evaluate the modeling effect of the three models, two function cases and four real data are selected to verify the effect of the proposed method. The functions are based on the original cases with some changes so that their outputs have a certain degree of relevance, and the selected function cases include the one-dimensional case of Forrester and the two-dimensional case of Branin, and all data details are shown in Table 1.

The specific expressions of Forrester and Branin are given in (Appendix C), and the information of Energy_efficiency, Polymer, Broomcorn, and Sarcos are as follows:

Energy_efficiency [22]. This dataset contains a total of two outputs for heat load and cooling load demand for building energy efficiency. It includes eight factors, such as lighting area, roof area, and overall height. There is a total of 768 sample points in this dataset.

Polymer [16]. The polymer dataset contains ten inputs, such as temperature and feed rate, and contains four outputs of measurements. The dataset contains a total of 61 samples.

Broomcorn [22]. This dataset is a sorghum sample from the Institute of Crop Seed Resources, Chinese Academy of Agricultural Sciences, containing a total of 128 sample points. It contains 19 inputs and 3 outputs. The outputs are the protein, lysine, and starch fractional content of the sorghum samples, respectively.

Sarcos [13,23]. This dataset is a high-dimensional, large-scale dataset. This dataset is an inverse dynamics modeling problem for a 7-degree-of-freedom anthropomorphic robotic arm that has 21 inputs (7 joint positions, 7 joint velocities, and 7 joint accelerations) and corresponding moments at 7 joints as outputs. Only the modeling results for six responses are shown.

4.3. Results and Discussions

4.3.1. Numerical Functions

From Forrester’s expression, it can be found that the output f1 is a nonlinear variation of f2. As shown in Figure 2, the actual output of the two functions is computationally highly correlated, with a Pearson correlation coefficient of 0.95.

Figure 3 shows the modeling results for the two outputs of Forrester. The bottom and top of the boxplot indicate the lower and upper quartiles, respectively, while the center depression indicates the median of the metrics. The maximum length of the vertical line at the end of the boxplot is 1.5 times the interquartile spacing. A red plus sign indicates an outlier that is outside the boundaries of the vertical line. It can be seen that when the number of sample points is small (4), the three modeling methods are roughly equivalent. As the number of training sets increases, the two modeling methods MBSVR and MGP, which consider output correlation. This phenomenon may be due to the fact that when the number of samples is small, the current information cannot support the accurate solution of relevant parameters such as correlation. Moreover, MBSVR has the best results and the most significant advantage when the size of the training set is 5 or 6, after which the modeling results of MBSVR are gradually comparable to the modeling results of MGP. This may be related to the small-sample modeling capability of SVR.

Next, a two-dimensional function Branin with two outputs is used to validate method validity. Output 2 is the original Branin function, and Output 1 is a fine-tuning of the original Branin with a linear translation added, as shown in Figure 4. These two outputs are also strongly correlated, with a Pearson correlation coefficient of 0.69.

Figure 5 shows the modeling results of Branin. Overall, the modeling method considering correlation outperforms the independent modeling method, and for both Output 1 and Output 2, the accuracy of the MBSVR is higher than that of the MGP for moderate training set sizes (11, 14, 17). For training set size 20, the accuracy of the MBSVR model is roughly comparable to that of the MGP model, and even lower than that of the MGP in some sample point cases are even lower than MGP. This phenomenon, which is the modeling advantage of MBSVR becoming progressively less significant as the training set increases, is consistent with the one-dimensional arithmetic case.

4.3.2. Real-World Dataset

According to the results of the correlation analysis, the Pearson correlation coefficient of Energy_efficiency is greater than 0.97. The Pearson correlation coefficients of the four outputs of Polymer are large, as shown in Figure 6a, and there is a strong correlation between Output 1 and Output 2 as well as between Output 3 and Output 4. As for Broomcorn, the Pearson correlation coefficient of Output 1 and Output 2 is 0.39, the correlation coefficient of Output 1 and Output 3 is −0.67, and the correlation coefficient of Output 2 and Output 3 is −0.37. The results of the output correlation analysis of Sarcos are shown in Figure 6b. The correlation of Output 4 and Output 7 is the highest, and the correlation of Output 1 and Output 2 is the lowest. Output 6 has low correlation with all other outputs.

Figure 7, Figure 8, Figure 9 and Figure 10 shows the modeling results for each dataset. As the number of sample points increases, the model accuracy gradually improves but the rate of improvement gradually decreases, as can be seen from the modeling results:

In most cases, modeling methods that consider correlation outperform independent modeling methods, and in aggregate, MBSVR works better. For example, MGP and MBSVR outperform the metric modeling approach BSVR in Output 1 for Energy_efficiency, Output 2 and Output 3 for Broomcorn, and most of the outputs for Sarcos. Except for Output 1 for Broomcorn, where the accuracy of MBSVR is significantly lower than that of the independent modeling approach (IND). In conclusion, MBSVR is more accurate in many problems than other modeling methods.

In some cases, independent modeling methods will outperform methods that consider correlations, such as polymer’s output 3 and Broomcorn’s output 1. For Broomcorn’s output 1, the independent modeling approach significantly outperforms the other. Observing the correlation coefficients, it can be found that the Pearson correlation coefficient between output 1 and output 2 is 0.39, and the correlation coefficient between output 1 and output 3 is −0.67. This disadvantage in accuracy may be due to the lack of obvious correlation, or it may be due to the model’s inability to accurately approximate the real shared information.

As the training size increases, the model accuracy advantage shows two different trends. In Output 1 and Output 4 for polymer and Output 2 and Output 3 for Broomcorn, the advantage of MBSVR over other modeling methods is more pronounced as the training size increases. However, there are also cases where the advantage of MBSVR is not obvious as the training size increases, e.g., in the Sacros dataset, the modeling accuracy of MBSVR gradually converges to the modeling accuracy of MGP as the training set increases. Theoretically, as the training set increases, the hyperparameter estimates should get closer and closer to the true values, and if the model assumptions are correct, the hyperparameter estimates are accurate, and the model accuracy improves as the training set increases.

From the results of the numerical analysis, it can be seen that in some arithmetic cases, due to the large number of hyperparameters that need to be optimized by MBSVR, the discrepancy between the shared information obtained from modeling and the actual accurate information leads to limited room for improvement in model accuracy. The final modeling accuracy is comparable to that of MGP. This discrepancy also leads to the fact that in some outputs, the accuracy of modeling methods that consider correlation may be lower than that of independent modeling methods. Additionally, MBSVR requires more optimized hyperparameters, resulting in lower time efficiency.

5. Conclusions

This paper investigates the multi-output modeling problem, aiming to improve the model accuracy by quantitatively describing the correlation between the outputs and using the information between the outputs to construct the model for multiple outputs simultaneously. To inherit the advantages of a single-output Bayesian support vector machine, based on it, the SLFM model is introduced, combined with Bayesian derivation, and the hyperparameters are optimized comprehensively to get the multi-output model that can predict multiple output means and probability distributions at the same time. Model validation is carried out on simple function arithmetic cases and real datasets, and overall, the MBSVR accuracy is higher due to the single-output modeling and the multi-output Gaussian process model.

Due to the large number of hyperparameters that need to be optimized in MBSVR, the efficiency of the algorithm is low. In addition, inaccurate hyperparameters make a difference between the shared information and the actual accurate information, resulting in limited room for improvement in model accuracy. Therefore, achieving efficient and accurate parameter optimization is the main problem that needs to be solved in the future. In addition, how to simplify the correlation description structure and further improve the applicability and optimization efficiency of the model is also the direction of further research.

Author Contributions

Conceptualization, Y.W.; methodology, Y.W.; validation, Y.W. and Z.W.; writing—original draft preparation, Y.W.; writing—review and editing, Y.W., Z.C. and Z.W.; visualization, Y.W. and Z.W.; supervision, Z.C. and Z.W.; funding acquisition, Z.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China (No. 72171231).

Data Availability Statement

All data generated or analyzed during this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

The posterior distribution in (48) considering multiple outputs can be described as:

\begin{array}{l} p (G| Y, γ) & = \frac{p (Y | G, γ) p (G | γ)}{p (Y | γ)} \\ = \prod_{i = 1}^{i = N_{f}} \prod_{j = 1}^{j = N} p (δ_{i j}) \times \frac{1}{{(2 π)}^{N_{f} \times N / 2} {(| K_{M} |)}^{1 / 2}} \exp [- \frac{1}{2} {(G - b)}^{T} {K_{M}}^{- 1} (G - b)] \\ = \prod_{i = 1}^{N_{f}} \frac{1}{Z_{δ_{i}}} \exp (- C_{i} l (δ_{i})) \times \frac{1}{{(2 π)}^{N_{f} \times N / 2} {(| K_{M} |)}^{1 / 2}} \exp [- \frac{1}{2} {(G - b)}^{T} {K_{M}}^{- 1} (G - b)] \\ = \frac{\exp (\sum_{i = 1}^{N_{f}} - C_{i} l (δ_{i}) - \frac{1}{2} {(G - b)}^{T} {K_{M}}^{- 1} (G - b))}{\prod_{i = 1}^{N_{f}} Z_{δ_{i}} \times {(2 π)}^{N_{f} \times N / 2} {(| K_{M} |)}^{1 / 2}} \\ = \frac{\exp (\sum_{i = 1}^{N_{f}} - C_{i} l (y_{i} - g_{i}) - \frac{1}{2} {(G - b)}^{T} {K_{M}}^{- 1} (G - b))}{\prod_{i = 1}^{N_{f}} \int \exp (- C_{i} l (y_{i} - g_{i})) d G \times \int \exp (- \frac{1}{2} {(G - b)}^{T} {K_{M}}^{- 1} (G - b))) d G} \\ = \frac{\exp (\sum_{i = 1}^{N_{f}} - C_{i} l (y_{i} - g_{i}) - \frac{1}{2} {(G - b)}^{T} {K_{M}}^{- 1} (G - b))}{\int \exp (\sum_{i = 1}^{N_{f}} (- C_{i} l (y_{i} - g_{i}) - \frac{1}{2} {(G - b)}^{T} {K_{M}}^{- 1} (G - b))) d G} \end{array}

Appendix B

The Lagrangian function of the primal optimization problem in (53) reads:

L (G, α, e, b) = \frac{1}{2} (G - b) K_{M}^{- 1} (G - b) + \frac{1}{2} C_{M} e^{2} + \sum_{r = 1}^{N_{f} N} β_{r} (y_{r} - g_{r} - e_{r})

where

e = {[e_{1}, e_{2}, \dots, e_{N_{f} N}]}^{T}

,

β = {[β_{1}, β_{2}, \dots, β_{N_{f} N}]}^{T}

is Lagrange multiplier. According to the Karush–Kuhn–Tucker conditions, we can obtain:

\frac{\partial L (G, β, e, b)}{\partial G} = 0 \to G = K_{M} β + b

\frac{\partial L (G, β, e, b)}{\partial e_{r}} = 0 \to \sum_{r = 1}^{N_{f} N} β_{r} = C_{M} e

\frac{\partial L (G, β, e, b)}{\partial β_{r}} = 0 \to y_{r} = g_{r} + e_{r}

\frac{\partial L (G, β, e, b)}{\partial b} = 0 \to \sum_{r = 1}^{N_{f} N} β_{r} = 0

Solving the above equation, we obtain the optimal values of G as

\hat{G} = K_{M} β + b

Appendix C

The expression of the Forrester function is:

f_{1} (x) = 1.5 (x + 2.5) \sqrt{{(6 x - 2)}^{2} \sin (12 x - 4) + 10}, x \in [0, 1]

f_{2} (x) = {(6 x - 2)}^{2} \sin (12 x - 4) + 10, x \in [0, 1]

The expression of the Branin function is:

f_{1} (x) = {(x_{2} - \frac{3}{4 π} x_{1}^{2} + \frac{4}{π} x_{1} - 6)}^{2} + 10 (1 - \frac{1}{8 π}) \cos (x_{1}) + 2 x_{1} - 9 x_{2} + 32

f_{2} (x) = {(x_{2} - \frac{5.1}{4 π} x_{1}^{2} + \frac{5}{π} x_{1} - 6)}^{2} + 10 (1 - \frac{1}{8 π}) \cos (x_{1}) + 10

x_{1} \in [- 5, 10], x_{2} \in [0, 15]

References

Ghattas, B.; Manzon, D. Machine Learning Alternatives to Response Surface Models. Mathematics 2023, 11, 3406. [Google Scholar] [CrossRef]
Yu, H.; Yang, K.; Zhang, L.; Wang, W.; Ouyang, M.; Ma, B.; Yang, S.; Li, J.; Liu, X. Multi-Output Ensemble Deep Learning: A Framework for Simultaneous Prediction of Multiple Electrode Material Properties. Chem. Eng. J. 2023, 475, 146280. [Google Scholar] [CrossRef]
Zhou, Y.; Chang, F.-J.; Chang, L.-C.; Kao, I.-F.; Wang, Y.-S.; Kang, C.-C. Multi-Output Support Vector Machine for Regional Multi-Step-Ahead PM2.5 Forecasting. Sci. Total Environ. 2019, 651, 230–240. [Google Scholar] [CrossRef] [PubMed]
Nguyen, N.-H.; Abellán-García, J.; Lee, S.; Nguyen, T.-K.; Vo, T.P. Simultaneous Prediction the Strain and Energy Absorption Capacity of Ultra-High Performance Fiber Reinforced Concretes by Using Multi-Output Regression Model. Constr. Build. Mater. 2023, 384, 131418. [Google Scholar] [CrossRef]
Vapnik, V.N. The Nature of Statistical Learning Theory; Springer: New York, NY, USA, 2000; ISBN 978-1-4419-3160-3. [Google Scholar]
Roy, A.; Chakraborty, S. Support Vector Machine in Structural Reliability Analysis: A Review. Reliab. Eng. Syst. Saf. 2023, 233, 109126. [Google Scholar] [CrossRef]
Chen, L.; Pan, Y.; Zhang, D. Prediction of Carbon Emissions Level in China’s Logistics Industry Based on the PSO-SVR Model. Mathematics 2024, 12, 1980. [Google Scholar] [CrossRef]
Chu, W.; Keerthi, S.S.; Ong, C.J. Bayesian Support Vector Regression Using a Unified Loss Function. IEEE Trans. Neural Netw. 2004, 15, 29–44. [Google Scholar] [CrossRef] [PubMed]
Wang, J.; Li, C.; Xu, G.; Li, Y.; Kareem, A. Efficient Structural Reliability Analysis Based on Adaptive Bayesian Support Vector Regression. Comput. Methods Appl. Mech. Eng. 2021, 387, 114172. [Google Scholar] [CrossRef]
Cheng, K.; Lu, Z. Adaptive Bayesian Support Vector Regression Model for Structural Reliability Analysis. Reliab. Eng. Syst. Saf. 2021, 206, 107286. [Google Scholar] [CrossRef]
Cheng, K.; Lu, Z. Active Learning Bayesian Support Vector Regression Model for Global Approximation. Inf. Sci. 2021, 544, 549–563. [Google Scholar] [CrossRef]
Gao, H.; Ma, Z. Geometric Metric Learning for Multi-Output Learning. Mathematics 2022, 10, 1632. [Google Scholar] [CrossRef]
Liu, H.; Cai, J.; Ong, Y.-S. Remarks on Multi-Output Gaussian Process Regression. Knowl.-Based Syst. 2018, 144, 102–121. [Google Scholar] [CrossRef]
Perez-Cruz, F.; Camps-Valls, G.; Soria-Olivas, E.; Perez-Ruixo, J.J.; Figueiras-Vidal, A.R.; Artes-Rodrıguez, A. Multi-Dimensional Function Approximation and Regression Estimation; Springer: Berlin/Heidelberg, Germany, 2002. [Google Scholar]
Zhang, W.; Liu, X.; Ding, Y.; Shi, D. Multi-Output LS-SVR Machine in Extended Feature Space. In Proceedings of the 2012 IEEE International Conference on Computational Intelligence for Measurement Systems and Applications (CIMSA) Proceedings, Tianjin, China, 2–4 July 2012; pp. 130–134. [Google Scholar]
Xu, S.; An, X.; Qiao, X.; Zhu, L.; Li, L. Multi-Output Least-Squares Support Vector Regression Machines. Pattern Recognit. Lett. 2013, 34, 1078–1084. [Google Scholar] [CrossRef]
Tran, N.K.; Kühle, L.C.; Klau, G.W. A Critical Review of Multi-Output Support Vector Regression. Pattern Recognit. Lett. 2024, 178, 69–75. [Google Scholar] [CrossRef]
Bayes, T. An Essay towards Solving a Problem in the Doctrine of Chances. Reson 2003, 8, 80–88. [Google Scholar] [CrossRef]
Fricker, T.E.; Oakley, J.E.; Urban, N.M. Multivariate Gaussian Process Emulators with Nonseparable Covariance Structures. Technometrics 2013, 55, 47–56. [Google Scholar] [CrossRef]
Karal, O. Maximum Likelihood Optimal and Robust Support Vector Regression with Lncosh Loss Function. Neural Netw. 2017, 94, 1–12. [Google Scholar] [CrossRef] [PubMed]
Li, M.; Hu, C. Multivariate System Reliability Analysis Considering Highly Nonlinear and Dependent Safety Events. Reliab. Eng. Syst. Saf. 2018, 180, 189–200. [Google Scholar] [CrossRef]
Zhu, X.; Gao, Z. An Efficient Gradient-Based Model Selection Algorithm for Multi-Output Least-Squares Support Vector Regression Machines. Pattern Recognit. Lett. 2018, 111, 16–22. [Google Scholar] [CrossRef]
Liu, H.; Ding, J.; Xie, X.; Jiang, X.; Zhao, Y.; Wang, X. Scalable Multi-Task Gaussian Processes with Neural Embedding of Coregionalization. Knowl.-Based Syst. 2022, 247, 108775. [Google Scholar] [CrossRef]

Figure 1. Model structure of MBSVR.

Figure 2. Two outputs for Forrester.

Figure 3. Modeling results for Forrester: (a) Results for Output 1; (b) results for Output 2.

Figure 4. Two outputs for Branin.

Figure 5. Modeling results for Branin: (a) Results for Output 1; (b) results for Output 2.

Figure 6. Pearson correlation coefficient for polymer and Sarcos: (a) Pearson correlation coefficient of polymer. (b) Pearson correlation coefficient of Sarcos.

Figure 7. Modeling results for Energy_efficiency: (a) Results for Output 1; (b) results for Output 2.

Figure 8. Modeling results for polymer: (a) results for Output 1; (b) results for Output 2; (c) results for Output 3; (d) results for Output 4.

Figure 9. Modeling results for Broomcorn: (a) results for Output 1; (b) results for Output 2. (c) results for Output 3.

Figure 10. Modeling results for Sarcos (f1–f6): (a) results for Output 1; (b) results for Output 2. (c) Results for Output 3; (d) results for Output 4; (e) results for Output 5. (f) results for Output 6.

Table 1. Information about the test dataset.

Type	Name	D	$N_{f}$	Dataset Size	Train Size	Test Size
Numerical function	Forrester	1	2		4:1:8	100
Numerical function	Branin	2	2		5:3:20	5000
Real-world datasets	Energy_efficiency	8	2	768	10:5:40	500
	Polymer	10	4	61	10:5:35	20
	Broomcorn	19	3	128	30:10:90	30
	Sarcos	21	7	44,484	10:5:50	4449

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, Y.; Cheng, Z.; Wang, Z. Multi-Output Bayesian Support Vector Regression Considering Dependent Outputs. Mathematics 2024, 12, 2923. https://doi.org/10.3390/math12182923

AMA Style

Wang Y, Cheng Z, Wang Z. Multi-Output Bayesian Support Vector Regression Considering Dependent Outputs. Mathematics. 2024; 12(18):2923. https://doi.org/10.3390/math12182923

Chicago/Turabian Style

Wang, Yanlin, Zhijun Cheng, and Zichen Wang. 2024. "Multi-Output Bayesian Support Vector Regression Considering Dependent Outputs" Mathematics 12, no. 18: 2923. https://doi.org/10.3390/math12182923

APA Style

Wang, Y., Cheng, Z., & Wang, Z. (2024). Multi-Output Bayesian Support Vector Regression Considering Dependent Outputs. Mathematics, 12(18), 2923. https://doi.org/10.3390/math12182923

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Output Bayesian Support Vector Regression Considering Dependent Outputs

Abstract

1. Introduction

2. Related Description and Basic Theories

2.1. Single-Output Bayesian Support Vector Machine

2.2. Semiparametric Latent Factor Model (SLFM)

3. Multi-Output Bayesian Support Vector Regression Model

3.1. Bayesian Assumptions for MBSVR

3.2. Model Construction for MBSVR

3.3. Optimized Solution of Parameters

4. Numerical Experiments

4.1. Performance Metrics and Experiment Settings

4.2. Datasets Description

4.3. Results and Discussions

4.3.1. Numerical Functions

4.3.2. Real-World Dataset

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix B

Appendix C

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI