Kernel Learning by Spectral Representation and Gaussian Mixtures

Luis R. Pena-Llamas; Ramon O. Guardado-Medina; Arturo Garcia; Andres Mendez-Vazquez

doi:10.3390/app13042473

,

and

¹

Department of Computer Science, El Centro de Investigación y de Estudios Avanzados (CINVESTAV), Ciudad de Mexico 44960, Mexico

²

Department of Research, Escuela Militar de Mantenimiento y Abastecimiento, Universidad del Ejercito y Fuerza Aerea, Zapopan 45200, Mexico

^*

Author to whom correspondence should be addressed.

Appl. Sci.2023, 13(4), 2473;https://doi.org/10.3390/app13042473

This article belongs to the Collection Machine Learning in Computer Engineering Applications

Version Notes

Order Reprints

Review Reports

Abstract

One of the main tasks in kernel methods is the selection of adequate mappings into higher dimension in order to improve class classification. However, this tends to be time consuming, and it may not finish with the best separation between classes. Therefore, there is a need for better methods that are able to extract distance and class separation from data. This work presents a novel approach for learning such mappings by using locally stationary kernels, spectral representations and Gaussian mixtures.

Keywords:

non-parametric kernel learning; approximating kernel; kernel spectral representation; locally stationary kernel

1. Introduction

During the 90’s, the use of kernels [1,2,3,4,5] in Machine Learning received a considerable attention for their ability to improve the performance of linear classifiers. By using kernels, Support Vector Machines and other kernel methods [6,7] can classify complex data sets by mapping them into high dimensional spaces. However, an underlying issue exists, summarized by a simple question: Which kernel should be used? [8].

Kernel selection is not a small task and would highly depend on the problem to be solved. A first idea to select the best kernel could be evaluating different kernels from a small set using leave-one-out cross-validation and selecting the kernel with better classification properties. Nevertheless, this can become a time-consuming task when the number of samples range in the thousands. A better idea is to use a combination of kernels to create kernels with better classification properties. Methods using this type of techniques are called Multiple Kernel Learning (MKL) [9]. For example, Lanckriet [10] uses a Semi-Definite Programming (SDP) to find the best conic combination of multiple kernels. However, these methods still require some pre-selected set of kernels as inputs. A better plan will be to use the distance information at the class data sets. For example, Hoi [11] tries to find a kernel Gram matrix by building the Laplacian Graph [12] of the data. Then, an SDP is applied to find the best combination of kernels.

However, none of these methods are scalable given that their Gram matrix needs to be built the computational complexity of building a Gram matrix is

(O (N^{2})

where N is the number of samples. As a possible solution, it has been proposed to use Sequential Minimal Optimization (SMO) [13] to reduce complexity. This allows to move the quadratic programming problem to a quadratic programming sub-problems. For example, Bach [14] uses an SDP setup and solves the problem by using an SMO algorithm. Other techniques [15] use a random sample of the training set, and a possible approximation to the Gram Matrix to reduce complexity

(O (m^{2} N + m^{2} + m N)

, m sub-problem size). Expanding on this idea, Rahimi [16] approximates kernel functions using samples of the distribution, but only for stationary kernels. On the other hand, Ghiasi-Shirazi [17] proposes a method for learning m number of stationary kernels in the approach of MKL. The method has a main advantage, its ability to learn m number of kernels in an unsupervised way by reducing the complexity of the output function. Furthermore, it reduces the complexity of the classifier output from

O (m x N x N_{S} V)

to

O (m x N)

by using m kernels. Finally, Oliva [18], makes use of Bayesian methods to learn a stationary kernel in a non-parametric way.

On this work, we propose to learn locally stationary kernel from data, given that stationary kernels are a subset of the locally stationary kernel, by using a spectral representation and Gaussian Mixtures [19]. This allows to improve the classification and regression task by looking at the kernel as the result of a sampling process on a spectral representation. This paper is structured in the following way: in Section 2, we show the basic theory to understand the idea of stationary and locally stationary kernels. In Section 3, the proposed algorithm is developed by using Fourier Basis and sampling. Additionally, a theorem is given about the performance of the spectral representation. In Section 4, we review the experiments for classification and regression tasks to test the robustness of the proposed algorithm. Finally, we present an analysis of the advantages of the proposed algorithm and possible venues of research in the Section 5.

2. The Concept of Kernels

The main idea of using kernel methods is to obtain the distance between samples on higher dimensional. Thus, avoid mapping the samples to higher dimensional spaces and using the inner product for such process. In other words, let

X \subseteq R^{D}

be the input set, where

D \in N

, let

K

be a feature space, and suppose the feature mapping function is defined as

φ : X \to K

. Hence, the kernel function,

κ : X \times X \to R

, has the following property:

κ (x, x^{'}) \equiv ⟨ φ (x), φ (x^{'}) ⟩,

where

x, x^{'} \in X

. Thus,

φ

and the feature space can be implicitly defined. Now, let

{x_{i}}_{i = 1}^{N}

be the set of samples,

κ : X \times X \to R

be a valid kernel, and

⟨ \cdot, \cdot ⟩

be a well defined inner product. Then, the elements at the Gram Matrix,

K \in R^{N \times N}

, are computed using the

κ

mapping,

K_{i j} : = κ (x_{i}, x_{j})

. Given this definition, Genton [20] makes an in-depth study of the class of kernel from a statistics perspective, i.e., the kernel functions as a co-variance function. He pointed out that kernels have a spectral representation which can be used to represent their Gram matrix. Based on this representation, the proposed algorithm learns the Gram matrix by using a Gibbs sampler to obtain the structure of such matrix.

2.1. Stationary Kernels

Stationary kernels [20] are defined as

κ (x, x^{'}) = κ_{s} (x - x^{'})

. An important factor in such definition is its dependency on the lag vector which can be interpreted as generalizations of the Gaussian probability distribution functions which are used to represent distributions [15]. Additionally, Bochner [21] proved that a symmetric function

κ_{s}

is a positive definite in

R^{D}

, if and only if, it has the form:

κ_{s} (x - x^{'}) = \int_{R^{D}} e^{2 π i ω^{T} (x - x^{'})} d μ (ω),

(1)

where

μ

is a positive finite measure. Equation (1) is called the spectral representation of

κ_{s}

. Now, suppose

μ

has a density

F (ω)

and

τ = x - x^{'}

. Thus, it is possible to obtain:

\begin{matrix} κ (τ) & = \int F (ω) e^{2 π i ω^{T} τ} d ω, \\ F (ω) & = \int κ (τ) e^{- 2 π i ω^{T} τ} d (τ) . \end{matrix}

In other words, the kernel function

κ_{s}

and its spectral density F are Fourier dual of each other. Furthermore, if

κ (0) = \int F (ω) d ω

and F is a probability measure, the unique condition to define a valid Gaussian process is

κ (0) = 1

. In other words, we need this condition to ensure that the kernel

κ

and the function f are correctly correlated.

2.2. Locally Stationary Kernels

Extending on the previous concept, Silverman [22] defines the locally stationary kernels as:

κ (x, x) : = κ_{1} (\frac{x + x^{'}}{2}) κ_{s} (x - x^{'}),

(2)

where

κ_{1}

is a non negative function and

κ_{s}

is a stationary kernel. This type of kernels increase the power of the representation by introducing a possible variance into the final calculated similarity through the use of

κ_{1}

. Furthermore, we can see from Equation (2) that the Locally Stationary Kernels include all stationary kernels. In order to see this, we make

κ_{1} (\cdot) = c

, where c is a positive constant, then

κ (x, x) : = c κ_{s} (x - x^{'})

, a multiple of all stationary kernels. Furthermore, the variance of locally stationary kernels is given by

x = x^{'}

, thus, the variance is defined as:

κ (x, x) = κ (x) κ (0) = κ_{1} (x),

This means that the variance of the Locally Stationary Kernels relies in the positive definite function

κ_{1}

.

The spectral representation of a locally stationary kernel is also given by [22], and it is defined as:

κ_{1} (\frac{x + x^{'}}{2}) κ_{s} (x - x^{'}) = \int_{X} \int_{X} e x p (i \frac{ω_{1}^{T} x - ω_{2}^{T} x^{'}}{2}) f_{1} (\frac{ω_{1} + ω_{2}}{2}) f_{2} (ω_{1} - ω_{2}) d ω_{1} d ω_{2} .

Furthermore, by setting

x = x^{'} = 0

, we can get:

κ (0, 0) = \int_{X} \int_{X} f_{1} (\frac{ω_{1} + ω_{2}}{2}) f_{2} (ω_{1} - ω_{2}) d ω_{1} d ω_{2} .

Consequently, in order to define a locally stationary kernel,

f_{1}

and

f_{2}

must be integrable functions. Additionally, an important fact is that the kernel

κ

has a defined inverse, given by:

f_{1} (\frac{ω_{1} + ω_{2}}{2}) f_{2} (ω_{1} - ω_{2}) = \frac{1}{{(2 π)}^{2}} \int_{X} \int_{X} κ_{1} (\frac{x_{1} + x^{'}}{2}) κ_{2} (x - x^{'}) d x d x^{'} .

Moreover,

f_{2}

is the Fourier transform of

κ_{1}

and

f_{1}

is the Fourier transform of

κ_{2}

. Thus, if we introduce two dummy variables

u = (x + x^{'}) / 2

and

v = x - x^{'}

, it is possible to obtain:

\begin{matrix} f_{1} (ω_{1}) & = \frac{1}{2 π} \int_{X} e x p (- i v^{T} ω_{1}) κ_{2} (v) d v \\ f_{2} (ω_{2}) & = \frac{1}{2 π} \int_{X} e x p (- i u^{T} ω_{2}) κ_{1} (u) d u \end{matrix}

and

\begin{matrix} κ_{1} (u) & = \int_{X} e x p (i u^{T} ω_{2}) f_{2} (ω_{2}) d ω_{2} \\ κ_{2} (v) & = \int_{X} e x p (i v^{T} ω_{1}) f_{1} (ω_{1}) d ω_{1}, \end{matrix}

with this in mind, it is possible to use the ideas in [16] to approximate the locally stationary kernels.

3. Approximating Stationary Kernels

Rahimi [16] makes use of (1) to approximate stationary kernels. This is, if we define

ζ_{ω} = e x p (i ω^{T} x)

; then, Equation (1) becomes:

κ (x - x^{'}) = \int_{X} f (ω) e x p (i ω^{T} (x - x^{'})) d ω = E_{ω} [ζ_{ω} (x) ζ_{ω}^{*} (x^{'})]

where

ω \sim f

. Now, using Monte Carlo integration and taking

ω_{j} \sim f

, the kernel can be approximated as

κ (x - x^{'}) \approx \frac{1}{M_{1}} \sum_{j = 1}^{M_{1}} ζ_{ω_{j}} (x) ζ_{ω_{j}}^{*} (x^{'}) .

(3)

In particular, if the kernel is real-valued; then, Equation (3) becomes

κ (x - x^{'}) \approx \frac{1}{M_{1}} ϕ^{T} (x) ϕ (x^{'}),

(4)

where

ϕ (s) = [cos (ω_{1}^{T} s), \dots, cos (ω_{M_{1}}^{T} s), sin (ω_{1}^{T} s), \dots, sin (ω_{M_{1}}^{T} s)]

. A side effect of (4) is that we can compute

f (x)

as

f (x) = \sum_{i = 1}^{n} α_{i} κ (x_{i} - x)

. This means that function f can be approximated as

f (x) \approx \frac{1}{M} \sum_{i = 1}^{n} α_{i} ϕ {(x_{i})}^{T} ϕ (x) = γ^{T} ϕ (x)

where

γ = \frac{1}{M} \sum_{i = 1}^{n} α_{i} ϕ (x_{i})

is a constant. This constant makes possible to avoid some of the operations to obtain the Gram matrix.

3.1. Approximating Locally Stationary Kernel

As we know,

κ_{2}

is a stationary kernel which allows to approximate

κ_{2}

as presented in Section 3. Now, to obtain the locally stationary kernel, we would like to approximate

κ_{1}

. For this, we define

ζ_{v}^{'} (x) = e x p (i v^{T} x / 2)

:

κ_{1} (\frac{x + x^{'}}{2}) = \int_{R^{D}} e x p (i v^{T} (\frac{x + x^{'}}{2})) f_{2} (v) d v = E_{v} [ζ_{v}^{'} (x) ζ_{v}^{'} (x)],

where

v \sim f_{2}

. Using Monte Carlo integration and taking

v_{k} \sim f_{2}

, for

k = 1, 2, \dots, M_{2}

, it is possible to approximate

κ_{1}

as:

κ_{1} (\frac{x + x^{'}}{2}) \approx \frac{1}{M_{2}} \sum_{k = 1}^{M_{2}} ζ_{v_{k}}^{'} (x) ζ_{v_{k}}^{'} (x^{'}) .

(5)

To approximate the output of the locally stationary kernel, we can use Equations (3) and (5) as follows:

\begin{matrix} κ (x, x^{'}) & = κ_{1} (\frac{x + x^{'}}{2}) κ_{2} (x - x^{'}) \\ \approx \frac{1}{M_{1} M_{2}} (\sum_{k = 1}^{M_{2}} exp (i \frac{v_{k}^{T} x}{2}) exp (i \frac{v_{k}^{T} x^{'}}{2})) (\sum_{n = 1}^{M_{1}} exp (i ω_{n}^{T} \vec{x}) exp (- i ω_{n}^{T} x^{'})) \\ = \frac{1}{M_{1} M_{2}} \sum_{n = 1}^{M_{1}} \sum_{k = 1}^{M_{2}} exp (i {(\frac{v_{k}}{2} + ω_{n})}^{T} x) exp (i {(\frac{v_{k}}{2} - ω_{n})}^{T} x^{'}) \end{matrix}

where

ω_{n} \sim f_{1}

and

v_{k} \sim f_{2}

. In particular, if our kernel is real-valued, then previous equation becomes

κ_{1} (\frac{x + x^{'}}{2}) κ_{2} (x - x^{'}) \approx \frac{1}{M_{1} M_{2}} φ {(x)}^{T} φ^{-} (x^{'})

(6)

where

\begin{matrix} φ (s) & = [cos ({(\frac{v_{1}}{2} + ω_{1})}^{T} s), \dots, cos ({(\frac{v_{M_{2}}}{2} + ω_{M_{1}})}^{T} s), sin ({(\frac{v_{1}}{2} + ω_{1})}^{T} s), \dots, sin ({(\frac{v_{M_{2}}}{2} + ω_{M_{1}})}^{T} s)] \\ φ^{-} (s) & = [cos ({(\frac{v_{1}}{2} - ω_{1})}^{T} s), \dots, cos ({(\frac{v_{M_{2}}}{2} - ω_{M_{1}})}^{T} s), - sin ({(\frac{v_{1}}{2} - ω_{1})}^{T} s), \dots, - sin ({(\frac{v_{M_{2}}}{2} - ω_{M_{1}})}^{T} s)] \end{matrix}

and

ω_{n} \sim f_{1}

,

v_{k} \sim f_{2}

. Thus, the advantage of representing the locally stationary kernel as Equation (6) is the possibility of computing

f (x)

as:

f (x) = \sum_{j = 1}^{N} α_{i} κ (x_{j}, x) \approx \frac{1}{M_{1} M_{2}} \sum_{j = 1}^{N} α_{i} φ^{T} (x_{j}) φ^{-} (x) = ψ^{T} φ^{-} (x)

where

ψ = \frac{1}{M_{1} M_{2}} \sum_{j = 1}^{N} α_{i} φ^{T} (x_{j})

. Given this representation, we only need to compute

ψ

once, avoiding the use of total Gram Matrix representation.

Now, it is necessary to remark an interesting property of using this representation. It is possible to say that

| φ^{T} (x) φ^{-} (x^{'}) - κ (x, x^{'}) | \leq C

(using the Hoeddfing’s inequality [23]) almost everywhere. Given this, it is possible to obtain the following inequality: given any

ϵ > 0

, and taking samples

M_{1}

and

M_{2}

from

κ_{2}

and

κ_{1}

respectively; then

P (| φ^{T} (x) φ (x^{'}) - κ (x, x^{'}) | \geq ϵ) \leq 2 exp (- \frac{M_{1} M_{2} ϵ^{2}}{2 (σ^{2} + 2 ϵ^{2} / 3)}) .

Therefore, the proposed representation of the kernel allows to obtain a good approximation to

φ^{T} (x) φ^{-} (x^{'})

. Furthermore, the following theorem gives a tighter bound making possible to say: Given a larger

ϵ

, the less likely is the possibility of having a larger

|φ^{T} (x) φ (x^{'}) - κ (x, x^{'})|

.

Theorem 1.

Approximation of a locally stationary kernel.

Let

M

be a compact subset of

R^{D}

with diameter

D i a m (M)

, and

σ^{2} > ϵ / 2

then the approximation of the kernel is given by:

P (sup_{x, x^{'}} |φ^{T} (x) φ (x^{'}) - κ (x, x^{'})| \geq ϵ) < \frac{2 σ^{2}}{ϵ} (1 + 4 D i a m (M) exp (- \frac{M_{1} M_{2} ϵ^{2}}{4 D (σ^{2} + 2 ϵ^{2} / 3)}))

Proof of Theorem 1.

Define

s (x, x^{'}) \equiv φ^{T} (x) φ^{-} (x^{'})

, a locally stationary kernel

κ (x, x^{'}) \equiv κ_{1} (\frac{x + x^{'}}{2}) κ_{2} (x - x^{'})

and

f (x, x^{'}) = | s (x, x^{'}) - κ (x, x^{'}) | \leq 2

. Then, it is possible to say that

E [f (x, x^{'})] = 0

. Given that

κ_{2}

is shift invariant, it is possible to define

Δ_{-} \equiv x - x^{'} \in M_{-}

. Now, given that

κ_{1}

can be interpreted as the mean, it is possible to define

Δ_{+} \equiv \frac{x + x^{'}}{2} \in M_{+}

. Consequently, it is possible to define

κ (Δ_{-}, Δ_{+}) = κ_{1} (Δ_{+}) κ_{2} (Δ_{-})

. Let

M \subset R^{D}

a compact bounded subset, it is known that

D i a m (M_{-}) \leq 2 D i a m (M)

and

D i a m (M_{+}) \leq 2 D i a m (M)

. With this in mind, it is possible to define

ϵ

-net that covers

M_{-} \times M_{+}

at most

T = {(\frac{4 D i a m (M)}{r})}^{2 D}

balls of radius r. Let

{Δ_{-, i}, Δ_{+, i}}_{i = 1}^{T}

denote the center of these T balls, and let

L_{f}

be the Lipschitz constant of f. Therefore, we have that

| f (Δ_{-}, Δ_{+}) | < ϵ

for all

Δ_{-}, Δ_{+} \in M_{-} \times M_{+}

. Then,

f (Δ_{-, i}, Δ_{+, i}) < \frac{ϵ}{2}

and

L_{f} < \frac{ϵ}{2 r}

for all i. Now,

L_{f} = ∥ \nabla f (Δ_{+, i}^{*}, Δ_{-, i}^{*}) ∥

, where

(Δ_{+}^{*}, Δ_{-}^{*}) = {max}_{Δ_{+}, Δ_{-} \in M_{-} \times M_{+}} ∥ \nabla f (Δ_{+}, Δ_{-}) ∥

. Additionally, we know

E [\nabla s (Δ_{+}, Δ_{-})] = \nabla κ (Δ_{+}, Δ_{-})

. Thus, it is possible to say:

E [L_{f}^{2}] = E [∥ \nabla s (Δ_{+}^{*}, Δ_{-}^{*}) - κ (Δ_{+}^{*}, Δ_{-}^{*}) ∥^{2}] = E [∥ s (Δ_{+}^{*}, Δ_{-}^{*}) ∥^{2}] + E [∥ κ (Δ_{+}^{*}, Δ_{-}^{*}) ∥^{2}]

Now, given that

E [∥ s (Δ_{+}^{*}, Δ_{-}^{*}) ∥^{2}]

and

E [∥ κ (Δ_{+}^{*}, Δ_{-}^{*}) ∥^{2}]

are positive,

E [∥ s (Δ_{+}^{*}, Δ_{-}^{*}) ∥^{2}] - E [∥ κ (Δ_{+}^{*}, Δ_{-}^{*}) ∥^{2}] \leq E [∥ s (Δ_{+}^{*}, Δ_{-}^{*}) ∥^{2}] = σ^{2}

with

E [L_{f}^{2}] = σ^{2}

and

E [L_{f}] = σ

, where

σ^{2}

is the second momentum of Fourier transform of

κ

. Thus, using the Markov’s inequality,

\begin{matrix} P (L_{f} \geq t) & \leq \frac{E [L_{f}]}{t}, \\ P (L_{f} \geq \frac{ϵ}{2 r}) & \leq \frac{2 r σ^{2}}{ϵ} \end{matrix}

Finally, using the Boole’s inequality we have

P (\cup_{i = 1}^{T} | f (Δ_{+, i}^{*}, Δ_{-, i}^{*}) | \geq \frac{ϵ}{2}) \leq 2 T exp (- \frac{M_{1} M_{2} ϵ^{2}}{2 (σ^{2} + 2 ϵ^{2} / 3)})

With this at hand, it is possible to say:

P (sup_{x, x^{'}} |φ^{T} (x) φ (x^{'}) - κ (x, x^{'})| \geq ϵ) \leq 1 - 2 {(\frac{4 D i a m (M)}{r})}^{2 D} exp (- \frac{M_{1} M_{2} ϵ^{2}}{2 (σ^{2} + 2 ϵ^{2} / 3)}) - \frac{2 r σ^{2}}{ϵ}

Meaning that we need to solve the following equation

1 - k_{1} r^{- 2 D} - k_{2} r,

(7)

where

\begin{matrix} k_{1} & = 2 {(4 D i a m (M))}^{2 D} exp (- \frac{M_{1} M_{2} ϵ^{2}}{2 (σ + 2 ϵ^{2} / 3)}), \\ k_{2} & = \frac{2 σ^{2}}{ϵ} \end{matrix}

The solution of (7) is given by

r = {(\frac{k_{1}}{k_{2}})}^{\frac{1}{2 D}}

. Then, plugging back this result,

1 - k_{1} {({(\frac{k_{1}}{k_{2}})}^{\frac{1}{2 D}})}^{- 2 D} - k_{2} {(\frac{k_{1}}{k_{2}})}^{\frac{1}{2 D}} .

After some development, it is possible to obtain:

k_{1} {({(\frac{k_{1}}{k_{2}})}^{\frac{1}{2 D}})}^{- 2 D} = \frac{2 σ^{2}}{ϵ}, k_{2} {(\frac{k_{1}}{k_{2}})}^{\frac{1}{2 D}} = 2 {(\frac{σ^{2}}{ϵ})}^{\frac{2 D - 1}{2 D}} 4 D i a m (M) exp (- \frac{M_{1} M_{2} ϵ^{2}}{4 D (σ^{2} + 2 ϵ^{2} / 3)})

Using this equality, we get (8) and (9).

P (sup_{x, x^{'}} |φ^{T} (x) φ (x^{'}) - κ (x, x^{'})| \leq ϵ) \geq 1 - \frac{2 σ^{2}}{ϵ} - 2 {(\frac{σ^{2}}{ϵ})}^{\frac{2 D - 1}{2 D}} 4 D i a m (M) exp (- \frac{M_{1} M_{2} ϵ^{2}}{4 D (σ^{2} + 2 ϵ^{2} / 3)})

(8)

P (sup_{x, x^{'}} |φ^{T} (x) φ (x^{'}) - κ (x, x^{'})| \geq ϵ) \leq \frac{2 σ^{2}}{ϵ} + 2 {(\frac{σ^{2}}{ϵ})}^{\frac{2 D - 1}{2 D}} 4 D i a m (M) exp (- \frac{M_{1} M_{2} ϵ^{2}}{4 D (σ^{2} + 2 ϵ^{2} / 3)})

(9)

Now, if

σ^{2} > ϵ / 2

, then

\frac{2 σ^{2}}{ϵ} + {\frac{2 σ^{2}}{ϵ}}^{\frac{2 D - 1}{2 D}} x < \frac{2 σ^{2}}{ϵ} (1 + x)

. Finally:

P (sup_{x, x^{'}} |φ^{T} (x) φ (x^{'}) - κ (x, x^{'})| \geq ϵ) < \frac{2 σ^{2}}{ϵ} (1 + 4 D i a m (M) exp (- \frac{M_{1} M_{2} ϵ^{2}}{4 D (σ^{2} + 2 ϵ^{2} / 3)}))

□

3.2. Learning Locally Stationary Kernel, GaBaSR

In this section, we explain how to learn the proposed stationary kernel. This learning algorithm is based on the work presented in [18], named Bayesian Nonparametric Kernel (BaNK) algorithm. However, given its greatest representation capabilities, we propose learning a Gaussin mixture distribution to improve the performance of the algorithm. For this reason, we name this model as Gaussian Mixture Bayesian Nonparametric Kernel Learning using Spectral Representation (GaBaSR). Furthermore, to learn the Gaussian mixture, the proposed algorithm uses ideas proposed in [15], together with a different way to learn the kernel in the classification task. Additionally, one of its main advantages is the use of vague/non-informative priors, [15,24], as well as having fewer hyperparameters for learning the kernels.

3.2.1. GaBaSR Algorithm

Based in the previous ideas, we have the following high level description of the algorithm.

Learn all the parameters for the Gaussian mixture $ρ (ω)$ :
- Let ${π_{k}, μ_{k}, Σ_{k}}_{k = 1}^{K}$ be the current parameters of the Gaussian Mixture Model (GMM), where $π_{k}$ is the prior probability of the kth component, $μ_{k}$ is the mean and $Σ_{k}$ is the covariance matrix of the kth component, then the GMM is given by $ρ (ω) = \sum_{k = 1}^{K} π_{k} N (x | μ_{k}, Σ_{k})$ , here the output will be the new sample parameters for the GMM $ρ (ω)$ .
Take M samples from $ρ (ω)$ , i.e., $ω_{i} \sim ρ (ω), i = 1, 2, . . ., M$ for the spectral representation.
- Here the input are the parameters of the GMM, and the frequencies $ω_{i}, i = 1, . . ., M$ , and the output will be the new frequencies sampled.
Approximate the kernel as

$κ_{1} (\frac{x + x^{'}}{2}) κ_{2} (x - x^{'}) \approx \frac{1}{M_{1} M_{2}} φ {(x)}^{T} φ^{-} (x^{'})$
Predict the new samples:
(a)
If the task is a regression use:

$f (x) = N (β^{T} φ (x), σ^{2})$

(b)
If the task is a classification use:

$f (x) = \frac{1}{1 + exp (- β^{T} φ (x))}$

In this work, we use a Markov Chain Monte Carlo (MCMC) algorithm, the Gibbs sampler [25], to learn and predict new inputs. The final entire process is described in the following subsections.

3.2.2. Learning the Gaussian Mixture

In order to learn the parameters

Z_{i}, μ_{k},

and

Σ_{k}

of the Gaussian mixture, we take the following steps:

First sample $Z_{i}$ :
$Z_{i}$ indicates the component of the Gaussian Mixture from which the random frequency $ω_{i}$ is drawn.
For $i = 1, 2, . . ., M$ do:
(a)
The element $ω_{i}$ belongs to class $k = 1, 2, . . ., K$ with probability:

$p (z_{i} = k | Z_{- i}, α, μ_{k}, Λ_{k}) \propto \frac{N_{- i, k}}{N - 1 + α} {|Λ_{k}|}^{1 / 2} e^{- 1 / 2 {(ω_{i} - μ_{k})}^{T} Λ_{k} (ω_{i} - μ_{k}))}$

(b)
The element $ω_{i}$ belongs to an unrepresented class, with probability:

$p (z_{i} = k^{'} | α, μ_{k^{'}}, Λ_{k^{'}}) \propto \frac{α}{N - 1 + α} {|Λ_{k^{'}}|}^{1 / 2} e^{- 1 / 2 {(ω_{i} - μ_{k^{'}})}^{T} Λ_{k} (ω_{i} - μ_{k^{'}}))},$

where the parameters $μ_{k^{'}}$ and $Λ_{k^{'}}$ are sampled from their priors,

$\begin{matrix} μ_{k} \sim & N (λ, R^{- 1}), \\ Λ_{k} \sim & W (β, W^{- 1}), \end{matrix}$

where $λ, R^{- 1}, β$ and $W^{- 1}$ are vague/non-informative priors.
Second sample $μ_{k}$ and $Σ_{k}$ :
For $k = 1, 2, . . ., K$ , sample $μ_{k}$ and $Σ_{k}$ from:

$\begin{matrix} μ_{k} \sim & N ((N_{k} {\bar{w}}_{k} Λ_{j} + λ R) {(N_{k} Λ_{k} + R)}^{- 1}, {(N_{k} Λ_{k} + R)}^{- 1}), \\ Λ_{k} \sim & W (β + N_{k} + D - 1, {({(β W)}^{- 1} + \sum_{i = 1}^{N} (δ (z_{i} = k) (ω_{i} - μ_{k}) {(ω_{i} - μ_{k})}^{T}))}^{- 1}) . \end{matrix}$

3.2.3. Sampling to approximate the kernel

As we established earlier, the kernel can be represented by:

κ (x - x^{'}) \approx \frac{1}{M} φ^{T} (x) φ (x^{'}),

where

φ (s) = [cos (ω_{1}^{T} s), \dots, cos (ω_{M}^{T} s), sin (ω_{1}^{T} s), \dots, sin (ω_{M}^{T} s)]

, and

ω_{i}, i = 1, 2, 3, . . ., M

is a sampled from the learned Gaussian Mixture. In order to approximate the kernel, for each random representation, we take a candidate frequency with probability

r = m i n (1, α)

, where

α = \frac{P (y | X, W_{- j}, ω_{j}^{*})}{P (y | X, W)}

Now, if the task is a regression, then Equation (10) is used. With classification, Equation (11) is used. Then, we take a random number

u \sim U (0, 1)

and accept

ω_{j}^{*}

if

u < r

; otherwise reject

ω_{j}^{*}

. For this, it is clear that we need to sample

ω_{j}^{*}

from:

ω_{j}^{*} \sim N (ω_{j}^{*} | μ_{Z_{j}}, Σ_{Z_{j}})

In order to compute

P (y | X, Ω)

, it is necessary to identify what type of task is being solved, regression or classification.

In the case of a regression:

$P (y | X, Ω) \propto \sqrt{\frac{| V_{N} |}{| V_{0} |}} \frac{b_{0}^{a_{0}}}{b_{N}^{a_{N}}} \frac{Γ (a_{N})}{Γ (a_{0})},$

(10)

where

$\begin{matrix} w_{N} & = V_{N} (V_{0}^{- 1} w_{0} + Φ {(X)}^{T} y) \\ V_{N} & = {(V_{0} + Φ {(X)}^{T} Φ (X))}^{- 1} \\ a_{N} & = a_{0} + \frac{N}{2} \\ b_{N} & = b_{0} + \frac{1}{2} (w_{0}^{T} V_{0}^{- 1} w_{0} + y^{T} y - w_{N}^{T} V_{N}^{- 1} w_{N}) \\ Φ (X) & = {(φ {(x_{1})}^{T}, . . ., φ {(x_{N})}^{T})}^{T} \end{matrix}$
In the classification task, an approximation of the logistic regression is used,

$p (y_{i} = C_{1} | x, X, W) \approx s i g m (w_{N}^{T} φ (x)),$

where $s i g m (a) = \frac{1}{1 + exp (- a)}$ . Thus, the likelihood is approximated by:

$\begin{matrix} p (y | X, W) & = \prod_{i = 1}^{N} p {(y_{i} = C_{1} | x, X, W)}^{y_{i}} {(1 - p (y_{i} = C_{1} | x, X, W))}^{1 - y_{i}} \\ \approx \prod_{i = 1}^{N} {(s i g m (w_{N}^{T} φ (x_{i})))}^{y_{i}} {(1 - s i g m (w_{N}^{T} φ (x_{i})))}^{1 - y_{i}} \end{matrix}$

(11)

where,

$\begin{matrix} w_{N} & = V_{N} (V_{0}^{- 1} w_{0} + Φ {(X)}^{T} y) \\ V_{N} & = {(V_{0} + Φ {(X)}^{T} Φ (X))}^{- 1} \end{matrix}$
Computing $α$ : the following criteria is used to accept a sample $ω_{j}^{*}$ with probability r:

$r = min (1, α),$

3.2.4. Learning Locally Stationary Kernels

In order to learn locally stationary kernels, we use a similar process, but we compute

φ (x)

by using Equation (6) instead of Equation (4). Equation (6) needs the variables

ω_{i}, i = 1, 2, . . ., M_{1}

(approximating the

κ_{2}

) and

v_{k}, k = 1, 2, . . ., M_{2}

(approximation for

κ_{1}

). To learn the variables in

κ_{2}

we use the algorithm showed; however to learn the variables to approximate

κ_{1}

, we approximate

κ_{1}

as a infinite Gaussian mixture. This means that we need to learn the variables

Z_{j}^{'}, μ_{k}^{'}, Σ_{k}^{'}

and

v_{k}

that approximate the function

κ_{1}

. Learning these variables is very similar on how we learn them from the stationary kernel with a slight modification:

Sample $Z_{j}^{'}$ : Sampling $Z_{j}^{'}$ is analogous to learning the stationary kernel but with $v_{k}$ instead of $ω_{k}$ .
Sample $μ_{k}^{'}, Σ_{k}^{'}$ : This sample is analogous to the previous section but with $v_{k}$ instead of $ω_{k}$ .
Sample $v_{k}$ : To sample $v_{k}$ , we sample from

$v_{k}^{*} \sim N (v_{k}^{*} | μ_{Z_{j}^{'}}^{'}, Σ_{Z_{j}^{'}}^{'}) .$

In order to compute $P (y | X, W, V)$ , we use $φ (x)$ as a locally stationary kernel instead of a stationary kernel. This simple change allows to add more learning capabilities to GaBaSR.

3.2.5. Complexity of GaBaSR

Complexity of sampling all $Z_{i}$ : The complexity of sampling one $Z_{i}$ is $O (K M d)$ . Thus, the complexity of sampling all $Z_{i}$ for $i = 1, 2, . . ., M$ is bounded by $O (K M^{2} d)$ , where d is the dimension of the input vectors, M is the number of samples to approximate the kernel and K is the number of Gaussian’s found by the algorithm.
Complexity of sampling all $μ_{k}$ and $Σ_{k}$ :
-
Complexity of computing $μ_{k}$ : To take a sample we need to compute $(N_{k} {\bar{w}}_{k} Λ_{j} + λ R) {(N_{k} Λ_{k} + R)}^{- 1}$ which takes $O (d + d^{2})$ . Also, we need to compute ${(N_{k} Λ_{k} + R)}^{- 1}$ which takes $O (d^{3})$ , so the complexity is bounded by $O (d^{3})$ .
-
Complexity of computing $Λ_{k}$ : To take a sample we need to compute ${(β W)}^{- 1}$ which takes $O (d^{3})$ . After after that we need to compute the inverse of a matrix of $d \times d$ which takes $O (d^{3})$ , so this step is bounded by $O (d^{3})$ .
-
Complexity of computing both $μ_{k}$ and $Λ_{k}$ is bounded by $O (d^{3})$ .
We need to take K samples, so sample all $μ_{k}$ and $Σ_{k}, k = 1, 2, . . ., K$ is bounded by $O (K d^{3})$ .
Complexity of $P (y | X, W_{- j}, ω_{j}^{*})$ : The complexity of computing $P (y | X, W_{- j}, ω_{j}^{*})$ (doesn’t matter if it is a regression or classification task) is bounded by $O (N d M + M^{2} N + M^{3})$ . Since the complexity of computing the matrix $Φ (X)$ is $O (N d M)$ ; then, the complexity of computing $V_{N}$ is $O (M^{2} N + M^{3})$ . This means that the complexity of taking M samples ( $ω_{1}, ω_{2}, . . ., ω_{M})$ is bounded by $O (N d M^{2} + M^{3} N + M^{4})$ .
Complexity of one swap (loop) of the algorithm: We sum the three complexities and we have: $O (K M^{2} d + K d^{3} + N d M^{2} + M^{3} N + M^{4}) = O (M^{2} d (K + N) + K d^{3} + M^{3} N + M^{4}) = O (M^{3} N)$ because $N > > M$ .
Complexity of s swaps (loops) of the algorithm: If we make s swaps, then the complexity of all the GaBaSR is bounded by $O (s M^{3} N)$ .

4. Experiments

In this work, the experiments are performed without data cleaning i.e. no normalization or removal of outliers is done. Additionally, we use vague/non-informative priors to test the robustness of GaBaSR. We use the following variables

a_{0} = 0.001, b_{0} = 0.001, V_{0} = 0.000001 * I_{2 M}

, where

I_{2 M}

is the identity matrix of dimension

2 M \times 2 M

.

Using non-informative priors together with the fact that there is no need of prepossessing the data, can be seen as one of the many the advantage of GaBaSR. Finally, the main idea of the kernel methods is to give more power to linear machines via the kernel trick. For this reason, we designed the experiments to compare GaBaSR with pure linear machines. Unfortunately, when trying to collect the original datasets by Oliva et al. [18], we found that they are no longer available online. Thus, the comparison between GaBaSR and Oliva’s algorithm could not be performed.

4.1. Classification

The first dataset is the XOR problem in 2D. We set the number of samples to

N = 6000

. After five swaps with 300 frequencies (M), the proposal got an AUC of 0.98. The result of this experiment is shown in Figure 1.

Figure 1. The XOR problem with the probability of belonging to class 1 (orange). If the probability is more than 0.5, then the sample belongs to class 1 otherwise belongs to class 2.

All the results of the GaBaSR algorithm uses 500 samples (M) and 5 swaps each one. For classification problem we use some small datasets, Breast Cancer, Credit-g, Blood Transfusion, Electricity, Egg-eye-state and Kr vs Kp. The breast cancer dataset it comes from the UCI repository [26] dataset, this dataset is the breast cancer wisconsin dataset. The Credit-g dataset comes from the UCI repository [26] and classifies people by a set of attributes as good or bad credit risk. The Electricity Dataset we downloaded from openml.org and contains data from the Australian New South Wales. Dataset egg-eye-state we downloaded from UCI, this describes if the eye is closed (1) or open (0). Kr Vs Kp dataset was downloaded from UCI, is the King Rook vs King Pawn and it’s from the King+Rook’s side to move and the classification is see if win or not win.

Table 1, Table 2 and Table 3 show the results when we try to solve the problem using perceptron, SVM and GaBaSR, respectively. From those tables, we can see that the in the problems of Kr vs. Kp and Electricity GaBaSR has "similar" AUC to SVM but it is important to notice that in blood transfusion GaBaSR performs better than the perceptron and SVM.

Table 1. Results of the Perceptron.

Table 2. Results SVM Classification.

Table 3. Results of GaBaSR Classification.

In other words, if we compare Table 3 with Table 2, it is possible to observe that the results, in general, for SVM Classification are better than for GaBaSR Classification. AUC equal to 0.5 indicates that the classifier is random, so it does not fulfil its function. Comparing Table 3 with Table 2, we can conclude that SVM Classification works properly for all tested datasets except Blood Transfusion, and GaBaSR Classification works properly only for Kr vs. Kp and Electricity. Results for Blood Transfusion obtained by GaBaSR are better than for SVM but still not satisfactory. A bad result is also a result that has its value.

In general we had a good accuracy, in most of the cases we had an accuracy above 0.8. For example in the dataset for the breast cancer, we had an accuracy of 0.89, which in general is a good accuracy. In the dataset credit-g we had 0.91 of accuracy using only 500 frequencies in 5 swaps.

4.2. Regression

For the regression experiments, we use a synthetic data set in our regression attempt. For this, samples are taken from the Gaussian Mixture distribution shown in Equation (12). After that,

M = 250, and β \sim N (0, I_{501})

are set. In fact, the number 501 is because the extended vector is used, which takes samples from

y_{i} \sim N (φ_{ρ} {(x_{i})}^{T} β)

, where

β_{ρ}

represents the random features from

ρ

, i.e.

ω_{i} \sim ρ (ω), i = 1, 2, . . ., M

. Furthermore, an instance of this problem is shown in Figure 2. Also, 250 samples and vague/non-informative prior to learn the function are used, and the result is shown in Figure 3.

\begin{matrix} p (ω) & = π_{1} p (ω | 0, \frac{1}{2^{2}}) + π_{2} p (ω | \frac{3 π}{4}, \frac{1}{2^{2}}) + π_{3} p (ω | \frac{11 π}{8}, \frac{1}{4^{2}}) \end{matrix}

(12)

\begin{matrix} with & π_{1} = π_{2} = π_{3} = \frac{1}{3} \end{matrix}

(13)

Figure 2. An instance of the samples taken from Equation (12).

Figure 3. The real vs. predicted using GaBaSR.

All the results of the GaBaSR algorithm uses 500 samples (M) and 5 swaps each one. For regression problems we use some small datasets, Mauna LOA CO

_{2}

, California Houses, Boston house-price, and Diabetes. The Mauna LOA CO

_{2}

[29] from the global monitoring laboratory this collects the information of the monthly mean CO

_{2}

, as we can see from Figure 4, this data it is stationary, has some repetitions and increments. The California Houses is a set of 20,640 rows with 8 columns. The Boston house price dataset was collected in 1978 from various suburbs in Boston. Diabetes Dataset has ten variables and the progression of the disease one year after.

Figure 4. Mauna LOA CO

_{2}

from 1958 to 2001 and the prediction.

In this section we show three tables of results, Table 4 shows the results of our algorithm. Table 5 and Table 6 we present the results of the linear regression and the Support Vector Machine with linear kernel, respectively.

Table 4. Results GaBaSR Regression.

Table 5. Results of Linear Regression.

Table 6. Results Linear Regression SVM.

In this subsection, the experiments are performed with simple data and the result of a simple linear regression are shown. The results are shown in Table 4 and Table 5.

As it can be seen, an important result is the given by the Mauna LOA CO

_{2}

dataset. This dataset contains data from the year 1958 to 2001. Thus, for this experiment the algorithm is trained with

M = 250

and performing five swaps. After the model has been trained, to learn stationary kernels, it is possible to asses the performance of the model. For example, the achieved MSE is 0.6052 which helps at the estimation ot the CO

_{2}

outputs. For example, at the sample

2002.13

, we have prediction

376.873

where the real measure for this value is

373.08

. The total results of Mauna LOA CO

_{2}

are shown in Figure 4 and Table 4.

We use the following datasets: (1) Mauna LOA CO

_{2}

from [29], (2) California houses from [30], (3) Boston house-price from [30] and Diabetes from [30]

5. Conclusions

Although GaBaSR’s result are promising, there are still quite a lot of work to do. For example, sampling the

ω_{j}

is quite slow, and there is a need to update the matrix

Φ

for only one sample. Oliva et al. [18] states that this can be done using low-rank updates. However, he does not present any procedure to perform such task, the low-rank updates are being consider for the next phase of GaBaSR. Thus, it is necessary to research how many samples M are required in order to obtain a low rank approximation.

In the experiments, it is possible to observe that GaBaSR is more accurate when performing a classification task rather than a regression task. This is an opportunity to improve the regression model. It means that the regression model needs more research to improve its performance or perhaps that a different model to learn the regression task is needed.

Author Contributions

Conceptualization, L.R.P.-L. and A.M.-V.; Methology, L.R.P.-L. and R.O.G.-M.; software L.R.P.-L. and A.G.; validation, R.O.G.-M.; formal analysis, L.R.P.-L. and A.M.-V.; resources, A.M.-V.; data curation, A.G.; writing—original draft preparation, L.R.P.-L. and R.O.G.-M.; writing—review and editing, A.M.-V. and A.G.; supervision, A.M.-V. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All the data was cited and can be downloaded with their respective citation.

Acknowledgments

The authors wish to thank the The National Council for Science and Technology (CONACyT) in Mexico and Escuela Militar de Mantenimiento y Abastecimiento, Fuerza Aérea Mexicana Zapopan.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AUC	Area Under the Curve
MSE	Mean Squared Error
SVM	Support Vector Machine
$R^{2}$	Coefficient of determination
MKL	Multiple Kernel Learning
SDP	Semi-Definite Programming
SMO	Sequential Minimal Optimization
BaNK	Bayesian Nonparametric Kernel
GaBaSR	Gaussian Mixture Bayesian Nonparametric Kernel Learning using Spectral Representation
GMM	Gaussian Mixture Model
MCMC	Markov Chain Monte Carlo
UCI	University of California, Irvine

References

Smola, A.J.; Schölkopf, B. Learning with Kernels; Citeseer: Princeton, NJ, USA, 1998; Volume 4. [Google Scholar]
Soentpiet, R. Advances in Kernel Methods: Support Vector Learning; MIT Press: Cambridge, MA, USA, 1999. [Google Scholar]
Anand, S.S.; Scotney, B.W.; Tan, M.G.; McClean, S.I.; Bell, D.A.; Hughes, J.G.; Magill, I.C. Designing a kernel for data mining. IEEE Expert 1997, 12, 65–74. [Google Scholar] [CrossRef]
Schölkopf, B.; Smola, A.; Müller, K.R. Kernel principal component analysis. In Proceedings of the International Conference on Artificial Neural Networks, Lausanne, Switzerland, 8–10 October 1997; pp. 583–588. [Google Scholar]
Zien, A.; Rätsch, G.; Mika, S.; Schölkopf, B.; Lengauer, T.; Müller, K.R. Engineering support vector machine kernels that recognize translation initiation sites. Bioinformatics 2000, 16, 799–807. [Google Scholar] [CrossRef] [PubMed]
Tipping, M.E. The relevance vector machine. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2000; pp. 652–658. [Google Scholar]
Junli, C.; Licheng, J. Classification mechanism of support vector machines. In Proceedings of the WCC 2000-ICSP 5th International Conference on Signal Processing Proceedings 16th World Computer Congress, Beijing, China, 21–25 August 2000; Volume 3, pp. 1556–1559. [Google Scholar]
Bennett, K.P.; Campbell, C. Support vector machines: Hype or hallelujah? Acm Sigkdd Explor. Newsl. 2000, 2, 1–13. [Google Scholar] [CrossRef]
Gönen, M.; Alpaydın, E. Multiple kernel learning algorithms. J. Mach. Learn. Res. 2011, 12, 2211–2268. [Google Scholar]
Lanckriet, G.R.; Cristianini, N.; Bartlett, P.; Ghaoui, L.E.; Jordan, M.I. Learning the Kernel Matrix with Semidefinite Programming. J. Mach. Learn. Res. 2004, 5, 27–72. [Google Scholar]
Hoi, S.C.; Jin, R.; Lyu, M.R. Learning Nonparametric Kernel Matrices from Pairwise Constraints. In Proceedings of the 24th International Conference on Machine Learning ACM, Corvallis, OR, USA, 20–24 June 2007; pp. 361–368. [Google Scholar]
Cvetkovic, D.M.; Doob, M.; Sachs, H. Spectra of Graphs; Academic Press: New York, NY, USA, 1980; Volume 10. [Google Scholar]
Platt, J. Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines; Microsoft: Redmond, WA, USA, 1998. [Google Scholar]
Bach, F.R.; Lanckriet, G.R.; Jordan, M.I. Multiple Kernel Learning, Conic Duality, and the SMO Algorithm. In Proceedings of the Twenty-First international Conference on Machine Learning ACM, Banff, AB, Canada, 4–8 July 2004; p. 6. [Google Scholar]
Williams, C.K.; Rasmussen, C.E. Gaussian Processes for Machine Learning; MIT Press: Cambridge, MA, USA, 2006. [Google Scholar]
Rahimi, A.; Recht, B. Random features for large-scale kernel machines. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 3–5 December 2007; pp. 1177–1184. [Google Scholar]
Ghiasi-Shirazi, K.; Safabakhsh, R.; Shamsi, M. Learning translation invariant kernels for classification. J. Mach. Learn. Res. 2010, 11, 1353–1390. [Google Scholar]
Oliva, J.B.; Dubey, A.; Wilson, A.G.; Póczos, B.; Schneider, J.; Xing, E.P. Bayesian Nonparametric Kernel-Learning. In Proceedings of the Artificial Intelligence and Statistics, Cadiz, Spain, 9–11 May 2016; pp. 1078–1086. [Google Scholar]
Rasmussen, C. The infinite Gaussian mixture model. In Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2000; pp. 554–560. [Google Scholar]
Genton, M. Classes of kernels for machine learning: A statistics perspective. J. Mach. Learn. Res. 2001, 2, 299–312. [Google Scholar]
Bochner, S. Harmonic Analysis and the Theory of Probability; California University Press: Berkeley, CA, USA, 1955. [Google Scholar]
Silverman, R. Locally stationary random processes. IRE Trans. Inf. Theory 1957, 3, 182–187. [Google Scholar] [CrossRef]
Hoeffding, W. Probability inequalities for sums of bounded random variables. In The Collected Works of Wassily Hoeffding; Springer: Berlin, Germany, 1994; pp. 409–426. [Google Scholar]
Gelman, A.; Carlin, J.B.; Stern, H.S.; Dunson, D.B.; Vehtari, A.; Rubin, D.B. Bayesian Data Analysis; Chapman and Hall/CRC: Boca Raton, FL, USA, 2013. [Google Scholar]
Geman, S.; Geman, D. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 1984, 6, 721–741. [Google Scholar] [CrossRef] [PubMed]
Dua, D.; Graff, C. UCI Machine Learning Repository. Open J. Stat. 2017, 10. [Google Scholar]
Yeh, I.C.; Yang, K.J.; Ting, T.M. Knowledge discovery on RFM model using Bernoulli sequence. Expert Syst. Appl. 2009, 36, 5866–5871. [Google Scholar] [CrossRef]
Gama, J. Electricity Dataset. 2004. Available online: http://www.inescporto.pt/~{}jgama/ales/ales_5.html (accessed on 6 August 2019).
Carbon, D. Mauna LOA CO2. 2004. Available online: https://cdiac.ess-dive.lbl.gov/ftp/trends/CO2/sio-keel-flask/maunaloa_c.dat (accessed on 6 August 2019).
Buitinck, L.; Louppe, G.; Blondel, M.; Pedregosa, F.; Mueller, A.; Grisel, O.; Niculae, V.; Prettenhofer, P.; Gramfort, A.; Grobler, J.; et al. API design for machine learning software: Experiences from the scikit-learn project. In Proceedings of the ECML PKDD Workshop: Languages for Data Mining and Machine Learning, Prague, Czech Republic, 23–27 September 2013; pp. 108–122. [Google Scholar]

Figure 1. The XOR problem with the probability of belonging to class 1 (orange). If the probability is more than 0.5, then the sample belongs to class 1 otherwise belongs to class 2.

Figure 2. An instance of the samples taken from Equation (12).

Figure 3. The real vs. predicted using GaBaSR.

Figure 4. Mauna LOA CO

_{2}

from 1958 to 2001 and the prediction.

Figure 4. Mauna LOA CO

_{2}

from 1958 to 2001 and the prediction.

Table 1. Results of the Perceptron.

Dataset	N	M	Swaps	AUC
Breast Cancer [26]	569	500	5	0.96480
Credit-g [26]	1000	500	5	0.44576
Blood Transfusion [27]	748	500	5	0.37572
Electricity [28]	45,312	500	5	0.68576
Egg-eye-state [26]	14,980	500	5	0.61635
Kr vs Kp [26]	3196	500	5	0.99357

Table 2. Results SVM Classification.

Dataset	N	M	Swaps	AUC
Breast Cancer [26]	569	500	5	0.934656
Credit-g [26]	1000	500	5	0.85282
Blood Transfusion [27]	748	500	5	0
Electricity [28]	45,312	500	5	0.76076
Egg-eye-state [26]	14,980	500	5	0.692924
Kr vs Kp [26]	3196	500	5	0.96970

Table 3. Results of GaBaSR Classification.

Dataset	N	M	Swaps	AUC
Breast Cancer [26]	569	500	5	0.51348
Credit-g [26]	1000	500	5	0.5142
Blood Transfusion [27]	748	500	5	0.54063
Electricity [28]	45,312	500	5	0.74991
Egg-eye-state [26]	14,980	500	5	0.519743
Kr vs Kp [26]	3196	500	5	0.9045

Table 4. Results GaBaSR Regression.

Dataset	M	Swaps	MSE	$R^{2}$
Mauna LOA CO $_{2}$	250	5	0.60522	0.99789
California houses	250	5	2.91313	−1.16892
Boston house-price	250	5	3.10931	0.97060
Diabetes	250	5	5,918,263.63330	−869.13210

Table 5. Results of Linear Regression.

Dataset	MSE	$R^{2}$
Mauna LOA CO $_{2}$	6.86	0.98
California houses	0.55	0.59
Boston house-price	18.92	0.78
Diabetes	3141.62	0.51

Table 6. Results Linear Regression SVM.

Dataset	MSE	$R^{2}$
Mauna LOA CO $_{2}$	758.08513	−1.60705
California houses	2.00777	−0.50784
Boston house-price	47.66904	0.43533
Diabetes	8094.08752	−0.36496

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Kernel Learning by Spectral Representation and Gaussian Mixtures

Abstract

1. Introduction

2. The Concept of Kernels

2.1. Stationary Kernels

2.2. Locally Stationary Kernels

3. Approximating Stationary Kernels

3.1. Approximating Locally Stationary Kernel

3.2. Learning Locally Stationary Kernel, GaBaSR

3.2.1. GaBaSR Algorithm

3.2.2. Learning the Gaussian Mixture

3.2.3. Sampling to approximate the kernel

3.2.4. Learning Locally Stationary Kernels

3.2.5. Complexity of GaBaSR

4. Experiments

4.1. Classification

4.2. Regression

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Article Metrics

Citations

Article Access Statistics