1. Introduction
Gaussian Mixture Models play an important role in various artificial intelligence and machine learning tasks, such as computer vision, natural language processing, data clustering, classification, and recognition, due to their simplicity and capability to model any probability density function knowing the exact number of modes. GMMs are successfully used in automatic speech recognition [
1], speaker verification and recognition [
2], image retrieval [
3], pattern recognition [
4], genre classification [
5], age and gender recognition [
6] as well as economy [
7], mechanics [
8], robotics [
9], and numerous other research areas. They are also exceptionally popular as input and/or output data representations in deep learning [
10]. At the same time, complex machine learning systems often comprise high-dimensional feature representations. Finding efficient and precise similarity measures between GMMs involving proper dimensionality reduction techniques to reduce computational complexity and consequently, execution times, became an imperative.
Numerous GMM similarity measures have been proposed in the literature. Informational distances among probability distributions, such as Chernoff, Bhattacharyya, or Matusita [
11], have been thoroughly analyzed and explored. Nevertheless, the Kullback–Leibler (KL) divergence [
12] emerged as the most natural and effective informational distance measure between two probability distributions. A solution for the KL divergence among two Gaussian components exists in the analytic, i.e., closed form. However, a closed-form solution for the KL divergence between arbitrary GMMs cannot be analytically expressed, nor does any computationally efficient algorithm exist. Consequently, various more or less computationally expensive approximation functions have been proposed in recent years. The Monte Carlo sampling method [
13] offers an arbitrary accurate, computationally expensive solution, often inappropriate for real-time classification and/or recognition tasks. Other approximations of the KL divergence among two GMMs have also been proposed [
14], such as the variational approximation and the variational upper bound, an approximation based on the unscented transform, or the matched bound approximation, performing poorly on GMMs comprising a few low-probability components. In [
15], the so-called Gaussian Quadratic Form Distance (GQFD) with a closed-form solution for GMMs of diagonal covariance matrices is presented. A multivariate online Kernel Density Estimation (KDE) has been proposed in [
16]. The KDE enables building PDFs from data by observing only a single data point at a time. In [
17], a metric on the space of multivariate Gaussian distributions based on the fundamental idea of parametrizing the space as the Riemannian symmetric space is proposed. In [
18], a robust and efficient sparse representation-based Earth Mover’s Distance (EMD) is presented. The EMD uses an effective pair-wise-based method to learn EMD metrics among GMMs, along with two ground distances between Gaussian components based on the information geometry, obtained by embedding the space of Gaussians into a Lie group or regarding it as the product of Lie groups, to measure the intrinsic distance between Gaussians in the underlying Riemannian manifold. The method is additionally improved and extended in [
19], also including a study on various image features for GMM matching, such as the Gabor filter, Local Binary Pattern descriptor, SIFT, covariance descriptor and high-level features extracted by deep convolution networks. Computational efficiency, as well as the accuracy of these approximations, have been confirmed in most cases by experiments on both real and synthetic data.
Defining the proper feature space dimensionality reduction technique to resolve computational complexity and cope with the problems of data sparsity and the curse of dimensionality is another challenging issue. In miscellaneous natural language processing, speech recognition and synthesis, emotion recognition, image recognition and retrieval and other machine learning tasks, feature vectors contain hundreds or even thousands of features. Diverse dimensionality reduction techniques have been designed to reduce computational complexity, aiming to keep the same or at least highly comparable accuracy. For instance, the Principal Geodesic Analysis (PGA) tries to map SPD matrices into a tangent by maximizing the variability of the mapped data points [
20]. Nonlinear dimensionality reduction algorithms such as Locally Linear Embedding (LLE) and Laplacian Eigenmaps (LE) provide embeddings into lower dimensional space based on Riemannian geometry [
21]. The LE method uses the connection between the Laplace Beltrami operator and the graph Laplacian to construct representation with locally preserving properties [
22]. Locality Preserving Projections (LPP), an approach based on the LE, was developed as an alternative to the Principal Component Analysis (PCA). The LPP tends to learn linear projective maps by solving a variational problem that optimally preserves the local neighborhood structure of the original dataset in the transformed space [
23]. On the other hand, kernel approaches, like those presented in [
24], try to embed feature matrices in a Reproducing Kernel Hilbert Space. Dimensionality reduction is then performed using various kernel-based methods. Finally, the idea behind manifold learning techniques is to increase discrimination of the transformed features by projecting those features to a lower dimensional manifold embedded in a subspace of the original high dimensional feature space [
25]. Supervised methods, such as Linear Discriminant Analyses (LDA) or the Maximum Margin Criterion (MMC) as well as unsupervised methods, like the Principal Component Analysis, are some of the most popular manifold learning representatives. Unlike the PCA which aims to preserve the global structure of the data, the LPP tends to preserve the local structure of the data. Therefore, it may keep more discriminating information, assuming that samples from the same class are close to each other in the input space. Neighborhood Preserving Embedding (NPE) is yet another methodology aiming to preserve the local neighborhood structure on a data manifold. The NPE is able to learn not only the projective matrix but also the weights extracting the neighborhood information in the original feature space [
26].
Two different dimensionality reduction methods have been analyzed and compared in our previous work. The first one was based on an LPP-like projection of the parameter space [
27]. The LLP-based GMM similarity measure was used to calculate the linear transformation matrix that projects the vectorized parameters of Gaussian components of arbitrary GMMs into a low-dimensional Euclidean space. At the same time, the distinctiveness of the original feature space is preserved in lower-dimensional space. Both the symmetric and the nonsymmetric version of the LPP-based GMM similarity measure has been developed, utilizing the symmetric or the one-sided KL divergence between Gaussian components corresponding to GMMs. The other one assumes that the parameters of full covariance Gaussians lie close to each other in a lower-dimensional surface embedded in the cone of positive definite matrices. This is contrary to the assumption that data themselves lie on the lower-dimensional manifold embedded in the feature space [
28]. The NPE-based idea has been employed to evaluate the projection matrix. The matrix is then applied to the parameter space of Gaussian components, by projecting the parameters of Gaussian components into lower-dimensional Euclidean vectors.
Recently, a novel geometry-aware dimensionality reduction technique has been presented [
29]. This technique tends to preserve the local structure of the data by Distance Preservation to the Local Mean (DPLM), considering the geometry of the SPD matrices. Based on this approach, a novel GMM similarity measure is proposed in the paper. The method utilizes the fact that the space of multivariate Gaussians is a Riemannian manifold that can be embedded into the cone of SPD matrices. Both the supervised and unsupervised version of the DPLM algorithm has been employed. Baseline KL-based GMM similarity measures are then applied over low-dimensional feature matrices, i.e., GMM projections, preserving the locality induced by the manifold structure from the original parameter space, while achieving significantly lower computational cost. A much better trade-off between the recognition accuracy and the computational complexity has been achieved in comparison to KL-based distance approximations calculated between GMMs from the original parameter space. Experiments are conducted within a texture recognition task, but the proposed method is suitable for any big-data artificial intelligence system using a large number of GMMs as well as high dimensional features.
The paper is organized as follows. In
Section 2, we start with a review of baseline KL-based GMM similarity measures presented in the literature. We then propose a novel GMM similarity measure motivated by the geometry-aware dimensionality reduction algorithm presented in [
29], projecting the original feature space into a low-dimensional feature space. Computational complexities in the recognition phase are also estimated. In
Section 3, we compare and discuss the results obtained using the proposed DPLM-based and baseline KL-based similarity measures within a texture recognition task conducted on three publicly available image databases (UIUC [
30], KTH-TIPS [
31], and UMD [
32] database). In all examined cases, the results obtained using the proposed method were highly superior concerning the trade-off between accuracy and computational complexity compared to all baseline methods. The paper is summarized in
Section 4. Author contributions, funding and data availability statement are provided at the end of the manuscript.
2. Materials and Methods
In the following section, we will discuss baseline GMM similarity measures based on some of the most popular KL divergence approximations presented in the literature and propose a novel geometry-aware GMM similarity measure constructed using a nonlinear geometry-aware dimensionality reduction algorithm for the manifold of SPD matrices. At the end of the section, we will estimate the computational complexities of the proposed and baseline GMM similarity measures.
2.1. KL-Based GMM Similarity Measures
The KL divergence measures how much one probability distribution differs from another probability distribution [
33]. Although there are other loss functions, the KL divergence is used as a fundamental equation in information theory and the most natural solution in many machine learning tasks dealing with probability distributions. For two probability distributions
p and
q, the measure is defined as
. For two simple Gaussians
and
, it can be computed easily using an intuitive closed-form solution given by
where
d is the dimensionality of Gaussians
and
,
and
are their mean vectors and covariance matrices, and
is the trace function, i.e., the sum of elements on the main diagonal. On the other hand, there is no closed-form solution for the KL divergence between two GMMs.
The Monte Carlo (MC) sampling [
14] is a straightforward and the most accurate, although computationally extremely expensive solution for the KL divergence between two different GMMs. The idea is to sample the probability distribution
p using independent and identically distributed (i.i.d.) random samples
,
, so that
=
. Using N samples, we obtain
as
. The variance of the estimation error is now computed as
. Unfortunately, the solution is unacceptably time-consuming and expensive for most real-world and big-data applications, which is why various approximations of the KL divergence are proposed for estimating the KL divergence between two GMMs accurately and efficiently.
The roughest approximation is based on the convexity of the KL divergence. The upper bound of the KL divergence [
34] between two GMMs
and
is given by
where
and
are Gaussian components of the corresponding mixtures,
and
are the corresponding weights, satisfying
,
, and
can be computed using (
1), yielding the Weighted Average (WA) approximation
approximation is computationally much more efficient than
approximation. However, this approximation is too crude in cases when each mixture density is composed of unimodal distributions and the modes are far apart [
34].
Various other approximations of the KL divergence between two GMMs have been proposed and applied in several machine learning tasks, such as speech recognition, image retrieval and speaker identification [
34,
35,
36]. The Matching-Based (MB) Approximation [
34] given by
is based on the assumption that the element
that is most proximal to
dominates the integral
. A more efficient matching based approximation has also been proposed, given by
This approximation provides good performances when the Gaussian components of p and q are mostly far apart. However, is inappropriate in cases when there is significant overlapping among Gaussians from p and q. The Unscented Transform based approximation which uses the unscented estimator similar to the Monte Carlo approximation, except the samples are chosen deterministically, provides a way to deal with GMMs overlapping.
The Unscented Transform (UC) is a mathematical function used to estimate the results of applying a nonlinear transformation to a probability distribution characterized in terms of a finite set of statistics [
35]. Assuming that
, the unscented transform approach tends to approximate the integral
as
is the
kth column of the matrix square root of
. Approximating the integral
, we obtain
Approximating the second integral in similar manner, the is obtained.
The Variational Approximation (VA) [
14,
36] is given by
This approximation utilizes the KL divergence between the Gaussian components in order to obtain an approximate KL divergence between the full GMMs p and q. This is a simple, closed-form expression.
2.2. DPLM-Based GMM Similarity Measure
To construct a novel DPLM-based GMM similarity measure and decrease the computational cost, we propose the following procedure:
The original set of Gaussian mixture components is embedded into the cone of SPD matrices;
A nonlinear geometry-aware dimensionality reduction algorithm for the manifold of SPD matrices [
29] is applied to obtain a projection matrix, providing a low dimensional representation of the manifold by preserving the distance to the local mean (DPLM);
The embeddings are projected into a lower dimensional space using the projection matrix, where baseline KL-based measures can now be used to measure how much one GMM differs from another in a cost-efficient way.
A Riemannian manifold is a real, smooth manifold with a Riemannian metric, equipped with a positive-definite inner product on the tangent space at each point that varies smoothly from point to point, providing local notions of angle, length of curves, surface area and volume [
37]. A set of
SPD matrices denoted by
is a differentiable manifold with a natural Riemannian structure. To obtain the projection matrix using the proposed nonlinear geometry-aware dimensionality reduction algorithm for the manifold of SPD matrices [
29], we use the fact that a set of multivariate Gaussians is a Riemannian manifold and that any
d-dimensional multivariate Gaussian
can be embedded into the cone of SPD matrices
in the following way
where
denotes the determinant of the covariance matrix of Gaussian component
g [
18]. All information regarding the particular Gaussian component
of a given GMM is now contained in a single positive definite matrix P, an element of
.
The aim is to find a projection matrix W, mapping the above-mentioned embeddings into , where . The local structure is preserved in lower dimensional representation by preserving distance to the local mean, i.e., by calculating the Riemannian mean of the K-nearest neighbors of each embedding in order to find a projection matrix that preserves distances between nearest neighbors and their means.
Let us assume we have obtain a set of
N-dimensional embeddings
where
,
is a corresponding class label, and
, is the set of K nearest neighbors of
. In the case of the supervised version of the algorithm,
K-nearest neighbors are selected only among embeddings which have the same class label as the current embedding. The Riemannian mean of each set
denoted by
is calculated using equation
where
is Affine Invariant Riemannian Metric [
38], given by
The projection matrix W can now be calculated by solving optimization problem given by
where
is an
identity matrix, and
where
and
is the Jensen–Bregman LogDet Divergence [
39], given by
The gradient of
with respect to
W can be computed as
where
is the sign function, using the prior knowledge
The lower dimensional projection of
can now be computed as
and used as the original GMM representative inside Equations (
3), (
4) and (
10), providing computationally efficient DPLM-based approximations
,
and
for the supervised, and
,
and
for the unsupervised version of the above-mentioned algorithm. Namely, the GMMs
p and
q in Formulas (
3), (
4) and (
10) are obtained as lower-dimensional projections of the original
-dimensional embeddings given in the form of symmetric positive definite matrices obtained by (
11), using the projection matrix
W, calculated by solving the optimization problem given by (
14) and introduced into the expression (
20). As previously explained, for
,
and
,
K nearest neighbors (see expression (
12)) are selected only among embeddings which have the same class label as the current embedding. In the case of
,
and
, no such requirement has been applied.
2.3. Computational Complexity
In the following subsection, we’ll define the computational cost of the above-mentioned algorithms in the testing phase, bearing in mind that computational cost in the training phase is not crucial for employment.
Let us assume, without loss of generality, that GMMs p and q have the same number of components m, represented using full covariance matrices, and d is the dimension of the original feature space.
The computational complexity of the Monte Carlo approximation is estimated as , where N is the number of samples. However, to obtain an efficient KL divergence approximation, the number of i.i.d. samples N must be very large, i.e., .
The complexity of the KL-based measures , and is roughly equivalent and estimated as . The complexity of calculating the KL divergence between two d-variate Gaussians is of order . This complexity is approximately equal to the complexity of calculating the inversion of a matrix. Since there are such inversions, we obtain the previous estimate.
According to [
40], the complexity of multiplications between (
)-dimensional embeddings
and (
)-dimensional projection matrix
W is estimated as
based on expression (
20) (both left and right multiplications). There are
m such multiplications, so the complexity is calculated as
. The complexity of the KL-based measures applied over (
)-dimensional projections is roughly estimated as
, as previously explained. Therefore, the complexity of the proposed DPLM-based solutions, namely, the
,
,
,
,
and
is now estimated as
, where
l is the dimension of the transformed feature space.
3. Results and Discussion
In this section, we present the results obtained using novel DPLM-based GMM similarity measures proposed in
Section 2.2 with baseline KL-based GMM similarity measures described in
Section 2.1. The algorithms were evaluated in a texture recognition task. The system was trained using data extracted from three publicly available corpora—the UIUC texture database (named after the University of Illinois Urbana-Champaign), the KTH-TIPS image database (KTH is an abbreviation of the Royal Institute of Technology while TIPS stands for Textures under varying Illumination, Pose and Scale), and the UMD texture dataset (named after the University of Maryland). Concerning the UIUC database, 5 classes have been extracted (wood, water, granite, marble, and floor), taken at
pixels. In the case of KTH-TIPS, we also took 5 classes (aluminum foil, brown bread, corduroy, cotton, and cracker), and the images were cropped at
pixels. Finally, for the third database, we used sample images from classes 2 (paint cans), 3 (stones), 8 (brick walls), 9 (apples) and 12 (textile patterns), sampled at
pixels. Selected samples from all three databases are shown in
Figure 1.
For the purposes of experiments,
,
and
, defined by Equations (
3), (
4) and (
10), were selected as baseline GMM similarity measures. Both the supervised (
,
,
) and the unsupervised (
,
,
) versions of DPLM-based GMM similarity measures were applied. Compared to all baseline measures, a significantly better trade-off between computational cost and accuracy was obtained for the proposed DPLM-based GMM similarity measures.
Region covariance descriptors proposed in [
41] were used as texture features since they have already shown good performance in various texture recognition tasks. They were formed in the following way. For any given image, patches of size
for the UIUC (step 16),
for the KTH-TIPS (step 5), or
for the UMD database (step 32) were extracted. For every pixel positioned at
, features were calculated in a form
, where
I represents illumination,
and
are the first, and
and
are the second-order derivatives (meaning that the actual dimension of feature vectors was
). Covariance matrices were calculated using these features and additionally vectorized by aligning their upper triangular values into
-dimensional feature vectors. The parameters of GMMs were then estimated using the Expectation Maximization (EM) algorithm [
42] applied over the pool of feature vectors obtained as previously described. Every sample image was uniformly divided into four sub-images represented by four GMMs, which were then used for training and testing purposes. Every GMM (or its low-dimensional projection in the case of DPLM-based algorithms) was compared to all other GMMs in the train set and its label was determined as a majority vote over 5 nearest neighbors, using the K-Nearest Neighbors algorithm (KNN), and in the case of DPLM-based algorithms, the GMMs were embedded into the cone of SPD matrices
.
In
Table 1,
Table 2 and
Table 3, the recognition accuracies are presented for the proposed DPLM-based measures as well as baseline KL-based measures. Each GMM was represented using one to ten Gaussian components (
). For the case of DPLM-based measures, the number of nearest neighbors of any current embedding was set to
, and the projection dimension was set to
, providing projection matrices of size
and
, respectively, i.e., less than a half of the original feature space dimension.
For each class, a fixed number of samples was randomly selected, keeping the rest for testing. For , the accuracies obtained using the baseline KL-based measures and the full covariance matrices were in most cases only slightly, i.e., less than better than the results obtained using the proposed DPLM-based measures and the reduced-sized representatives. The difference was somewhat more significant for , but on the other hand, the recognition time was reduced by more than third in comparison to all baseline measures. No significant difference has been observed between the supervised and the unsupervised version of the proposed algorithm, probably because we used a relatively small value for K, which is why most nearest neighbors belonged to the same class even for the unsupervised version, but this will be additionally examined in the future. Recall that computational complexities of the proposed DPLM-based algorithms were roughly estimated as . At the same time, computational complexities of baseline KL-based measures were estimated as . The ratio between the computational complexity of the proposed DPLM-based and any mentioned baseline KL-based measures is therefore largely in favor of the proposed DPLM-based measures.
In
Figure 2,
Figure 3 and
Figure 4, CPU processing times during the test phase are presented for the proposed DPLM-based and baseline KL-based algorithms (
vs.
vs.
, and
vs.
, for the UIUC database. Note that the CPU times include not only the given measures but the whole testing procedure, i.e., they also comprise the final voting, meaning that the results would otherwise be even more in favor of the proposed DPLM-based algorithms vs. the KL-based algorithms. In
Figure 5,
Figure 6 and
Figure 7, the same results are presented for the experiments conducted using the KTH-TIPS database. The results of experiments conducted using the UMD database are presented in
Figure 8,
Figure 9 and
Figure 10. It can be concluded that the proposed measures provide significantly lower CPU processing times in comparison to all baseline measures due to a significant reduction in the dimensionality of the original feature space.
The experiments on UMD and UIUC databases were conducted on a workstation equipped with AMD Ryzen™ 7 5800H processor, 3.20 GHz, 8 cores, 16 threads, 16 MB cache, 16 GB ( GB) DDR4 3200 MHz RAM. The experiments on the KTH-TIPS database were conducted on a workstation equipped with Intel® Core™ i5-4690 processor, 3.50 GHz, 4 cores, 4 threads, 6 MB cache and 16 GB ( GB) DDR3 1600 MHz RAM. Differences in execution times among repeated experiments for the same configuration were statistically negligible, i.e., the measurements are reproducible and consistent. Bearing in mind the purpose of these experiments, we did not care about total execution times, i.e., their absolute values, as long as all experiments are conducted on the same hardware for a single database. In fact, we only cared about statistical differences, i.e., ratios between the baseline and the proposed algorithms for different configurations used in the paper.
The proposed methodology could also be applied in various other tasks and learning methodologies comprising models trained during a learning phase, such as the one presented in [
43]. The transformation matrix is formed during the training. As a consequence, the process does not consume CPU time during the employment phase. By reducing the dimension of the original feature space, significantly better results have been obtained concerning trade-offs between speed and accuracy in all our experiments.