Supervised Density-Based Metric Learning Based on Bhattacharya Distance for Imbalanced Data Classification Problems

Jalali Mojahed, Atena; Moattar, Mohammad Hossein; Ghaffari, Hamidreza

doi:10.3390/bdcc8090109

Open AccessArticle

Supervised Density-Based Metric Learning Based on Bhattacharya Distance for Imbalanced Data Classification Problems

by

Atena Jalali Mojahed

¹,

Mohammad Hossein Moattar

^2,*

and

Hamidreza Ghaffari

¹

Department of Computer Engineering, Ferdows Branch, Islamic Azad University, Ferdows, Iran

²

Department of Computer Engineering, Mashhad Branch, Islamic Azad University, Mashhad, Iran

^*

Author to whom correspondence should be addressed.

Big Data Cogn. Comput. 2024, 8(9), 109; https://doi.org/10.3390/bdcc8090109

Submission received: 23 July 2024 / Revised: 21 August 2024 / Accepted: 3 September 2024 / Published: 4 September 2024

Download

Browse Figures

Versions Notes

Abstract

Learning distance metrics and distinguishing between samples from different classes are among the most important topics in machine learning. This article proposes a new distance metric learning approach tailored for highly imbalanced datasets. Imbalanced datasets suffer from a lack of data in the minority class, and the differences in class density strongly affect the efficiency of the classification algorithms. Therefore, the density of the classes is considered the main basis of learning the new distance metric. It is possible that the data of one class are composed of several densities, that is, the class is a combination of several normal distributions with different means and variances. In this paper, considering that classes may be multimodal, the distribution of each class is assumed in the form of a mixture of multivariate Gaussian densities. A density-based clustering algorithm is used for determining the number of components followed by the estimation of the parameters of the Gaussian components using maximum a posteriori density estimation. Then, the Bhattacharya distance between the Gaussian mixtures of the classes is maximized using an iterative scheme. To reach a large between-class margin, the distance between the external components is increased while decreasing the distance between the internal components. The proposed method is evaluated on 15 imbalanced datasets using the k-nearest neighbor (KNN) classifier. The results of the experiments show that using the proposed method significantly improves the efficiency of the classifier in imbalance classification problems. Also, when the imbalance ratio is very high and it is not possible to correctly identify minority class samples, the proposed method still provides acceptable performance.

Keywords:

imbalanced data classification; distance metric learning; Bhattacharya divergence; class density estimation

1. Introduction

Machine learning (ML) allows computers to learn without being explicitly programmed. Many machine learning and pattern recognition methods require calculating the distance between data points, often utilizing the Euclidean distance metric [1,2], as developing a new and suitable distance metric for the data is challenging. Currently, distance metric learning (DML) is a significant aspect of machine learning, where a machine learns a novel distance metric according to the input patterns’ characteristics. This new metric can enhance the effectiveness of classification algorithms that rely on the distance metric.

Supervised learning is a category of learning algorithms. In this type of learning, the algorithm has access to labeled data. The goal of supervised distance metric learning is to train a new distance metric that brings the data points with the same label closer together and separates the points with different labels. In the learning process, according to Figure 1, the most appropriate distance matrix is extracted for each dataset using the distance metric learning algorithm. Then, the data are mapped to a new space based on the new distance metrics, and after that, various classification algorithms can be applied to the mapped data.

One of the important issues in the field of data mining and machine learning is the problem of classifying imbalanced datasets. An imbalanced distribution of classes in datasets occurs when the number of observations in the minority class (positive class) is much less than in the majority class (negative class). The existence of rare or very expensive samples in the real world creates imbalanced datasets [3,4]. On the other hand, traditional classification algorithms such as the k-nearest neighbor (KNN) algorithm and support vector machine (SVM) do not work well on imbalanced datasets because they do not consider the quality of the data space and the imbalance ratio. Therefore, when classifying imbalanced datasets, these algorithms tend towards the majority class and consider the minority class samples as noise or outliers. Consequently, the probability of misclassifying the minority class compared to the majority class increases, and the accuracy of their performance on these samples is very low. Additionally, when the imbalance ratio is very high, it becomes challenging to identify the minority class [5,6].

The challenge of classifying imbalanced datasets is usually encountered in scenarios where anomaly detection is crucial, such as medical diagnosis (diagnosing rare diseases) [7], fraud detection in the banking system [8], prediction of natural disasters like earthquakes, face recognition, text classification [9], error detection [10], and anomaly detection [11]. Consequently, research in the field of imbalanced dataset classification has gained significant attention in recent years [12]. Given that in real-world applications, the primary objective is often to identify rare cases, it is imperative to develop a model capable of accurately classifying the minority class.

In this article, a new distance metric is proposed for imbalanced data classification. The difference in class density strongly affects the efficiency of classification algorithms. Therefore, the local density of classes is the primary basis for learning the new approach. In the proposed method, the density between classes is the main criterion for learning the distance metric. Imbalanced datasets suffer from a lack of data in the minority class, and the disparity between the density of the minority class and the majority class significantly impacts the efficiency of the k-nearest neighbor classification algorithm.

On the other hand, it is possible that the data of one class are composed of several densities. That is, the data are a combination of several normal distributions with different parameters, each having distinct means and variances. In the proposed method, for identifying the number of normal distributions, the distribution of classes is assumed to be a mixture of Gaussians. The number of components is determined using the DBSCAN density-based clustering algorithm. To accurately identify the Gaussian components of the Gaussian mixture model (GMM) probabilistic models, the parameters are estimated and updated separately using the maximum a posteriori (MAP) estimator. Subsequently, the distance between the densities of Gaussian components is maximized using the Bhattacharya distance and an iterative optimization algorithm to achieve a large between-class margin.

The rest of this article is organized as follows: In Section 2, works related to the topic of distance metric learning are reviewed. In Section 3, the proposed model is presented, which is based on training the appropriate distance metric for the imbalanced datasets to enhance the efficiency of classification algorithms. Section 4 reports the results of experiments on several imbalanced datasets. The conclusion and future works are outlined in Section 5.

2. Related Works

2.1. Background and Definitions

Distance metric learning emerged in 2003 in Mr. Xing’s article with the introduction of Mahalanobis distance [13]. Unlike Euclidean distance, Mahalanobis distance also considers the correlation between features and is represented by Equation (1).

d_{M} (x, x^{'}) = \sqrt{{(x - x^{'})}^{T} M (x - x^{'})}

(1)

where x and x’ are two random vectors from the same distribution with covariance matrix M. Matrix M is a symmetric positive semi-definite (PSD) matrix with d × d dimensions [14]. If M is an identity matrix, Equation (1) will represent the Euclidean distance. Since M is PSD, d_M has the following properties.

d_{M} (x, x^{'}) \geq 0

d_{M} (x, x) = 0

d_{M} (x, x^{'}) = d_{M} (x^{'}, x)

d_{M} (x, x^{″}) \leq d_{M} (x, x^{'}) + d_{M} (x^{'}, x^{″})

Since matrix M can be written as M = L^TL, the linear transformation of the data is performed with the transfer matrix L in the Mahalanobis distance.

d_{M} (x, x^{'}) = \sqrt{{(x - x^{'})}^{T} L^{T} L (x - x^{'})}

d_{M} (x, x^{'}) = \sqrt{{(L x - {x L}^{'})}^{T} (L x - {L x}^{'})}

In general, the research conducted in the field of distance metric learning can be divided into the following categories [15].

Pairwise cost-based approaches

Weinberger et al. [16,17] introduced the large margin nearest neighbor (LMNN) algorithm, a supervised distance metric designed to improve the k-nearest neighbor classifier. LMNN works by reducing the distance between k-nearest neighbors that share the same label (target neighbors), thus increasing their separation from samples of different classes and creating a larger margin. However, a notable limitation of this algorithm is its focus on optimizing intra-class distances, meaning the initially designated target neighbors do not change during training. As a result, the performance of LMNN is highly dependent on the initial selection of these target neighbors.

Zadeh et al. [18] presented the geometric mean metric learning (GMML) algorithm to improve the accuracy of k-nearest neighbor classification. In their approach, samples are divided into two sets: S for similar point pairs and D for dissimilar point pairs. The algorithm computes a similarity matrix for S and a dissimilarity matrix for D. It then finds the shortest line connecting the inverses of the second-moment matrix of similar points to the second-moment matrix of dissimilar points, with the midpoint of this line defined as the geometric mean. This method aims to achieve a global minimum in a strongly convex objective function, focusing on minimizing the total distance among all similar points.

Ying et al. [19] developed the distance metric learning algorithm with eigenvalue optimization (DML_eig) to improve the accuracy of k-nearest neighbor classification. Their method involves dividing samples into two sets: S, which contains pairs of similar points, and D, which includes pairs of dissimilar points. They refined the objective function from Xing et al. that seeks to maximize the distances between dissimilar pairs while keeping the sum of squared distances between similar pairs within an upper limit. The DML_eig algorithm optimizes this process by identifying the largest eigenvector in each iteration, which helps maximize the minimum squared distances between dissimilar pairs while adhering to the constraints for similar pairs. However, this approach requires eigen decomposition of a matrix at each iteration, resulting in significant time consumption during convergence.

Nguyen et al. [20] introduced distance metric learning through maximization of the Jeffrey divergence (DMLMJ) to enhance the accuracy of k-nearest neighbor classification. In their approach, they divided samples into two groups: S, comprising k similar neighbors, and D, consisting of k dissimilar neighbors. They calculated the covariance matrices for both sets and derived their eigenvectors and eigenvalues. The objective was to learn a linear transformation that maximizes the Jeffrey divergence between the two multivariate Gaussian distributions.

b.: Probabilistic approaches

In this group of methods, researchers aim to maximize the probability that a point is similar to its neighbors while minimizing classification error based on k-nearest neighbors in the image space. A significant disadvantage of these methods is that their equations are non-convex, leading to convergence at a local maximum and increased computational complexity. Davis et al. [21], in a method known as information theoretic metric learning (ITML), categorized all samples into two sets: classmates and non-classmates. They sought to train and optimize the Mahalanobis distance by minimizing the log-determinant divergence. This method employs concepts from Bergman’s information theory and optimization.

c.: Boost-Like methods

Boost-like methods try to train a new metric in each step with a linear combination of the sub-metrics of the previous steps, which have weak constraints in their formula. In other words, by adding weaker learners, strong learners are produced. Chang et al. [22] introduced a boosting algorithm for supervised learning of the Mahalanobis distance metric called BoostMDM. In this method, a cost function is defined, which is repeatedly reduced in each step. In each iteration, the metric matrix is combined with what was learned in the previous step. In this method, the entire sample space is used for training.

d.: Hybrid approaches

This category explores the integration of distance metric learning with other models, such as online learning, to enhance system efficiency. These methods are adaptable to large datasets and can continuously update metrics with current data.

Zhong et al. [23] introduced the scalable large margin online metric learning (SLMOML) method, which employs log-determinant divergence to maintain proximity between two trained Mahalanobis distance metrics. Using the Hinge loss function, SLMOML creates a large margin between different samples and connects passive learning with Bergman imaging, achieving global convergence. The initial Mahalanobis matrix remains positive semi-definite (PSD) throughout the process.

Liu et al. [24] proposed LM-KNN, addressing label prediction in multi-label problems where each example can have multiple related tags, such as in document classification. To reduce the cost and increase the speed of tag prediction for unseen samples, they utilized a distance metric to identify relationships between labels, enabling the separation of distinct tags. Rather than relying on costly optimization techniques, LM-KNN employs a KNN solution to predict labels in the transformed space.

e.: Deep metric learning approaches

In this category, distance metric learning is combined with deep learning to increase the efficiency of the system for managing non-linear and massive data [25,26,27]. In this type of learning, instead of Mahalanobis distance, neural networks are used to create a new feature space with high discrimination power. Cao et al. [28] used a deep neural network to reduce the intra-class distance, increase the extra-class distance, and improve the performance of classification methods such as KNN.

2.2. Learning Distance Metric in Imbalanced Applications

Wang et al. [29] introduced the iterative metric learning (IML) method to address imbalanced data in classification, consisting of three steps. First, the LMNN algorithm iteratively transforms training samples to better align with the test data space. Second, the distance between test samples and training samples is calculated to select the closest training samples, thereby reducing the training sample size. Finally, training samples from each iteration are compared to those from the previous iteration to retain the most similar ones, creating a more stable neighborhood space for test samples.

Feng et al. [30] proposed the distance metric by balanced KL-divergence (DMBK), which develops a new distance metric suitable for imbalanced datasets. They assumed a Gaussian density with uniform covariance across classes and used Kullback–Leibler divergence to measure the distance between class distributions. To effectively handle samples from both minority and majority classes, they utilized the logarithm of the geometric mean of the normalized Kullback–Leibler deviation, solving the optimization problem using ascending gradients and iterative operations.

Gautheron et al. [31] introduced imbalanced metric learning (IML), which creates a distance metric to improve performance on imbalanced datasets. Unlike the traditional Mahalanobis formula, which minimizes loss across all pairs without considering labels, IML employs two loss functions: one for same-label pairs, aiming to reduce their distance below one, and another for different-label pairs to increase their distance beyond one plus a margin.

Yan et al. [32] presented the deep metric framework with border-line-margin loss (DMFBML), combining a new distance metric with a neural network to minimize intra-class distance while maximizing extra-class distance in overlapping class regions, thereby improving classification accuracy for minority class samples in imbalanced datasets. Lastly, the authors of [33] proposed a method to extract desirable features and reduce undesirable ones in imbalanced datasets, which could enhance distance metric learning and facilitate the transfer of data to a new feature space with higher discrimination power. This paper proposes a method that simultaneously selects and extracts features through a cost-sensitive optimization problem. The feature extraction phase focuses on reducing error and maintaining geometric relationships between data using a manifold learning optimization problem. In the feature selection phase, a cost-sensitive optimization problem is used to minimize the upper limit of the generalization error. The combined optimization problem is solved by adding a cost-sensitive term to balance the classes without manipulating the data.

The authors of [34] introduced DFSVM, a novel method for imbalanced classification that combines deep learning with fuzzy support vector machines. DFSVM begins by utilizing a deep neural network to generate an embedding representation of the data, trained with triplet loss to strengthen the similarities within classes while maximizing differences between classes. To address the challenges posed by imbalanced data distribution, an oversampling technique is applied in the embedding space, focusing on feature and center distances to create diverse new samples and mitigate overfitting. Finally, a fuzzy support vector machine (FSVM) with cost-sensitive learning serves as the classifier, assigning higher misclassification costs to minority class samples to enhance overall classification performance.

3. The Proposed Method

Since most distance metric learning algorithms are designed for balanced datasets, it is essential to introduce additional algorithms that cater to imbalanced datasets. In this article, we present a novel distance metric called DMLdbIm, specifically developed for imbalanced datasets. The aim of DMLdbIm is to establish an appropriate data space that enhances the efficiency of distance-based classification algorithms.

In imbalanced datasets, the disparity between the densities of the minority and majority classes is extremely pronounced. Consequently, during classification, the minority class may be overlooked, leading to a significant reduction in the classifier’s effectiveness for that class. The proposed algorithm emphasizes the density of the minority class, employing a new distance metric designed to bring samples of the minority class closer together while maximizing their distance from samples of the majority class in the data space. This approach aims to enhance the classification algorithm’s performance on the minority class. The overall structure of the proposed DMLdbIm method is illustrated in Figure 2. The subsequent sections will discuss the various phases of the proposed model.

3.1. The Proposed Model Construction Method

The proposed approach includes 5 phases.

3.1.1. The First Phase: Estimating the Density of the Classes

In the proposed method, a new distance metric is developed based on the density differences between classes. This approach is particularly relevant for imbalanced datasets, where the disparity in density between the minority and majority classes is often significant. Additionally, it is possible for the data of a single class to consist of multiple populations; that is, the data may be represented as a combination of several normal distributions, each with distinct parameters, including different means and variances. The proposed method aims to identify the number of normal distributions present in the data. For example, in a medical dataset, the number of patients with a rare disease may be considerably lower than that of those who are unaffected. By employing the new distance metric, we can analyze the density differences between affected and unaffected patients. If the affected class comprises multiple subgroups with varying characteristics, we can model this combination using multivariate Gaussian models. The distribution of different classes is represented as a mixture of Gaussian distributions with varying covariances. Consequently, if the sample distribution of each class adheres to several Gaussian models, the relevant parameters for each Gaussian can be estimated. If X = {x₁,…,x_n} denotes the set of training data, then the multivariate Gaussian density function can be expressed as shown in Equation (2).

g (X; μ, \sum) = \frac{1}{{2 π}^{\frac{D}{2}} {|\sum|}^{\frac{1}{2}}} \exp (- \frac{1}{2} {(X - μ)}^{T} \sum^{- 1} (X - μ))

(2)

μ = \frac{1}{n} \sum_{i = 1}^{n} x_{i}

\sum = \frac{1}{n} \sum_{i = 1}^{n} (x_{i} - μ) {(x_{i} - μ)}^{T}

In Equation (2), D is the dimensionality (the number of features in the dataset), π is the weight (percentage of the data in one component), μ is the mean vector, and ∑ is the covariance matrix. The covariance matrix should be symmetric and positive semi-definite (PSD). This phase includes the following steps:

Estimating the initial values of GMM parameters

In the proposed method, the first step is to determine whether the distribution of samples from each class conforms to several Gaussian models, followed by making initial estimates of the weights, means, and covariance parameters. To achieve this, a common clustering method is employed. Given that the focus of the proposed method is on the density of imbalanced datasets, the DBSCAN algorithm is selected. This method clusters data based on the density of data points, enabling it to identify clusters with complex shapes and recognize data points that do not belong to any cluster (outliers). However, when the data are intertwined and lack clear boundaries, DBSCAN may struggle to detect clusters accurately. In such cases, GMMs, which are probabilistic models, can be utilized to identify these types of clusters. In this step, the parameters, including weights, means, and initial covariances of the components, are calculated for each cluster identified by DBSCAN.

b.: Maximum a posteriori (MAP) parameter re-estimation

In many applications, the maximum likelihood estimator is employed to estimate the covariance of class distributions. However, this method performs poorly with imbalanced datasets, where the number of samples in the minority class is significantly small. As a result, the estimate of the covariance matrix is typically elliptical and deviates from the true covariance matrix, which substantially diminishes classification accuracy. In the referenced article [26], a solution to this challenge is proposed by jointly estimating the covariance matrix for all classes, under the assumption that all classes share the same covariance. However, in real-world applications, the variances of class distributions are often unequal. Therefore, this research addresses the challenge using the maximum a posteriori (MAP) estimator, implemented in the following two steps.

First step: The probability of the training vectors to be in the components of the initial mixture is computed by the posterior probability in Equation (3).

\Pr (i| X_{t}, θ_{p r i o r}) = \frac{w_{i} g (X_{t} | μ_{i}, \sum_{i})}{\sum_{k = 1}^{K} w_{k} g (X_{t} | μ_{k}, \sum_{k})}

(3)

In the expression

\Pr (i| X_{t}, θ_{p r i o r}),

the probability that a data point

X_{t}

originates from the ith Gaussian component is denoted, and

θ_{p r i o r}

represents initial parameters. MAP involves using prior knowledge or “prior beliefs” about the model parameters. This approach allows leveraging historical information to improve parameter estimation. Using prior information can help prevent overfitting. A GMM without prior evidence may become overly sensitive to training data.

g (X_{t} | μ_{i}, \sum_{i})

is the Gaussian density function that is expressed in Equation (2). A Gaussian mixture model is a weighted sum of K Gaussian densities as given by the equation,

\sum_{k = 1}^{K} w_{k} g (X_{t} | μ_{k}, \sum_{k})

, where K is the number of clusters, X_t is a D-dimensional continuous-valued data vector, and w_k, k = 1, …, K, are the mixture weights. Utilizing MAP allows for more effective adjustments of cluster weights based on prior evidence, aiding in better identification and separation of clusters.

Then, the weight, mean, and variance parameters are calculated using Equations (4)–(6).

w e i g h t : w_{i} = \sum_{t = 1}^{T} \Pr (i| X_{t}, θ_{p r i o r})

(4)

m e a n : E_{i} (X) = \frac{1}{n_{i}} \sum_{t = 1}^{T} \Pr (i| X_{t}, θ_{p r i o r}) X_{t}

(5)

v a r i a n c e : E_{i} (X^{2}) = \frac{1}{n_{i}} \sum_{t = 1}^{T} \Pr (i| X_{t}, θ_{p r i o r}) X_{t}^{2}

(6)

Second step: The previous estimates must be formulated as Equations (7)–(9) to ensure compatibility with the ith component.

{\hat{w}}_{i} = [\frac{α_{i}^{w} n_{i}}{T} + (1 - α_{i}^{w}) w_{i}] γ

(7)

{\hat{μ}}_{i} = α_{i}^{m} E_{i} (X) + (1 - α_{i}^{m}) μ_{i}

(8)

{\hat{δ}}_{i}^{2} = α_{i}^{v} E_{i} (X^{2}) + (1 - α_{i}^{v}) (α_{i}^{2} - μ_{i}^{2}) - {\hat{μ}}_{i}^{2}

(9)

Each Gaussian component is composed of the following parameters: a mean μ, which defines its center; covariance Σ, which defines its width; and mixing probability π, which defines how big or small the Gaussian function will be. If we have a dataset composed of N = 1000 three-dimensional points (D = 3), then x will be a 1000 × 3 matrix. μ will be a 1 × 3 vector, and Σ will be a 3 × 3 matrix. Adjustment coefficients

{α_{i}^{w}, α_{i}^{m}, α_{i}^{v}}

balance the new and old estimates. The

γ

normalization factor is applied to the weight of all of the mixtures so that the sum of the weights becomes 1. For each mixture and each parameter, the coefficients of adaptation (i.e.,

α_{i}^{ρ})

are defined in the form of Equation (10). According to its value, the effect of new data in estimating parameters is changed.

α_{i}^{ρ} = \frac{n_{i}}{n_{i} + r^{ρ}}

(10)

The related factor

r^{ρ}

controls the number of new samples that must be observed in each mixture component before the new parameters replace the old ones. MAP can effectively estimate the means and covariances of the Gaussian distributions, especially when dealing with limited or noisy data.

3.1.2. The Second Phase: Calculating the Distance between the Gaussian Components Using the Bhattacharya Distance

The Bhattacharya distance [30] measures the symmetric difference between two probability distributions. For imbalanced data, the Bhattacharyya distance is a better choice than other divergences because it is symmetrical and its calculation method is less sensitive to the differences in the number of samples in each class and focuses more on the degree of overlap between the distributions. For

P_{i} = (X; μ_{i}, \sum_{i})

and

P_{j} = (X; μ_{j}, \sum_{j})

as two Gaussian distributions, the Bhattacharya distance between the two distributions is in the form of Equation (11).

B_{i j} = \frac{1}{8} {(μ_{i} - μ_{j})}^{T} \sum^{- 1} (μ_{i} - μ_{j}) + \frac{1}{2} \ln (\frac{\det \sum}{\sqrt{d e t \sum_{i} d e t \sum_{j}}})

(11)

In Equation (11), the variance is equal to Equation (12):

\sum = \frac{{\sum_{i} + \sum}_{j}}{2}

(12)

3.1.3. The Third Phase: The Process of Learning the Proposed Distance Metric

To train the new distance metric (matrix A), one needs to calculate and update the transfer matrix (linear transformation), such as W, so that

A = W W^{T}

. Therefore, W is initialized with the eigenvector of all training samples, then each sample x is replaced with W^Tx, and the Bhattacharya distance between the two distributions in the new space is calculated. The Bhattacharya distance is normalized between the Gaussian components according to Equation (13) to describe the difference between the components of different classes.

E_{i_{k} j_{l}}^{A} = \frac{q_{i_{k}} q_{j_{l}} B_{A} (p_{i_{k}} | | p_{j_{l}})}{\sum_{1 \leq m \leq c_{i}} \sum_{1 \leq n \leq c_{j}} q_{m} q_{n} B_{A} (p_{m} | | p_{n})}

(13)

where

q_{i_{k}}

is the number of samples of the k-th component of the i-th class,

p_{i_{k}}

represents the Gaussian pdf of the k-th component of the i-th class, and

c_{i}

denotes the number of components in the ith class. Then, according to Equation (14), the geometric mean is taken until the samples from different components are separated in a balanced way.

A^{*} = a r g m a x {(\prod_{1 \leq i_{k} < j_{l} \leq z} E_{i_{k} j_{l}}^{A})}^{\frac{1}{z}}

(14)

where z is the number of components. Equation (14) is maximized when the E^A values in both desired components are equal. This equation aims to diminish the influence of majority class components while enhancing the impact of minority class components. Consequently, the optimal distance metric matrix A can account for the imbalanced distribution of components. To further achieve this balance and reduce the effect of majority class components while increasing the influence of minority class components, the logarithm of the geometric mean is computed, as indicated in Equation (15).

A^{*} = a r g m a x {\log (\prod_{1 \leq i_{k} < j_{l} \leq z} E_{i_{k} j_{l}}^{A})}^{\frac{1}{z}}

(15)

To enhance the discrimination power of A and achieve a larger margin in Equation (15), the between-class distance must be increased, while the within-class distance should be decreased. To ensure that the proposed distance metric learning method is both appropriate and optimal, the following two constraints are added to the objective function in Equation (16).

\begin{matrix} \max L_{A} = a r g m a x {\log (\prod_{1 \leq i_{k} < j_{l} \leq z} E_{i_{k} j_{l}}^{A})}^{\frac{1}{z}} \\ + λ_{1} \sum_{1 \leq i_{k} < j_{l} \leq N} θ_{i_{k} j_{l}} d_{A} (x_{i_{k}}, x_{j_{l}}) - λ_{2} \sum_{1 \leq i < j \leq N} {(1 - θ}_{i j} {) d}_{A} (x_{i}, x_{j}) \end{matrix}

(16)

\begin{matrix} S . t . A \geq 0 \\ g (A) = \sum_{1 \leq i, j \leq N} (1 - θ_{i j}) d_{A}^{2} (x_{i}, x_{j}) \leq 1 \end{matrix}

In Equation (16),

L_{A}

is the objective function, and N is the total number of samples in the training dataset. When

θ_{i j} = 0

, it means both samples have the same label. λ₁ and λ₂ are regularization parameters that can be adjusted experimentally. Matrix A should be PSD so that it can be decomposed into A = WW^T. Therefore, the constraint A ≥ 0 is considered. So that the distance between samples of the same class does not fall below a certain value, the limit g(A) is set. The proposed distance metric learning method treats the classes in the imbalanced dataset equally and avoids the tendency of the learning algorithm towards the majority class.

3.1.4. The Fourth Phase: The Optimization Process of the Proposed Objective Function

Since it is difficult to find the distance metric matrix A directly, we try to find the transformation matrix W. First, the eigenvectors of all training samples are considered as the initial value of W. Then, according to Equation (17), to update W, an incremental gradient of L_A is calculated with respect to W to obtain the local maximum value compared to the current solution by stepping in the positive direction of the gradient.

W_{n e w} = W_{o l d} + α \nabla_{w} L_{A}, α = 0.1 (s t e p s i z e)

(17)

When matrix W is updated, A is set as WW^T. The above operation is repeated until the distance metric matrix A converges.

3.2. Evaluation

In the proposed DMLdbIm method, according to Figure 3, the datasets are mapped to the new space based on the obtained distance metrics, and then traditional classification algorithms are applied to determine the labels of the test samples.

4. Experiments

4.1. Specifications of the Datasets

In this research, we utilized 13 imbalanced datasets from KEEL (URL (accessed on 22 July 2024) https://sci2s.ugr.es/keel/imbalanced.php) [31] and 2 imbalanced datasets from UCI (URL (accessed on 22 July 2024) https://archive.ics.uci.edu/ml/index.php) [32], each exhibiting different imbalance ratios, for the evaluation process. The details of these datasets are presented in Table 1. In the experiments, all datasets were transformed into two-class problems using the One-Against-All (OAA) method. For each dataset, we specified the number of samples, the number of features, the number of classes, the number of minority class samples, the distribution of classes, and the imbalance rate (the ratio of the number of majority class samples to the number of minority class samples). Evaluation on these datasets was conducted using 5-fold cross-validation [35].

4.2. Evaluation Criteria

In the evaluation stage, the confusion matrix, as shown in Table 2, is utilized. In this matrix, TP (true positive) represents the number of samples accurately classified as belonging to the positive class. TN (true negative) indicates the number of samples correctly classified as belonging to the negative class. FP (false positive) refers to the number of samples incorrectly classified as belonging to the positive class. FN (false negative) denotes the number of samples mistakenly classified as belonging to the negative class.

The classification accuracy is obtained from Equation (18).

a c c = \frac{T P + T N}{T P + F P + T N + F N}

(18)

To evaluate the effectiveness of classification algorithms on imbalanced datasets, accuracy is not an appropriate criterion, as models tend to favor the majority class, resulting in inflated overall accuracy. Instead, it is more beneficial to focus on the number of positive samples identified from the minority class. For this purpose, recall (sensitivity) and precision criteria are used, as they account for the types of errors (false positives or false negatives) that the model makes. Recall, as defined in Equation (19), compares the number of positive samples to all truly positive samples, while precision, according to Equation (20), compares the number of positive samples to all samples diagnosed as positive. Finally, the F1 measure is utilized as the harmonic mean of these two criteria, as indicated in Equation (21), with the objective of maximizing the F1 value [37,38].

R e c a l l = \frac{T P}{T P + F N}

(19)

P r e c i s i o n = \frac{T P}{T P + F P}

(20)

F 1 = 2 * \frac{R e c a l l \times P r e c i s i o n}{R e c a l l + P r e c i s i o n}

(21)

4.3. Evaluation of the Proposed DMLdbIm Method

In this article, similar to many existing studies, the k-nearest neighbor classifier is used as the classification method. This classifier is applied to the primary dataset with the Euclidean distance metric and is compared against several methods, including DMBK [26], LMNN [16,17], ITML [21], IML [27], DMLMJ [20], GMML [18], DML_eig [19], and the proposed DMLdbIm method. Below is a brief introduction to each of the compared methods:

LMNN: this method ensures that, for each training example, its k-nearest neighbors of the same class (target neighbors) are closer than examples from other classes.
ITML: The objective of this method is to minimize the differential relative entropy between two multivariate Gaussian distributions. It utilizes LogDet regularization to minimize (or maximize) the distance between examples of the same (or different) classes.
GMML: the goal of this method is to minimize the total distances among similar points.
DML_eig: this method aims to maximize the minimum squared distance between dissimilar pairs while keeping an upper bound on the total squared distance for similar pairs.
DMLMJ: the objective here is to learn a linear transformation that maximizes the Jeffrey divergence between two distributions derived from local pairwise constraints.
IML: this method positions samples of the same class at distances of less than one and samples of different classes at distances greater than one, aiming to reduce the negative effects of class imbalance in the dataset.
DMBK: the goal of DMBK is to learn a linear transformation that maximizes the logarithm of the geometric mean of the normalized Kullback–Leibler divergence between distributions that share the same covariance Gaussian density.
DMLdbIm: This proposed method aims to identify components of multivariate Gaussian density with varying covariances for each class. It seeks to maximize the Bhattacharya distance between the Gaussian mixtures of different classes, increasing the distance between external components while decreasing the distance between internal components to create a wide margin between classes.

In the evaluation stage, the parameters for the objective function outlined in Equation (16) are set according to the values specified in Table 3.

In Table 4, the F1 measure is used for comparison. The F1 score is computed as the average of 10 runs of five-fold cross-validation on each dataset. Initially, the number of neighbors in the KNN classification is set to 1. The rank of each method on each dataset is indicated in parentheses. The Mean row presents the average efficiency of each method, while the Average Rank row indicates their performance ranking. In the evaluation stage, a method with a higher F1 measure and a lower rank is considered to perform better.

As shown in Table 4, the proposed approach outperforms most other methods across several datasets. However, the DMLMJ approach has outperformed the proposed method in some cases. DMLMJ is likely to excel over DMLdbIm when dealing with simpler, linearly separable data distributions, where computational efficiency is crucial and where a linear transformation is sufficient to achieve good separability. In contrast, DMLdbIm performs better in complex, multimodal distributions with varying covariances across classes, where capturing intricate internal structures and maximizing class separation in a non-linear feature space are essential. The choice between the two depends on the complexity of the data and the need for either linear or non-linear separability.

The results also indicate that when class imbalance is low, some other methods may perform better than the proposed approach, occasionally placing it in the second or third position. However, as the number of samples in the minority class decreases and imbalance increases, causing existing methods to lose efficiency, the proposed approach maintains its effectiveness. It continues to correctly identify minority class samples, showing an efficiency of about 40%, while some methods, such as Euclidean, LMNN, DMLMJ, and even the DMBK method, fail to function at all, indicated by NA (not available).

On the other hand, when the classifier has less complex decision boundaries (such as with KNN when k > 1), the proposed approach has outperformed the other methods by a wider margin. In Table 5, the F1 measure for 3NN is presented. As seen in this table, by increasing the number of neighbors to 3, the proposed DMLdbIm method shows better performance than other methods, and when the imbalance rate increases, although the efficiency of the algorithm decreases, it still shows better efficiency and more stability than other methods. As shown, the proposed approach surpasses the other methods in 14 cases, with an average rank of 1.4, whereas for the 1NN case (as reported in Table 4), it achieved 9 better results with an average rank of 1.6. These observations confirm the insights from the previous paragraph and provide valuable guidance for potential applications of the proposed approach.

One of the main reasons our method performs better than others is that it considers the data to be a combination of several different distributions with varying means and variances. This approach is particularly effective for classes composed of multiple subgroups with distinct characteristics. Additionally, we use MAP to estimate the parameters of the Gaussian model, which proves to be more effective even when the number of data points in the minority class is small.

Table 6 shows the F1 measure for the 5NN classifier. As Table 6 shows, the efficiency of the proposed method is reduced in comparison with Table 5, but it generally performs better than the other methods on the highly imbalanced datasets and has a better ranking. It also can be concluded that DMLdbIm, IML, and DML_eig are particularly effective in scenarios where class structure and distribution intricacies play a significant role. DMLdbIm excels in complex, multimodal datasets where class distributions involve varying covariances. By focusing on maximizing the Bhattacharya distance between Gaussian mixtures, it effectively separates classes with intricate internal structures. IML is specifically designed to handle class imbalance, outperforming other methods in datasets where class proportions are skewed. Its approach of positioning same-class samples closer and different-class samples farther apart makes it highly effective in maintaining classification performance, even when minority classes are underrepresented. DML_eig, on the other hand, shines in situations where ensuring a wide separation between dissimilar pairs is critical, especially when it is essential to prevent class overlap. By maximizing the minimum squared distance between dissimilar pairs, DML_eig is particularly useful in maintaining clear decision boundaries in datasets with closely spaced classes.

From the other approaches, which had lower performance, LMNN is highly effective in cases where the primary goal is to ensure that k-nearest neighbors of the same class are closer than those of different classes, making it ideal for nearest-neighbor classification tasks with well-defined class clusters. ITML, with its focus on minimizing differential relative entropy and employing LogDet regularization, excels in cases where balancing the distance between similar and dissimilar classes is essential, particularly when the data follow a Gaussian distribution. GMML is best suited for situations where the goal is to minimize the total distances among similar points, making it ideal for tasks that require clustering or grouping data points within the same class while ensuring that similar points remain close. These approaches tend to work well in more traditional classification tasks where the data are less complex and more homogeneous.

Figure 4 shows the average F1 measure of the proposed DMLdbIm method compared to other methods for overall performance comparison. In general, the proposed method outperforms all other methods. Additionally, as observed, the DML-eig approach ranks second on average for the reasons previously mentioned.

Figure 5 shows the accuracy, precision, recall, and F1 criteria of the evaluated approaches when the 3NN classifier is applied to the Heart dataset. As Figure 5 shows, the proposed DMLdbIm method has higher performance in terms of accuracy and precision compared to other methods, and because it has less FP error, negative samples are correctly identified in most cases. This situation arises from a strict classification threshold, which prioritizes accurate positive predictions while potentially missing some true positives, thus lowering recall. Additionally, it shows that the approach is designed to address class imbalance by focusing on precision, so it might effectively reduce false positives but at the cost of capturing fewer positive instances overall. This trade-off suggests that while the method is accurate in its positive classifications, it may not cover all positive cases, reflecting a balance between precision and recall.

Figure 6, Figure 7, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12, Figure 13, Figure 14, Figure 15, Figure 16, Figure 17, Figure 18 and Figure 19 illustrate the detailed performance of various algorithms on different datasets using the 3NN classifier. As shown in Figure 6, the proposed approach significantly outperforms the other methods on the WDBC dataset, indicating its superiority. We hypothesize that the multimodal representation of the classes contributes to this improved performance.

Similar observations are noted in Figure 7 and with the Pima dataset. The proposed approach again shows a significant improvement over other methods, though precision remains considerably higher than recall. This reinforces the previous findings. However, the lower recall could be a drawback in medical datasets, possibly due to a high imbalance ratio. To address this issue, oversampling techniques could be employed to reduce the imbalance and potentially improve recall.

The average superiority of the proposed approach is also evident in Figure 8 and Figure 9. Comparing these figures, the proposed approach surpasses the other methods by a wider margin in Figure 9, which is attributed to the higher imbalance ratio of the Ecoli1 dataset. This demonstrates the approach’s effectiveness in handling highly imbalanced problems. The same observations can be seen in Figure 10 and on the Ecoli2 dataset, which has an imbalance ratio of 5.1.

Figure 11 suggests that the proposed approach fails in comparison with other approaches. Besides the aforementioned reasons, this observation may be due to the prior assumptions on the parameters, especially the number of components.

Figure 12, Figure 13 and Figure 14 suggest the better performance of the proposed approach on the Glass6, Ecoli3, and Yeast-2_vs_4 datasets. However, the performance superiority of the proposed approach is more visible in Figure 15, Figure 16, Figure 17, Figure 18 and Figure 19, which are categorized under highly imbalanced problems. Even for winequality-red-8_vs_6, winequality-red-8_vs_7, winequality-white-3-9_vs_5, and winequality-red-3_vs_5, which are very highly imbalanced with imbalance ratios of 35.44, 46.5, 58.28, and 68.1, respectively, most of the previous approaches have failed to provide results. In contrast, the proposed approach has achieved acceptable outcomes, demonstrating both its robustness against high imbalance and its applicability in such problems.

As seen in Figure 5, Figure 6, Figure 7, Figure 8, Figure 9, Figure 10, Figure 11, Figure 12, Figure 13, Figure 14, Figure 15, Figure 16, Figure 17, Figure 18 and Figure 19, when the imbalance ratio increases, methods that ignore the minority class still have higher accuracy. This is why the accuracy criterion is not suitable for evaluating the efficiency of classification algorithms on imbalanced datasets. However, the proposed DMLdbIm method is still able to identify minority class samples in cases of higher imbalance ratios. In Table 7, the mean and rank of accuracy, precision, recall, and F1 measures of different methods for 3NN classification are given. According to Table 7, the proposed DMLdbIm method generally has higher accuracy, precision, recall, and F1 and better ranking than other methods.

4.4. Comparison with Deep Learning

In the continuation of the experiments, we compared our proposed method with another method that uses a deep neural network on an unbalanced dataset. Wang et al. [34], discussed in the related works section, present an imbalance classification method based on deep learning and fuzzy support vector machine, named DFSVM. This method first uses a deep neural network to obtain an embedding representation of the data. The deep neural network is trained using triplet loss to enhance similarities within classes and differences between classes. In Table 8, the F1 measure for the 3NN is shown.

As can be seen in Table 8, the proposed DMLdbIm method performs better than the DFSVM method in most cases, demonstrating the effectiveness of the approach even when compared to a recent deep learning-based method. Table 8 shows that the deep learning-based approach is very sensitive to the amount of available data and has superior performance compared to the proposed approach only when a good amount of data are available. In other cases, the proposed approach has outperformed the DFSVM approach by a very wide margin.

4.5. Computational Complexity Analysis

The computational complexity of the proposed DMLdbIm method is analyzed in this section. The approach starts with DBSCAN clustering. With N datapoints, the time complexity of the DBSCAN algorithm is O(NlogN). Additionally, the time complexity of GMM construction is O(N.K.D³), where K represents the number of Gaussian components. Subsequently, the covariance matrices are updated with a time complexity of O(K.N.D²). The generalized eigenvalue decomposition of the covariance matrices can be performed in O(D³).

Following the previously described phases, the optimization phase of L_A commences. This phase involves calculating the Bhattacharya distance between two probability distributions with a computational complexity of O(K².C²), where C represents the number of classes. This is followed by the computation of L_A, which has a complexity of O(N²). If we assume there are I iterations in the optimization process, the overall computational complexity for this phase will be O(I.(K².C² + N²)). Given that N is significantly larger than both K and C, the computational complexity can be simplified to O(I.N²).

With the aforementioned steps in mind, the overall time complexity of the proposed method is O(NlogN + N.K.D³ + K.N.D² + D³ + I.N²). Since the first five steps are executed only once, their computational complexity can be considered negligible. Consequently, we can conclude that the computational complexity of the proposed method is O(I.N²), which is primarily influenced by the number of data points. This complexity is comparable to that of some earlier works, such as DMLMJ [20], which has a complexity of O(N².D + k.N³), where k denotes the number of k-nearest neighbors.

In Figure 20, the execution time of different methods on various datasets is given in seconds. All methods were implemented in MATLAB R2015b (64-bit) on a personal laptop with an Intel Core i5 processor (2.30 GHz) and 4 GB of main memory, running the Windows 7 operating system.

As Figure 20 suggests, although the proposed approach is expected to have a high computational cost compared to some other approaches, the computational speed of the proposed DMLdbIm method is several times faster than commonly used methods such as ITML and LMNN.

5. Conclusions

This article proposes a new distance metric learning approach called DMLdbIm, which is most efficient on highly imbalanced datasets. In the proposed method, the distribution of classes is assumed to be in the form of a mixture of Gaussians. This assumption is based on the possibility that the data are composed of several normal distributions with different parameters, each having a different mean and variance. The proposed approach aims to increase the discrimination power of the learned metric by increasing the between-class distance of the external Gaussian components and decreasing the distance between the internal ones. For this purpose, MAP estimation is used to calculate the parameters of the components, even when the number of samples in a class is very small.

In the experiments, the proposed DMLdbIm method shows better performance compared to other distance metric learning methods in increasing the efficiency of the k-nearest neighbor classifier, especially when the imbalance ratio increases. In these cases, when other methods are not effective at all, the proposed method provides acceptable performance. Therefore, in fields where accurately identifying exceptional cases is highly important, this capability is extremely valuable, for example, in detecting low-prevalence cancers, identifying real financial fraud (where mistakes can lead to huge financial losses), and detecting cybercrimes in the virtual space. Additionally, the approach has a higher speed than some commonly used methods. In the future, one may plan to use non-linear and kernel-based variations of the proposed distance metric learning model. Also, since deep metric learning is a relatively new field in the literature, it would be worthwhile to implement the proposed approach with a deep learning architecture. Deep neural networks can find complex patterns in data, which helps in identifying scarce samples. Deep models with retraining capabilities can adapt well to scarce data.

Author Contributions

A.J.M. implemented the approach, performed the analyses, and also prepared the initial draft. M.H.M. proposed the original idea, supervised the process, and proofread the manuscript. H.G. is the advisor and helped with the completion of the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The source code and all necessary data for this article are shared via the link below: (accessed on 22 July 2024) https://github.com/Mohammad-H-Moattar/Supervised-density-based-metric-learning-.git.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Russell, S.J.; Norvig, P. Artificial Intelligence: A Modern Approach; Pearson Education Limited: Kuala Lumpur, Malaysia, 2016. [Google Scholar]
Duin, R.P.; Tax, D.M.J. Statistical pattern recognition. In Handbook of Pattern Recognition and Computer Vision; World Scientific Pub Co Inc: Hackensack, NJ, USA, 2005; pp. 3–24. [Google Scholar]
He, H.; Ma, Y. (Eds.) Imbalanced Learning: Foundations, Algorithms, and Applications; John Wiley & Sons: Hoboken, NJ, USA, 2013. [Google Scholar]
He, H.; Garcia, E.A. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 2008, 21, 1263–1284. [Google Scholar]
Ali, A.; Shamsuddin, S.M.; Ralescu, A.L. Classification with class imbalance problem: A review. Int. J. Adv. Soft Comput. Its Appl. 2015, 7, 176–204. [Google Scholar]
Nguyen, G.H.; Bouzerdoum, A.; Phung, S.L. Learning pattern classification tasks with imbalanced datasets. In Pattern Recognition; InTech: Houston, TX, USA, 2009. [Google Scholar]
Mazurowski, M.A.; Habas, P.A.; Zurada, J.M.; Lo, J.Y.; Baker, J.A.; Tourassi, G.D. Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance. Neural Netw. 2008, 21, 427–436. [Google Scholar] [CrossRef] [PubMed]
Wei, W.; Li, J.; Cao, L.; Ou, Y.; Chen, J. Effective detection of sophisticated online banking fraud on extremely imbalanced data. World Wide Web 2013, 16, 449–475. [Google Scholar] [CrossRef]
Li, Y.; Sun, G.; Zhu, Y. Data imbalance problem in text classification. In Proceedings of the Information Processing (ISIP), 2010 Third International Symposium on Information Processing, Qingdao, China, 15–17 October 2010; IEEE: Piscataway, NJ, USA; pp. 301–305. [Google Scholar]
Zhu, Z.B.; Song, Z.H. Fault diagnosis based on imbalance modified kernel Fisher discriminant analysis. Chem. Eng. Res. Des. 2010, 88, 936–951. [Google Scholar] [CrossRef]
Tavallaee, M.; Stakhanova, N.; Ghorbani, A.A. Toward credible evaluation of anomaly-based intrusion-detection methods. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 2010, 40, 516–524. [Google Scholar] [CrossRef]
Kotsiantis, S.; Kanellopoulos, D.; Pintelas, P. Handling imbalanced datasets: A review. GESTS Int. Trans. Comput. Sci. Eng. 2006, 30, 25–36. [Google Scholar]
Xing, E.P.; Jordan, M.I.; Russell, S.J.; Ng, A.Y. Distance metric learning with application to clustering with side-information. In Advances in Neural Information Processing Systems; Mit Pr: Cambridge, MA, USA, 2003; pp. 521–528. [Google Scholar]
Bellet, A.; Habrard, A.; Sebban, M. Metric Learning; Synthesis Lectures on Artificial Intelligence and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2015; Volume 9, pp. 1–151. [Google Scholar]
Li, D.; Tian, Y. Survey and experimental study on metric learning methods. Neural Netw. 2018, 105, 447–462. [Google Scholar] [CrossRef]
Weinberger, K.Q.; Blitzer, J.; Saul, L.K. Distance metric learning for large margin nearest neighbor classification. In Advances in Neural Information Processing Systems; Mit Pr: Cambridge, MA, USA, 2006; pp. 1473–1480. [Google Scholar]
Weinberger, K.Q.; Saul, L.K. Distance metric learning for large margin nearest neighbor classification. J. Mach. Learn. Res. 2009, 10, 207–244. [Google Scholar]
Zadeh, P.; Hosseini, R.; Sra, S. Geometric mean metric learning. In Proceedings of the International Conference on Machine Learning, PMLR, New York City, NY, USA, 19–24 June 2016; pp. 2464–2471. [Google Scholar]
Ying, Y.; Li, P. Distance metric learning with eigenvalue optimization. J. Mach. Learn. Res. 2012, 13, 1–26. [Google Scholar]
Nguyen, B.; Morell, C.; De Baets, B. Supervised distance metric learning through maximization of the Jeffrey divergence. Pattern Recognit. 2017, 64, 215–225. [Google Scholar] [CrossRef]
Davis, J.V.; Kulis, B.; Jain, P.; Sra, S.; Dhillon, I.S. Information-theoretic metric learning. In Proceedings of the 24th international conference on Machine learning, Corvallis, OR, USA, 17–24 June 2007; ACM: New York, NY, USA; pp. 209–216. [Google Scholar]
Chang, C.C. A boosting approach for supervised Mahalanobis distance metric learning. Pattern Recognit. 2012, 45, 844–862. [Google Scholar] [CrossRef]
Zhong, G.; Zheng, Y.; Li, S.; Fu, Y. SLMOML: Online Metric Learning With Global Convergence. IEEE Trans. Circuits Syst. Video Technol. 2018, 28, 2460–2472. [Google Scholar] [CrossRef]
Liu, W.; Tsang, I.W. Large Margin Metric Learning for Multi-Label Prediction. In Proceedings of the AAAI, Austin, TX, USA, 25–30 January 2015; Volume 15, pp. 2800–2806. [Google Scholar]
Kaya, M.; Bilge, H.Ş. Deep metric learning: A survey. Symmetry 2019, 11, 1066. [Google Scholar] [CrossRef]
Suárez, J.L.; García, S.; Herrera, F. A tutorial on distance metric learning: Mathematical foundations, algorithms, experimental analysis, prospects and challenges. Neurocomputing 2021, 425, 300–322. [Google Scholar] [CrossRef]
Ghojogh, B.; Ghodsi, A.; Karray, F.; Crowley, M. Spectral, Probabilistic, and Deep Metric Learning: Tutorial and Survey. arXiv 2022, arXiv:2201.09267. [Google Scholar]
Cao, X.; Ge, Y.; Li, R.; Zhao, J.; Jiao, L. Hyperspectral imagery classification with deep metric learning. Neurocomputing 2019, 356, 217–227. [Google Scholar] [CrossRef]
Wang, N.; Zhao, X.; Jiang, Y.; Gao, Y. Iterative Metric Learning for Imbalance Data Classification. In Proceedings of the 2018 International Joint Conference on Artificial Intelligence IJCAI, Stockholm, Sweden, 13–19 July 2018; pp. 2805–2811. [Google Scholar]
Feng, L.; Wang, H.; Jin, B.; Li, H.; Xue, M.; Wang, L. Learning a Distance Metric by Balancing KL-Divergence for Imbalanced Datasets. IEEE Trans. Syst. Man Cybern. Syst. 2018, 49, 2384–2395. [Google Scholar] [CrossRef]
Gautheron, L.; Habrard, A.; Morvant, E.; Sebban, M. Metric learning from imbalanced data with generalization guarantees. Pattern Recognit. Lett. 2020, 133, 298–304. [Google Scholar] [CrossRef]
Yan, M.; Li, N. Borderline-margin loss based deep metric learning framework for imbalanced data. Appl. Intell. 2022, 53, 1487–1504. [Google Scholar] [CrossRef]
Fattahi, M.; Moattar, M.H.; Forghani, Y. Improved cost-sensitive representation of data for solving the imbalanced big data classification problem. J. Big Data 2022, 9, 1–24. [Google Scholar] [CrossRef]
Wang, K.F.; An, J.; Wei, Z.; Cui, C.; Ma, X.H.; Ma, C.; Bao, H.Q. Deep learning-based imbalanced classification with fuzzy support vector machine. Front. Bioeng. Biotechnol. 2022, 9, 802712. [Google Scholar] [CrossRef] [PubMed]
UCI Machine Learning Repository. Available online: https://archive.ics.uci.edu/ml/index.php (accessed on 22 July 2024).
Navarro, J.R.D.; Noche, J.R. Classification of Mixtures of Student Grade Distributions Based on The Gaussian Mixture Model Using The Expectation-Maximization Algorithm. 2003. Available online: https://www.researchgate.net/publication/2922541_Classification_of_Mixtures_of_Student_Grade_Distributions_Based_on_the_Gaussian_Mixture_Model_Using_the_Expectation-Maximization_Algorithm (accessed on 22 July 2024).
Ester, M.; Kriegel, H.P.; Sander, J.; Xu, X. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the KDD’96: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996; Volume 96, pp. 226–231.
Bhattacharyya, A. On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta Math. Soc. 1943, 35, 99–109. [Google Scholar]

Figure 1. The general process of supervised distance metric learning.

Figure 2. The general structure of the proposed DMLdbIm model.

Figure 3. Evaluation of the proposed DMLdbIm model.

Figure 4. Comparison of the mean F1 measure for different neighborhood sizes in the KNN algorithm. k represents the number of nearest neighbors used. Blue denotes k = 1, while red and gray denote k = 3 and k = 5, respectively.

Figure 5. Comparison of the accuracy, precision, recall, and F1 of the evaluated approaches with 3NN classifier on Heart dataset.

Figure 6. Comparison of the accuracy, precision, recall, and F1 of the evaluated approaches with 3NN classifier on WDBC dataset.

Figure 7. Comparison of the accuracy, precision, recall, and F1 of the evaluated approaches with 3NN classifier on Pima dataset.

Figure 8. Comparison of the accuracy, precision, recall, and F1 of the evaluated approaches with 3NN classifier on Glass0 dataset.

Figure 9. Comparison of the accuracy, precision, recall, and F1 of the evaluated approaches with 3NN classifier on Ecoli1 dataset.

Figure 10. Comparison of the accuracy, precision, recall, and F1 of the evaluated approaches with 3NN classifier on Ecoli2 dataset.

Figure 11. Comparison of the accuracy, precision, recall, and F1 of the evaluated approaches with 3NN classifier on Newthyroid1 dataset.

Figure 12. Comparison of the accuracy, precision, recall, and F1 of the evaluated approaches with 3NN classifier on Glass6 dataset.

Figure 13. Comparison of the accuracy, precision, recall, and F1 of the evaluated approaches with 3NN classifier on Ecoli3 dataset.

Figure 14. Comparison of the accuracy, precision, recall, and F1 of the evaluated approaches with 3NN classifier on Yeast-2_vs_4 dataset.

Figure 15. Comparison of the accuracy, precision, recall, and F1 of the evaluated approaches with 3NN classifier on Yeast-1_vs_7 dataset.

Figure 16. Comparison of the accuracy, precision, recall, and F1 of the evaluated approaches with 3NN classifier on winequality-red-8_vs_6 dataset. As seen, the other approaches actually fail in recognizing the minority class.

Figure 17. Comparison of the accuracy, precision, recall, and F1 of the evaluated approaches with 3NN classifier on winequality-red-8_vs_7 dataset. As seen, the other approaches actually fail in recognizing the minority class.

Figure 18. Comparison of the accuracy, precision, recall, and F1 of the evaluated approaches with 3NN classifier on winequality-white-3-9_vs_5 dataset. As seen, the other approaches actually fail in recognizing the minority class.

Figure 19. Comparison of the accuracy, precision, recall, and F1 of the evaluated approaches with 3NN classifier on winequality-red-3_vs_5 dataset. As seen, the other approaches actually fail in recognizing the minority class.

Figure 20. The average execution time of different distance metric learning methods (in seconds) with 3NN as the classifier.

Table 1. The specifications of 15 evaluation datasets including the size of the datasets, class distributions, and imbalance ratios.

	Dataset	No. of Samples	No. of Features	No. of Minority Sample	Distribution of Classes	Imbalance Ratio
1	Heart (uci)	270	13	120	(55.56,44.44)	1.25
2	WDBC (uci)	569	30	212	(62.74,37.26)	1.68
3	Pima	768	8	268	(65.16,34.84)	1.87
4	Glass0	214	9	70	(67.32,32.68)	2.06
5	Ecoli1	336	7	81	(75.9,24.1)	3.14
6	Ecoli2	336	7	55	(83.63,16.37)	5.1
7	newthyroid1	215	5	35	(83.72,16.28)	5.14
8	Glass6	214	9	29	(86.45,13.55)	6.37
9	Ecoli3	336	7	35	(89.58,10.42)	8.6
10	yeast-2_vs_4	514	8	51	(90.08,9.92)	9.08
11	yeast-1_vs_7	459	7	30	(93.46,6.54)	14.3
12	winequality-red-8_vs_6	656	11	18	(97.26,2.74)	35.44
13	winequality-red-8_vs_6-7	855	11	18	(97.89,2.11)	46.5
14	winequality-white-3-9_vs_5	1482	11	25	(98.31,1.69)	58.28
15	winequality-red-3_vs_5	691	11	10	(98.55,1.45)	68.1

Table 2. Confusion matrix [36] as the basis for performance comparisons.

		Predicted
		Positive	Negative
Actual	Positive	TP	FN
Actual	Negative	FP	TN

Table 3. Initial parameters of the proposed approach.

Name	Value
$λ_{1}$	0.00001
$λ_{2}$	0.00001
$α$	0.1
Tol	1 × 10⁻³

Table 4. Comparison of F1 measures for the 1NN classifier using the proposed DMLdbIm method and other methods on 15 standard datasets. The numbers in parentheses indicate the rank of each approach on the respective dataset. The last two rows show the average performance and average rank of the approaches, providing a useful basis for comparison.

Dataset	Euclidean	ITML	LMNN	DML-eig	GMML	DMLMJ	DMBK	IML	DMLdbIm
Heart	75.15 (6)	77.33 (2)	75.09 (7)	75.06 (8)	73.14 (9)	76.65 (3)	76.08 (4)	75.83 (5)	77.41 (1)
Wdbc	91.51 (8)	90.21 (9)	91.94 (7)	93.17 (5)	92.28 (6)	95.56 (1)	94.15 (3)	93.75 (4)	94.81 (2)
Pima	56.34 (4)	54.25 (9)	55.55 (7)	54.75 (8)	57.77 (2)	58.07 (1)	55.98 (6)	56.3 (5)	56.55 (3)
Glass0	67.61 (6)	63.5 (9)	64.63 (8)	74.1 (4)	68.62 (5)	75.92 (1)	75.17 (2)	67.15 (7)	74.30 (3)
Ecoli1	73.56 (6)	73.7 (4)	67.9 (9)	69.68 (8)	72.33 (7)	78.32 (1)	73.58 (5)	74.61 (2)	74.51 (3)
Ecoli2	71.43 (5)	70.28 (8)	71.35 (6)	77.1 (4)	68.65 (9)	87.12 (1)	78.26 (3)	70.37 (7)	83.93 (2)
Newthyroid1	87.07 (9)	90.25 (8)	96.77 (1)	93.79 (5)	93.34 (6)	92.52 (7)	94.61 (3)	93.82 (4)	95.96 (2)
Glass6	76.06 (8)	78.53 (5)	78.16 (7)	81.99 (3)	78.33 (6)	70.73 (9)	84.45 (2)	80.32 (4)	88.72 (1)
Ecoli3	50 (8)	52.98 (6)	54.8 (3)	52.99 (5)	50.4 (7)	58.73 (2)	54.32 (4)	45.66 (9)	60.74 (1)
Yeast-2_vs_4	69.39 (9)	73.79 (5)	74.45 (4)	70.77 (8)	71.17 (7)	75.54 (2)	75.08 (3)	71.66 (6)	78.27 (1)
Yeast-1_vs_7	26.09 (7)	23.04 (8)	33.15 (6)	36.51 (4)	37.07 (3)	NAN (9)	39.85 (2)	34.15 (5)	47.79 (1)
winequality-red-8_vs_6	NA	14.81 (5)	NA	32.12 (3)	25.11 (4)	NA	38.99 (2)	NA	44.44 (1)
winequality-red-8_vs_6-7	NA	17.25 (3)	NA	25.00 (2)	12.45 (5)	NA	NA	16.19 (4)	40.00 (1)
winequality-white-3-9_vs_5	NA	22.13 (2)	NA	20.00 (3)	11.57 (5)	NA	NA	17.07 (4)	33.33 (1)
winequality-red-3_vs_5	NA	NA	NA	25.00 (2)	10.78 (3)	NA	NA	NA	40.00 (1)
Mean	49.61	53.47	50.91	58.80	54.86	51.27	56.03	53.12	66.05
AverageRank	6.53	5.8	5.8	4.8	5.6	3.93	3.66	5.06	1.6

Table 5. Comparison of F1 measures for the 3NN classifier using the proposed DMLdbIm method and other methods on 15 standard datasets. The numbers in parentheses indicate the rank of each approach on the respective dataset. The last two rows show the average performance and average rank of the approaches, providing a useful basis for comparison.

Dataset	Euclidean	ITML	LMNN	DML-eig	GMML	DMLMJ	DMBK	IML	DMLdbIm
Heart	72.07 (9)	73.53 (8)	74.31 (7)	77.14 (3)	75.30 (5)	75.80 (4)	78.23 (2)	74.99 (6)	79.06 (1)
Wdbc	92.61 (7)	92.27 (9)	92.64 (6)	94.19 (5)	92.58 (8)	95.37 (2)	95.31 (3)	94.99 (4)	97.67 (1)
Pima	58.47 (7)	58.48 (6)	59.57 (4)	58.94 (5)	58.38 (8)	57.04 (9)	60.42 (3)	60.93 (2)	63.27 (1)
Glass0	70.8 (8)	70.77 (9)	73.02 (5)	76.12 (1)	71.55 (7)	74.73 (3)	73.01 (6)	73.25 (4)	75.67 (2)
Ecoli1	73.68 (6)	74.33 (5)	72.28 (8)	74.62 (4)	73.11 (7)	80.38 (3)	81.71 (2)	71.31 (9)	82.76 (1)
Ecoli2	85.60 (4)	81.68 (6)	80.41 (7)	84.95 (5)	78.28 (9)	86.27 (3)	86.68 (2)	80.24 (89)	89.88 (1)
Newthyroid1	90.51 (9)	97.15 (1)	95.95 (4)	94.56 (5)	96.37 (3)	93.80 (7)	92.73 (8)	97.14 (2)	94.42 (6)
Glass6	74.88 (8)	75.92 (7)	76.99 (6)	81.57 (2)	74.04 (9)	79.51 (4)	79.87 (3)	77.25 (5)	83.33 (1)
Ecoli3	53.12 (7)	49.32 (9)	$55.22$ (5)	51.19 (8)	53.82 (6)	59.27 (2)	58.77 (3)	58.08 (4)	60.97 (1)
Yeast-2_vs_4	75.56 (8)	77.85 (5)	78.99 (2)	74.76 (9)	78.12 (4)	76.35 (7)	78.38 (3)	77.32 (6)	80.56 (1)
Yeast-1_vs_7	22.22 (8)	31.79 (5)	28.49 (6)	36.61 (4)	21.76 (9)	39.21 (2)	39.04 (3)	26.87 (7)	46.38 (1)
winequality-red-8_vs_6	NA	5.88 (4)	NA	22.22 (2)	8.00 (3)	NA	NA	NA	39.39 (1)
winequality-red-8_vs_6-7	NA	NA	NA	NA	NA	NA	NA	NA	28.58 (1)
winequality-white-3-9_vs_5	NA	NA	NA	22.00 (2)	15.78 (3)	NA	NA	3.21 (4)	28.57 (1)
winequality-red-3_vs_5	NA	NA	NA	NA	NA	NA	NA	NA	36.71 (1)
Mean	51.30	52.59	52.52	56.59	53.13	54.51	54.94	53.03	65.80
Average Rank	6.33	5.53	4.93	3.93	5.66	4	3.46	4.66	1.4

Table 6. Comparison of F1 measures for the 5NN classifier using the proposed DMLdbIm method and other methods on 15 standard datasets. The numbers in parentheses indicate the rank of each approach on the respective dataset. The last two rows show the average performance and average rank of the approaches, providing a useful basis for comparison.

Dataset	Euclidean	ITML	LMNN	DML-eig	GMML	DMLMJ	DMBK	IML	DMLdbIm
Heart	75.44 (8)	75.72 (7)	79.1 (3)	79.94 (1)	77.60 (6)	73.48 (9)	78.14 (5)	79.63 (2)	78.62 (4)
Wdbc	95.16 (4)	92.12 (9)	94.32 (8)	94.98 (5)	95.32 (3)	94.43 (7)	94.58 (6)	95.67 (2)	96.58 (1)
Pima	57.52 (8)	61.05 (3)	57.32 (9)	60.48 (5)	57.65 (6)	57.53 (7)	60.75 (4)	61.78 (1)	61.08 (2)
Glass0	71.28 (7)	66.24 (9)	71.75 (4)	75.85 (1)	70.45 (8)	71.54 (5)	73.83 (3)	71.46 (6)	74.10 (2)
Ecoli1	69.44 (7)	69.69 (6)	65 (8)	83.12 (1)	70.52 (5)	82.27 (2)	78.72 (4)	63.62 (9)	79.99 (3)
Ecoli2	85.96 (7)	83.01 (9)	91.34 (1)	86.30 (6)	85.21 (8)	88.45 (3)	86.36 (5)	87.02 (4)	88.74 (2)
Newthyroid1	84.73 (9)	89.28 (6)	97.24 (2)	88.03 (8)	92.16 (4)	88.96 (7)	90.92 (5)	98.92 (1)	92.27 (3)
Glass6	74.1 (9)	82.9 (5)	86.37 (3)	85.28 (4)	89.39 (2)	79.32 (7)	78.06 (8)	91.2 (1)	79.87 (6)
Ecoli3	55.69 (9)	57.82 (8)	60.78 (7)	63.25 (3)	62.56 (4)	60.94 (6)	62.17 (5)	64.65 (2)	65.43 (1)
Yeast-2_vs_4	75 (8)	69.33 (9)	76.34 (5)	78.01 (2)	75.9 (6)	75.45 (7)	77.25 (4)	78.24 (1)	77.38 (3)
Yeast-1_vs_7	NA	14.42 (6)	3.83 (8)	29.39 (3)	23.38 (5)	30.12 (2)	27.14 (4)	10.77 (7)	35.39 (1)
winequality-red-8_vs_6	NA	NA	NA	NA	NA	NA	NA	NA	28.57 (1)
winequality-red-8_vs_6-7	NA	NA	NA	NA	NA	NA	NA	NA	25 (1)
winequality-white-3-9_vs_5	NA	NA	NA	25 (2)	7.4 (4)	NA	NA	10.29 (3)	27.14 (1)
winequality-red-3_vs_5	NA	1.19 (3)	NA	NA	NA	NA	NA	9.36 (2)	31.08 (1)
Mean	49.62	50.85	52.22	56.64	53.83	53.49	53.86	54.84	62.74
Average Rank	6.53	5.93	4.73	3.26	4.6	5	4.4	3	2.13

Table 7. Comparison of average accuracy, precision, recall, and F1 of 3NN classifier with the proposed DMLdbIm method and other methods on 15 standard datasets.

Evaluation Criteria	Euclidean	ITML	LMNN	DML-eig	GMML	DMLMJ	DMBK	IML	DMLdbIm
Average Accuracy	91.40	91.28	91.31	92.04	90.98	92.37	92.31	91.51	92.83
Average Precision	57.57	58.87	55.67	61.56	58.57	60.44	59.23	55.53	73.25
Average Recall	48.7	50.18	51.36	54.33	50.96	52.19	53.45	55.8	62.3
Average F1 Measure	51.30	52.59	52.52	56.59	53.13	54.51	54.94	53.03	65.80
Accuracy Average Rank	5.53	6.13	6.26	4.6	6.46	3.33	3.4	5.73	3.13
Precision Average Rank	4.4	4.8	5.8	4.53	5.26	3.01	3.53	5.53	3
Recall Average Rank	6.26	5.66	4.66	4.4	5.4	4.53	3.6	3.46	2.13
F1 Average Rank	6.33	5.53	4.93	3.93	5.66	4	3.46	4.66	1.4

Table 8. Comparison of F1 measure of 3NN classifier for the proposed DMLdbIm method and DFSVM method on evaluation datasets.

#	Dataset	No. of Samples	No. of Feature	Imbalance Ratio	DFSVM	DMLdbIm
1	Glass1	214	9	1.82	62.9	64.52
2	Glass6	214	9	6.38	83.1	90.91
3	Yeast1vs7	459	8	14.3	47.6	47.79
4	Yeast3	1484	8	8.1	78.1	67.74
5	Yeast6	1484	8	41.4	50.1	83.33
6	Ecoli0147vs2356	336	7	10.56	71.6	86.15
7	Ecoli01vs235	244	7	9.17	80.9	90.91
8	Ecoli0267vs35	224	7	9.18	76.4	80.23
9	Vehcle3	846	18	2.99	73.7	81.53
10	Pageblocks0	5472	10	8.79	80.5	79.21
Average					63.29	77.23

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jalali Mojahed, A.; Moattar, M.H.; Ghaffari, H. Supervised Density-Based Metric Learning Based on Bhattacharya Distance for Imbalanced Data Classification Problems. Big Data Cogn. Comput. 2024, 8, 109. https://doi.org/10.3390/bdcc8090109

AMA Style

Jalali Mojahed A, Moattar MH, Ghaffari H. Supervised Density-Based Metric Learning Based on Bhattacharya Distance for Imbalanced Data Classification Problems. Big Data and Cognitive Computing. 2024; 8(9):109. https://doi.org/10.3390/bdcc8090109

Chicago/Turabian Style

Jalali Mojahed, Atena, Mohammad Hossein Moattar, and Hamidreza Ghaffari. 2024. "Supervised Density-Based Metric Learning Based on Bhattacharya Distance for Imbalanced Data Classification Problems" Big Data and Cognitive Computing 8, no. 9: 109. https://doi.org/10.3390/bdcc8090109

APA Style

Jalali Mojahed, A., Moattar, M. H., & Ghaffari, H. (2024). Supervised Density-Based Metric Learning Based on Bhattacharya Distance for Imbalanced Data Classification Problems. Big Data and Cognitive Computing, 8(9), 109. https://doi.org/10.3390/bdcc8090109

Article Menu

Supervised Density-Based Metric Learning Based on Bhattacharya Distance for Imbalanced Data Classification Problems

Abstract

1. Introduction

2. Related Works

2.1. Background and Definitions

2.2. Learning Distance Metric in Imbalanced Applications

3. The Proposed Method

3.1. The Proposed Model Construction Method

3.1.1. The First Phase: Estimating the Density of the Classes

3.1.2. The Second Phase: Calculating the Distance between the Gaussian Components Using the Bhattacharya Distance

3.1.3. The Third Phase: The Process of Learning the Proposed Distance Metric

3.1.4. The Fourth Phase: The Optimization Process of the Proposed Objective Function

3.2. Evaluation

4. Experiments

4.1. Specifications of the Datasets

4.2. Evaluation Criteria

4.3. Evaluation of the Proposed DMLdbIm Method

4.4. Comparison with Deep Learning

4.5. Computational Complexity Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI