1. Introduction
Linear discriminant analysis (LDA) [
1] is a supervised classical linear learning method stemming from the Fisher criterion [
2], which can be used not only for supervised classification problems but also for feature dimensional reduction. Its idea is not difficult to understand. For a set of given training samples, LDA attempts to find the optimal projection matrix that can maximize the between-class scatter matrix and minimize the within-class scatter matrix, simultaneously. As a supervised feature extraction method, LDA has demonstrated its feasibility and efficiency in pattern classification [
3], and it has been widely applied in hyperspectral image classification [
4], EEG signals analysis [
5], re-identification [
6], etc. LDA is also a critical evaluation approach to visual classification, such as handwriting recognition, face recognition and object categorization [
7].
LDA assumes that the samples of all training data follow a multivariate Gaussian distribution that has the same covariance but different mean values for different classes [
8]; once this statement is not satisfied, for example, when several different clusters are encountered from samples in the same class, LDA gives undesirable results [
9]. To solve this problem, a two-step subclass division method was proposed [
10]; the first step applies k-means for clustering and the second step uses the EM-alike framework to optimize the subclasses. Recently, the reverse nearest neighbors approach has been exploited with LDA to form neighborhood linear discriminant analysis (nLDA) [
7]. As an unsupervised outlier detection method, reverse nearest neighbor (RNN) can eliminate the “isolated points” in the training set. nLDA expects that a sample and its RNN should be as close as possible, while the RNN from two samples that belong to different classes should be as far as possible in the projected space. As the scatter matrix is defined based on RNN, which does not directly consider the whole class, the problem of datasets containing multimodal classes can be overcome by nLDA. nLDA is a method where the scatter matrix is directly defined on the neighborhood, which is a new localized discriminator. The smallest subclass can be thought of as a neighborhood; thus, it can perform discriminative learning without obeying the independently and identically distributed (i.i.d.) assumption of sample data.
However, since nLDA is extended from LDA, which is a linear parametric method, thus it may fail to handle nonlinear data. It is well known that the complex appearance variations caused by deformation, illumination and visual angle often generate non-linearity [
11,
12]. If we ignore the non-linearity of the data, the linear discriminators often give a lower performance. One practical solution is to map the points of the original input data to a higher-dimensional feature space and then learn the linear classifier on that space for discriminant learning. However, addressing the non-linearity in the higher-dimensional feature space is non-trivial, especially when the dimension of the mapped space is infinite. Many methods have been proposed to use the kernel trick to handle this problem [
11,
12,
13], which is also exploited in our proposed approach.
To handle the nonlinear data, circumvent the assumption of i.i.d. of sample data and solve the multimodal classes problem simultaneously, in this paper, we extend nLDA to its kernel version, called kernel reverse neighborhood discriminant analysis (KRNDA). The Gaussian kernel is employed to map the input data into a higher-dimensional feature space where non-linearity can be alleviated. The deduction of KRNDA with kernel tricks is presented in detail. We give extensive evaluations on the UCI dataset, and the visual classification tasks of handwriting recognition, face recognition and object categorization to verify the effectiveness of the proposed approach.
The rest of this paper is organized as follows.
Section 2 reviews the related work.
Section 3 briefly introduces the basic principles of LDA and nLDA methods. Subsequently, we propose our KRNDA method; the deduction with the kernel trick is presented in
Section 4. Experiments are conducted in
Section 5.
Section 6 concludes this paper.
2. Related Work
Research works on LDA have been studied for decades [
14,
15,
16] and plentiful approaches have been proposed to address different problems of LDA. For pattern recognition, the small sample size (SSS) problem is an essential issue for LDA. The SSS problem generally occurs when the feature dimension of each sample is very high, while there are insufficient training samples of each class. This result often leads to the singularity or ill-condition of the within-class scatter matrix, and the LDA can not be solved by the eigen-decomposition. To address this problem, Fisherfaces [
1] were proposed to conduct principal component analysis (PCA) for reducing the feature dimension of each sample and then implement the LDA algorithm. The singularity or ill-condition of the within-class scatter matrix caused by the SSS problem can also be solved by adding a small value to the diagonal of the within-class scatter matrix, which is known as regularized LDA [
17]. Pang et al. [
18] combined the clustering and regularization terms to alleviate the deviation of the scatter matrix caused by the SSS problem. There are many other approaches proposed to circumvent the singularity of the within-class scatter matrix and the instability of its reverse caused by the SSS. For example, direct LDA (DLDA) [
19] and null space LDA (NLDA) [
20] select different subspaces of the scatter matrix to avoid the singularity. Eigenfeature regularization and extraction (ERE) [
21] retains and reweights the entire eigenspace of the within-class scatter matrix for non-singularity to persist. This approach achieved excellent performance on face recognition.
The multimodal classes problem of LDA arouses a lot of attention. Marginal fisher analysis (MFA) [
22] employs graph embedding to characterize the nearest neighbors to avoid the i.i.d. of the sample data. Exploiting manifolds to represent local structures is also one of the solutions. For instance, local fisher discriminant analysis (LFDA) [
23] effectively combines the ideas of locality preserving projections (LPPs) [
24] and LDA; it attains inter-class separation and intra-class local structure preservation by defining a local intra-class and a local inter-class scatter matrix to handle the multimodal classes problem. Locality sensitive discriminant analysis (LSDA) [
25] projects the dataset to a lower-dimensional subspace to preserve local manifold structure and discriminant information. Neighbors with the same label should be as close as possible; that is the aim of LFDA and LSDA. Nonparametric discriminant analysis (NDA) [
9] solves the multimodal problem of LDA by using k-nearest neighbors, which constructs a kind of nonparametric inter-class divergence in a local area. Nonparametric discriminant analysis for face recognition [
26] extends NDA to address multi-class situations. Recently, nLDA [
7] directly defined the scatter matrix on the neighborhood to solve the multimodal classes problem and achieved remarkable results.
To handle the non-linearity problem, many LDA-based approaches are extended to their kernel versions to learn the nonlinear structures by the powerful tool of the kernel trick [
27]; for example, generalized discriminant analysis (GDA) [
11] detailed the deduction of kernel LDA. Kernel Fisherface (KLDA) [
28], kernel direct-LDA (KDDA) [
13], null space kernel LDA (NKDA) [
12] and complete discriminant evaluation and feature extraction (CDEFE) [
29] are the kernel versions of PCA+LDA, DLDA, NDA and ERE, respectively. In the previous studies, researchers worked out an alternative formulation of kernel LPP (KLPP) used to develop a framework for KPCA+LPP algorithms. Ref. [
30] achieved good results in recognition tasks, such as face recognition and radar target recognition. In solving the nonlinear problems encountered in image classification, the KAHISD/KCHISD of the Cevikalp and Triggs [
31] article is an extension of the affine/convex hull-based image classification task to its kernel version. Zhu et al. [
32] proposed the KCH-ISCRC with kernel tricks, which well addresses collaborative image set-based representation and classification (ISCRC). It is easy to see that Gaussian kernels can successfully solve nonlinear problems in various applications [
28,
29,
31,
32].
LDA has new developments with recent technologies. Dorfer et al. [
33] proposed a deep linear discriminant analysis with the deep neural network. Alarcón and Destercke [
34] proposed a Gaussian discriminator to fuse the near-ignorance priors and robust Bayesian analysis. Belous et al. [
35] proposed a framework called dual spatial discriminative projection learning and successfully applied it to image classification tasks. Hu et al. [
36] proposed a cross-modal discriminative network for cross-modal learning.
4. Kernel Reverse Neighborhood Discriminant Analysis
In this section, we detail the proposed kernel reverse neighborhood discriminant analysis (KRNDA). We use the kernel trick to deal with the non-linearity underlying the input data of nLDA. First, the kernel method based on nLDA requires the definition of a mapping for each observed sample
x being mapped from the original feature space to a higher-dimensional feature space
:
Subsequently, the mapped training dataset of
can be represented as
. Usually, the mapping function is not explicitly specified [
38], especially when the mapped feature space is infinite-dimensional. Hence, many kernel functions are defined for implicitly calculating the inner product of two image vectors in the mapped space
. For two images
and
, the inner product of them in the mapped space can be expressed as:
where
is the inner product. How to select a suitable kernel function to implicitly calculate the inner product of two vectors in higher-dimensional feature space is a critical factor for kernel methods [
39]. The typical and effective Gaussian kernel function is used in this paper:
One of the advantages of the Gaussian kernel is that it has only one free parameter , which is easily tuned.
The proposed KRNDA aims to find an optimal projection matrix
, where
h is the reserved dimension of the final extracted feature, such that the mapped data in space
can be projected onto a more discriminative and lower-dimensional feature space by
. The optimal
can be obtained by maximizing the following equation:
where
is the between-neighborhood scatter matrix and
is the within-neighborhood scatter matrix in space
.
According to [
11], the projection matrix
can be linearly represented by all the training data in the mapped space
. Therefore we can obtain:
where
denotes the coefficient of the linear combination. For a sample
in higher-dimensional space
, the projected vector can be expressed by:
Obviously, the final extracted feature
does not seem to be explicitly expressed, since each
and the
contain the implicit mapping
. However, the kernel trick can circumvent this problem. According to Equations (
17) and (
20), each
in Equation (
21) can be calculated as:
where
. According to Equation (
17), we have
;
denotes the inner product between all training samples and the
m-th training sample, in the mapped space
; then the extracted feature
can be explicitly expressed as:
where
. We define the between-neighborhood scatter matrix of our KRNDA in Equation (
19) and formulate the
as:
where
In Equation (
24),
denotes the mean vector of the local neighborhood computed by reverse k-nearest neighbors (
) of
in dataset
with the same label
. Certainly, the number of the
should be equal to or larger than threshold
t. Likewise,
denotes the mean vector of the local neighborhood computed by
of
in dataset
with the same label
. It should be noted that the labels of these two
neighborhoods are different (
), which exhibits the between-class information. The calculation of
and
in Equation (
25) can be formulated by calculating
and
. Using the relationship of Equation (
23), we give the deduction of
, for instance:
Hence, . Similarly, .
We employ the definition of the approximate between-neighborhood scatter in [
7] to construct
; hence, for each
, the number of the between-class reverse nearest neighborhoods is restricted to k by using
. This approach reduces the amount of computation when the dataset is very large [
7]. The calculation of
follows the definition of the
set in Equation (
8), where the distance of nearest neighbors is calculated in the mapped higher-dimensional feature space with the kernel trick.
Likewise, the within-neighborhood scatter can also be explicitly expressed by the kernel trick; the equation
in Equation (
19) can be formulated as:
where:
where
is the mean of higher-dimensional features in
of
with the same label
. According to Equations (
24), (
25), (
27) and (
28), the object function of Equation (
19) can be rewritten as:
Hence, the optimization of maximizing Equation (
29) can be reduced to the generalized eigen-decomposition problem:
The optimal projection matrix can be obtained by applying the eigen-decomposition to
. We reserve
h eigenvectors according to the largest eigenvalues to form the final coefficient matrix
.
At the testing stage, for a test sample
, we first map it to the higher-dimensional space
by
, then the final extracted feature can be expressed as
; according to the kernel trick, it can be rewritten as
.
denotes the kernel vector of inner production between all gallery (or training) samples and test sample
in the mapped space
. We compare the Euclidean distance between
and the projected mean vector
, where
can be explicitly rewritten by
.
is the mean vector of the
i-th gallery RNN with the same label, and
denotes the kernel vector of inner production between all gallery samples and the mean vector
in the mapped space
. Therefore the label of test sample
can be determined by the nearest Euclidean distance of:
5. Experiments
Experiments are conducted on two handwritten digit datasets and the University of California at Irvin (UCI) benchmark database. In addition, the COIL-20 (Columbia Object Image Library) [
40] object recognition dataset and the ORL face recognition dataset [
41] are also adopted for evaluation.
To verify the effectiveness of our method, we compare the performances of our KRNDA with the original nLDA. Several LDA-based discriminators are added for comparison, including LDA, LFDA, ccLDA and
. The LFDA is a traditional local discriminator which has been successfully applied to pedestrian re-identification [
42]. We implemented the code of ccLDA [
18], which uses clustering to solve the SSS problem.
is a discriminator with excellent discriminant effect and the performance of
norm is better than that of
-norm [
43]. The codes of LFDA,
and nLDA are provided by the authors.
The parameters in the above discriminators are set as follows: for LDA, the final extracted features are preserved according to 95% energy of the eigenvalues. For LFDA, we set the parameter K = 2 to achieve the optimal effect in our experiments (K is a parameter used in the local scaling method). To avoid the singularity caused by the SSS problem, regularization is applied to the within-class scatter matrix: , where is the identity matrix and the is a small value, i.e, . For ccLDA, we follow the original experimental settings: K < C ( here, K is the cluster number and C is the class number) and 98% energy of the eigenvalues is reserved for experiments. In , we use the original parameter settings provided by the authors.
There are four main parameters related to the proposed KRNDA. The first one is the parameter
t, which controls the isolated points in Equations (
24) and (
27); we set
according to the article [
7], which has proved that it does not affect the performance much. We also applied this setting to our KRNDA in all experiments. The second parameter is the value of k in
; [
7] has proved that the value of k has better performance in the range
; hence, we adopted similar setting with nLDA in our KRNDA. The third parameter is the
in Gaussian kernel, see Equation (
18); we evaluate this parameter in
Section 5.4 and apply the best value for our experiments. The last parameter is the dimension of the final extracted feature, similar to nLDA, and we preserve 95% energy of the eigenvalues to construct the final features. All of the experiments were implemented on an Intel(R) Core(TM) i7-11700K (3.60GHz) PC, using Matlab R2020b.
5.1. Experiments on Handwritten Digit Recognition
Two handwritten digit datasets, Mnist and USPS, are used in our evaluation. The purpose of KRNDA is to solve the nonlinear problem; however, it is the extension of the nLDA which performs well on the multimodal class problem. Therefore, we construct the classification problem of handwritten digit recognition as a binary classification task to verify the performance of KRNDA in solving the multimodal classes problem. For binary digit recognition, it is to determine whether the numbers are odd or even. The odd class is 1, 3, 5, 7 and 9, and the even class is 2, 4, 6, 8 and 10. Some digital images of the two classes are shown in
Figure 1. As the five numbers in each of the odd class and even class are quite different, it can be considered that there are five subclasses (clusters) in each class.
The Mnist dataset has a total of 60,000 samples in the training set; the details of the training and testing samples are shown in
Table 1. Adopting the whole set directly for training is time-consuming and may lead to being out of memory. Hence, we constructed subsets for our experiments. We randomly selected different numbers of samples for experiments and created four subsets: Mnist600, Mnist1000, Mnist6000 and Mnist10000. This means that 600, 1000, 6000 and 10,000 samples are selected from the total training set for training, respectively. For the corresponding testing sets, we randomly selected 100, 166, 1000 and 1666 samples from the original 10,000 testing samples to form four testing subsets for evaluation. The USPS dataset has a total of 9298 handwritten digit images. We conducted our experiments according to the protocol in [
7], where the training set contains 7291 images and the testing set contains 2007 images, and the detailed training and testing numbers of each class are shown in
Table 2.
Firstly, we compare the proposed method (KRNDA) with the Gaussian kernel function, sigmoid kernel function (SRNDA) and polynomial kernel function (PRNDA) on the USPS and Mnist datasets. We chose the parameters corresponding to the highest recognition rates during our experiments. For the Mnist datasets, we performed 10-fold cross-validations to obtain the average recognition rates and standard deviations.
From
Table 3, we can find that the Gaussian kernel function outperforms the polynomial and sigmoid kernel functions. The poor performance of the sigmoid kernel function may be caused by the multimodal classes data. The Gaussian kernel demonstrates a better ability to correct the non-linearity and it has only one adjustable parameter, which is easy to use for evaluations. Consequently, in the following experiments of this paper, we employ the Gaussian kernel function with KRNDA for all evaluations.
Here, we present the recognition rates on the USPS handwritten digit dataset, and the average recognition rates and standard deviations of 10-fold cross-validations on Mnist handwritten digit datasets. As shown in
Table 4, compared with the original nLDA, our KRNDA consistently achieves better recognition rates, especially on the Mnist600 and Mnist1000 subsets, because the small training subset may cause more non-linearity. These results demonstrate the effectiveness of the proposed KRNDA on handling the nonlinear data problem. We can see that the proposed KRNDA also outperforms other LDA-based methods, such as LDA, LFDA, ccLDA and
. This result illustrates that our KRNDA inherits the advantage in solving the multimodal classes problem.
5.2. Experiments on COIL-20 and ORL Datasets
In this subsection, we use the COIL-20 object dataset and the ORL face dataset to evaluate more complicated image classification tasks. The COIL-20 consists of 20 different objects, such as car, cat, cup and eraser, etc. For each object, there are 72 images of different viewpoints [
44]. The images are normalized and resized to
pixels. We randomly selected 50% of the images per class for training and the remaining images for testing. Some sample images are shown in
Figure 2.
The ORL face dataset contains 400 images of 40 individuals. Images were captured at different times, and other variations of expression (open or closed eyes, smiling or non-smiling) and facial details (glasses or no glasses) [
45] are involved. The face images of the ORL database are normalized and resized to
pixels. We randomly selected 50% of the images per class for training and the remaining images for testing. Some sample images are shown in
Figure 3.
For the COIL-20 object dataset and the ORL face dataset, we also performed 10-fold cross-validations to obtain the average recognition rates and standard deviations. As shown in
Table 5, the proposed KRNDA outperforms the original nLDA, and the recognition rates reach the highest, 98.62% and 99.90% on ORL and COIL-20, respectively. We can see that, in
Figure 2 and
Figure 3, the COIL-20 dataset with multi-view samples causes much more distortion than the ORL dataset. Hence the accuracy increment of KRNDA on the COIL-20 dataset is larger than the ORL dataset compared to nLDA. Our KRNDA also performs well on face recognition which contains non-linearity caused by lighting and pose variation. These results have shown the effectiveness of the proposed kernel efficient solver. Due to the benefits of RNN and the non-linearity solver, the proposed KRNDA can handle different applications of pattern recognition.
5.3. Experiments on UCI Benchmark Datasets
We selected 40 benchmark datasets from the University of California at Irvine (UCI) machine learning repository [
46] for our experiments. The details of the class number, sample number and feature dimension of each dataset are listed in
Table 6. The number of classes is from 2 to 10. The number of features is from 2 to 256, which is less than the visual classification tasks. For each dataset, we randomly selected approximately 50% of the samples for training and the remaining samples for testing. We also employed the LDA-based discriminators, which were introduced previously, for comparison.
The classification results of the UCI datasets are reported in
Table 7. In each row, the best classification accuracies are highlighted. As can be seen, the proposed KRNDA outperforms other discriminators in most cases, and the average classification accuracy of all datasets achieves 81.63%, which is better than all other discriminators. The degradation of KRNDA, such as in the datasets of Bupa, Sonar and breast-tissue, etc., may be caused by the lower non-linearity (caused by lower-dimensional features of UCI datasets) or the non-optimal value setting of parameter
.
5.4. Parameter Evaluation
The parameter of the Gaussian kernel plays an important role in exhibiting the performance of the proposed KRNDA. It can be seen as the controller of handling the non-linearity. Therefore, we conducted experiments to tune the optimal value of . Experiments were conducted on a subset of the Mnist database; we randomly selected 600 samples for training and 100 samples for testing, and 10 cross-validations were conducted to obtain the mean accuracy.
Experimental results are shown in
Figure 4. The Y-axis is the recognition accuracy and the X-axis is the value of
. As can be seen, the best value
reaches the highest accuracy with 95.1%; hence, in most of our experiments, the value of
is set as
. Certainly, this value is non-optimal for different applications and different datasets; however, it achieves considerable performances in most cases of our experiments.