1. Introduction
The development of data acquisition techniques and the diversification of data-processing methods make multi-view data a typical form of big data. Concretely, multi-view data is a general term for data with multiple feature representations of the same object. For example, cancers can be studied in terms of RNA gene expression, mitochondrial DNA, microRNA expression, and reverse phase protein array expression [
1], where each kind of expression corresponds to a view in multi-view data. For another example, different visual descriptors, such as LBP [
2], HOG [
3], and SIFT [
4] can be extracted from images, and each visual descriptor forms a view. For multi-view data, since views are the descriptions of the same object from different aspects, each view not only contains somewhat consistent information with the others but also contains some information complementary to the other views. How to make full use of the consistent and complementary information from multiple views to improve the data analysis performance gives rise to the flourishment of multi-view learning.
The studies on multi-view learning were conducted from various aspects, such as subspace learning [
5], feature selection [
6], classification [
7], clustering [
8], and multi-label learning [
9]. More details can be found in surveys [
10,
11]. Among these studies, multi-view feature selection, as an important preprocessing step for dealing with multi-view high-dimensional data, has attracted extensive attention. Multi-view feature selection aims to select the most relevant features or variables for the subsequent learning tasks. Compared with multi-view subspace learning, which is an another typical way to handle the curse of dimensionality, multi-view feature selection keeps the original semantics of the variables and has an advantage in interpretability. In this paper, we focus on multi-view feature selection.
Conventionally, according to the labelling condition of the training data, multi-view feature selection methods can be grouped into three categories. That is, if the training samples are all labelled, partly labelled or all unlabeled, then the corresponding methods are categorized as supervised [
12], semi-supervised [
13], or unsupervised [
14]. It is known that labelling samples is considerably expensive. Thus, unsupervised multi-view feature selection is more preferred in real applications. Existing unsupervised multi-view feature selection works can be divided into two groups, i.e., serial models and parallel models [
15]. In the early period, serial models first concatenate all views into a big one, and then apply single-view feature selection methods on the concatenated features, such as spectral feature selection (SPEC) [
16], minimum redundancy spectral feature selection (MRSF) [
17], and URAFS [
18]. However, approaches of this kind are not able to make full use of the consistency and complementarity between views. To overcome this deficiency, adaptive multi-view feature selection (AMFS) combines the graph Laplacian from all views [
19], and Feng et al. added a multi-view manifold regularization term into the subspace learning scheme with sparsity induced by the
-norm [
20]. Along this line of research, Dong et al. proposed to learn consensus similarity from initial view-specific graphs, together with the sparse projection [
21]. By contrast, parallel models conduct sparse projection learning on each view and merge the selected features from all views. Representative methods include MVUFS [
22], RMVFS [
23], ASVW [
24], and JMVFG [
25]. MVUFS learns pseudo-labels with local manifold regularization and simultaneously selects discriminative features from each view via sparse projection to the pseudo-label-indicating matrix [
22]. RMVFS realizes view-specific sparse feature selection based on the multi-view K-means model [
23]. ASVW encodes
-norm sparse regularized projection learning on each view into the consensus similarity learning scheme [
24]. In view of the effectiveness of these methods, JMVFG gathers
-norm regularized multi-view K-means, cross-view manifold regularization, and consensus graph learning into a whole [
25].
Though previous works promoted the development of multi-view feature selection, they still retain certain limitations. For serial models, complementary information between multiple views might not be well handled, especially when concatenating multiple views into a big one. For parallel models, it is not easy to decide how many features are required for each view, which poses an obstacle to their practical applications. Moreover, both serial and parallel models employ - or -norm regularization to impose feature-specific sparsity. This might affect the accuracy of feature selection since - and -norms merely fulfill approximate sparsity. More importantly, these methods are all designed in a strict unsupervised setting and overlook certain available background prior information. Usually, background prior information, such as label correlation, label proportion, and pairwise constraints, plays an integral role in boosting learning performance.
In this paper, we focus on utilizing pairwise constraints to guide the multi-view feature selection process. To be specific, pairwise constraints consist of multi-link (ML) and cannot-link (CL) constraints. A must-link between two samples indicates that they are in the same cluster. Similarly, if there is a cannot-link between two samples, then they must be in different clusters. Pairwise constraints can be naturally obtained in many applications [
26,
27,
28]. For example, in image segmentation, must-links can be added between groups of pixels from the same object, while cannot-link constraints can be inferred from two different image segments [
27].
To overcome the abovementioned issues, in this paper, we aim to propose a new weakly supervised multi-view feature selection method by utilizing pairwise constraints. Note that pairwise constraints are easy to incorporate into graph-based clustering methods. Thus, we formulate the multi-view feature selection model by simultaneously performing pairwise constrained similarity learning and view-specific sparse projection. In this manner, the pairwise constraint information and local structure information can be fully explored to facilitate discriminative projection learning. To select an exact number of features, the
-norm-based row sparsity constraint is adopted instead of the
-norm in most existing works. That is to say, the
-norm-based row sparsity constraint is imposed on the concatenated projection of all views. Then, the features that correspond to the non-zero rows of the concatenated projection matrix are selected. The resultant model is referred to as the
pairwise
constrained multi-view
feature
selection method, which is named PCFS for short hereinafter. The workflow of PCFS is shown in
Figure 1.
In summary, the major contribution of this paper is that a new weakly supervised multi-view feature selection method is proposed. The proposed PCFS method makes full use of pairwise constraints to boost the feature selection performance, which are overlooked by existing multi-view feature selection methods. To be specific, the contributions of this paper can be refined as follows:
A common pairwise constrained similarity matrix for all views is adaptively learned, which exploits both the pairwise constraint information and local structure information to help select discriminative features.
View-specific projections are optimized under the -norm-based cross-view sparsity constraint on the concatenated projection. In this way, both the consistent and complementary information of multiple views can be well captured.
An iterative algorithm was designed to optimize the objective function of the proposed PCFS method. Systematic experiments were conducted to verify its effectiveness.
The remaining contents of this paper are organized as follows. Some related works and preliminary knowledge are reviewed in
Section 2. The PCFS model, including its formulation and optimization, is described in
Section 3. In
Section 4, we present the conducted experimental studies used to evaluate the effectiveness of PCFS. In
Section 5, PCFS was applied to two cancer datasets.
Section 6 gives a conclusion of this paper and
Section 7 puts forward future work.
4. Experiments
4.1. Experimental Datasets
In this part, experiments that were performed on two different types of datasets are described. Specifically, we selected six real datasets related to images or videos to evaluate PCFS as below:
Caltech101-7 comes from the Caltech Class 101 Image Database and it consists of 441 images from one to seven classes, which is associated with six views, namely, Gabor, WM, CENT, HOG, GIST, and LBP.
Caltech101-20 is comprised of 2386 images belonging to 20 classes. Gabor, WM, CENT, HOG, GIST, and LBP were extracted as six views.
AR10P includes 130 images divided into 10 categories. Each category contains 13 instances and SIFT, HOG, LBP, wavelet texture, and GIST were extracted as five views for the experiments.
Yale contains a total of 165 images from 15 volunteers, each with 15 images. Each person corresponds to one class; thus, five features were used as five views: SIFT, HOG, LBP, wavelet texture, and GIST.
YaleB is a dataset containing 2414 images from 38 classes. SIFT, HOG, LBP, wavelet texture, and GIST formed five views for the experiments.
Kodak is a video dataset for the consumer domain, and all samples are from actual users. It contains 2172 samples that can be divided into 21 categories. The edge orientation histogram, Gabor texture, and grid color rectangle were divided into three views.
More details related to the datasets are directly shown in
Table 2.
4.2. Experimental Setting
Baselines: We compared PCFS with several representative multi-view feature selection methods on the clustering performance. When determining the baseline methods, we mainly considered their representativeness, diversity, and novelty. First of all, to show whether feature selection boosted the clustering performance or not,
k-means with all features (ALLfea) was compared. Then, considering that the proposed PCFS method is a graph-based method, we compared it with the representative and prevalent graph-based methods: ACSL [
21], ASVW [
24], and AMFS [
19]. Moreover, to increase the type diversity of the baselines, RMVFS [
23] was selected because it is a classical multi-view feature selection method based on pseudo-label learning in a k-means-like style. Finally, considering the novelty and timeliness, we compared PCFS with JMVFG [
25]. More concretely, JMVFG is a new multi-view feature selection method and it combines pseudo-label learning and graph learning to facilitate feature selection. Overall, due to the above considerations, AllFea, ACSL, ASVW, AMFS, RMVFS, and JMVFG were selected as baselines.
Evaluation metrics: We employed the standard metrics clustering accuracy (ACC) and normalized mutual information (NMI) for performance comparison. ACC is a popular metric that utilizes the permutation map function to describe the rate of accurately clustered samples to the total sample amount and is widely used to evaluate clustering performance [
38]. NMI is a normalized form of mutual information (MI) using a geometric mean, which provides a credible indication of the shared information between two clustering partitions without the influence of the arrangement of cluster labels [
39]. Another interesting criterion is the variation of information (VI), which is derived from information theoretic principles, the same as MI, and represents the sum of conditional entropies for comparing a pair of clusters [
40]. The value of NMI ranges from 0 to 1, while VI is bounded by an upper bound that depends on the number of clusters. Therefore, NMI is more suitable than VI for comparison in our setting as a metric. ACC and NMI are the most commonly used metrics for evaluating clustering results. In [
41,
42,
43,
44], only NMI and ACC were used as evaluation metrics. Overall, ACC and NMI are cogent enough for evaluating clustering results in most cases. For both evaluation metrics, larger values indicate better performance.
Parameter setting: In our work, the number of neighbors was set to five when the KNN graph was involved in a certain method. In terms of PCFS, and were chosen from , and , respectively. For each experiment, the same key pairwise constraint that consisted of 80 cannot-link constraints and 240 must-link constraints were used in each outer loop. In our PCFS, the total reduced dimensionality was empirically selected as . For a fair comparison, all the vital parameters in the baselines were set as suggested in the original paper. The number of selected features was set to , where r is the percentage of selected features, traversed from 0.1 to 0.3, with 0.02 as the interval, and is the floor function, which returns the maximum integer value that is not greater than x.
Key pairwise constraints selection: Since excessive constraints are labor-intensive, many constraint selection methods have been proposed [
45,
46]. Instead, we specifically use an efficient scheme [
31] for semi-supervised graph clustering. Considering that
S is sparse and each sample is only connected to
neighbors in the feature space, it is easy to obtain intuitive and rough “neighbors”. On this basis, we (
) obtained the must-link constraints by querying the “non-neighbors” of the sample and (
) obtained the cannot-link constraints by querying the “neighbors” of the sample. In this way, we could find key pairwise samples in the original feature space that were easily mislinked or unlinked.
4.3. Evaluations on Real Datasets
In order to alleviate the effect caused by K-means random center clustering, we predefined the cluster center by collecting the best features selected each time, and each experiment was repeated 20 times. The mean ACC and NMI scores on real datasets are reported in
Figure 2 and
Figure 3. It can been seen that the ACC and NMI scores grew with fluctuations when the number of selected features increased. When the scores arrived at the peak, they decreased slowly as the number of selected features continuously increased. This is consistent with intuition. By comparing the scores of different methods, it can be found that PCFS achieved a better performance than the baseline methods in most cases.
Furthermore, we show the scores of evaluation metrics of different multi-view feature selection methods on six real datasets when the percentage of selected features
in
Table 3. The proposed method obtained the highest ACC and NMI scores in most cases. Moreover, we utilized Student’s
t-test to verify the statistical significance of the results. The symbols “*” and “†” indicate significant improvement and reduction in PCFS over the corresponding baseline method, respectively, with
. On Kodak, Caltech101-20, and Caltech101-7, as shown in
Table 3, PCFS had a slightly lower evaluation score than ACSL when
. However, as
Figure 2 and
Figure 3 show, with the increase in
r, the evaluation score of PCFS exceeded that of ACSL, and the result was more robust than that of ACSL on the whole. The results of the Student’s
t-test also show that the improvement in PCFS compared with ACSL was significant when
r varied in the region of 0.1 to 0.3. Overall, the robust and high performance of PCFS is fully demonstrated in
Table 3 and
Figure 2 and
Figure 3.
4.4. Convergence Analysis
In this section, we conduct empirical convergence analysis on the real datasets. According to the theoretical analysis in the Methodology section, the monotonicity of each step in our optimization algorithm could be guaranteed. Specifically, we illustrate the convergence curves of PCFS in
Figure 4. We can easily observe that the objective value rapidly and monotonically decreased as the number of iterations grew, where the convergence curves became stable within about 10 iterations. The fast convergence ensured the optimization efficiency of PCFS.
4.5. Parameter Analysis
First of all, let us theoretically discuss the effects of the parameters
, and
.
represents a constant that sets weights for the edges between instances that must be linked. As the connectivity of the graph is independent of the edge weights [
31], the value of
has no direct impact on our model’s performance. With regard to
, it has been established that a sufficiently large
guarantees that
is an ideal similarity matrix, and how large the value of
is considered enough needs to be discussed further in the experiments. Considering
, as
, the pairwise constraints regularization term approaches zero infinitely, and all cannot-link constraints will be satisfied. However, too strict cannot-link constraints regularization will lead to very meticulous spectral clustering results, and it may be more difficult to select important and vital features, which affects the accuracy of the following k-means clustering with selected features. Therefore,
should be appropriate, i.e., neither too small nor too large.
Then, as we emphasized, the results of the experiments designed to analyze the influence of
and
directly show that the variability of the two parameters
and
have important impacts on the performance of our model. Specifically, the performance of our method with varying parameters
and
and fixing
is illustrated in
Figure 5. As shown in it, the performance of the proposed method fluctuated with some hidden regularity, which suggests that the selection of
needed to be careful and delicate and the selection of
should not be too large. Small changes in
will cause the pairwise constraints to have a drastic impact on the results. Within a certain range, the increase in
will increase the
score, but it will be counterproductive after exceeding this range. The selection requirements of these two parameters actually reflect the requirements of two different constraints on the graph. Our method needs to find a balance point from which to obtain high-quality performance on most of the datasets.
4.6. Ablation Study
This section describes the ablation analysis of the proposed PCFS method. There are two regularization terms in the objective function of PCFS. They are the cannot-link regularization related to the cannot indicating matrix
and the Laplacian rank regularization related to
. We compared PCFS with its variants with the cannot-link regularization or the Laplacian rank regularization removed. The results are shown in
Table 4. It can be seen that the complete PCFS method using all regularization terms led to average
and
scores (over six datasets) of 62.86% and 64.63%, respectively.
On average, when the cannot-link regularization term was removed, the corresponding and scores were the lowest. This demonstrated that pairwise constraints were beneficial for multi-view feature selection from the opposite side. From the results of each dataset, it can be seen that only using the Laplacian rank regularization obtained lower scores than the model without any regularization terms. The possible reason for this was that excessive attention on the graph Laplacian rank led to inaccurate similarity learning. The utilization of pairwise constraints helped to prevent such deterioration, and thus, PCFS achieved a better performance.
4.7. Discussion
We tried to understand the performance of PCFS from two aspects. First, we conducted a comparison of PCFS with six baseline methods on six widely used multi-view datasets. The comparison results demonstrate the effectiveness of the proposed PCFS method. Second, we conducted ablation experiment to analyze the function of each part of the objective of PCFS. The ablation experiment results show that the proper integration of pairwise constraints and graph rank regularization helped to improve the multi-view feature selection performance.
In other published papers, Refs. [
31,
47] integrated pairwise constraints into the single-view clustering method and concluded that including pairwise constraint information considerably improves the clustering performance. Ref. [
48] also tried to introduce pairwise constraints in multi-views data and obtained consistent results. In our work, the pairwise constraints were introduced into the similarity graph learning and combined with the row sparse projections based on the
-norm to finally fulfill the multi-view feature selection task. Our work was a fusion and extension of the existing work, and theoretically had the ability to improve the clustering performance. The final experimental results also demonstrate that the findings of our work are in line with previous related works. Pairwise constraints helped the multi-view feature selection and improved its clustering performance.
5. Application on Cancer Datasets
Two multi-omics cancer datasets obtained from The Cancer Genome Atlas (TCGA) were used for the evaluation.
BRCA (breast invasive carcinoma) [
49] consists of 398 samples belonging to four categories. It has four different views, namely, gene expression (RNA), mitochondrial DNA (mDNA), microRNA expression (miRNA), and reverse phase protein array expression (protein), which are measured on different platforms and contain different biological information.
CESC (cervical carcinoma) [
50] consists of 128 samples which come from three different categories. Gene expression (RNA), mitochondrial DNA (mDNA), microRNA expression (miRNA), and reverse phase protein array expression (protein) contain different information and are measured on different platforms. Thus, the information naturally formed four different views for experiments.
The experimental setting was kept consistent with
Section 4. By using the selected features to predefine the cluster center, we obtained the performance of each method on the multi-omics cancer datasets. According to the average ACC and NMI scores shown in
Figure 6 and
Figure 7, our proposed method achieved excellent performances on these two datasets. It should be noted that for the BRCA dataset, we preprocessed the original dataset with the
-norm, while the CESC dataset was preprocessed with the
-norm. With either preprocessing method, the proposed method showed robust and excellent performances on both two datasets. More details are directly shown in
Table 5.