Next Article in Journal
Automated Arabic Long-Tweet Classification Using Transfer Learning with BERT
Previous Article in Journal
Large Scale Optical Assisted Mm-Wave Beam-Hopping System for Multi-Hop Bent-Pipe LEO Satellite Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Data Analysis for Information Discovery

Department of Electrical and Information Engineering, Politecnico di Bari, 70125 Bari, Italy
*
Author to whom correspondence should be addressed.
Appl. Sci. 2023, 13(6), 3481; https://doi.org/10.3390/app13063481
Submission received: 19 January 2023 / Revised: 26 February 2023 / Accepted: 6 March 2023 / Published: 9 March 2023
(This article belongs to the Topic Data Science and Knowledge Discovery)

Abstract

:
Artificial intelligence applications are becoming increasingly popular and are producing better results in many areas of research. The quality of the results depends on the quantity of data and its information content. In recent years, the amount of data available has increased significantly, but this does not always mean more information and therefore better results. The aim of this work is to evaluate the effects of a new data preprocessing method for machine learning. This method was designed for sparce matrix approximation, and it is called semi-pivoted QR approximation (SPQR). To best of our knowledge, it has never been applied to data preprocessing in machine learning algorithms. This method works as a feature selection algorithm, and in this work, an evaluation of its effects on the performance of an unsupervised clustering algorithm is proposed. The obtained results are compared to those obtained using, as preprocessing algorithm, principal component analysis (PCA). These two methods have been applied to various publicly available datasets. The obtained results show that the SPQR algorithm can achieve results comparable to those obtained using PCA without introducing any transformation of the original dataset.
Keywords:
data analysis; PCA; SPQR; FCM

1. Introduction

Considering generic systems (natural or artificial phenomena, processes, etc.) and the relationships existing among inputs and outputs, they can be represented physically, mathematically, or logically. Anyhow, they create relationships among data (input and output).
In many phenomena, inputs to the system can be modelled as random variables. If our model is mathematical, this is a quantitative representation of a natural phenomenon. Similar to all other models used in science, its purpose is to represent, as incisively as possible, a given object, real phenomenon, or set of phenomena (mathematical model of a physical, chemical, or biological system).
Often, the model is an approximate representation of reality, but it is meaningful to conduce the analysis or prognosis.
Mathematical models are widely used in all areas of science. There are various mathematical tools in use from combinatorics to infinitesimal calculus. For example, for many phenomena, a very concise and intuitive description can be formulated immediately through differential equations.
For the sake of simplicity, we will speak of data input and data output from the model, regardless of the nature of the characteristic transfer function (e.g., linear dynamic systems, etc.).
In the present case, we consider the input and output data as discrete random variables, i.e.,
p ( x ) = P ( X = x )  
A random phenomenon (a phenomenon that is characterizable by a random variable) can be described in terms of the probability distribution and its parameters, such as the expected value and variance.
The relevance of variance in the study of models is widely used, for example, think of KLT (Karhunen–Loeve transform) [1], PCA, and its variants.
PCA is often used for dimensionality reduction. This process aims to reduce a large number of variables describing a dataset to a smaller number of latent variables, limiting the loss of information as much as possible [2]. In [3], authors used PCA in a computational chain to identify and remove noisy data from a dataset before addressing the classification problem of an artificial neural network (ANN). In [4] authors used PCA in a method to improve image classification performance by exploring the utilization of collective class characteristics to establish a statistically weighted algorithm and combined this weight with PCA to enhance the discrimination ability.
From this point of view, it is possible to say that giving input to a model original dataset and a dataset reduced using PCA, the model will produce almost the same results.
To test this hypothesis, in this work, a widely used algorithm (PCA) and an algorithm under our evaluation (semi-pivoted QR approximation SPQR [5,6]) have been used to reduce the dimensionality of the original data of various publicly available databases [7]. On the other hand, manipulating input data is one of the factors that can introduce uncertainty in numerical models. This is an interesting problem that is studied in the field of uncertainty analysis. Uncertainty analysis aims at quantifying the variability of the output that is due to the variability of the input. Some interesting approaches to this problem can be found in [8,9].
The SPQR method was presented in [5] and, to best of our knowledge, it has never been applied to dataset dimensionality reduction for machine learning algorithms.
Fuzzy clustering and silhouette analysis were used to compare the results. The performances obtained were compared among them, showing strong variations among the methods, highlighting the strong impact that preprocessing techniques can have on the performances of machine learning algorithms.
In further development of this, the information loss in the case of using the well-known PCA compared to the SPQR algorithm will be evaluated. This analysis was carried out on 10 public databases from the UCI site [7], and for each database, the following procedure was run:
  • Fuzzy clustering on the raw database and performance evaluation with silhouette analysis;
  • Fuzzy clustering on the preprocessed database with PCA and performance evaluation with silhouette analysis;
  • Fuzzy clustering on the preprocessed database with SPQR and performance evaluation with silhouette analysis;
  • All the previous tests performed on normalized data.
The remaining part of this paper is so organized: Section 2 reports a brief overview of related works, Section 3 describes the ten databases used for the tests, and Section 4 and Section 5 describe, respectively, clustering and silhouette algorithms. The SPQR and PCA algorithms are described in Section 6 and Section 7, respectively. Experiments and results are reported in Section 8, and conclusions and final remarks are in Section 9.

2. Related Works

Modern applications generate large amounts of data that are not always relevant. The transition between data and information is a topic of great interest to study. Many algorithms try to increase the density of knowledge from this data used in machine learning. We want to have less data in the original dataset, but with the same information or minimal information loss. To improve the computational efficiency of these algorithms, data preprocessing techniques are used to filter the data. Typically, statistics-based algorithms such as PCA are used.
Formally, PCA is a statistical technique for reducing the dimensionality of a dataset, linearly transforming the data into a new coordinate system, and preserving the maximum amount of information, often used to enable visualization of multidimensional data [10].
In our opinion, these techniques have a strong impact on the final performance of these algorithms, and therefore, we believe that push-button solutions should clear the way for more accurate analyses. The remaining part of this article will also focus on limitations and problems encountered in these applications.
Typically, PCA is used assuming that the relationships between variables are linear and comparable in terms of magnitude, i.e., they are scaled numerically. Several dimension reduction techniques have been introduced to handle nonlinear relationships, such as Isomap [11], locally linear embedding (LLE) [12], Hessian LLE [13], Laplacian eigenmaps [14], and its variants [15], including kernel PCA [16].
These methodologies discover the inherent geometric structure of high-dimensional data. In [17], authors showed that high-dimensional spaces are sparse and suffer from distance concentration.
This challenge makes the discovery of the intrinsic geometric structure nontrivial. In nonlinear dimension reduction (NDR), distance covariance is used for nonlinear relations, while the sparsity of high-dimensionality space is addressed by evaluating group dependencies.
To solve the problem of computational disadvantages and imperfect estimations in large-scale scenarios, divide-and-conquer mechanisms have been proposed for dimensionality reduction in [18,19,20,21]. In these works, the authors have shown that, in many cases, well-known application-specific relations and natural groupings can be exploited for efficient size reduction.
However, such approaches [18,19,20,21], require prior knowledge of these relationships to enable organization, which may not be available in big data scenarios. A divide-and-conquer approach is also presented in these studies.
Moreover, in traditional dimensionality reduction approaches [11,12,13,14,15,16,17,18,19,20,21,22,23,24], the basic principle is to perform one-step mapping from an upper-dimensional space to a lower-dimensional space. There is no mechanism for incorporating new information into the analysis structure when new data are available. In these scenarios, it would be necessary to recalculate the dimension reduction [22,23,24], which is inefficient.
Although parameter aggregation methods facilitate the Pearson correlation and/or covariance updating [23] with the corresponding single-value decomposition (SVD) [25], they are focused on computational efficiency when all data are available.
However, unlike in [18,19,20,21], NDR allows for SVD results to be used to perform group organization. Moreover, if information on such relationships is available, the generic organizational structure proposed here allows for this information to be incorporated at the first step.
First, such an approach is computationally poor when the amount of original data is very large. It is also susceptible to improper estimation due to the presence of large amounts of noise and redundancies [26].
Dimensionality reduction approaches require the practitioner to specify the number of dimensions to be extracted from the data. In some algorithms, the user simply specifies the percentage of information to be stored in the NDR, and the algorithm estimates the number of dimensions to be extracted from the data. Unfortunately, this is not the case in big data scenarios; these methods can be computationally inefficient.
The algorithm studied and proposed in this paper, called “semi-pivoted QR approximation” (SPQR), is an efficient deterministic method to reduce a given matrix to its most important columns by estimating its significance and specific information content. It was introduced by Stewart [5,27].
As the name highlights, the approach is based on the QR decomposition that expresses a matrix A as the product of an orthogonal matrix Q and an upper triangular matrix R. These factors are obtained by the Gram–Schmidt algorithm to orthonormalize the columns of A, one at a time, from first to last. This procedure is preferred because it uses Pivoted QR: it differs in that the Gram–Schmidt procedure takes the largest remaining column at the beginning of each new step [6]. The new step is taken, and then a permutation matrix P is operated such that:
A     P   =   Q R

3. Database Description

All the experiments carried out in this work have used public databases downloaded from [7]. This repository contains a large number of datasets. In this work, all the experiments have been carried out on ten datasets. These datasets are very different among them in many aspects: number of features, number of instances, semantic content, etc. This choice improves the generalizability of the obtained results.
In particular, the used databases are:
  • Gender Gap in Spanish WP Dataset [28]: Dataset used to estimate the number of women editors and their editing practices in the Spanish Wikipedia. It is composed of 21 attributes and 4746 instances;
  • TUANDROMD (Tezpur University Android Malware Dataset) Dataset [29]: This dataset contains 4465 instances and 241 attributes. The target attribute for classification is a category (malware vs. goodware);
  • Room Occupancy Estimation Dataset [30]: this dataset contains 10,129 instances and 16 attributes, and it is used to estimate the occupation level of a room. The setup consisted of seven sensor nodes and one edge node in a star configuration with the sensor nodes transmitting data to the edge every 30 s using wireless transceivers. Each sensor node contained various sensors such as temperature, light, sound, CO2, and digital passive infrared (PIR);
  • Myocardial Infarction Complications Dataset [31]: This dataset contains 1700 instances and 124 attributes. The main application of this database is to predict complications of myocardial infarction based on information about the patient at the time of admission and on the third day of the hospital period;
  • Wine Dataset [7]: This dataset is used to recognize a specific wine using the concentration of 13 chemical parameters. It is composed of 13 features and 178 instances;
  • Dry Bean Dataset [32]: This dataset contains 16 visual features extracted from 13,611 images of grains of 7 different dry beans. The used visual features are in 12 dimensions and 4 shape forms;
  • APS Failure at Scania Trucks Dataset [7]: This dataset contains 60,000 instances and 171 features. The dataset consists of data collected from heavy Scania trucks in everyday usage. The system in focus is the air pressure system (APS), which generates pressurized air that is utilized in various functions in the truck, such as braking and gear changes;
  • Multiple Features Dataset [7]: This dataset consists of the features of handwritten numerals (‘0′–‘9′) extracted from a collection of Dutch utility maps. It is composed of 649 features and 2000 instances;
  • Relative Location of CT Slices on Axial Axis Dataset [7]: This dataset consists of 384 features extracted from 53,500 CT images. The class variable is numeric and denotes the relative location of the CT slice on the axial axis of the human body;
  • Mice Protein Expression Dataset [33]: This dataset contains 1080 instances and 82 features. The dataset consists of the expression levels of 77 proteins/protein modifications that produced detectable signals in the nuclear fraction of the cortex.
In all the experiments, for each database, only the numerical features have been considered.

4. Clustering

The term “clustering” refers to a wide family of learning algorithms that comprise unsupervised, supervised [34], and semi-supervised [35] learning techniques. These algorithms implement different approaches (hierarchical, partitional, grid, density, and model based [34]) to solve the same classification problem, namely, grouping objects according to a given similarity criterion.
In this work, the authors used the fuzzy C-means (FCM) algorithm [36,37]. This clustering algorithm generates fuzzy partitions and prototypes for any set of numerical data. It works by optimizing a generalized least squares objective function:
Q = i = 1 c k = 1 N u i k m x k v i 2
where
  • ||.|| is a distance function such as: Euclidean, Mahalanobis, etc.;
  • v1, v2, …, vc are the centroids of the clusters, also called prototypes;
  • X = {x1, x2, …, xN} is the set of points to be clustered;
  • U = [uik] is the partition matrix;
  • c is the number of clusters;
  • N is the number of points to be clustered;
  • i is an index that varies from 1 to c;
  • k is an index that varies from 1 to N;
  • “m” is a coefficient called the “fuzzification coefficient”. It is greater than 1, and it is responsible for the level of “fuzziness” of the partition matrix. In other words, it controls the level of fuzziness with which each point belongs to the various clusters.
This algorithm allows for the discovery of the inner structure of a given dataset X, minimizing the objective function Q according to the given prototypes and the partition matrix U. The minimization process is iterative. It works by updating the values of the partition matrix and the prototypes until a given criterion of stop is reached.
For example, considering two partition matrices, U and U’, obtained in two consecutive iterations, the procedure may stop when the quantity:
||U − U′|| = maxi,k|uik − u′ik|
gets smaller than some predefined positive threshold ε.
It should be noticed that the nature of the optimization function leads to a solution that somehow reflects the geometry of the original dataset.
On the other hand, the hidden structure of the dataset, discovered by the clustering algorithm, can be used to classify other elements added to the set after the initial clustering. This task can be accomplished applying the following rules:
  • The “anchor points” of the classifier are the prototypes of the clusters;
  • Each cluster defines a class;
  • A point x belongs to a class defined by the cluster with prototype vj if:
j = arg ( m i n i | | x v i | | 2 )

5. Silhouette Method for Clustering Evaluation

The performance evaluation of a clustering algorithm is not a trivial task. Indeed, when using a clustering algorithm, it is possible to be in one of the following situations:
  • The correct solution is known: in this case, the classification of each point is known a priori, and the classification performance of the clustering algorithm can be computed counting the number of misclassified patterns (error rate);
  • The correct solution is subjective: in this case, there is no true ground against which to evaluate the results of the clustering algorithm. Since the classification is subjective, there is no universally acceptable solution, so the classification task falls into the problem of the semantic gap [38];
  • The correct solution is unknown: also in this case, there is no true ground against which to evaluate the results of the clustering algorithm. To face this problem, there are various approaches as reported in [39]. On the other hand, in the latest years, the silhouette parameter has become a more commonly used method for assessing the quality of clusters [40].
In this work the silhouette parameter has been used to evaluate the clustering performance. This method is based on a comparative evaluation of the similarity level of each object in relation to its cluster (tightness) and to the other clusters (separation). This parameter is defined as follows: if y is a point belonging to cluster A, then t(y) is the mean distance between y and all other points of A. Let us now consider any cluster B, different from A, and compute the average distance between y and all points of B(d(y, B)). Once we have computed d(y, B) for each cluster B such that B ≠ A, we select the smallest of these numbers and denote it by:
v ( y ) = min A B d ( y , B )
Starting from these considerations, the silhouette for the point y is defined as shown in the following formula:
s ( y ) = { 1 t ( y ) v ( y )                                                                               i f   t ( y ) < v ( y ) 0                                                                                                           i f   t ( y ) = v ( y ) v ( y ) t ( y ) 1                                                                             i f   t ( y ) > v ( y )
From this definition, it is possible to say that for each point x into the dataset:
−1 < s(y) < +1
An in-depth analysis of this parameter is reported in [40].
On the other hand, using this method makes it possible to show the obtained results using a graphical representation highlighting, for each point, how well it has been classified. Furthermore, the flexibility of this method should be highlighted due to the fact that it is possible to use any kind of distance metric, such as Mahalanobis, Euclidean, Manattan, etc.
All these features give to the silhouette parameters a strong appeal when it is required to evaluate the performance of a clustering algorithm applied to datasets for which there is no a-priori knowledge (this is also the case study of this paper).
Another interesting application of the silhouette method is as a reference guide in discovering the optimal number of clusters for a given dataset. This task can be carried out by implementing an iterative process composed of the following steps:
  • Run the clustering algorithm using a certain number of clusters “C”;
  • Compute the silhouette for the obtained clusters.
If there is a satisfying number of points with a good level of silhouette, then “C” can be considered a good number of clusters. Otherwise, change the value of “C” and return to Step 1.

6. SPQR Algorithm

Many data applications deal with the necessity to represent m objects characterized by n features. One of the most often used representations of this kind of data is a matrix A composed of m rows and n columns. In modern applications of data analysis, such as image analysis and environmental datasets, these matrices often have high dimensionalities. This fact leads to a series of difficulties in the processes of data mining, representation, communication, and storage.
In the latest years, many works in the field of feature selection [41] have demonstrated that, through analyzing a dataset, it is possible to identify and eliminate some redundant and/or irrelevant features. By applying feature selection methods, it is possible to achieve a series of advantages in data analysis processes, such as a decrease in the amount of data, an improvement in the prediction accuracy, pulling out important features, understanding the attributes or variables easily, and finally reducing the execution time [41]. An interesting overview on feature section methods can be found in [41].
Many of these methods aim to approximate the matrix A by means of a “smaller” matrix obtained by combining its columns and rows. The drawback of these methods is that they usually yield dense factorizations, and more seriously, these terms are often much harder to interpret than the original ones. For example, truncating the SVD at k terms is one of the most common ways to obtain the “best” rank-k approximation of A when measured with respect to any unitarily invariant matrix norm. The drawback of this method is that it produces a representation of the dataset that is difficult to relate to the original dataset and to the processes that generated it. The same drawbacks are present in another widely used method for feature selection: principal component analysis (PCA).
Starting from these considerations, many methods to solve the column subset selection problem have been developed [42]. These methods try to find a subset of k actual columns of A, with k less than n, which “captures” most of the information of A with respect to the spectral or the Frobenius norm.
Essentially two classes of methods may be defined:
  • Randomized methods: these methods select the most representative columns in a matrix using probability distributions;
  • Deterministic methods: these methods select the columns in a deterministic way.
An effective deterministic method to reduce the matrix A to its more important columns is “semi-pivoted QR approximation” (SPQR) by Stewart [5,27,43]. As its name suggests, the key approach is the QR decomposition of A into an orthogonal matrix Q and a triangular matrix R. These factors are computed by the Gram–Schmidt algorithm for orthonormalizing the columns of A, one at a time, from first to last. In many situations pivoted QR is preferred: in practice, columns are exchanged at the start of each new stage to take the largest remaining column. In this way a permutation matrix P is built such that:
A·P = Q·R
When A is rank deficient, the column pivoting A·P is applied to improve the numerical accuracy. Moreover, the choice of P usually guarantees that R does not have increasing diagonal entries, a specific feature which is useful in the following. More in detail, we may partition the expression above as:
[   B 1   B 2 ] = [   Q 1   Q 2 ] [ R 11 R 12 0 R 22 ]
The following properties hold:
  • B1 = Q1·R11
  • ||B2 − Q1·R12|| = ||R22||
The semi-QR algorithm exploits these results to use the approximation:
A·P ≈ Q1·[R11 R12]
that, thanks to Property 1 above, reproduces the first k columns of A·P exactly by introducing a quantifiable error (Property 2). An additional strength of this method is that the explicit computation of the nonsparse orthogonal matrix Q1 is not required.
In practice, given a rank parameter k, the SPQR algorithm gives k columns of A whose span approximates the column space of A; they form the matrix B1 of dimension m × k, while the factor R11 contains the coefficients of the column orthogonalization.

7. PCA

Principal component analysis (PCA) is a well-known method often used to reduce the dimensionality of large databases. Reducing the dimensions of a dataset introduces a loss in accuracy, but on the other hand, it improves the efficiency of algorithms of data analysis, such as data exploration, data visualization, machine learning, etc. From this perspective, when using PCA, it is required to find a trade-off between performance improvement and accuracy loss.
In [22], there is an in-depth analysis of the PCA method, while in this section, a brief operative description is proposed.
Essentially, PCA can be seen as a five-step process:
  • Standardization: In this step, the range of each initial variable (each column of the dataset is a variable) is standardized so that each variable contributes equally to the overall analysis;
  • Covariance matrix computation: The goal of this step is to compute the degree of the relationship among the variables of the dataset;
  • Identification of the principal components: These are new variables obtained as a linear combination or mixture of the initial variables. These new variables have the following properties: they are uncorrelated and most of the information contained in the initial variables is compressed in the first components. This allows for the reduction in dimensionality without losing too much information, discarding the components with low information and considering the remaining components as the new variables. To find these principal components, eigenvectors and eigenvalues of the covariance matrix are computed because they are the directions of the axes where there is the most variance (most information);
  • Feature vector selection: In this step, the eigenvectors are disposed in descending order by their eigenvalues, allowing it to find the principal components in order of significance. In this way, it is possible to choose the number of components to keep for further analysis, discarding those of lesser significance (of low eigenvalues);
  • Recast the data along the principal component axes: Once the principal components have been chosen, it is necessary to recast on them the original dataset.

8. Experiments and Results

In this section a brief description of the carried-out experiments is reported. For each database described in Section 3, a clustering analysis was conducted using the FCM algorithm, varying the number of clusters from 10 to 50. The results of each clustering were evaluated using the silhouette parameter. For each database, the percentages of points with a silhouette level greater than 0.7, 0.8, and 0.9 are reported. These analyses were carried out six times on each database using:
  • The raw data (namely, the original data stored in each database);
  • The dataset reduced using the PCA algorithm, losing less than 2% of the total variation in the dataset;
  • The dataset reduced using the SPQR algorithm, setting the same number of features used with PCA;
  • The raw data normalized between 0 and 1;
  • The normalized dataset reduced using the PCA algorithm, losing less than 2% of the total variation in the dataset;
  • The normalized dataset reduced using the SPQR algorithm, setting the same number of features used with PCA;
  • In the following, the obtained results for each database are reported.

8.1. Gender Gap in Spanish WP Dataset

8.1.1. Clustering and Silhouettes Using Raw Data

The original dataset is composed of 20 numerical features. Applying PCA to this dataset, losing less than 2% of the total variation in the data, produced a dataset with a single dimension.
The following Table 1, Table 2 and Table 3 show the obtained results in terms of silhouettes varying the number of clusters from 10 to 50 (see columns). Each row represents the percentage of points with a silhouette greater than a given threshold, that is 0.7, 0.8, and 0.9.

8.1.2. Clustering and Silhouettes Using Normalized Data

The original dataset was normalized between 0 and 1. Applying PCA to this dataset, losing less than 2% of the total variation in the data, produced a dataset with six dimensions.
The following Table 4, Table 5 and Table 6 show the obtained results in terms of silhouettes varying the number of clusters from 10 to 50 (see columns). Each row represents the percentage of points with a silhouette greater than a given threshold, that is 0.7, 0.8, and 0.9.
In these experiments, the results obtained using normalized data seem to be worse than those obtained with the raw data.

8.2. Room Occupancy Estimation Dataset

8.2.1. Clustering and Silhouettes Using Raw Data

The original dataset is composed of 20 numerical features. Applying PCA to this dataset, losing less than 2% of the total variation in the data, produced a dataset with two dimensions.
The following Table 7, Table 8 and Table 9 show the obtained results in terms of silhouettes varying the number of clusters from 10 to 50 (see columns). Each row represents the percentage of points with a silhouette greater than a given threshold, that is 0.7, 0.8, and 0.9.

8.2.2. Clustering and Silhouettes Using Normalized Data

The original dataset was normalized between 0 and 1. Applying PCA to this dataset, losing less than 2% of the total variation in the data, produced a dataset with five dimensions.
The following Table 10, Table 11 and Table 12 show the obtained results in terms of silhouettes varying the number of clusters from 10 to 50 (see columns). Each row represents the percentage of points with a silhouette greater than a given threshold, that is 0.7, 0.8, and 0.9.
In these experiments, the results obtained using normalized data seem to be slightly worse than those obtained with the raw data.

8.3. Myocardial Infarction Complications Dataset

8.3.1. Clustering and Silhouettes Using Raw Data

This dataset could be considered a sort of spare matrix. There are many 0s, and some features contain not-a-number values (NaN). The proposed results were obtained, selecting only the features without NaN values, obtaining a dataset composed of fourteen dimensions. Applying PCA to this dataset, losing less than 2% of the total variation in the data, produces a dataset with one dimension.
The following Table 13, Table 14 and Table 15 show the obtained results in terms of silhouettes varying the number of clusters from 10 to 50 (see columns). Each row represents the percentage of points with a silhouette greater than a given threshold, that is 0.7, 0.8, and 0.9.

8.3.2. Clustering and Silhouettes Using Normalized Data

The original dataset (without the features containing NaN values) was normalized between 0 and 1. Applying PCA to this dataset, losing less than 2% of the total variation in the data, produced a dataset with nine dimensions.
The following Table 16, Table 17 and Table 18 show the obtained results in terms of silhouettes varying the number of clusters from 10 to 50 (see columns). Each row represents the percentage of points with a silhouette greater than a given threshold, that is 0.7, 0.8, and 0.9.
In these experiments, the results obtained using normalized data seem to be worse than those obtained with the raw data.

8.4. TUANDROMD (Tezpur University Android Malware Dataset) Dataset

8.4.1. Clustering and Silhouettes Using Raw Data

This dataset could be considered a sort of spare matrix. There are many 0s, but there are no features containing not-a-number values (NaN). Applying PCA to this dataset, losing less than 2% of the total variation in the data, produced a dataset with seven dimensions.
The following Table 19, Table 20 and Table 21 show the obtained results in terms of silhouettes varying the number of clusters from 10 to 50 (see columns). Each row represents the percentage of points with a silhouette greater than a given threshold, that is 0.7, 0.8, and 0.9.

8.4.2. Clustering and Silhouettes Using Normalized Data

The original dataset was normalized between 0 and 1. Applying PCA to this dataset, losing less than 2% of the total variation in the data, produced a dataset with seven dimensions.
The following Table 22, Table 23 and Table 24 show the obtained results in terms of silhouettes varying the number of clusters from 10 to 50 (see columns). Each row represents the percentage of points with a silhouette greater than a given threshold, that is 0.7, 0.8, and 0.9.

8.5. Wine Dataset

8.5.1. Clustering and Silhouettes Using Raw Data

The original dataset is composed of 13 numerical features. Applying PCA to this dataset, losing less than 2% of the total variation in the data, produced a dataset with a single dimension. The following Table 25, Table 26 and Table 27 show the obtained results in terms of silhouettes varying the number of clusters from 10 to 50.

8.5.2. Clustering and Silhouettes Using Normalized Data

The original dataset was normalized between 0 and 1. Applying PCA to this dataset, losing less than 2% of the total variation in the data, produced a dataset with three dimensions. The following Table 28, Table 29 and Table 30 show the obtained results in terms of silhouettes varying the number of clusters from 10 to 50.

8.6. Multiple Features Dataset

8.6.1. Clustering and Silhouettes Using Raw Data

The original dataset is composed of 649 numerical features. Applying PCA to this dataset, losing less than 2% of the total variation in the data, produced a dataset with two dimensions. The following Table 31, Table 32 and Table 33 show the obtained results in terms of silhouettes varying the number of clusters from 10 to 50.

8.6.2. Clustering and Silhouettes Using Normalized Data

The original dataset was normalized between 0 and 1. Applying PCA to this dataset, losing less than 2% of the total variation in the data, produced a dataset with four dimensions. The following Table 34, Table 35 and Table 36 show the obtained results in terms of silhouettes varying the number of clusters from 10 to 50.

8.7. Dry Beans Dataset

8.7.1. Clustering and Silhouettes Using Raw Data

The original dataset is composed of 16 numerical features. Applying PCA to this dataset, losing less than 2% of the total variation in the data, produced a dataset with one dimension. The following Table 37, Table 38 and Table 39 show the obtained results in terms of silhouettes varying the number of clusters from 10 to 50.

8.7.2. Clustering and Silhouettes Using Normalized Data

The original dataset was normalized between 0 and 1. Applying PCA to this dataset, losing less than 2% of the total variation in the data, produces a dataset with two dimensions. The following Table 40, Table 41 and Table 42 show the obtained results in terms of silhouettes varying the number of clusters from 10 to 50.

8.8. APS Failure at Scania Trucks Dataset

8.8.1. Clustering and Silhouettes Using Raw Data

The original dataset is composed of 168 numerical features. Applying PCA to this dataset, losing less than 2% of the total variation in the data, produced a dataset with one dimension. The following Table 43, Table 44 and Table 45 show the obtained results in terms of silhouettes varying the number of clusters from 10 to 50.

8.8.2. Clustering and Silhouettes Using Normalized Data

The original dataset was normalized between 0 and 1. Applying PCA to this dataset, losing less than 2% of the total variation in the data, produced a dataset with five dimensions. The following Table 46, Table 47 and Table 48 show the obtained results in terms of silhouettes varying the number of clusters from 10 to 50.

8.9. Relative location of CT Slices on Axial Axis Dataset

8.9.1. Clustering and Silhouettes Using Raw Data

The original dataset is composed of 16 numerical features. Applying PCA to this dataset, losing less than 2% of the total variation in the data, produced a dataset with six dimensions. The following Table 49, Table 50 and Table 51 show the obtained results in terms of silhouettes varying the number of clusters from 10 to 50.

8.9.2. Clustering and Silhouettes Using Normalized Data

The original dataset was normalized between 0 and 1. Applying PCA to this dataset, losing less than 2% of the total variation in the data, produced a dataset with five dimensions. The following Table 52, Table 53 and Table 54 show the obtained results in terms of silhouettes varying the number of clusters from 10 to 50.

8.10. Mice Protein Expression Dataset

8.10.1. Clustering and Silhouettes Using Raw Data

The original dataset is composed of 81 numerical features. Applying PCA to this dataset, losing less than 2% of the total variation in the data, produced a dataset with one dimension. The following Table 55, Table 56 and Table 57 show the obtained results in terms of silhouettes varying the number of clusters from 10 to 50.

8.10.2. Clustering and Silhouettes Using Normalized Data

The original dataset was normalized between 0 and 1. Applying PCA to this dataset, losing less than 2% of the total variation in the data, produced a dataset with one dimension. The following Table 58, Table 59 and Table 60 show the obtained results in terms of silhouettes varying the number of clusters from 10 to 50.

8.11. Synthesis

The mean values of all the results shown in the previous sections are reported in the following figures. Figure 1, Figure 2 and Figure 3 report the mean results obtained by applying the analysis to the raw datasets, while Figure 4, Figure 5 and Figure 6 report those obtained by applying the analysis to the normalized datasets. Each figure shows the mean results in terms of silhouettes obtained varying the number of clusters from 10 to 50.
In Figure 1, each line represents the mean percentage of points with a silhouette greater than 0.7. The blue line represents the results obtained using the raw data, the orange line represents those obtained after preprocessing with PCA, and the gray line represents those obtained after preprocessing with SPQR.
Figure 1. Mean results in terms of silhouettes >0.7 obtained by varying the number of clusters from 10 to 50.
Figure 1. Mean results in terms of silhouettes >0.7 obtained by varying the number of clusters from 10 to 50.
Applsci 13 03481 g001
In Figure 2, each line represents the mean percentage of points with a silhouette greater than 0.8.
Figure 2. Mean results in terms of silhouettes >0.8 obtained by varying the number of clusters from 10 to 50.
Figure 2. Mean results in terms of silhouettes >0.8 obtained by varying the number of clusters from 10 to 50.
Applsci 13 03481 g002
In Figure 3, each line represents the mean percentage of points with a silhouette greater than 0.9.
Figure 3. Mean results in terms of silhouettes >0.9 obtained by varying the number of clusters from 10 to 50.
Figure 3. Mean results in terms of silhouettes >0.9 obtained by varying the number of clusters from 10 to 50.
Applsci 13 03481 g003
In Figure 4, each line represents the mean percentage of points with a silhouette greater than 0.7 obtained by applying the analysis to the normalized datasets.
Figure 4. Mean results in terms of silhouettes >0.7 obtained by varying the number of clusters from 10 to 50 using normalized data.
Figure 4. Mean results in terms of silhouettes >0.7 obtained by varying the number of clusters from 10 to 50 using normalized data.
Applsci 13 03481 g004
In Figure 5, each line represents the mean percentage of points with a silhouette greater than 0.8 obtained by applying the analysis to the normalized datasets.
Figure 5. The mean results in terms of silhouettes >0.8 obtained varying the number of clusters from 10 to 50 using normalized data.
Figure 5. The mean results in terms of silhouettes >0.8 obtained varying the number of clusters from 10 to 50 using normalized data.
Applsci 13 03481 g005
In Figure 6, each line represents the mean percentage of points with a silhouette greater than 0.9 obtained by applying the analysis to the normalized datasets.
Figure 6. Mean results in terms of silhouettes >0.9 obtained by varying the number of clusters from 10 to 50 using normalized data.
Figure 6. Mean results in terms of silhouettes >0.9 obtained by varying the number of clusters from 10 to 50 using normalized data.
Applsci 13 03481 g006
To perform a comparative evaluation of various methods, statistical hypothesis tests are commonly adopted with experimental results for a number of datasets. In this work, the obtained results were analyzed using the Friedman test [44]. It is a nonparametric statistical method to evaluate the validity of the null hypothesis (no difference or relationship exists between the two sets of data or variables being analyzed). The analysis was carried out using Matlab®. Since the obtained value of χ 2 is 28.13 which is greater than the critical value of 5.991 (derived from the table of the χ 2 distribution choosing the value of significance α equal to 0.05), we can reject the null hypothesis. Once the null hypothesis was rejected applying the Friedman test, the authors adopted a post hoc procedure for the pairwise comparison using the “multcompare” Matlab function. Figure 7 shows a graphical representation of the obtained results. In this figure, the intervals are computed so that the two estimates being compared are significantly different if their intervals are disjointed and are not significantly different if their intervals overlap. In Figure 7, all the intervals are disjoined, indicating the fact that the three observations are significatively different among them.

9. Discussion

In this work, the authors proposed to use the SPQR algorithm as a data preprocessing method for machine learning algorithms. It is a technique for sparce matrix approximation, and to best of our knowledge, it has never been applied to this context. The obtained results were compared to those obtained using the well-known PCA under the same conditions. In particular, an unsupervised learning method (FCM) was applied to ten publicly available databases using different preprocessing techniques:
  • No preprocessing: in these experiments, the FCM was used to cluster raw data;
  • Data normalization: in these experiments, each feature of the datasets was normalized;
  • PCA: the dimensions of the datasets were reduced using PCA, retaining at least the 98% of the total variation in the data;
  • SPQR: the dimensions of the datasets were reduced by applying the same number of features used in PCA.
The obtained results are shown in the previous section, and they allow us to outline some considerations:
  • Data normalization has a significant impact on the performance of the clustering algorithms. This is due to the “equalization effect” that data normalization has on the morphology of the feature space defined by a dataset. Indeed, when normalizing a dataset, each dimension of the feature space has the same extension (1); hence, all the dimensions of the feature space have the same weight in the distance function used in the clustering algorithm. From a semantic point of view, this situation could cause trouble when analyzing a dataset where there are features more important than others;
  • Reducing the dimensions of the dataset could improve the classification performance of the algorithms. In the proposed experiments, there were improvements both in the computational and classification performances;
  • As shown in Section 7, reducing dimensions using PCA means to reproject the original feature space into a new one defined by the eigenvectors of the covariance matrix. This fact makes the estimation of the contribution given by each feature of the original dataset to the classification process quite difficult. Furthermore, it is not possible to classify new points eventually added in a second time to the original dataset without recasting them into the new feature space;
  • As shown in Section 6, the SPQR algorithm does not introduce any changes in the original dataset (it changes only the position of some features in the original dataset). This overcomes the drawbacks of PCA discussed above. Furthermore, the proposed results show that the performance obtained by preprocessing data with this algorithm often overcomes that obtained using PCA.

Author Contributions

Investigation, V.D.L. and A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: http://archive.ics.uci.edu/ml/index.php.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Gerbrands, J.J. On the relationships between SVD, KLT and PCA. Pattern Recognit. 1981, 14, 375–381. [Google Scholar] [CrossRef]
  2. Tufféry, S. Data Mining and Statistics for Decision Making; Wiley: Hoboken, NJ, USA, 2011. [Google Scholar]
  3. Adolfo, C.M.S.; Chizari, H.; Win, T.Y.; Al-Majeed, S. Sample Reduction for Physiological Data Analysis Using Principal Component Analysis in Artificial Neural Network. Appl. Sci. 2021, 11, 8240. [Google Scholar] [CrossRef]
  4. Buatoom, U.; Jamil, M.U. Improving Classification Performance with Statistically Weighted Dimensions and Dimensionality Reduction. Appl. Sci. 2023, 13, 2005. [Google Scholar] [CrossRef]
  5. Stewart, G. Four algorithms for the the efficient computation of truncated pivoted QR approximations to a sparse matrix. Numer. Math. 1999, 83, 313–323. [Google Scholar] [CrossRef]
  6. Popolizio, M.; Amato, A.; Piuri, V.; Di Lecce, V. Improving Classification Performance Using the Semi-pivoted QR Approximation Algorithm. In Rising Threats in Expert Applications and Solutions: Proceedings of FICR-TEAS, Jaipur, India, 4 July 2022; Rathore, V.S., Sharma, S.C., Tavares, J.M.R., Moreira, C., Surendiran, B., Eds.; Springer: Singapore, 2022; Volume 434, pp. 263–271. [Google Scholar] [CrossRef]
  7. Dua, D.; Graff, C. UCI Machine Learning Repository; University of California, School of Information and Computer Science: Irvine, CA, USA, 2019; Available online: http://archive.ics.uci.edu/ml (accessed on 28 February 2023).
  8. Meng, Z.; Zhang, Z.; Zhang, D.; Yang, D. An active learning method combining Kriging and accelerated chaotic single loop approach (AK-ACSLA) for reliability-based design optimization. Comput. Methods Appl. Mech. Eng. 2019, 357, 112570. [Google Scholar] [CrossRef]
  9. Meng, Z.; Li, G.; Yang, D.; Zhan, L. A new directional stability transformation method of chaos control for first order reliability analysis. Struct. Multidiscip. Optim. 2016, 55, 601–612. [Google Scholar] [CrossRef]
  10. de Velasco, M.; Justo, R.; Zorrilla, A.L.; Torres, M.I. Analysis of Deep Learning-Based Decision-Making in an Emotional Spontaneous Speech Task. Appl. Sci. 2023, 13, 980. [Google Scholar] [CrossRef]
  11. Balasubramanian, M.; Schwartz, E.L. The Isomap Algorithm and Topological Stability. Science 2002, 295, 7. [Google Scholar] [CrossRef] [Green Version]
  12. Roweis, S.T.; Saul, L.K. Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science 2000, 290, 2323–2326. [Google Scholar] [CrossRef] [Green Version]
  13. Donoho, D.L.; Grimes, C. Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data. Proc. Natl. Acad. Sci. USA 2003, 100, 5591–5596. [Google Scholar] [CrossRef] [Green Version]
  14. Belkin, M.; Niyogi, P. Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural Comput. 2003, 15, 1373–1396. [Google Scholar] [CrossRef] [Green Version]
  15. Huang, H.; Feng, H. Gene Classification Using Parameter-Free Semi-Supervised Manifold Learning. IEEE/ACM Trans. Comput. Biol. Bioinform. 2011, 9, 818–827. [Google Scholar] [CrossRef] [PubMed]
  16. Shawe-Taylor, J.; Cristianini, N. Kernel Methods for Pattern Analysis; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
  17. Giraud, C. Introduction to High-Dimensional Statistics; CRC Press: Boca Raton, FL, USA, 2014; Volume 138. [Google Scholar]
  18. Adragni, K.P.; Al-Najjar, E.; Martin, S.; Popuri, S.K.; Raim, A.M. Group-wise sufficient dimension reduction with prin-cipal fitted components. Comput. Statist. 2016, 31, 923–941. [Google Scholar] [CrossRef]
  19. Guo, Z.; Li, L.; Lu, W.; Li, B. Groupwise Dimension Reduction via Envelope Method. J. Am. Stat. Assoc. 2015, 110, 1515–1527. [Google Scholar] [CrossRef] [Green Version]
  20. Ward, A.D.; Hamarneh, G. The Groupwise Medial Axis Transform for Fuzzy Skeletonization and Pruning. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 32, 1084–1096. [Google Scholar] [CrossRef] [Green Version]
  21. Zhou, J.; Wu, J.; Zhu, L. Overlapped groupwise dimension reduction. Sci. China Math. 2016, 59, 2543–2560. [Google Scholar] [CrossRef]
  22. Jolliffe, I. Principal Component Analysis; Wiley: Hoboken, NJ, USA, 2002. [Google Scholar]
  23. Johnson, R.A.; Wichern, D.W. Applied Multivariate Statistical Analysis; Prentice: Englewood Cliffs, NJ, USA, 1992; Volume 4. [Google Scholar]
  24. Fodor, I.K. A Survey of Dimension Reduction Techniques. arXiv 2002, arXiv:1403.2877. [Google Scholar]
  25. Brand, M. Fast low-rank modifications of the thin singular value decomposition. Linear Algebra Its Appl. 2005, 415, 20–30. [Google Scholar] [CrossRef] [Green Version]
  26. Gui, J.; Wang, S.-L.; Lei, Y.-K. Multi-step dimensionality reduction and semi-supervised graph-based tumor classification using gene expression data. Artif. Intell. Med. 2010, 50, 181–191. [Google Scholar] [CrossRef] [PubMed]
  27. Berry, M.; Pulatova, S.; Stewart, G. Computing sparse reduced-rank approximations to sparse matrices. ACM Trans. Math. Softw. 2005, 31, 252–269. [Google Scholar] [CrossRef]
  28. Minguillón, J.; Meneses, J.; Aibar, E.; Ferran-Ferrer, N.; Fãbregues, S. Exploring the gender gap in the Spanish Wikipedia: Differences in engagement and editing practices. PLoS ONE 2021, 16, e0246702. [Google Scholar] [CrossRef] [PubMed]
  29. Borah, P.; Bhattacharyya, D.K.; Kalita, J.K. Malware Dataset Generation and Evaluation. In Proceedings of the 2020 IEEE 4th Conference on Information & Communication Technology (CICT), Chennai, India, 3–5 December 2020; IEEE: Piscataway, NJ, USA, 2020. [Google Scholar]
  30. Singh, A.P.; Jain, V.; Chaudhari, S.; Kraemer, F.A.; Werner, S.; Garg, V. Machine Learning-Based Occupancy Estimation Using Multivariate Sensor Nodes. In Proceedings of the 2018 IEEE Globecom Workshops (GC Wkshps), Abu Dhabi, United Arab Emirates, 9–13 December 2018. [Google Scholar]
  31. Golovenkin, S.E.; Bac, J.; Chervov, A.; Mirkes, E.M.; Orlova, Y.V.; Barillot, E.; Gorban, A.N.; Zinovyev, A. Trajectories, bifurcations, and pseudo-time in large clinical datasets: Applications to myocardial infarction and diabetes data. Gigascience 2020, 9, giaa128. [Google Scholar] [CrossRef]
  32. Koklu, M.; Ozkan, I.A. Multiclass Classification of Dry Beans Using Computer Vision and Machine Learning Techniques. Comput. Electron. Agric. 2020, 174, 105507. [Google Scholar] [CrossRef]
  33. Higuera, C.; Gardiner, K.J.; Cios, K.J. Self-Organizing Feature Maps Identify Proteins Critical to Learning in a Mouse Model of Down Syndrome. PLoS ONE 2015, 10, e0129126. [Google Scholar] [CrossRef] [PubMed]
  34. Saxena, A.; Prasad, M.; Gupta, A.; Bharill, N.; Patel, O.P.; Tiwari, A.; Er, M.J.; Ding, W.; Lin, C.-T. A review of clustering techniques and developments. Neurocomputing 2017, 267, 664–681. [Google Scholar] [CrossRef] [Green Version]
  35. Pedrycz, W. Algorithms of fuzzy clustering with partial supervision. Pattern Recognit. Lett. 1985, 3, 13–20. [Google Scholar] [CrossRef]
  36. Dunn, J.C. A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters. J. Cybern. 1973, 3, 32–57. [Google Scholar] [CrossRef]
  37. Bezdek, J.C.; Ehrlich, R.; Full, W. FCM: The fuzzy c-means clustering algorithm. Comput. Geosci. 1984, 10, 191–203. [Google Scholar] [CrossRef]
  38. Smeulders, A.; Worring, M.; Santini, S.; Gupta, A.; Jain, R. Content-based image retrieval at the end of the early years. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 1349–1380. [Google Scholar] [CrossRef]
  39. Rand, W.M. Objective Criteria for the Evaluation of Clustering Methods. J. Am. Stat. Assoc. 1971, 66, 846. [Google Scholar] [CrossRef]
  40. Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef] [Green Version]
  41. Visalakshi, S.; Radha, V. A literature review of feature selection techniques and applications: Review of feature selection in data mining. In Proceedings of the 2014 IEEE International Conference on Computational Intelligence and Computing Research, Coimbatore, India, 18–20 December 2014; pp. 1–6. [Google Scholar] [CrossRef]
  42. Krömer, P.; Platoš, J.; Snasel, V. Genetic Algorithm for the Column Subset Selection Problem. In Proceedings of the 2014 Eighth International Conference on Complex, Intelligent and Software Intensive Systems (CISIS), Birmingham, UK, 2–4 July 2014; pp. 16–22. [Google Scholar] [CrossRef]
  43. Stewart, G.W. Error Analysis of the Quasi-Gram--Schmidt Algorithm. SIAM J. Matrix Anal. Appl. 2005, 27, 493–506. [Google Scholar] [CrossRef] [Green Version]
  44. Friedman, M. The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance. J. Am. Stat. Assoc. 1937, 32, 675–701. [Google Scholar] [CrossRef]
Figure 7. Results of the “multicompare” procedure.
Figure 7. Results of the “multicompare” procedure.
Applsci 13 03481 g007
Table 1. Clustering performance in terms of silhouettes using original data.
Table 1. Clustering performance in terms of silhouettes using original data.
Silhouette\N. Clusters1020304050
0.743.9537.4443.2850.4660.81
0.821.6016.2929.3139.4051.45
0.900.4016.7535.9237.340
Table 2. Clustering performance in terms of silhouettes using the reduced dataset with PCA.
Table 2. Clustering performance in terms of silhouettes using the reduced dataset with PCA.
Silhouette\N. Clusters1020304050
0.743.9537.4443.2850.4660.81
0.821.6016.2929.3139.4051.45
0.900.4016.7535.9237.340
Table 3. Clustering performance in terms of silhouettes using the reduced dataset with SPQR.
Table 3. Clustering performance in terms of silhouettes using the reduced dataset with SPQR.
Silhouette\N. Clusters1020304050
0.792.5476.0678.9382.4987.80
0.889.5569.6469.5975.4174.40
0.978.5947.8357.75346.10348.463
Table 4. Clustering performance in terms of silhouettes using normalized data.
Table 4. Clustering performance in terms of silhouettes using normalized data.
Silhouette\N. Clusters1020304050
0.70.060.080.170.080.23
0.80.060.080.100.080.15
0.90.060.080.060.060.15
Table 5. Clustering performance in terms of silhouettes using the reduced dataset with PCA applied to normalized data.
Table 5. Clustering performance in terms of silhouettes using the reduced dataset with PCA applied to normalized data.
Silhouette\N. Clusters1020304050
0.701.921.960.020.17
0.800.0400.020.17
0.900.0400.020.17
Table 6. Clustering performance in terms of silhouettes using the reduced dataset with SPQR applied to normalized data.
Table 6. Clustering performance in terms of silhouettes using the reduced dataset with SPQR applied to normalized data.
Silhouette\N. Clusters1020304050
0.78.813.372.612.401.85
0.801.6002.211.01
0.90000.460.63
Table 7. Clustering performance in terms of silhouettes using original data.
Table 7. Clustering performance in terms of silhouettes using original data.
Silhouette\N. Clusters1020304050
0.761.8568.7059.2967.4156.31
0.855.3249.0531.2039.9846.01
0.950.2934.875.569.2222.68
Table 8. Clustering performance in terms of silhouettes using the reduced dataset with PCA.
Table 8. Clustering performance in terms of silhouettes using the reduced dataset with PCA.
Silhouette\N. Clusters1020304050
0.767.6468.1961.7761.2356.06
0.858.4947.5043.9344.0345.42
0.948.9834.787.9186.7416.00
Table 9. Clustering performance in terms of silhouettes using the reduced dataset with SPQR.
Table 9. Clustering performance in terms of silhouettes using the reduced dataset with SPQR.
Silhouette\N. Clusters1020304050
0.763.3454.0162.7162.1157.87
0.846.8335.1646.1326.0640.93
0.94.6813.312.593.2919.58
Table 10. Clustering performance in terms of silhouettes using normalized data.
Table 10. Clustering performance in terms of silhouettes using normalized data.
Silhouette\N. Clusters1020304050
0.715.0320.3913.7419.4723.40
0.86.175.674.048.108.65
0.900.010.021.030.03
Table 11. Clustering performance in terms of silhouettes using the reduced dataset with PCA applied to normalized data.
Table 11. Clustering performance in terms of silhouettes using the reduced dataset with PCA applied to normalized data.
Silhouette\N. Clusters1020304050
0.751.4341.1638.5637.0832.20
0.832.0821.4621.2717.0114.96
0.94.340.011.580.833.14
Table 12. Clustering performance in terms of silhouettes using the reduced dataset with SPQR applied to normalized data.
Table 12. Clustering performance in terms of silhouettes using the reduced dataset with SPQR applied to normalized data.
Silhouette\N. Clusters1020304050
0.753.3456.4755.4061.6962.98
0.837.8835.1634.8840.5140.34
0.913.5119.0710.689.8117.37
Table 13. Clustering performance in terms of silhouettes using original data.
Table 13. Clustering performance in terms of silhouettes using original data.
Silhouette\N. Clusters1020304050
0.748.9125.4821.6016.5915.48
0.827.3114.248.838.067.062
0.905.593.352.240.82
Table 14. Clustering performance in terms of silhouettes using the reduced dataset with PCA.
Table 14. Clustering performance in terms of silhouettes using the reduced dataset with PCA.
Silhouette\N. Clusters1020304050
0.762.4564.0466.2764.9267.10
0.852.4450.8552.6251.6854.21
0.923.1924.4825.3723.9027.78
Table 15. Clustering performance in terms of silhouettes using the reduced dataset with SPQR.
Table 15. Clustering performance in terms of silhouettes using the reduced dataset with SPQR.
Silhouette\N. Clusters1020304050
0.763.3961.8659.9859.8659.74
0.850.6847.3846.4446.0847.14
0.922.3119.3618.0717.8317.83
Table 16. Clustering performance in terms of silhouettes using normalized data.
Table 16. Clustering performance in terms of silhouettes using normalized data.
Silhouette\N. Clusters1020304050
0.727.4911.420.122.773.12
0.817.6600.1200
0.9000.1200
Table 17. Clustering performance in terms of silhouettes using the reduced dataset with PCA applied to normalized data.
Table 17. Clustering performance in terms of silhouettes using the reduced dataset with PCA applied to normalized data.
Silhouette\N. Clusters1020304050
0.725.6017.8910.2412.485.59
0.811.4814.129.714.002.77
0.900000
Table 18. Clustering performance in terms of silhouettes using the reduced dataset with SPQR applied to normalized data.
Table 18. Clustering performance in terms of silhouettes using the reduced dataset with SPQR applied to normalized data.
Silhouette\N. Clusters1020304050
0.727.3715.309.240.063.06
0.814.4210.120.180.060.06
0.90000.060.06
Table 19. Clustering performance in terms of silhouettes using original data.
Table 19. Clustering performance in terms of silhouettes using original data.
Silhouette\N. Clusters1020304050
0.738.6436.2936.3312.0936.51
0.838.1721.0321.1611.9121.34
0.917.7417.7417.780.1717.96
Table 20. Clustering performance in terms of silhouettes using the reduced dataset with PCA.
Table 20. Clustering performance in terms of silhouettes using the reduced dataset with PCA.
Silhouette\N. Clusters1020304050
0.762.2752.9552.0163.4461.53
0.845.1648.2946.9358.5158.19
0.938.6443.2731.0955.5153.22
Table 21. Clustering performance in terms of silhouettes using the reduced dataset with SPQR.
Table 21. Clustering performance in terms of silhouettes using the reduced dataset with SPQR.
Silhouette\N. Clusters1020304050
0.748.4748.8148.7648.5652.93
0.837.0940.3640.3240.1248.31
0.921.8429.0529.0040.1239.8
Table 22. Clustering performance in terms of silhouettes using normalized data.
Table 22. Clustering performance in terms of silhouettes using normalized data.
Silhouette\N. Clusters1020304050
0.736.8230.3936.3730.1030.04
0.821.0729.9221.2129.6329.56
0.90.0418.1817.8317.8917.83
Table 23. Clustering performance in terms of silhouettes using the reduced dataset with PCA applied to normalized data.
Table 23. Clustering performance in terms of silhouettes using the reduced dataset with PCA applied to normalized data.
Silhouette\N. Clusters1020304050
0.762.2758.0658.7158.7558.51
0.845.1652.8653.6754.4554.83
0.938.6434.8345.4349.8450.33
Table 24. Clustering performance in terms of silhouettes using the reduced dataset with SPQR applied to normalized data.
Table 24. Clustering performance in terms of silhouettes using the reduced dataset with SPQR applied to normalized data.
Silhouette\N. Clusters1020304050
0.748.1648.9048.7649.6652.91
0.828.3340.4540.3241.2144.46
0.928.3329.1429.0129.9039.85
Table 25. Clustering performance in terms of silhouettes using original data.
Table 25. Clustering performance in terms of silhouettes using original data.
Silhouette\N. Clusters1020304050
0.762.9257.8642.1340.4542.13
0.845.5038.2025.2824.1528.65
0.910.677.866.1810.1114.61
Table 26. Clustering performance in terms of silhouettes using the reduced dataset with PCA.
Table 26. Clustering performance in terms of silhouettes using the reduced dataset with PCA.
Silhouette\N. Clusters1020304050
0.770.2270.7973.5974.1668.54
0.858.4359.5558.9958.9957.30
0.928.6526.4031.4631.4639.89
Table 27. Clustering performance in terms of silhouettes using the reduced dataset with SPQR.
Table 27. Clustering performance in terms of silhouettes using the reduced dataset with SPQR.
Silhouette\N. Clusters1020304050
0.765.7371.3575.2874.1674.16
0.852.8160.1161.8057.3060.11
0.921.9131.4634.8329.7741.57
Table 28. Clustering performance in terms of silhouettes using normalized data.
Table 28. Clustering performance in terms of silhouettes using normalized data.
Silhouette\N. Clusters1020304050
0.70.561.121.681.681.123
0.80.560.561.121.680.56
0.90.560.561.121.680.56
Table 29. Clustering performance in terms of silhouettes using the reduced dataset with PCA applied to normalized data.
Table 29. Clustering performance in terms of silhouettes using the reduced dataset with PCA applied to normalized data.
Silhouette\N. Clusters1020304050
0.75.628.998.9910.1112.92
0.801.684.493.938.42
0.901.123005.62
Table 30. Clustering performance in terms of silhouettes using the reduced dataset with SPQR applied to normalized data.
Table 30. Clustering performance in terms of silhouettes using the reduced dataset with SPQR applied to normalized data.
Silhouette\N. Clusters1020304050
0.714.6114.6110.6719.1014.61
0.83.934.493.938.998.43
0.9003.372.812.81
Table 31. Clustering performance in terms of silhouettes using original data.
Table 31. Clustering performance in terms of silhouettes using original data.
Silhouette\N. Clusters1020304050
0.722.402.400.700.350.70
0.815.950000
0.900000
Table 32. Clustering performance in terms of silhouettes using the reduced dataset with PCA.
Table 32. Clustering performance in terms of silhouettes using the reduced dataset with PCA.
Silhouette\N. Clusters1020304050
0.756.5532.3024.9524.1023.45
0.838.6510.907.108.506.400
0.915.6500.350.350
Table 33. Clustering performance in terms of silhouettes using the reduced dataset with SPQR.
Table 33. Clustering performance in terms of silhouettes using the reduced dataset with SPQR.
Silhouette\N. Clusters1020304050
0.765.3555.1048.6044.8544.95
0.851.6035.8028.7525.8525.50
0.922.458.853.653.552.10
Table 34. Clustering performance in terms of silhouettes using normalized data.
Table 34. Clustering performance in terms of silhouettes using normalized data.
Silhouette\N. Clusters1020304050
0.700.200.100.150
0.800.200.100.150
0.900.200.100.150
Table 35. Clustering performance in terms of silhouettes using the reduced dataset with PCA applied to normalized data.
Table 35. Clustering performance in terms of silhouettes using the reduced dataset with PCA applied to normalized data.
Silhouette\N. Clusters1020304050
0.721.058.706.9021.40
0.84.150.400.6500
0.900000
Table 36. Clustering performance in terms of silhouettes using the reduced dataset with SPQR applied to normalized data.
Table 36. Clustering performance in terms of silhouettes using the reduced dataset with SPQR applied to normalized data.
Silhouette\N. Clusters1020304050
0.773.7074.8569.2567.7572.35
0.867.7064.6057.2058.5061.75
0.946.4545.4529.8023.3540.60
Table 37. Clustering performance in terms of silhouettes using original data.
Table 37. Clustering performance in terms of silhouettes using original data.
Silhouette\N. Clusters1020304050
0.762.1559.8360.8360.5259.70
0.848.1845.6446.2045.9545.34
0.917.2416.0916.7215.9914.38
Table 38. Clustering performance in terms of silhouettes using the reduced dataset with PCA.
Table 38. Clustering performance in terms of silhouettes using the reduced dataset with PCA.
Silhouette\N. Clusters1020304050
0.762.2061.4361.2760.4561.16
0.848.2946.5246.9546.8646.85
0.917.4617.4818.4118.3719.23
Table 39. Clustering performance in terms of silhouettes using the reduced dataset with SPQR.
Table 39. Clustering performance in terms of silhouettes using the reduced dataset with SPQR.
Silhouette\N. Clusters1020304050
0.762.2661.1361.3161.0260.66
0.848.1447.1247.6146.8046.85
0.917.4017.4318.6118.4018.11
Table 40. Clustering performance in terms of silhouettes using normalized data.
Table 40. Clustering performance in terms of silhouettes using normalized data.
Silhouette\N. Clusters1020304050
0.713.164.393.220.550.53
0.83.112.362.3100
0.90.150.0370.0100
Table 41. Clustering performance in terms of silhouettes using the reduced dataset with PCA applied to normalized data.
Table 41. Clustering performance in terms of silhouettes using the reduced dataset with PCA applied to normalized data.
Silhouette\N. Clusters1020304050
0.730.0722.9419.4317.2317.13
0.810.106.745.553.914.74
0.900000
Table 42. Clustering performance in terms of silhouettes using the reduced dataset with SPQR applied to normalized data.
Table 42. Clustering performance in terms of silhouettes using the reduced dataset with SPQR applied to normalized data.
Silhouette\N. Clusters1020304050
0.730.5521.6120.9919.9521.64
0.811.394.944.334.635.99
0.900000
Table 43. Clustering performance in terms of silhouettes using original data.
Table 43. Clustering performance in terms of silhouettes using original data.
Silhouette\N. Clusters1020304050
0.711.9437.0729.1345.4345.31
0.811.3132.0720.3639.4539.34
0.99.7027.025.5435.0734.99
Table 44. Clustering performance in terms of silhouettes using the reduced dataset with PCA.
Table 44. Clustering performance in terms of silhouettes using the reduced dataset with PCA.
Silhouette\N. Clusters1020304050
0.768.4074.0578.8675.3477.41
0.858.1465.1270.8366.6668.81
0.941.4041.3851.8946.5250.42
Table 45. Clustering performance in terms of silhouettes using the reduced dataset with SPQR.
Table 45. Clustering performance in terms of silhouettes using the reduced dataset with SPQR.
Silhouette\N. Clusters1020304050
0.790.6685.5181.2884.0979.63
0.885.3080.7976.0178.1572.96
0.981.1765.3459.4863.0856.15
Table 46. Clustering performance in terms of silhouettes using normalized data.
Table 46. Clustering performance in terms of silhouettes using normalized data.
Silhouette\N. Clusters1020304050
0.734.2820.9718.439.9711.9
0.89.6513.5215.202.998.56
0.97.085.125.421.970.40
Table 47. Clustering performance in terms of silhouettes using the reduced dataset with PCA applied to normalized data.
Table 47. Clustering performance in terms of silhouettes using the reduced dataset with PCA applied to normalized data.
Silhouette\N. Clusters1020304050
0.748.1725.9321.1816.8012.46
0.818.6315.1314.078.934.05
0.98.012.481.9401.16
Table 48. Clustering performance in terms of silhouettes using the reduced dataset with SPQR applied to normalized data.
Table 48. Clustering performance in terms of silhouettes using the reduced dataset with SPQR applied to normalized data.
Silhouette\N. Clusters1020304050
0.728.4027.8814.5512.7412.79
0.815.1011.139.596.587.49
0.96.883.374.382.753.95
Table 49. Clustering performance in terms of silhouettes using original data.
Table 49. Clustering performance in terms of silhouettes using original data.
Silhouette\N. Clusters1020304050
0.70.0020.0020.0040.0130.019
0.80.0020.0020.0040.0130.019
0.90.0020.0020.0040.0130.019
Table 50. Clustering performance in terms of silhouettes using the reduced dataset with PCA.
Table 50. Clustering performance in terms of silhouettes using the reduced dataset with PCA.
Silhouette\N. Clusters1020304050
0.710.067.164.996.563.79
0.80.822.912.882.8580.68
0.901.541.421.190
Table 51. Clustering performance in terms of silhouettes using the reduced dataset with SPQR.
Table 51. Clustering performance in terms of silhouettes using the reduced dataset with SPQR.
Silhouette\N. Clusters1020304050
0.723.6125.9115.6420.3037.95
0.88.395.750.0115.53121.03
0.95.380.010.015.478.31
Table 52. Clustering performance in terms of silhouettes using normalized data.
Table 52. Clustering performance in terms of silhouettes using normalized data.
Silhouette\N. Clusters1020304050
0.70.0060.0070.0110.0150.020
0.80.0060.0070.0110.0150.020
0.90.0060.0070.0110.0150.020
Table 53. Clustering performance in terms of silhouettes using the reduced dataset with PCA applied to normalized data.
Table 53. Clustering performance in terms of silhouettes using the reduced dataset with PCA applied to normalized data.
Silhouette\N. Clusters1020304050
0.713.4510.935.202.151.46
0.81.464.323.1800
0.901.401.4600
Table 54. Clustering performance in terms of silhouettes using the reduced dataset with SPQR applied to normalized data.
Table 54. Clustering performance in terms of silhouettes using the reduced dataset with SPQR applied to normalized data.
Silhouette\N. Clusters1020304050
0.753.4382.8690.6588.2285.94
0.825.0179.7686.6483.6281.28
0.9057.7479.6770.0867.80
Table 55. Clustering performance in terms of silhouettes using original data.
Table 55. Clustering performance in terms of silhouettes using original data.
Silhouette\N. Clusters1020304050
0.700.180.090.280.37
0.800.180.090.280.37
0.900.180.090.280.37
Table 56. Clustering performance in terms of silhouettes using the reduced dataset with PCA.
Table 56. Clustering performance in terms of silhouettes using the reduced dataset with PCA.
Silhouette\N. Clusters1020304050
0.763.4262.4163.7063.7064.63
0.850.0949.0748.6150.9250.65
0.923.0519.3521.8521.7619.07
Table 57. Clustering performance in terms of silhouettes using the reduced dataset with SPQR.
Table 57. Clustering performance in terms of silhouettes using the reduced dataset with SPQR.
Silhouette\N. Clusters1020304050
0.758.1561.8564.3563.1563.15
0.847.5950.9249.4449.9149.72
0.917.8721.3022.6821.3922.41
Table 58. Clustering performance in terms of silhouettes using normalized data.
Table 58. Clustering performance in terms of silhouettes using normalized data.
Silhouette\N. Clusters1020304050
0.70000.090.18
0.80000.090.18
0.90000.090.18
Table 59. Clustering performance in terms of silhouettes using the reduced dataset with PCA applied to normalized data.
Table 59. Clustering performance in terms of silhouettes using the reduced dataset with PCA applied to normalized data.
Silhouette\N. Clusters1020304050
0.760.0961.8562.4162.6864.44
0.846.3049.4448.7048.9850.65
0.919.2620.7420.4620.3719.81
Table 60. Clustering performance in terms of silhouettes using the reduced dataset with SPQR applied to normalized data.
Table 60. Clustering performance in terms of silhouettes using the reduced dataset with SPQR applied to normalized data.
Silhouette\N. Clusters1020304050
0.761.1162.7859.9161.94161.851
0.848.98150.18145.55148.05148.421
0.919.2620.8317.7818.8020.37
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Amato, A.; Di Lecce, V. Data Analysis for Information Discovery. Appl. Sci. 2023, 13, 3481. https://doi.org/10.3390/app13063481

AMA Style

Amato A, Di Lecce V. Data Analysis for Information Discovery. Applied Sciences. 2023; 13(6):3481. https://doi.org/10.3390/app13063481

Chicago/Turabian Style

Amato, Alberto, and Vincenzo Di Lecce. 2023. "Data Analysis for Information Discovery" Applied Sciences 13, no. 6: 3481. https://doi.org/10.3390/app13063481

APA Style

Amato, A., & Di Lecce, V. (2023). Data Analysis for Information Discovery. Applied Sciences, 13(6), 3481. https://doi.org/10.3390/app13063481

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop