Next Article in Journal
Unveiling Biological Activities of Marine Fungi: The Effect of Sea Salt
Previous Article in Journal
MenuNER: Domain-Adapted BERT Based NER Approach for a Domain with Limited Dataset and Its Application to Food Menu Domain
Previous Article in Special Issue
A Comparative Cross-Platform Meta-Analysis to Identify Potential Biomarker Genes Common to Endometriosis and Recurrent Pregnancy Loss
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Partition Quantitative Assessment (PQA): A Quantitative Methodology to Assess the Embedded Noise in Clustered Omics and Systems Biology Data

by
Diego A. Camacho-Hernández
1,2,†,
Victor E. Nieto-Caballero
1,2,†,
José E. León-Burguete
1,2 and
Julio A. Freyre-González
1,*
1
Regulatory Systems Biology Research Group, Center for Genomic Sciences, Laboratory of Systems and Synthetic Biology, Universidad Nacional Autónoma de México (UNAM), Morelos 62210, Mexico
2
Undergraduate Program in Genomic Sciences, Center for Genomic Sciences, Universidad Nacional Autónoma de México (UNAM), Morelos 62210, Mexico
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Appl. Sci. 2021, 11(13), 5999; https://doi.org/10.3390/app11135999
Submission received: 28 December 2020 / Revised: 8 January 2021 / Accepted: 10 January 2021 / Published: 28 June 2021
(This article belongs to the Special Issue Towards a Systems Biology Approach)

Abstract

:

Featured Application

A method to quantify statistically the intrinsic noise of clustered data.

Abstract

Identifying groups that share common features among datasets through clustering analysis is a typical problem in many fields of science, particularly in post-omics and systems biology research. In respect of this, quantifying how a measure can cluster or organize intrinsic groups is important since currently there is no statistical evaluation of how ordered is, or how much noise is embedded in the resulting clustered vector. Much of the literature focuses on how well the clustering algorithm orders the data, with several measures regarding external and internal statistical validation; but no score has been developed to quantify statistically the noise in an arranged vector posterior to a clustering algorithm, i.e., how much of the clustering is due to randomness. Here, we present a quantitative methodology, based on autocorrelation, in order to assess this problem.

1. Introduction

A common task in today’s research is the identification of specific markers, as predictors of a classification yielded in clustering analysis of the data. For instance, this approach is particularly useful after high-throughput experiments to compare gene expression or methylation profiles among different cell lines [1]. This task is used in the nascent field of single-cell sequencing, leading to the important step of clustering cells for further classification or as a qualifying metric of the sequencing process [2]. Regarding the vastly used gene expression assays, the vector of profiles for each marker across different cell lines is recorded using hierarchical clustering algorithms. These algorithms yield a dendrogram and a heat map representing the vector of marker profiles, illustrating the arrangement of the clusters. To assess how well the clustering is segregating different cell lines, a class stating the desired partitioning of each cell line is provided a posteriori. Then, a simple visual inspection of the vector of classes is used to estimate whether the clustering is providing a good partition. Such a partition vector is colored according to the classification that each item is associated with, and it is expected that similar items will be contiguous, so the groups formed are assessed qualitatively against the biological background of each item.
This procedure should not be confused with “supervised clustering”, which provides a vector of classes starting the desired partitioning a priori. This is then used to guide the clustering algorithms by allowing the learning of the metric distances that optimizes the partitioning [3]. Additionally, it may become confused with the metric assessment of the clustering algorithms, especially with the external cluster evaluation. For this, various metrics have been developed to qualify the clustering algorithm itself, such as intrinsic and extrinsic measures. The extrinsic validation compares the clustering to a goal to say whether it is good clustering or not. The internal validation compares the elements within the cluster and their differences [4]. Partition quantitative assessment (PQA) involves characteristics of both kinds of validation, through using both the crafted goal standard and the yielded signal itself (clustered vector). However, PQA gathers these elements not qualifying the clustering algorithm itself but to quantify the noise embedded in the cluster, this noise may be due to the intrinsic metric or marker used to order the data set.
A possible caveat of the qualitative assessment discussed above is that humans tend to perceive meaningful patterns within random data leading to a cognitive bias known as apophenia [5]. While interpreting the partitions obtained from unsupervised clustering analysis, researchers attempt to visually assess how close the classifications are to each other, finding patterns that are not well supported by the data. Such an effect is created because the adjacency between items may give a notion of the dissimilarity distance in the dendrogram leaves. Unfortunately, as far as we know, there is no method to quantitatively assess the quality of the groups of classifications from the clustering or, at least, there is no attempt to quantify whether certain configuration or order of the items may be due to randomness. This is a serious caveat, since the insertion of noise can lead to a false conclusion or misleading results. Furthermore, the purging of this noise can lead to more efficient descriptions of markers and its phenomena, accelerating the advance in many fields.
In statistics, serial correlation (SC) is a term used to describe the relationship between observations of the same variable over specific periods. It was originally used in engineering to determine how a signal, for instance, a radio wave, varies with itself over time. Later, SC was adapted to econometrics to analyze economic data over time principally to predict stock prices and, in other fields, to model-independent random variables [6]. We applied the SC to propose a manner to quantify how good the grouping is of a posterior classification just by retrieving the results of unsupervised clustering analysis. Thus, we propose a novel relative score, PQA, to solve the subjectivity of the visual inspection and to quantify statistically how much noise is embedded in the results of clustering analysis.

2. Methodology

2.1. Assigning Numeric Labels to Classifications

A vector denoting the putative similarities among the variables in a study is usually obtained after a clustering analysis. Each variable is classified to generate a vector of profiles (VP). Such a vector of classifications is usually translated into a colors vector, in which each color represents a classification. It is common to inspect this vector to find groups that make sense according to the analyzed data. To the method presented in this work, the VP may be as simple as a vector of strings or numbers that represent the input.
Whatever representation of the classifications may be, it is necessary to transform the classifications to a vector of numeric labels, in which a number represents a classification, to be able to calculate SC. To accomplish this, we assign the first numeric label (number 1) to the first item in the vector, which usually lays at one of the vector’s extremes. Then, if the classification of the next item is different from the previous one, the next number in the sequence is assigned, and so on. This way of labeling assures that the changes in the SC values are due to the order of numbers, that is to say, the grouping of the classifications resulting from the clustering, and it is not an artifact of the labeling itself (Figure 1).

2.2. Partition Quantitative Assessment (PQA) Score

Because the order of the VP could be interpreted as the grouping of the classifications, we measure how well the same classifications are held together in the VP through a SC shifted one position. Such a type of correlation is defined as the Pearson-product-moment correlation between the VP discarding the first item, and the VP discarding the last (Equation (1), xi (order vector i-th position), n (length of x), ρ i (resulting SC)).
ρ i = i = 2 n x i j = 2 n x i n 1 i = 1 n 1 x i j = 1 n 1 x i n 1 i = 2 n x i j = 2 n x i n 1 2 i = 1 n 1 x i   j = 1 n 1 x i n 1 2
We then define the PQA as the SC of the VP after removing background noise, normalized for the SC of the perfect grouping of the partitions (defined as the sorted vector in ascending order). Thus, the more similar VP is to its sorted vector, the higher the score is yielded (Equation (2), ρ x (SC of the VP), ρ r a n d x ¯ (mean of the SC of 1000 randomizations), ρ p e r f e c t x SC of the sorted vector in ascending order)).
P Q A x = ρ x ρ r a n d x ¯ ρ p e r f e c t x

2.3. Background-Noise Correlation Factor in the PQA Score

To compute the background-noise correlation factor in the PQA score definition, we sample the indexes of the VP and the swapping of the corresponding items. This background correction is aimed to remove inherent noise in the data, even though the score may still be subjected to noise from the chosen clustering algorithm or discrepancies in the posterior classification.

2.4. Statistical Significance of the PQA Score

To quantify the statistical significance of the PQA score, we calculate a Z-score (Equation (3)),
z x = P Q A x P Q A r a n d ¯ S D P Q A r a n d
where P Q A x is the PQA score of the VP, P Q A r a n d ¯ is the mean of PQA scores of 1000 randomizations of the VP. These randomizations have the purpose of generating a solid random background to compare it to the real signal. The number of randomizations does not depend on the size of the VP. It is worth noting that there are two randomization processes, one generate the input population of random vectors to calculate the PQA score to further calculate a Z-score and the other represents the noise in Equation (2).

2.5. Defining Noise Proportions

To provide a quantification of the embedded noise in the VP, we calculate the Z-scores from the distribution of PQA values of the randomized vectors. This shuffling is yielded by scrambling the vector. Then this Z-score is interpolated to retrieve the estimated noise in the VP cluster.

2.6. Effect of the Length and Number of Partitions of the Vector in the Z-Score Distributions

Since we want to compare the PQA with the noise, we randomized 1000 times the VP. We opted to describe the dynamic of the Z-score given the different percentage of noise and the number of partitions. For this, we synthetically crafted a vector of both ranging from 0 to 100 elements and number of classifications. The Z-scores were retrieved from the crafted vectors using the formulas described above.

3. Results and Discussion

3.1. Effects of Permuted Numeric Labels on the Partition

We wondered whether the correct assigning of numeric labels would alter the less possible the SC calculations, so we analyzed how the SC changes over the synthetic partitions with permuted labels. We began generating synthetic partitions in ascending and descending order, increasing both the number of classifications and the number of items, up to 100. It is important to highlight that the number of items belonging to each classification was kept constant. Because trying all the possible permutations for each vector would be implausible, we created a subset of 1000 permutations of each vector, then we calculated the mean SC (Figure 1, see Section 2). We observed that the mean SC became high when the number of items in the VP was greater or equal to 2 times the number of classifications, nevertheless, we obtained the highest SC when the numeric labels we assigned by sequential order, either ascending or descending (Figure 2).

3.2. Length of Partitions as a Proxy of the Number of Classifications

We wondered whether the number of classifications and the length of the VP may change the statistical significance of the PQA score because the less the number of items in the VP, the greater the chance to group each item with any order. We then tested such an effect by calculating a Z-score from ordered synthetic partitions increasing both the number of classifications and the number of items up to 100. We also kept constant the number of classifications for the sake of this analysis. We noticed that only the length of the partition has a true effect on the Z-score, but that is not the case for the number of classifications. We observed that every partition less than 13 could be considered as pure noise, however, we consider a Z-score cutoff of greater than 3 (p-value of 0.002). We also observed Z-score values still greater than 2 with a length of 12, 11, and 10, but less than with lengths between 2 and 9 (Figure 2). If we were more flexible, we could have laid out a length cutoff on those values without losing statistical significance, since a Z-score of 2 corresponds roughly to a p-value of 0.05. The results of this analysis were expected by intuition because the probability of an item to occupy a position in the VP increases the number of items doing the same.

3.3. Proof of Concept: Quantifying Real Noise

After a literature revision, we noticed that some datasets were subject to visual inspection in their respective papers, so we applied our method to quantify the proportion of noise embedded in those datasets and to test whether they may lead to apophenia. We choose two datasets from literature for two main reasons, first, the data should have a high number of items that are way above our Z-score significance threshold (>13) and, second, we wanted contrasting orderings of the partitions to have one dataset that looks very disordered and other that looks somewhat ordered to compare the noise proportions. Last, we assessed the behavior of the score in highly ordered data. This also matches our threshold mentioned above.

3.3.1. Cancer Methylation Signatures

The first dataset consists of methylation profiles of 242 different cancerous and non-cancerous samples [7] (Figure 3). The classifications look very sparse and the groups are torn apart in many subgroups distributed along with the data’s VP. We detected 25.1% of noise and a PQA score of 0.53 (Figure 4, with a Z-score of 8.2 and a p-value of 9.6 × 10−17), both numbers imply that even though there may be disordered in the VP, there is not a very high noise proportion nor a high PQA score. These results suggest that, like any other statistical test, the longer the number of items in the partition the more diluted is the effect of disorder in the VP, and the results also lead to a greater statistical significance as shown in the analysis of the number of items and classifications. Moreover, the authors concluded that their clustering analysis results made sense from their molecular and biological background, as well as the perspectives about the analyzed profiles; they only assessed grouping just by visual inspection and concluded the grouping was done well. However, understanding the noise in the cluster can help to pursue better markers since it could help to narrow the search space in these kinds of studies.

3.3.2. Distribution of microRNAs in Cancer

The second dataset consists of 103 expression profiles of microRNAs from three classes of sample: invasive breast cancer, those with ductal carcinoma in situ (DCIS), and health (Figure 3) [8]. The authors visually identified three clusters, although selecting the right cutting height threshold is difficult. Besides, one of the clusters is a mix of classes in different proportions, leading the authors to conclude that the DCIS and control sample profiles are not different. On this matter, the PQA score and the proportion of noise are 0.62% and 30.2%, respectively (Figure 4, with Z-score of 6.2 and a p-value of 3.9 × 10−10) providing a quantitative assay to support the grouping that the authors claimed. Furthermore, in comparison with the methylation profiles discussed above, we can appreciate that a partition which appears even less fuzzy has an even higher noise ratio, supporting the idea of how visual inspection could lead to misleading results.

3.3.3. Comparison of Genetic Regulatory Networks with Theoretical Models

Finally, to assess the PQA methodology using systems biology data we clustered 210 networks according to their pairwise dissimilarity [9]. First, 42 curated biological networks were retrieved from Abasy Atlas (v2.2) [10]. For each biological network, we then constructed four networks each according to a theoretical model (Barabasi–Alberts, Erdos–Renyi, scale-free, and hierarchical-modular). We estimated the parameters of each theoretical model from the properties of the corresponding biological network. The models used reproduce one or more intrinsic characteristics of the biological networks, such as power-law distribution, hubs, and scale-free degrees, and hierarchical modular structure [11]. Visual inspection suggested that the classification yielded a highly ordered VP, distinguishing according to the nature of each network (Figure 5). The PQA score for this VP is 0.92 (p-value = 2.5 × 10−40, Z-score = 13.2) and the proportion of noise was 5.8% (Figure 6). In contrast to the previous examples, here we obtained a highly ordered clustering and a very low proportion of noise, which suggests that although the models recapitulate some of the properties of genetic regulatory networks, each of them is not sufficient to capture their structural properties.

4. Conclusions

In this work, we presented a novel method to quantify the proportion of noise embedded in the grouping of associated classes of the elements in hierarchical clustering. We proposed a relative score derived from a SC of the VP from the dendrogram of any clustering analysis and calculated Z-statistics as well as an interpolation to deliver an estimation of noise in the VP. We explain how the method is formulated and show the tests we made to systematically refine it.
Additionally, we made a proof of concept by using clustering data from two works that we think perfectly represent overfitting by apophenia. Also, we added an example from network biology where clustered networks are separated by intrinsic characteristics. Although in this work we focused on examples where hierarchical clustering is performed, this framework can apply to any partition algorithm in which the elements are identified and a VP can be acquired.
We concluded that the clustered sets of biological data have a high measure of noise, despite looking well grouped. We proved what a minimum number of classifications should be considered in this sort of clustering analysis to have a significant reduction of noise. On the other hand, we permuted the labels of the associated classes and concluded that the effect is negligible. We proved that randomness still plays an important role by biasing the results, although it may not be evident through visual inspection.
The PQA could be used as a benchmark to test which clustering algorithm should be appropriate for the analyzed dataset by minimizing the noise proportion and to guide omics experimental designs. Nevertheless, a word of caution, the PQA score alone can be subject to subjectivity if not used properly since it depended on the characteristics of the analyzed data. Thus, the PQA score is thought to be considered a quantification of noise in clustered data and should be used with discretion.

Author Contributions

Conceptualization, J.A.F.-G.; methodology, J.A.F.-G.; software, D.A.C.-H., V.E.N.-C. and J.A.F.-G.; validation, D.A.C.-H., V.E.N.-C. and J.A.F.-G.; formal analysis, D.A.C.-H., V.E.N.-C. and J.A.F.-G.; investigation, D.A.C.-H., V.E.N.-C., J.E.L.-B. and J.A.F.-G.; resources, J.A.F.-G.; data curation, D.A.C.-H., V.E.N.-C. and J.E.L.-B.; writing—original draft preparation, D.A.C.-H., V.E.N.-C., J.E.L.-B. and J.A.F.-G.; writing—review and editing, D.A.C.-H., V.E.N.-C. and J.A.F.-G.; visualization, D.A.C.-H., V.E.N.-C., J.E.L.-B. and J.A.F.-G.; supervision, J.A.F.-G.; project administration, J.A.F.-G.; funding acquisition, J.A.F.-G. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Programa de Apoyo a Proyectos de Investigación e Innovación Tecnológica (PAPIIT-UNAM) [IN205918 and IN202421 to J.A.F.-G.].

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data analyzed in this study are available in the corresponding cited sources including data openly available in Abasy Atlas at https://abasy.ccg.unam.mx.

Acknowledgments

We thank one anonymous reviewer for his/her detailed comments and helpful suggestions.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Kang, S.; Kim, B.; Park, S.B.; Jeong, G.; Kang, H.-S.; Liu, R.; Kim, S.J. Stage-specific methylome screen identifies that NEFL is downregulated by promoter hypermethylation in breast cancer. Int. J. Oncol. 2013, 43, 1659–1665. [Google Scholar] [CrossRef] [PubMed]
  2. Kiselev, V.Y.; Andrews, T.S.; Hemberg, M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat. Rev. Genet. 2019, 20, 273–282. [Google Scholar] [CrossRef] [PubMed]
  3. Al-Harbi, S.H.; Rayward-Smith, V.J. Adapting k-means for supervised clustering. Appl. Intell. 2006, 24, 219–226. [Google Scholar] [CrossRef]
  4. Hassani, M.; Seidl, T. Using internal evaluation measures to validate the quality of diverse stream clustering algorithms. Vietnam. J. Comput. Sci. 2017, 4, 171–183. [Google Scholar] [CrossRef] [Green Version]
  5. Fyfe, S.; Williams, C.; Mason, O.J.; Pickup, G. Apophenia, theory of mind and schizotypy: Perceiving meaning and intentionality in randomness. Cortex 2008, 44, 1316–1325. [Google Scholar] [CrossRef] [PubMed]
  6. Getmansky, M.; Lo, A.W.; Makarov, I. An econometric model of serial correlation and illiquidity in hedge fund returns. J. Financial Econ. 2004, 74, 529–609. [Google Scholar] [CrossRef] [Green Version]
  7. Shen, J.; Hu, Q.; Schrauder, M.; Yan, L.; Wang, D.; Medico, L.; Guo, Y.; Yao, S.; Zhu, Q.; Liu, B.; et al. Circulating miR-148b and miR-133a as biomarkers for breast cancer detection. Oncotarget 2014, 5, 5284–5294. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  8. Toyooka, S.; Toyooka, K.O.; Maruyama, R.; Virmani, A.K.; Girard, L.; Miyajima, K.; Brambilla, E. DNA Meth-ylation Profiles of Lung Tumors. Mol. Cancer Ther. 2001, 1, 61–67. [Google Scholar] [PubMed]
  9. Schieber, T.A.; Carpi, L.; Díaz-Guilera, A.; Pardalos, P.M.; Masoller, C.; Ravetti, M.G. Quantification of network structural dissimilarities. Nat. Commun. 2017, 8, 13928. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  10. Escorcia-Rodríguez, J.M.; Tauch, A.; Freyre-González, J.A. Abasy Atlas v2.2: The most comprehensive and up-to-date inventory of meta-curated, historical, bacterial regulatory networks, their completeness and system-level characterization. Comput. Struct. Biotechnol. J. 2020, 18, 1228–1237. [Google Scholar] [CrossRef] [PubMed]
  11. Barabási, A.-L.; Oltvai, Z.N. Network biology: Understanding the cell’s functional organization. Nat. Rev. Genet. 2004, 5, 101–113. [Google Scholar] [CrossRef] [PubMed]
Figure 1. The pipeline of the partition quantitative assessment (PQA) methodology.
Figure 1. The pipeline of the partition quantitative assessment (PQA) methodology.
Applsci 11 05999 g001
Figure 2. Z-scores of the PQA scores from partitions varying in the number of classifications and the length of the partition.
Figure 2. Z-scores of the PQA scores from partitions varying in the number of classifications and the length of the partition.
Applsci 11 05999 g002
Figure 3. Visual representation of clustered data used to assess the method. (a) Dataset from Jie Shen et al. (b) Dataset from Tooyoka et al.
Figure 3. Visual representation of clustered data used to assess the method. (a) Dataset from Jie Shen et al. (b) Dataset from Tooyoka et al.
Applsci 11 05999 g003
Figure 4. Z-score distribution by percentage of randomized items. (a) Dataset from Jie Shen et al. (b) Dataset from Tooyoka et al. The red dots represent the Z-score interpolation of the corresponding data sets.
Figure 4. Z-score distribution by percentage of randomized items. (a) Dataset from Jie Shen et al. (b) Dataset from Tooyoka et al. The red dots represent the Z-score interpolation of the corresponding data sets.
Applsci 11 05999 g004
Figure 5. Cluster analysis of distance among gene regulatory networks and theoretical network models. The abbreviations and colors used in the posterior classification are as follows: Barabasi–Alberts (BA, red), Erdos–Renyi (ER, blue), scale-free (SF, green), hierarchical modularity (HM, purple), and biological networks (Bi, orange).
Figure 5. Cluster analysis of distance among gene regulatory networks and theoretical network models. The abbreviations and colors used in the posterior classification are as follows: Barabasi–Alberts (BA, red), Erdos–Renyi (ER, blue), scale-free (SF, green), hierarchical modularity (HM, purple), and biological networks (Bi, orange).
Applsci 11 05999 g005
Figure 6. Z-score distribution by percentage of randomized items of vector of profiles (VP) from genetic regulatory networks. The red dot represents the Z-score interpolation of the actual data set.
Figure 6. Z-score distribution by percentage of randomized items of vector of profiles (VP) from genetic regulatory networks. The red dot represents the Z-score interpolation of the actual data set.
Applsci 11 05999 g006
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Camacho-Hernández, D.A.; Nieto-Caballero, V.E.; León-Burguete, J.E.; Freyre-González, J.A. Partition Quantitative Assessment (PQA): A Quantitative Methodology to Assess the Embedded Noise in Clustered Omics and Systems Biology Data. Appl. Sci. 2021, 11, 5999. https://doi.org/10.3390/app11135999

AMA Style

Camacho-Hernández DA, Nieto-Caballero VE, León-Burguete JE, Freyre-González JA. Partition Quantitative Assessment (PQA): A Quantitative Methodology to Assess the Embedded Noise in Clustered Omics and Systems Biology Data. Applied Sciences. 2021; 11(13):5999. https://doi.org/10.3390/app11135999

Chicago/Turabian Style

Camacho-Hernández, Diego A., Victor E. Nieto-Caballero, José E. León-Burguete, and Julio A. Freyre-González. 2021. "Partition Quantitative Assessment (PQA): A Quantitative Methodology to Assess the Embedded Noise in Clustered Omics and Systems Biology Data" Applied Sciences 11, no. 13: 5999. https://doi.org/10.3390/app11135999

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop