A Short Review on Minimum Description Length: An Application to Dimension Reduction in PCA
Abstract
:1. Introduction
2. A Short Review about MDL
From Crude MDL to Refined MDL: NML
3. MDL Applications: A Review
NML for Dimension Reduction in PCA
4. Experimental Results
- ;
- ;
- ;
- ;
- , i.e., the upper bound of .
- TEST 1
- The first test is carried out on the hyperspectral dataset and follows the numerical experiments presented in [27]. A set composed of N signals randomly picked from N different classes () plus P random linear combinations of them corrupted by Gaussian noise has been considered—the weights of the linear combination are extracted from a normal distribution of non-negative values with variance , while the Gaussian noise is zero mean, with standard deviation equal to . The goal is to find the number of the original signals N.Each column of the resulting matrix is a signal so that the dimension of is , with being the number of spectral bands and being the total number of signals. In agreement with [27], the following two configurations have been considered: (i) and , (ii) and . In both cases, the number of independent components N is correctly identified. Figure 1 depicts the behaviour of with respect to k. As can be observed, the estimated stochastic complexity clearly presents a minimum in correspondence to . It is worth noting that the local relative minimum shown by the two curves is caused by the term a, which depends on the singular values. The quantization step has been set equal to . However, it is worth noting that, in this case, the choice of the quantization parameter is not crucial, since the contribution of the term e to the general trend of is negligible when compared with the contribution of the term b. The computing time required for performing the test has been less than s.
- TEST 2
- The second test refers to ECG data. Here, the same number of signals is randomly selected from the three classes, and the aim is to identify the number of classes.It is worth observing that, in this case, the dimension of the data matrix is such that , while . As a consequence of this imbalance, the combined effect of the terms a and d for not-normalized data, and of the term d in the case of data normalized w.r.t the (euclidean) norm of the signal with a maximum norm, leads to a trivial absolute minimum corresponding to , independently of the choice of the quantization step —resulting in a not-consistent estimation of the cost of the model. This results in the conclusion that the formula in Equation (5) generally fails in any case for which the length of the signals n far outweighs their number m. In Figure 2, the shape of is depicted for both not-normalized and normalized data, , and (30 recordings from each class). Similar plots are obtained for and . The computing time required for performing the test has been less than s. More consistent results are obtained by sampling the analyzed signals; however, sampling may cause the loss of some distinctive features for the signal belonging to the different classes, resulting in the estimation of a smaller number of independent classes, as is shown in Figure 3. In this case, a NML depending on both the number of components and the sampling step would be preferable.
- TEST 3
- The third test aims at using the proposed NML-based feature reduction method in a more interesting (real) case concerning hyperspectral image classification.For classification purposes, the conventional approach consists of first reducing the dimensionality of the data by applying PCA, and then feeding the transformed data to an SVM (support vector machine), which classifies them. It is straightforward that the selection of the right number of new components is a core problem, and often, several trials are needed to find the best classification score, resulting in a time-consuming and computationally expensive process.Our intent is to determine whether minimizing allows us to simplify the process, i.e., if it could be a good choice to simply select the first components, where minimizes . For the numerical experiment, the procedure adopted in [78] is taken as a reference, and the results concerning the Indian Pines dataset are compared with the ones presented there. Accordingly, the training set for the SVM is composed of of the samples in each class, randomly selected and normalized; these samples are the columns of the matrix that is analyzed. As depicted in Figure 4, the value which minimizes is 22, which is consistent with the best classification result for PCA+SVM obtained in [78], as shown in Figure 5.In this case, the -dependent term e plays a key role in determining the trend of for two reasons: first, the arguments of the logarithms in the terms b and e have the same magnitude; second, the dimensions n and m of the matrix are such that , so that the term e overwhelms the term b as k grows. It turns out that, in this case, the selection of the quantization step is crucial. As in the first test, the presented results refer to and the required computing time has been about s.
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Chandrashekar, G.; Sahin, F. A survey on feature selection methods. Comput. Electr. Eng. 2014, 40, 16–28. [Google Scholar] [CrossRef]
- Ferreira, A.J.; Figueiredo, M.A.T. Efficient feature selection filters for high-dimensional data. Pattern Recognit. Lett. 2012, 33, 1794–1804. [Google Scholar] [CrossRef] [Green Version]
- Jolliffe, I.; Cadima, J. Principal component analysis: A review and recent developments. Philosphiocal Trans. A 2016, 374, 20150202. [Google Scholar] [CrossRef]
- Lee, D.; Seung, H. Learning the parts of objects by non-negative matrix factorization. Nature 1990, 401, 788–791. [Google Scholar] [CrossRef] [PubMed]
- Tenenbaum, J.B.; De Silva, V.; Langford, J.C. A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science 2000, 290, 2319–2323. [Google Scholar] [CrossRef] [PubMed]
- Van der Maaten, L.; Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
- McInnesand, L.; Healy, J. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv 2018, arXiv:abs/1802.03426. [Google Scholar]
- Vincent, P.; LaRochelle, H.; Bengio, Y.; Manzagol, P.A. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on Machine learning-ICML’08, Helsinki, Finland, 5–9 July 2008; pp. 1096–1103. [Google Scholar]
- Cox, M.; Cox, T. Multidimensional Scaling. In Handbook of Data Visualization; Springer Handbooks Comp. Statistics; Springer: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
- Rissanen, J. Modeling by the shortest data description. Automatica 1978, 14, 465–471. [Google Scholar] [CrossRef]
- Rissanen, J. A universal prior for integers and estimation by minimum description length. Ann. Stat. 1983, 11, 416–431. [Google Scholar] [CrossRef]
- Cover, T.; Thomas, J. Elements of Information Theory; Wiley Interscience: New York, NY, USA, 1991. [Google Scholar]
- Myung, J.I.; Navarro, D.J.; Pitt, M.A. Model selection by normalized maximum likelihood. J. Math. Psychol. 2006, 50, 167–179. [Google Scholar] [CrossRef] [Green Version]
- Grünwald, P.D.; Grunwald, A. The Minimum Description Length Principle; MIT Press: Cambridge, MA, USA, 2007. [Google Scholar]
- Hu, B.; Rakthanmanon, T.; Hao, Y.; Evans, S.; Leonardi, S.; Keogh, E. Using the minimum description length to discover the intrinsic cardinality and dimansionality series. Data Min. Knowl. Discov. 2015, 29, 358–399. [Google Scholar] [CrossRef] [Green Version]
- Cubero, R.J.; Marsili, M.; Roudi, Y. Minimum Description Length Codes Are Critical. Entropy 2018, 20, 755. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Makalic, E.; Schmidt, D.F. Minimum Message Length Inference of the Exponential Distribution with Type I Censoring. Entropy 2021, 23, 1439. [Google Scholar] [CrossRef] [PubMed]
- Adriaans, P.; Vitanyi, P.M.B. Approximation of the Two-Part MDL Code. IEEE Trans. Inf. Theory 2009, 55, 444–457. [Google Scholar] [CrossRef] [Green Version]
- Murena, P.A.; Cornuéjols, A. Minimum Description Length Principle applied to structure adaptation for classification under concept drift. In Proceedings of the International Joint Conference on Neural Networks, Vancouver, BC, Canada, 24–29 July 2016; pp. 2842–2849. [Google Scholar]
- Barron, A.; Rissanen, J.; Yu, B. The minimum description length principle in coding and modeling. IEEE Trans. Inf. Theory 1998, 44, 2743–2760. [Google Scholar] [CrossRef] [Green Version]
- Grünwald, P. Minimum description length tutorial. In Advances in Minimum Description Length: Theory and Applications; Gru¨nwald, P., Myung, I.J., Pitt, M.A., Eds.; MIT Press: Cambridge, MA, USA, 2005; pp. 23–80. [Google Scholar]
- Hansen, M.H.; Yu, B. Minimum description length model selection criteria for generalized linear models. In Lecture Notes–Monograph Series; Institute of Mathematical Statistics: Beachwood, OH, USA, 2003; Volume 40, pp. 145–163. [Google Scholar]
- Rissanen, J. Strong optimality of the normalized ml models as universal codes. IEEE Trans. Inf. Theory 2000, 47, 1712–1717. [Google Scholar] [CrossRef] [Green Version]
- Bokde, D.; Girase, S.; Mukhopadhyay, D. Matrix factorization model in collaborative filtering algorithms: A survey. In Proceedings of the 4th International Conference on Advances in Computing, Communication and Control, Mumbai, India, 1–2 April 2015; pp. 136–146. [Google Scholar]
- Eckart, C.; Young, G. The approximation of one matrix by another of lower rank. Psychometrika 1936, 1, 211–218. [Google Scholar] [CrossRef]
- Udell, M.; Horn, C.; Zadeh, R.; Boyd, S. Generalized low rank models. Found. Trends Mach. Learn. 2016, 9, 1–118. [Google Scholar] [CrossRef]
- Tavory, A. Determining Principal Component Cardinality Through the Principle of Minimum Description Length. In Machine Learning, Optimization, and Data Science; Nicosia, G., Ojha, V., Malfa, E.L., Jansen, G., Sciacca, V., Pardalos, P., Giuffrida, G., Umeton, R., Eds.; Springer: Berlin/Heidelberg, Germany, 2019; pp. 655–666, LOD 2019, LNCS 11943. [Google Scholar]
- Grünwald, P.; Roos, T. Minimum description length revisited. Int. J. Math. Ind. 2019, 11, 1930001. [Google Scholar] [CrossRef] [Green Version]
- Navarro, D.J.; Lee, M.D. Common and distinctive features in stimulus representation: A modified version of the contrast model. Psychon. Bull. Rev. 2004, 11, 961–974. [Google Scholar] [CrossRef] [Green Version]
- Bruni, V.; Vitulano, D. An entropy based approach for SSIM speed up. Signal Process. 2017, 135, 198–209. [Google Scholar] [CrossRef]
- Bruni, V.; Tartaglione, M.; Vitulano, D. A signal complexity-based approach for am–fm signal modes counting. Mathematics 2020, 8, 2170. [Google Scholar] [CrossRef]
- Rissanen, J. Stochastic Complexity in Statistical Inquiry; World Scientific Publishing: Singapore, 1989. [Google Scholar]
- Rissanen, J. Strong optimality of the normalized ML models as universal codes and information in data. IEEE Trans. Inf. Theory 2001, 47, 1712–1717. [Google Scholar] [CrossRef] [Green Version]
- Myung, I.J.; Pitt, M.A. Applying Occam’s razor in modeling cognition: A Bayesian approach. Psychon. Bull. Rev. 1997, 4, 79–95. [Google Scholar] [CrossRef] [Green Version]
- Rissanen, J. MDL denoising. IEEE Trans. Inf. Theory 2000, 46, 2537–2542. [Google Scholar] [CrossRef]
- Kontkanen, P.; Myllymaki, P.; Buntine, V.; Rissanen, J.; Tirri, H. An MDL Framework for Data Clustering; Helsinki Institute for Information Technology HIIT Technical Report; MIT Press: Cambridge, MA, USA, 2003. [Google Scholar]
- Blier, L.; Ollivier, Y. The description length of deep learning models. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 2220–2230. [Google Scholar]
- Begum, N.; Hu, B.; Rakthanmanon, T.; Keogh, E. Towards a minimum description length based stopping criterion for semi-supervised time series classification. In Proceedings of the IEEE 14th International Conference on Information Reuse & Integration (2013), San Francisco, CA, USA, 14–16 August 2013; pp. 333–340. [Google Scholar]
- Yamanishi, K.; Fukushima, S. Model Change Detection With the MDL Principle. IEEE Trans. Inf. Theory 2018, 64, 6115–6126. [Google Scholar] [CrossRef]
- Yamanishi, K. Descriptive Dimensionality and Its Characterization of MDL-based Learning and Change Detection. arXiv 2019, arXiv:1910.11540. [Google Scholar]
- Hinton, G.E.; van Camp, D. Keeping Neural Networks Simple by Minimizing the Description Length of the Weights. In Proceedings of the 6th Annual Conference on Computational Learning Theory, Santa Cruz, CA, USA, 26–28 July 1993; pp. 5–13. [Google Scholar]
- Lin, B. Regularity Normalization: Neuroscience-Inspired Unsupervised Attention across Neural Network Layers. Entropy 2022, 24, 59. [Google Scholar] [CrossRef]
- Saxe, A.M.; Bansal, Y.; Dapello, J.; Advani, M.; Kolchinsky, A.; Tracey, B.D.; Cox, D.D. On the information bottleneck theory of deep learning. J. Stat. Mech. Theory Exp. 2019, 2019, 124020. [Google Scholar] [CrossRef]
- Tishby, N.; Zaslavsky, N. Deep learning and the information bottleneck principle. In Proceedings of the IEEE Information Theory Workshop, Jerusalem, Israel, 11–15 October 2015; pp. 1–5. [Google Scholar]
- Fang, J.; Ouyang, H.; Shen, V.; Dougherty, V.; Liu, W. Using the minimum description length principle to reduce the rate of false positives of best-fit algorithms. EURASIP J. Bioinform. Syst. Biol. 2014, 13, 13. [Google Scholar] [CrossRef] [Green Version]
- Chaitankar, V.; Zhang, C.; Ghosh, P.; Gong, P.; Perkins, E.J.; Deng, Y. Predictive minimum description length principle approach to inferring gene regulatory networks. Adv. Exp. Med. Biol. 2011, 696, 37–43. [Google Scholar] [PubMed]
- Fade, J.; Lefebvre, S.; Cézard, N. Minimum description length approach for unsupervised spectral unmixing of multiple interfering gas species. Opt. Express 2011, 19, 13862–13872. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Wallace, R.S.; Kanade, T. Finding natural clusters having minimum description length. In Proceedings of the 10th International Conference on Pattern Recognition, Atlantic City, NJ, USA, 16–21 June 1990. [Google Scholar]
- Hirai, S.; Yamanishi, K. Detecting Changes of Clustering Structures Using Normalized Maximum Likelihood Coding. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, Beijing, China, 12–16 August 2012; pp. 343–351. [Google Scholar]
- Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control 1974, 19, 716–723. [Google Scholar] [CrossRef]
- Schwarz, G. Estimating the dimension of a model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
- Al-Qurabat, A.K.M.; Abou Jaoude, C.; Idrees, A.K. Two Tier Data Reduction Technique for Reducing Data Transmission in IoT Sensors. In Proceedings of the 15th International Wireless Communications & Mobile Computing Conference, Tangier, Morocco, 4–28 June 2019. [Google Scholar]
- Squires, S.; Prügel-Bennett, A.; Niranjan, M. Minimum description length as an objective function for non-negative matrix factorization. arXiv 2019, arXiv:1902.01632. [Google Scholar]
- Pandey, G.; Dukkipati, A. Minimum description length principle for maximum entropy model selection. In Proceedings of the IEEE International Symposium on Information Theory, Istanbul, Turkey, 7–12 July 2013; pp. 1521–1525. [Google Scholar]
- Shamir, G.I. Minimum description length (MDL) regularization for online learning. In Proceedings of the 1st International Workshop on Feature Extraction: Modern Questions and Challenges at NIPS 2015, PMLR 44:260-276, Montreal, QC, Canada, 11 December 2015. [Google Scholar]
- Thodberg, H.H. Minimum Description Length Shape and Appearance Models. In Proceedings of the Biennial International Conference on Information Processing in Medical Imaging IPMI, Ambleside, UK, 20–25 July 2003; pp. 51–62. [Google Scholar]
- Bariatti, F.; Cellier, P.; Ferré, S.; Berthold, M.R.; Feelders, A.; Krempl, G. GraphMDL: Graph Pattern Selection Based on Minimum Description Length. In Advances in Intelligent Data Analysis XVIII; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 54–66. [Google Scholar]
- Jang, K.E.; Tak, S.; Jung, J.; Jang, J.; Jeong, Y.; Ye, J.C. Wavelet minimum description length detrending for near-infrared spectroscopy. J. Biomed. Opt. 2009, 14, 034004. [Google Scholar] [CrossRef]
- Hamid, E.Y.; Kawasaki, Z.I. Wavelet-based data compression of power system disturbances using the minimum description length criterion. IEEE Trans. Power Deliv. 2002, 17, 460–466. [Google Scholar] [CrossRef]
- Ojanen, J.; Heikkonen, J. A soft thresholding approach for MDL denoising. In Proceedings of the 15th European Signal Processing Conference, Poznan, Poland, 3–7 September 2007; pp. 1083–1087. [Google Scholar]
- Kumar, V.; Heikkonen, J.; Rissanen, J.; Kaski, K. Minimum description length denoising with histogram models. IEEE Trans. Signal Process. 2006, 54, 2922–2928. [Google Scholar] [CrossRef]
- Wettig, H.; Kontkanen, P.; Myllymaki, P. Calculating the Normalized Maximum Likelihood Distribution for Bayesian Forests. In Proceedings of the IADIS International Conference Intelligent Systems and Agents, Lisbon, Portugal, 5–8 October 2007. [Google Scholar]
- Jackson, D.A. Stopping rules in principal components analysis: A comparison of heuristical and statistical approaches. Ecology 1993, 74, 2204–2214. [Google Scholar] [CrossRef]
- Jolliffe, I. Principal Component Analysis; Wiley Online Library: Hoboken, NJ, USA, 2005. [Google Scholar]
- Okamoto, M. Optimality of principal components. In Multivariate Analysis II; Krishnaiah, P.R., Ed.; Academic Press: New York, NY, USA, 1969; pp. 673–685. [Google Scholar]
- McCabe, G.P. Principal variables. Technometrics 1984, 26, 137–144. [Google Scholar] [CrossRef]
- Cadima, J.; Cerdeira, J.O.; Minhoto, M. Computational aspects of algorithms for variable selection in the context of principal components. Comp. Stat. Data Anal. 2004, 47, 225–236. [Google Scholar] [CrossRef]
- R Development Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2015. [Google Scholar]
- Saccenti, E.; Camacho, J. Determining the number of components in principal components analysis: A comparison of statistical, cross-validation and approximated methods. Chemom. Intell. Lab. Syst. 2015, 149, 99–116. [Google Scholar] [CrossRef]
- Gabriel, K.R. The biplot graphical display of matrices with application to principal component analysis. Biometrika 1971, 58, 453–467. [Google Scholar] [CrossRef]
- Cadima, J.; Jolliffe, I.T. On relationships between uncentred and column-centred principal component analysis. Pak. J. Stat. 2009, 25, 473–503. [Google Scholar]
- Demmel, J.W. Applied Numerical Linear Algebra. In Proceedings of the SIAM, New Orleans, LA, USA, 13–15 July 1997. [Google Scholar]
- Mirsky, L. Symmetric gauge functions and unitarily invariant norms. Q. J. Math. 1960, 11, 50–59. [Google Scholar] [CrossRef]
- Baumgardner, M.F.; Biehl, L.L.; Landgrebe, D.A. 220 Band AVIRIS Hyperspectral Image Data Set: June 12, 1992 Indian Pine Test Site 3; Purdue University Research Repository; Purdue University: West Lafayette, IN, USA, 2015. [Google Scholar]
- Goldberger, A.L.; Amaral, L.; Glass, L.; Hausdorff, J.M.; Ivanov, P.; Mark, R.G.; Mietus, J.E.; Moody, G.B.; Peng, C.K.; Stanley, H.E. PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals. Circulation 2000, 101, 215–220. [Google Scholar] [CrossRef] [Green Version]
- Mallat, S. A Wavelet Tour of Signal Processing, 2nd ed.; Academic Press: Cambridge, MA, USA, 1999. [Google Scholar]
- Gersho, A.; Gray, R.M. Vector Quantization and Signal Compression; Springer Science & Business Media: Berlin/Heidelberg, Germany, 1991. [Google Scholar]
- Shambulinga, M.; Sadashivappa, G. Hyperspectral Image Classification using Support Vector Machine with Guided Image Filter. Int. J. Adv. Comput. Sci. Appl. 2019, 10, 271–276. [Google Scholar]
Test | 90% | 95% | 99% | Bartlett’s Test () | Bartlett’s Test () | MDL | True Value |
---|---|---|---|---|---|---|---|
Test 1 | 1 | 1 | 1 | 30 | 30 | 5 | 5 |
Test 2 | 37 | 52 | 75 | 90 | 90 | 90 | 3 |
Test 2 | 2 | 6 | 24 | 63 | 63 | 2 | 3 |
(decimated data) | |||||||
Test 3 | 2 | 6 | 27 | 161 | 159 | 22 | 22 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Bruni, V.; Cardinali, M.L.; Vitulano, D. A Short Review on Minimum Description Length: An Application to Dimension Reduction in PCA. Entropy 2022, 24, 269. https://doi.org/10.3390/e24020269
Bruni V, Cardinali ML, Vitulano D. A Short Review on Minimum Description Length: An Application to Dimension Reduction in PCA. Entropy. 2022; 24(2):269. https://doi.org/10.3390/e24020269
Chicago/Turabian StyleBruni, Vittoria, Maria Lucia Cardinali, and Domenico Vitulano. 2022. "A Short Review on Minimum Description Length: An Application to Dimension Reduction in PCA" Entropy 24, no. 2: 269. https://doi.org/10.3390/e24020269
APA StyleBruni, V., Cardinali, M. L., & Vitulano, D. (2022). A Short Review on Minimum Description Length: An Application to Dimension Reduction in PCA. Entropy, 24(2), 269. https://doi.org/10.3390/e24020269