Study on the Influence of Diversity and Quality in Entropy Based Collaborative Clustering
Abstract
:1. Introduction
- An optimization based method whose aim is to find the best weighting between algorithms in order to decide with which to collaborate or not. This is achieved using Karush Kuhn Tucker (KKT) optimization and the result gives an insight on the importance of diversity to achieve positive collaborations. This first contribution is an extension of two previous conference papers [14,15].
- An empirical simulation, in which we run multiple instances of the EBCC algorithm with solutions of various quality and diversity in order to assess the influence of these two parameters. From the results, we use linear regression analysis to propose a predictive model for the outcome of collaborative clustering.
2. Related Works
- What we propose in this paper is not limited to a single algorithm or a single family of algorithm. As we explain in Section 4, our optimization model of the weight is generic and can apply to most collaborative clustering algorithms.
- The optimization model for the weights in this paper does not rely on derivation and gradient descent but on KKT optimization. It is therefore very generic.
- We provide an interpretation of the optimal weights found by our proposed method and compare them with the empirical results to see if theory and practice agree or not.
3. Entropy Based Collaborative Clustering
3.1. Notations
3.2. Algorithm
- Local step: Initialization of all and using the local algorithms to maximize all .
- Collaborative step: Repeat the following alternate maximization steps until the global system entropy from Equation (1) stops improving:
- Find S that maximizes for fixed, update S.
- Find that maximizes for S fixed, update .
4. Mathematical Study on the Influence of Diversity
4.1. KKT Optimization Model
4.2. Results Interpretation
- Optimal collaborative clustering aims at reducing the divergences between similar partitions.
- Optimal collaborative clustering does not encourage diversity and reduces the exchanges between partitions or models that are too dissimilar.
4.3. Datasets and Indexes
- This Iris Dataset (UCI): This data set has 150 instances of iris flowers described by 4 integer attributes. The flowers can be classified in 3 categories: Iris Setosa, Iris Versicolour and Iris Virginica. Class structures are well behaved and the class instances are balanced (50/50/50). For the purpose of this experiment, we create artificial view containing only 3 of the 4 attributes.
- The Wine Dataset (UCI): This data set contains 178 instances of Italian wines from 3 different cultivars. All wines are described by 13 numerical attributes and the classes to be found are the 3 cultivars of origin. Class structures are well behaved in this data set, but the class instances are unbalanced (59/71/48).
- The EColi Dataset (UCI): This data set contains 336 instances describing cells measures of Escherichia Coli bacteria. The original data set contains 7 numerical attributes (we removed the first attribute containing the sequence name). The goal of this data set is to predict the localization site of proteins by employing some measures about the cells. There are 4 main site locations that can be divided into 8 hierarchical classes.
- The Wisconsin Data Breast Cancer (UCI): This dataset has 569 instances with 30 variables from 3 different cells that can easily be split into 3 natural views of 10 attributes. Each data observation is labeled as benign (357) or malignant (212).
- The Image Segmentation dataset (UCI): The 2310 instances of this data set were drawn randomly from a database of 7 outdoor images. The images were hand segmented to create a cation for every pixel. Each instance is a 3 × 3 region represented by 19 attributes and there are 7 classes to be found. The attributes can be split into views based on the colors (RGB).
- The VHR Strasbourg dataset ([34,35]): It contains the description of 187.058 segments extracted from a very high resolution satellite image of the French city of Strasbourg. Each segment is described by 27 attributes that can be split between radiometical attributes, shape attributes, and texture attributes. The data set is provided with a partial hybrid ground-truth containing 15 expert classes. Due to its large size, this dataset is not used in all our experiments.
4.4. Results with the Optimized Weights under KKT Conditions
5. Empirical Study on the Influence of Diversity and Quality
5.1. Experimental Protocol
- A collaboration outcome where the local result is improved so that it becomes better than the average results of the other collaborators (grey diagonal line) before collaboration will called a good collaboration.
- If the result improves but remain lower than than the average quality of the other collaborators before collaboration, it is a fair collaboration.
- If the result gets worse while remaining above the average quality before collaboration, then it is a negative collaboration.
- Finally, if the result are worse and are bellow the the average quality before collaboration, such collaboration is not just a negative collaboration, it is a bad collaboration. These terms are illustrated in the diagram shown in Figure 1.
5.2. Point Clouds Visualization and Interpretation
- The raw Silhouette index difference between the two collaborators depending on the initial diversity. These are not experimental results and are just here to show that our simulations covered most possible cases of differences in quality and diversity prior to the collaborative process.
- The Silhouette index raw improvement depending on the initial entropy between the collaborators.
- The Silhouette index raw improvement depending on the initial Silhouette index raw difference between the two collaborators.
- The Davies-Bouldin index raw improvement depending on the initial Davies-Bouldin index raw difference between the two collaborators.
5.3. Regression Model from the Point Clouds
- , the quality of the local clustering result before collaboration using the silhouette index. It is computed in the local feature space.
- , the quality of the collaborator clustering result before collaboration. It is computed using the Silhouette index in the local feature space of the algorithm receiving the collaboration.
- , the quality difference between the receiving algorithm and the collaborating algorithm.
- H, the diversity with the collaborator before collaboration. We use our oriented entropy to get the potential information gain that the collaborator can provide.
- , the quality improvement during the collaboration. It is the raw improvement on the silhouette index for the local algorithm. If the Silhouette index after collaboration is better than the one before, . Otherwise, in case of a negative collaboration .
5.4. Weighting Proposition Based on the Regression Results
- For the lower values of (near ) all collaborators are participating in the process with a relatively significant weight. In this case the collaboration results are poor (Wine data set) or average (the other data sets).
- When increases and gets closer to zero, only the best collaborators are given an important weight in the collaborative process and work together. For all the data sets, this range of values results in a sharp increase of the average quality gain during the collaboration.
- Finally, when becomes too high, there is no collaborator left that is considered good enough by the collaborative framework and the collaboration stops, hence the average gain. For most data sets, this occurs around .
6. Conclusions and Perspectives
6.1. Discussion on the Influence of Quality and Diversity
6.2. Limitations of This Work
6.3. Perspectives and Extensions to Other Algorithms
Author Contributions
Funding
Conflicts of Interest
References
- Dang, T.H.; Ngo, L.T.; Pedrycz, W. Multiple Kernel Based Collaborative Fuzzy Clustering Algorithm. In Proceedings of the 2016 Intelligent Information and Database Systems-8th Asian Conference ACIIDS, Da Nang, Vietnam, 14–16 March 2016; pp. 585–594. [Google Scholar] [CrossRef]
- Filali, A.; Jlassi, C.; Arous, N. SOM variants for topological horizontal collaboration. In Proceedings of the 2016 2nd International Conference on Advanced Technologies for Signal and Image Processing (ATSIP), Monastir, Tunisia, 21–23 March 2016; pp. 459–464. [Google Scholar] [CrossRef]
- Shen, Y.; Pedrycz, W. Collaborative fuzzy clustering algorithm: Some refinements. Int. J. Approx. Reason. 2017, 86, 41–61. [Google Scholar] [CrossRef]
- Vanhaesebrouck, P.; Bellet, A.; Tommasi, M. Decentralized Collaborative Learning of Personalized Models over Networks. AISTATS 2017. Available online: https://hal.inria.fr/hal-01533182/ (accessed on 27 September 2019).
- Cornuéjols, A.; Wemmert, C.; Gançarski, P.; Bennani, Y. Collaborative clustering: Why, when, what and how. Inf. Fus. 2018, 39, 81–95. [Google Scholar] [CrossRef]
- Murena, P.; Sublime, J.; Matei, B.; Cornuéjols, A. An Information Theory based Approach to Multisource Clustering. In Proceedings of the IJCAI-ECAI-18, Stockholm, Sweden, 13–19 July 2018; pp. 2581–2587. [Google Scholar]
- Ngo, L.T.; Dang, T.H.; Pedrycz, W. Towards interval-valued fuzzy set-based collaborative fuzzy clustering algorithms. Pattern Recognit. 2018, 81, 404–416. [Google Scholar] [CrossRef]
- Qiao, Y.; Li, S.; Denoeux, T. Collaborative Evidential Clustering. In Fuzzy Techniques: Theory and Applications, Proceedings of the 2019 Joint World Congress of the International Fuzzy Systems Association and the Annual Conference of the North American Fuzzy Information Processing Society IFSA/NAFIPS’2019, Lafayette, LA, USA, 18–21 June 2019; Advances in Intelligent Systems and Computing; Kearfott, R.B., Batyrshin, I.Z., Reformat, M., Ceberio, M., Kreinovich, V., Eds.; Springer: Berlin, Germany, 2019; Volume 1000, pp. 518–530. [Google Scholar] [CrossRef]
- Breiman, L. Bagging Predictors. Mach. Learn. 1996, 24, 123–140. [Google Scholar] [CrossRef]
- Chawla, N.V.; Eschrich, S.; Hall, L.O. Creating Ensembles of Classifiers. In Proceedings of the 2001 IEEE International Conference on Data Mining, San Jose, CA, USA, 29 November–2 December 2001; pp. 580–581. [Google Scholar]
- Bachman, P.; Alsharif, O.; Precup, D. Learning with Pseudo-Ensembles. In Advances in Neural Information Processing Systems 27; Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K., Eds.; Curran Associates, Inc.: New York, NY, USA, 2014; pp. 3365–3373. [Google Scholar]
- Iam-on, N.; Boongoen, T. Comparative study of matrix refinement approaches for ensemble clustering. Mach. Learn. 2015, 98, 269–300. [Google Scholar] [CrossRef]
- Sublime, J.; Matei, B.; Grozavu, N.; Bennani, Y.; Cornuéjols, A. Entropy Based Probabilistic Collaborative Clustering. Pattern Recognit. 2017, 72, 144–157. [Google Scholar] [CrossRef]
- Sublime, J.; Matei, B.; Murena, P. Analysis of the influence of diversity in collaborative and multi-view clustering. In Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; pp. 4126–4133. [Google Scholar] [CrossRef]
- Sublime, J.; Maurel, D.; Grozavu, N.; Matei, B.; Bennani, Y. Optimizing exchange confidence during collaborative clustering. In Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN), Rio de Janeiro, Brazil, 8–13 July 2018; pp. 1–8. [Google Scholar] [CrossRef]
- Kuncheva, L.I.; Whitaker, C.J. Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy. Mach. Learn. 2003, 51, 181–207. [Google Scholar] [CrossRef]
- Strehl, A.; Ghosh, J.; Cardie, C. Cluster Ensembles-A Knowledge Reuse Framework for Combining Multiple Partitions. J. Mach. Learn. Res. 2002, 3, 583–617. [Google Scholar]
- Lourenço, A.; Rota Bulò, S.; Rebagliati, N.; Fred, A.; Figueiredo, M.; Pelillo, M. Probabilistic consensus clustering using evidence accumulation. Mach. Learn. 2015, 98, 331–357. [Google Scholar] [CrossRef]
- Zimek, A.; Vreeken, J. The blind men and the elephant: On meeting the problem of multiple truths in data from clustering and pattern mining perspectives. Mach. Learn. 2015, 98, 121–155. [Google Scholar] [CrossRef]
- Loia, V.; Pedrycz, W.; Senatore, S. Semantic Web Content Analysis: A Study in Proximity-Based Collaborative Clustering. IEEE Trans. Fuzzy Syst. 2007, 15, 1294–1312. [Google Scholar] [CrossRef]
- Grozavu, N.; Ghassany, M.; Bennani, Y. Learning confidence exchange in Collaborative Clustering. In Proceedings of the The 2011 International Joint Conference on Neural Networks, San Jose, CA, USA, 31 July–5 August 2011; pp. 872–879. [Google Scholar] [CrossRef]
- Grozavu, N.; Cabanes, G.; Bennani, Y. Diversity analysis in collaborative clustering. In Proceedings of the 2014 International Joint Conference on Neural Networks (IJCNN), Beijing, China, 6–11 July 2014; pp. 1754–1761. [Google Scholar] [CrossRef]
- Rastin, P.; Cabanes, G.; Grozavu, N.; Bennani, Y. Collaborative Clustering: How to Select the Optimal Collaborators? In Proceedings of the 2015 IEEE Symposium Series on Computational Intelligence SSCI, Cape Town, South Africa, 7–10 December 2015; pp. 787–794. [Google Scholar] [CrossRef]
- Grozavu, N.; Bennani, Y. Topological Collaborative Clustering. Aust. J. Intell. Inf. Process. Syst. 2010, 12, 13–18. [Google Scholar]
- Ghassany, M.; Grozavu, N.; Bennani, Y. Collaborative Generative Topographic Mapping. In International Conference on Neural Information Processing; Springer: Berlin/Heidelberg, Germany, 2012; pp. 591–598. [Google Scholar] [CrossRef]
- Rastin, P.; Matei, B.; Cabanes, G.; Grozavu, N.; Bennani, Y. Impact of Learners’ Quality and Diversity in Collaborative Clustering. J. Artif. Intell. Soft Comput. Res. 2019, 9, 149–165. [Google Scholar] [CrossRef]
- Wang, X.N.; Wei, J.M.; Jin, H.; Yu, G.; Zhang, H.W. Probabilistic Confusion Entropy for Evaluating Classifiers. Entropy 2013, 15, 4969–4992. [Google Scholar] [CrossRef] [Green Version]
- Pedrycz, W. Collaborative fuzzy clustering. Pattern Recognit. Lett. 2002, 23, 1675–1686. [Google Scholar] [CrossRef]
- Kuhn, H.W.; Tucker, A.W. Nonlinear programming. In Proceedings of the Second Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 31 July–12 August 1951; pp. 481–492. [Google Scholar]
- de Carvalho, F.; de Melo, F.M.; Lechevallier, Y. A multi-view relational fuzzy c-medoid vectors clustering algorithm. Neurocomputing 2015, 163, 115–123. [Google Scholar] [CrossRef]
- Hanson, M.A. Invexity and the Kuhn–Tucker Theorem. J. Math. Anal. Appl. 1999, 236, 594–604. [Google Scholar] [CrossRef]
- Ben-David, S.; von Luxburg, U.; Pal, D. A Sober Look at Clustering Stability. In Learning Theory; Lecture Notes in Computer Science; Lugosi, G., Simon, H., Eds.; Springer: Berlin/Heidelberg, Germany, 2006; pp. 5–19. [Google Scholar] [Green Version]
- von Luxburg, U. Clustering Stability: An Overview. Found. Trends Mach. Learn. 2010, 2, 235–274. [Google Scholar]
- Rougier, S.; Puissant, A. Improvements of urban vegetation segmentation and classification using multi- temporal Pleiades images. In Proceedings of the 5th International Conference on Geographic Object-Based Image Analysis, Thessaloniki, Greece, 21–24 May 2014; p. 6. [Google Scholar]
- Sublime, J.; Troya-Galvis, A.; Puissant, A. Multi-Scale Analysis of Very High Resolution Satellite Images Using Unsupervised Techniques. Remote Sens. 2017, 9, 495. [Google Scholar] [CrossRef]
- Davies, D.L.; Bouldin, D.W. A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1974, 1, 224–227. [Google Scholar] [CrossRef]
- Rousseeuw, R. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 1987, 20, 53–65. [Google Scholar] [CrossRef] [Green Version]
Data Set | Silhouette Index | DB-Index |
---|---|---|
Wine | +1% | +1% |
WDBC | +1% | +1% |
Waveform | +4% | +1% |
EColi | +9% | +4% |
Image Segmentation | +4% | +15% |
Data Set | Number of Collaborations |
---|---|
Iris | 3000 |
Wine | 3000 |
WDBC | 3000 |
EColi | 3000 |
ImgSeg | 3000 |
VHR Strasbourg | 500 |
Data Set | D | Constant c | Abs. Mean Error | ||
---|---|---|---|---|---|
Iris | 0.5612 | 0.5405 | −0.0723 | 0.9683 | 0.0499 |
Wine | 0.3262 | 0.4763 | −0.0309 | 0.9718 | 0.0231 |
WDBC | 0.3410 | 0.5337 | −0.0032 | 0.9499 | 0.0302 |
Ecoli | 0.3352 | 0.4518 | 0.0419 | 0.9510 | 0.0292 |
ImgSeg | 0.2907 | 0.4438 | 0.0032 | 0.9221 | 0.0302 |
VHR Strasbourg | 0.2456 | 0.4858 | 0.0212 | 0.9519 | 0.0100 |
Average | 0.3363 | 0.4911 | −0.0295 | 0.9513 | 0.03374 |
Data Set | Mean Error | |
---|---|---|
Iris | 0.0999 | 0.9547 |
Wine | 0.0332 | 0.9717 |
WDBC | 0.0526 | 0.9499 |
EColi | 0.0361 | 0.9499 |
ImgSeg | 0.0376 | 0.9223 |
VHR Strasbourg | 0.0385 | 0.9090 |
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Sublime, J.; Cabanes, G.; Matei, B. Study on the Influence of Diversity and Quality in Entropy Based Collaborative Clustering. Entropy 2019, 21, 951. https://doi.org/10.3390/e21100951
Sublime J, Cabanes G, Matei B. Study on the Influence of Diversity and Quality in Entropy Based Collaborative Clustering. Entropy. 2019; 21(10):951. https://doi.org/10.3390/e21100951
Chicago/Turabian StyleSublime, Jérémie, Guénaël Cabanes, and Basarab Matei. 2019. "Study on the Influence of Diversity and Quality in Entropy Based Collaborative Clustering" Entropy 21, no. 10: 951. https://doi.org/10.3390/e21100951
APA StyleSublime, J., Cabanes, G., & Matei, B. (2019). Study on the Influence of Diversity and Quality in Entropy Based Collaborative Clustering. Entropy, 21(10), 951. https://doi.org/10.3390/e21100951