entropy-logo

Journal Browser

Journal Browser

Entropy-based Data Mining

A special issue of Entropy (ISSN 1099-4300). This special issue belongs to the section "Information Theory, Probability and Statistics".

Deadline for manuscript submissions: closed (31 January 2018) | Viewed by 68310

Special Issue Editors


E-Mail Website
Guest Editor
Instituto de Física Interdisciplinar y Sistemas Complejos (IFISC), E-07122 Palma, Spain
Interests: complex systems; complex networks; network science; data mining
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
Department of Computer Systems Languages and Sw Engineering, Faculty of Computer Science, Universidad Politecnica de Madrid, Madrid, Spain
Interests: data mining; data science project development; medical data analysis; NLP in medical domain
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Entropy and data mining are not so distant as concepts as it may initially appear. They both share a common idea: Information contained in data presents some regularities, or structures, which we ought to understand in order to better understand the system under study. If entropy aims at assessing the presence of these structures, data mining goes one step further, by extracting and making them, explicitly, for further use; however, it is clear that the former is a first and necessary step for the latter.

Not surprising, entropy and data mining have had an intermingled history. Specifically, entropy has been used extensively to define and support data mining algorithms. Examples include the use of entropy metrics as splitting and pruning criteria in Decision Trees; as a mean to weight distances in high-dimensional k-mean clustering algorithms; to select features subsets in classification ensembles; and as a criterion to combine multiple classifiers. Entropy has also buttressed the creation of data mining models, as in maximum entropy classifiers, implementations of the multinomial logistic regression concept, and in outlier detection. On the other hand, entropy has also been used as a way to create new features from data, in order to feed standard data mining algorithms. For instance, different types of entropies have been used to describe time series, e.g., to distinguish between normal and ictal brain dynamics, or to assess heart rate complexity; to describe symbolic sequences, to then compare a set of them, as in DNA and in the identification of protein coding and non-coding sequences; or to assess the complexity of graphs and networks, in order to then distinguish and classify them.

This Special Issue seeks contributions clarifying and strengthening the relationship between these two research fields, with a special focus on, but not limited to, the improvement of data-mining algorithms through the entropy concept, and on the application of entropy in real-world data-mining tasks. We welcome theoretical, as well as experiment works, original research and review papers.

Dr. Massimiliano Zanin
Dr. Ernestina Menasalvas
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Entropy is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • Data mining algorithms
  • Classification
  • Clustering
  • Feature selection
  • Time series analysis
  • Network entropy

Benefits of Publishing in a Special Issue

  • Ease of navigation: Grouping papers by topic helps scholars navigate broad scope journals more efficiently.
  • Greater discoverability: Special Issues support the reach and impact of scientific research. Articles in Special Issues are more discoverable and cited more frequently.
  • Expansion of research network: Special Issues facilitate connections among authors, fostering scientific collaborations.
  • External promotion: Articles in Special Issues are often promoted through the journal's social media, increasing their visibility.
  • e-Book format: Special Issues with more than 10 articles can be published as dedicated e-books, ensuring wide and rapid dissemination.

Further information on MDPI's Special Issue polices can be found here.

Published Papers (10 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

22 pages, 2548 KiB  
Article
Comparison of Compression-Based Measures with Application to the Evolution of Primate Genomes
by Diogo Pratas, Raquel M. Silva and Armando J. Pinho
Entropy 2018, 20(6), 393; https://doi.org/10.3390/e20060393 - 23 May 2018
Cited by 7 | Viewed by 5153
Abstract
An efficient DNA compressor furnishes an approximation to measure and compare information quantities present in, between and across DNA sequences, regardless of the characteristics of the sources. In this paper, we compare directly two information measures, the Normalized Compression Distance (NCD) and the [...] Read more.
An efficient DNA compressor furnishes an approximation to measure and compare information quantities present in, between and across DNA sequences, regardless of the characteristics of the sources. In this paper, we compare directly two information measures, the Normalized Compression Distance (NCD) and the Normalized Relative Compression (NRC). These measures answer different questions; the NCD measures how similar both strings are (in terms of information content) and the NRC (which, in general, is nonsymmetric) indicates the fraction of one of them that cannot be constructed using information from the other one. This leads to the problem of finding out which measure (or question) is more suitable for the answer we need. For computing both, we use a state of the art DNA sequence compressor that we benchmark with some top compressors in different compression modes. Then, we apply the compressor on DNA sequences with different scales and natures, first using synthetic sequences and then on real DNA sequences. The last include mitochondrial DNA (mtDNA), messenger RNA (mRNA) and genomic DNA (gDNA) of seven primates. We provide several insights into evolutionary acceleration rates at different scales, namely, the observation and confirmation across the whole genomes of a higher variation rate of the mtDNA relative to the gDNA. We also show the importance of relative compression for localizing similar information regions using mtDNA. Full article
(This article belongs to the Special Issue Entropy-based Data Mining)
Show Figures

Figure 1

9 pages, 3839 KiB  
Article
Remote Sensing Extraction Method of Tailings Ponds in Ultra-Low-Grade Iron Mining Area Based on Spectral Characteristics and Texture Entropy
by Baodong Ma, Yuteng Chen, Song Zhang and Xuexin Li
Entropy 2018, 20(5), 345; https://doi.org/10.3390/e20050345 - 6 May 2018
Cited by 15 | Viewed by 4504
Abstract
With the rapid development of the steel and iron industry, ultra-low-grade iron ore has been developed extensively since the beginning of this century in China. Due to the high concentration ratio of the iron ore, a large amount of tailings was produced and [...] Read more.
With the rapid development of the steel and iron industry, ultra-low-grade iron ore has been developed extensively since the beginning of this century in China. Due to the high concentration ratio of the iron ore, a large amount of tailings was produced and many tailings ponds were established in the mining area. This poses a great threat to regional safety and the environment because of dam breaks and metal pollution. The spatial distribution is the basic information for monitoring the status of tailings ponds. Taking Changhe Mining Area as an example, tailings ponds were extracted by using Landsat 8 OLI images based on both spectral and texture characteristics. Firstly, ultra-low-grade iron-related objects (i.e., tailings and iron ore) were extracted by the Ultra-low-grade Iron-related Objects Index (ULIOI) with a threshold. Secondly, the tailings pond was distinguished from the stope due to their entropy difference in the panchromatic image at a 7 × 7 window size. This remote sensing method could be beneficial to safety and environmental management in the mining area. Full article
(This article belongs to the Special Issue Entropy-based Data Mining)
Show Figures

Figure 1

17 pages, 1967 KiB  
Article
KL Divergence-Based Fuzzy Cluster Ensemble for Image Segmentation
by Huiqin Wei, Long Chen and Li Guo
Entropy 2018, 20(4), 273; https://doi.org/10.3390/e20040273 - 12 Apr 2018
Cited by 25 | Viewed by 5703
Abstract
Ensemble clustering combines different basic partitions of a dataset into a more stable and robust one. Thus, cluster ensemble plays a significant role in applications like image segmentation. However, existing ensemble methods have a few demerits, including the lack of diversity of basic [...] Read more.
Ensemble clustering combines different basic partitions of a dataset into a more stable and robust one. Thus, cluster ensemble plays a significant role in applications like image segmentation. However, existing ensemble methods have a few demerits, including the lack of diversity of basic partitions and the low accuracy caused by data noise. In this paper, to get over these difficulties, we propose an efficient fuzzy cluster ensemble method based on Kullback–Leibler divergence or simply, the KL divergence. The data are first classified with distinct fuzzy clustering methods. Then, the soft clustering results are aggregated by a fuzzy KL divergence-based objective function. Moreover, for image segmentation problems, we utilize the local spatial information in the cluster ensemble algorithm to suppress the effect of noise. Experiment results reveal that the proposed methods outperform many other methods in synthetic and real image-segmentation problems. Full article
(This article belongs to the Special Issue Entropy-based Data Mining)
Show Figures

Figure 1

19 pages, 24604 KiB  
Article
Multiple Sclerosis Identification Based on Fractional Fourier Entropy and a Modified Jaya Algorithm
by Shui-Hua Wang, Hong Cheng, Preetha Phillips and Yu-Dong Zhang
Entropy 2018, 20(4), 254; https://doi.org/10.3390/e20040254 - 5 Apr 2018
Cited by 36 | Viewed by 6420
Abstract
Aim: Currently, identifying multiple sclerosis (MS) by human experts may come across the problem of “normal-appearing white matter”, which causes a low sensitivity. Methods: In this study, we presented a computer vision based approached to identify MS in an automatic way. [...] Read more.
Aim: Currently, identifying multiple sclerosis (MS) by human experts may come across the problem of “normal-appearing white matter”, which causes a low sensitivity. Methods: In this study, we presented a computer vision based approached to identify MS in an automatic way. This proposed method first extracted the fractional Fourier entropy map from a specified brain image. Afterwards, it sent the features to a multilayer perceptron trained by a proposed improved parameter-free Jaya algorithm. We used cost-sensitivity learning to handle the imbalanced data problem. Results: The 10 × 10-fold cross validation showed our method yielded a sensitivity of 97.40 ± 0.60%, a specificity of 97.39 ± 0.65%, and an accuracy of 97.39 ± 0.59%. Conclusions: We validated by experiments that the proposed improved Jaya performs better than plain Jaya algorithm and other latest bioinspired algorithms in terms of classification performance and training speed. In addition, our method is superior to four state-of-the-art MS identification approaches. Full article
(This article belongs to the Special Issue Entropy-based Data Mining)
Show Figures

Figure 1

14 pages, 2997 KiB  
Article
Multi-Graph Multi-Label Learning Based on Entropy
by Zixuan Zhu and Yuhai Zhao
Entropy 2018, 20(4), 245; https://doi.org/10.3390/e20040245 - 2 Apr 2018
Cited by 8 | Viewed by 4805
Abstract
Recently, Multi-Graph Learning was proposed as the extension of Multi-Instance Learning and has achieved some successes. However, to the best of our knowledge, currently, there is no study working on Multi-Graph Multi-Label Learning, where each object is represented as a bag containing [...] Read more.
Recently, Multi-Graph Learning was proposed as the extension of Multi-Instance Learning and has achieved some successes. However, to the best of our knowledge, currently, there is no study working on Multi-Graph Multi-Label Learning, where each object is represented as a bag containing a number of graphs and each bag is marked with multiple class labels. It is an interesting problem existing in many applications, such as image classification, medicinal analysis and so on. In this paper, we propose an innovate algorithm to address the problem. Firstly, it uses more precise structures, multiple Graphs, instead of Instances to represent an image so that the classification accuracy could be improved. Then, it uses multiple labels as the output to eliminate the semantic ambiguity of the image. Furthermore, it calculates the entropy to mine the informative subgraphs instead of just mining the frequent subgraphs, which enables selecting the more accurate features for the classification. Lastly, since the current algorithms cannot directly deal with graph-structures, we degenerate the Multi-Graph Multi-Label Learning into the Multi-Instance Multi-Label Learning in order to solve it by MIML-ELM (Improving Multi-Instance Multi-Label Learning by Extreme Learning Machine). The performance study shows that our algorithm outperforms the competitors in terms of both effectiveness and efficiency. Full article
(This article belongs to the Special Issue Entropy-based Data Mining)
Show Figures

Figure 1

20 pages, 1789 KiB  
Article
Deconstructing Cross-Entropy for Probabilistic Binary Classifiers
by Daniel Ramos, Javier Franco-Pedroso, Alicia Lozano-Diez and Joaquin Gonzalez-Rodriguez
Entropy 2018, 20(3), 208; https://doi.org/10.3390/e20030208 - 20 Mar 2018
Cited by 82 | Viewed by 11396
Abstract
In this work, we analyze the cross-entropy function, widely used in classifiers both as a performance measure and as an optimization objective. We contextualize cross-entropy in the light of Bayesian decision theory, the formal probabilistic framework for making decisions, and we thoroughly analyze [...] Read more.
In this work, we analyze the cross-entropy function, widely used in classifiers both as a performance measure and as an optimization objective. We contextualize cross-entropy in the light of Bayesian decision theory, the formal probabilistic framework for making decisions, and we thoroughly analyze its motivation, meaning and interpretation from an information-theoretical point of view. In this sense, this article presents several contributions: First, we explicitly analyze the contribution to cross-entropy of (i) prior knowledge; and (ii) the value of the features in the form of a likelihood ratio. Second, we introduce a decomposition of cross-entropy into two components: discrimination and calibration. This decomposition enables the measurement of different performance aspects of a classifier in a more precise way; and justifies previously reported strategies to obtain reliable probabilities by means of the calibration of the output of a discriminating classifier. Third, we give different information-theoretical interpretations of cross-entropy, which can be useful in different application scenarios, and which are related to the concept of reference probabilities. Fourth, we present an analysis tool, the Empirical Cross-Entropy (ECE) plot, a compact representation of cross-entropy and its aforementioned decomposition. We show the power of ECE plots, as compared to other classical performance representations, in two diverse experimental examples: a speaker verification system, and a forensic case where some glass findings are present. Full article
(This article belongs to the Special Issue Entropy-based Data Mining)
Show Figures

Graphical abstract

18 pages, 3944 KiB  
Article
Applying Time-Dependent Attributes to Represent Demand in Road Mass Transit Systems
by Teresa Cristóbal, Gabino Padrón, Javier Lorenzo-Navarro, Alexis Quesada-Arencibia and Carmelo R. García
Entropy 2018, 20(2), 133; https://doi.org/10.3390/e20020133 - 20 Feb 2018
Cited by 2 | Viewed by 4182
Abstract
The development of efficient mass transit systems that provide quality of service is a major challenge for modern societies. To meet this challenge, it is essential to understand user demand. This article proposes using new time-dependent attributes to represent demand, attributes that differ [...] Read more.
The development of efficient mass transit systems that provide quality of service is a major challenge for modern societies. To meet this challenge, it is essential to understand user demand. This article proposes using new time-dependent attributes to represent demand, attributes that differ from those that have traditionally been used in the design and planning of this type of transit system. Data mining was used to obtain these new attributes; they were created using clustering techniques, and their quality evaluated with the Shannon entropy function and with neural networks. The methodology was implemented on an intercity public transport company and the results demonstrate that the attributes obtained offer a more precise understanding of demand and enable predictions to be made with acceptable precision. Full article
(This article belongs to the Special Issue Entropy-based Data Mining)
Show Figures

Graphical abstract

19 pages, 1713 KiB  
Article
Patent Keyword Extraction Algorithm Based on Distributed Representation for Patent Classification
by Jie Hu, Shaobo Li, Yong Yao, Liya Yu, Guanci Yang and Jianjun Hu
Entropy 2018, 20(2), 104; https://doi.org/10.3390/e20020104 - 2 Feb 2018
Cited by 81 | Viewed by 13969
Abstract
Many text mining tasks such as text retrieval, text summarization, and text comparisons depend on the extraction of representative keywords from the main text. Most existing keyword extraction algorithms are based on discrete bag-of-words type of word representation of the text. In this [...] Read more.
Many text mining tasks such as text retrieval, text summarization, and text comparisons depend on the extraction of representative keywords from the main text. Most existing keyword extraction algorithms are based on discrete bag-of-words type of word representation of the text. In this paper, we propose a patent keyword extraction algorithm (PKEA) based on the distributed Skip-gram model for patent classification. We also develop a set of quantitative performance measures for keyword extraction evaluation based on information gain and cross-validation, based on Support Vector Machine (SVM) classification, which are valuable when human-annotated keywords are not available. We used a standard benchmark dataset and a homemade patent dataset to evaluate the performance of PKEA. Our patent dataset includes 2500 patents from five distinct technological fields related to autonomous cars (GPS systems, lidar systems, object recognition systems, radar systems, and vehicle control systems). We compared our method with Frequency, Term Frequency-Inverse Document Frequency (TF-IDF), TextRank and Rapid Automatic Keyword Extraction (RAKE). The experimental results show that our proposed algorithm provides a promising way to extract keywords from patent texts for patent classification. Full article
(This article belongs to the Special Issue Entropy-based Data Mining)
Show Figures

Graphical abstract

15 pages, 3113 KiB  
Article
Using Entropy in Web Usage Data Preprocessing
by Michal Munk and Lubomir Benko
Entropy 2018, 20(1), 67; https://doi.org/10.3390/e20010067 - 22 Jan 2018
Cited by 5 | Viewed by 5255
Abstract
The paper is focused on an examination of the use of entropy in the field of web usage mining. Entropy creates an alternative possibility of determining the ratio of auxiliary pages in the session identification using the Reference Length method. The experiment was [...] Read more.
The paper is focused on an examination of the use of entropy in the field of web usage mining. Entropy creates an alternative possibility of determining the ratio of auxiliary pages in the session identification using the Reference Length method. The experiment was conducted on two different web portals. The first log file was obtained from a course of virtual learning environment web portal. The second log file was received from the web portal with anonymous access. A comparison of the results of entropy estimation of the ratio of auxiliary pages and a sitemap estimation of the ratio of auxiliary pages showed that in the case of sitemap abundance, entropy could be a full-valued substitution for the estimate of the ratio of auxiliary pages. Full article
(This article belongs to the Special Issue Entropy-based Data Mining)
Show Figures

Figure 1

793 KiB  
Article
Cross Entropy Method Based Hybridization of Dynamic Group Optimization Algorithm
by Rui Tang, Simon Fong, Nilanjan Dey, Raymond K. Wong and Sabah Mohammed
Entropy 2017, 19(10), 533; https://doi.org/10.3390/e19100533 - 9 Oct 2017
Cited by 12 | Viewed by 4703
Abstract
Recently, a new algorithm named dynamic group optimization (DGO) has been proposed, which lends itself strongly to exploration and exploitation. Although DGO has demonstrated its efficacy in comparison to other classical optimization algorithms, DGO has two computational drawbacks. The first one is related [...] Read more.
Recently, a new algorithm named dynamic group optimization (DGO) has been proposed, which lends itself strongly to exploration and exploitation. Although DGO has demonstrated its efficacy in comparison to other classical optimization algorithms, DGO has two computational drawbacks. The first one is related to the two mutation operators of DGO, where they may decrease the diversity of the population, limiting the search ability. The second one is the homogeneity of the updated population information which is selected only from the companions in the same group. It may result in premature convergence and deteriorate the mutation operators. In order to deal with these two problems in this paper, a new hybridized algorithm is proposed, which combines the dynamic group optimization algorithm with the cross entropy method. The cross entropy method takes advantage of sampling the problem space by generating candidate solutions using the distribution, then it updates the distribution based on the better candidate solution discovered. The cross entropy operator does not only enlarge the promising search area, but it also guarantees that the new solution is taken from all the surrounding useful information into consideration. The proposed algorithm is tested on 23 up-to-date benchmark functions; the experimental results verify that the proposed algorithm over the other contemporary population-based swarming algorithms is more effective and efficient. Full article
(This article belongs to the Special Issue Entropy-based Data Mining)
Show Figures

Figure 1

Back to TopTop