entropy-logo

Journal Browser

Journal Browser

Big Data

A special issue of Entropy (ISSN 1099-4300).

Deadline for manuscript submissions: closed (30 October 2013) | Viewed by 34591

Special Issue Editor


E-Mail Website
Guest Editor
NASA Ames Research Center, NASA, Moffett Field, CA 94035, USA
Interests: data mining; machine learning; ensemble learning methods; online learning; anomaly detection; applications of machine learning and data mining

Special Issue Information

Dear Colleagues,

"Big data" refers to datasets that are so large that conventional database management and data analysis tools are insufficient to work with them. Big data has become a bigger-than-ever problem for many reasons. Data storage is rapidly becoming cheaper in terms of cost per unit of storage, thereby making appealing the prospect of saving all collected data. Computer processing is becoming more powerful and cheaper, and computer memory is also becoming cheaper, thereby making processing such data increasingly practical. The number of deployed sensors is growing rapidly. For example, there are a greater number of Earth-Observing Satellites than ever before, collecting many terabytes of data per day. Engineered systems have increasing sensing of their environment as well as of the systems themselves for integrated vehicle health management. The internet has greatly added to the volume and heterogeneity of data available---the world-wide web contains an enormous volume of text, images, videos, and connections between these. Many complex processes that we desire to understand generate these data. We desire methods that go in the reverse direction---from big data to an understanding of these complex processes---how they work, when and how they display anomalous behavior, and other insights. Data mining is a field---brought about through the combination of machine learning, statistics, and database management---that seeks to develop such methods. This special issue seeks comprehensive reviews or research articles in the area of entropy and information theory methods for big data. Research articles may describe theoretical and/or algorithmic developments.

Dr. Nikunj C. Oza
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Entropy is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.


Keywords

  • big data
  • analytics
  • data mining
  • predictive analytics
  • knowledge discovery
  • classification
  • regression
  • anomaly detection
  • clustering

Published Papers (5 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

325 KiB  
Article
Fast Feature Selection in a GPU Cluster Using the Delta Test
by Alberto Guillén, M. Isabel García Arenas, Mark Van Heeswijk, Dusan Sovilj, Amaury Lendasse, Luis Javier Herrera, Héctor Pomares and Ignacio Rojas
Entropy 2014, 16(2), 854-869; https://doi.org/10.3390/e16020854 - 13 Feb 2014
Cited by 11 | Viewed by 7557
Abstract
Feature or variable selection still remains an unsolved problem, due to the infeasible evaluation of all the solution space. Several algorithms based on heuristics have been proposed so far with successful results. However, these algorithms were not designed for considering very large datasets, [...] Read more.
Feature or variable selection still remains an unsolved problem, due to the infeasible evaluation of all the solution space. Several algorithms based on heuristics have been proposed so far with successful results. However, these algorithms were not designed for considering very large datasets, making their execution impossible, due to the memory and time limitations. This paper presents an implementation of a genetic algorithm that has been parallelized using the classical island approach, but also considering graphic processing units to speed up the computation of the fitness function. Special attention has been paid to the population evaluation, as well as to the migration operator in the parallel genetic algorithm (GA), which is not usually considered too significant; although, as the experiments will show, it is crucial in order to obtain robust results. Full article
(This article belongs to the Special Issue Big Data)
Show Figures

728 KiB  
Article
Information-Theoretic Data Discarding for Dynamic Trees on Data Streams
by Christoforos Anagnostopoulos and Robert B. Gramacy
Entropy 2013, 15(12), 5510-5535; https://doi.org/10.3390/e15125510 - 13 Dec 2013
Cited by 7 | Viewed by 5387
Abstract
Ubiquitous automated data collection at an unprecedented scale is making available streaming, real-time information flows in a wide variety of settings, transforming both science and industry. Learning algorithms deployed in such contexts often rely on single-pass inference, where the data history is never [...] Read more.
Ubiquitous automated data collection at an unprecedented scale is making available streaming, real-time information flows in a wide variety of settings, transforming both science and industry. Learning algorithms deployed in such contexts often rely on single-pass inference, where the data history is never revisited. Learning may also need to be temporally adaptive to remain up-to-date against unforeseen changes in the data generating mechanism. Online Bayesian inference remains challenged by such transient, evolving data streams. Nonparametric modeling techniques can prove particularly ill-suited, as the complexity of the model is allowed to increase with the sample size. In this work, we take steps to overcome these challenges by porting information theoretic heuristics, such as exponential forgetting and active learning, into a fully Bayesian framework. We showcase our methods by augmenting a modern non-parametric modeling framework, dynamic trees, and illustrate its performance on a number of practical examples. The end product is a powerful streaming regression and classification tool, whose performance compares favorably to the state-of-the-art. Full article
(This article belongs to the Special Issue Big Data)
Show Figures

Figure 1

2532 KiB  
Article
Stochasticity: A Feature for the Structuring of Large and Heterogeneous Image Databases
by Abdourrahmane M. Atto, Yannick Berthoumieu and Rémi Mégret
Entropy 2013, 15(11), 4782-4801; https://doi.org/10.3390/e15114782 - 04 Nov 2013
Cited by 5 | Viewed by 5409
Abstract
The paper addresses image feature characterization and the structuring of large and heterogeneous image databases through the stochasticity or randomness appearance. Measuring stochasticity involves finding suitable representations that can significantly reduce statistical dependencies of any order. Wavelet packet representations provide such a framework [...] Read more.
The paper addresses image feature characterization and the structuring of large and heterogeneous image databases through the stochasticity or randomness appearance. Measuring stochasticity involves finding suitable representations that can significantly reduce statistical dependencies of any order. Wavelet packet representations provide such a framework for a large class of stochastic processes through an appropriate dictionary of parametric models. From this dictionary and the Kolmogorov stochasticity index, the paper proposes semantic stochasticity templates upon wavelet packet sub-bands in order to provide high level classification and content-based image retrieval. The approach is shown to be relevant for texture images. Full article
(This article belongs to the Special Issue Big Data)
Show Figures

Figure 1

2683 KiB  
Article
Kernel Spectral Clustering for Big Data Networks
by Raghvendra Mall, Rocco Langone and Johan A.K. Suykens
Entropy 2013, 15(5), 1567-1586; https://doi.org/10.3390/e15051567 - 03 May 2013
Cited by 64 | Viewed by 7554
Abstract
This paper shows the feasibility of utilizing the Kernel Spectral Clustering (KSC) method for the purpose of community detection in big data networks. KSC employs a primal-dual framework to construct a model. It results in a powerful property of effectively inferring the community [...] Read more.
This paper shows the feasibility of utilizing the Kernel Spectral Clustering (KSC) method for the purpose of community detection in big data networks. KSC employs a primal-dual framework to construct a model. It results in a powerful property of effectively inferring the community affiliation for out-of-sample extensions. The original large kernel matrix cannot fitinto memory. Therefore, we select a smaller subgraph that preserves the overall community structure to construct the model. It makes use of the out-of-sample extension property for community membership of the unseen nodes. We provide a novel memory- and computationally efficient model selection procedure based on angular similarity in the eigenspace. We demonstrate the effectiveness of KSC on large scale synthetic networks and real world networks like the YouTube network, a road network of California and the Livejournal network. These networks contain millions of nodes and several million edges. Full article
(This article belongs to the Special Issue Big Data)
Show Figures

Figure 1

221 KiB  
Article
Discretization Based on Entropy and Multiple Scanning
by Jerzy W. Grzymala-Busse
Entropy 2013, 15(5), 1486-1502; https://doi.org/10.3390/e15051486 - 25 Apr 2013
Cited by 31 | Viewed by 7946
Abstract
In this paper we present entropy driven methodology for discretization. Recently, the original entropy based discretization was enhanced by including two options of selecting the best numerical attribute. In one option, Dominant Attribute, an attribute with the smallest conditional entropy of the concept [...] Read more.
In this paper we present entropy driven methodology for discretization. Recently, the original entropy based discretization was enhanced by including two options of selecting the best numerical attribute. In one option, Dominant Attribute, an attribute with the smallest conditional entropy of the concept given the attribute is selected for discretization and then the best cut point is determined. In the second option, Multiple Scanning, all attributes are scanned a number of times, and at the same time the best cut points are selected for all attributes. The results of experiments on 17 benchmark data sets, including large data sets, with 175 attributes or 25,931 cases, are presented. For comparison, the results of experiments on the same data sets using the global versions of well-known discretization methods of Equal Interval Width and Equal Frequency per Interval are also included. The entropy driven technique enhanced both of these methods by converting them into globalized methods. Results of our experiments show that the Multiple Scanning methodology is significantly better than both: Dominant Attribute and the better results of Globalized Equal Interval Width and Equal Frequency per Interval methods (using two-tailed test and 0.01 level of significance). Full article
(This article belongs to the Special Issue Big Data)
Back to TopTop