Next Issue
Volume 5, June
Previous Issue
Volume 4, December
 
 

Data, Volume 5, Issue 1 (March 2020) – 27 articles

Cover Story (view full-size image): This is a data descriptor paper for a dataset of GNSS signals collected via roof antennas and Spectracom simulator. The provided dataset is multipurpose and can serve in such studies as GNSS time frequency characterization, signal acquisition/tracking, radio frequency fingerprinting (RFF) for transmitter-type identification, etc. Several transmitter types are present in the collected data, each having its own specific features or fingerprints, due to power amplifier nonlinearities, phase noises, or I/Q imbalances. Examples are given in this paper of achievable RFF classification accuracy of up to six of the collected signal classes. RFF can be a promising method to identify GNSS transmitters and can find future applicability in anti-spoofing and anti-jamming solutions. View this paper.
  • Issues are regarded as officially published after their release is announced to the table of contents alert mailing list.
  • You may sign up for e-mail alerts to receive table of contents of newly released issues.
  • PDF is the official format for papers published in both, html and pdf forms. To view the papers in pdf format, click on the "PDF Full-text" link, and use the free Adobe Reader to open them.
Order results
Result details
Section
Select all
Export citation of selected articles as:
5 pages, 223 KiB  
Data Descriptor
Data on Orientation to Happiness in Higher Education Institutions from Mexico and El Salvador
by Domingo Villavicencio-Aguilar, Edgardo René Chacón-Andrade and Maria Fernanda Durón-Ramos
Data 2020, 5(1), 27; https://doi.org/10.3390/data5010027 - 24 Mar 2020
Cited by 1 | Viewed by 2475
Abstract
Happiness-oriented people are vital in every society; this is a construct formed by three different types of happiness: pleasure, meaning, and engagement, and it is considered as an indicator of mental health. This study aims to provide data on the levels of orientation [...] Read more.
Happiness-oriented people are vital in every society; this is a construct formed by three different types of happiness: pleasure, meaning, and engagement, and it is considered as an indicator of mental health. This study aims to provide data on the levels of orientation to happiness in higher-education teachers and students. The present paper contains data about the perception of this positive aspect in two Latin American countries, Mexico and El Salvador. Structure instruments to measure the orientation to happiness were administrated to 397 teachers and 260 students. This data descriptor presents descriptive statistics (mean, standard deviation), internal consistency (Cronbach’s alpha), and differences (Student’s t-test) presented by country, population (teacher/student), and gender of their orientation to happiness and its three dimensions: meaning, pleasure, and engagement. Stepwise-multiple-regression-analysis results are also presented. Results indicated that participants from both countries reported medium–high levels of meaning and engagement happiness; teachers reported higher levels than those of students in these two dimensions. Happiness resulting from pleasure activities was the least reported in general. Males and females presented very similar levels of orientation to happiness. Only the population (teacher/student) showed a predictive relationship with orientation to happiness; however, the model explained a small portion of variance in this variable, which indicated that other factors are more critical when promoting orientation to happiness in higher-education institutions. Full article
18 pages, 501 KiB  
Article
Trend Analysis on Adoption of Virtual and Augmented Reality in the Architecture, Engineering, and Construction Industry
by Mojtaba Noghabaei, Arsalan Heydarian, Vahid Balali and Kevin Han
Data 2020, 5(1), 26; https://doi.org/10.3390/data5010026 - 13 Mar 2020
Cited by 115 | Viewed by 19813
Abstract
With advances in Building Information Modeling (BIM), Virtual Reality (VR) and Augmented Reality (AR) technologies have many potential applications in the Architecture, Engineering, and Construction (AEC) industry. However, the AEC industry, relative to other industries, has been slow in adopting AR/VR technologies, partly [...] Read more.
With advances in Building Information Modeling (BIM), Virtual Reality (VR) and Augmented Reality (AR) technologies have many potential applications in the Architecture, Engineering, and Construction (AEC) industry. However, the AEC industry, relative to other industries, has been slow in adopting AR/VR technologies, partly due to lack of feasibility studies examining the actual cost of implementation versus an increase in profit. The main objectives of this paper are to understand the industry trends in adopting AR/VR technologies and identifying gaps within the industry. The identified gaps can lead to opportunities for developing new tools and finding new use cases. To achieve these goals, two rounds of a survey at two different time periods (a year apart) were conducted. Responses from 158 industry experts and researchers were analyzed to assess the current state, growth, and saving opportunities for AR/VR technologies for the AEC industry. The findings demonstrate that older generations are significantly more confident about the future of AR/VR technologies and they see more benefits in AR/VR utilization. Furthermore, the research results indicate that Residential and commercial sectors have adopted these tools the most, compared to other sectors and institutional and transportation sectors had the highest growth from 2017 to 2018. Industry experts anticipated a solid growth in the use of AR/VR technologies in 5 to 10 years, with the highest expectations towards healthcare. Ultimately, the findings show a significant increase in AR/VR utilization in the AEC industry from 2017 to 2018. Full article
Show Figures

Figure 1

16 pages, 3464 KiB  
Article
Big Data Usage in European Countries: Cluster Analysis Approach
by Mirjana Pejić Bach, Tine Bertoncel, Maja Meško, Dalia Suša Vugec and Lucija Ivančić
Data 2020, 5(1), 25; https://doi.org/10.3390/data5010025 - 12 Mar 2020
Cited by 11 | Viewed by 4367
Abstract
The goal of this research was to investigate the level of digital divide among selected European countries according to the big data usage among their enterprises. For that purpose, we apply the K-means clustering methodology on the Eurostat data about the big data [...] Read more.
The goal of this research was to investigate the level of digital divide among selected European countries according to the big data usage among their enterprises. For that purpose, we apply the K-means clustering methodology on the Eurostat data about the big data usage in European enterprises. The results indicate that there is a significant difference between selected European countries according to the overall usage of big data in their enterprises. Moreover, the enterprises that use internal experts also used diverse big data sources. Since the usage of diverse big data sources allows enterprises to gather more relevant information about their customers and competitors, this indicates that enterprises with stronger internal big data expertise also have a better chance of building strong competitiveness based on big data utilization. Finally, the substantial differences among the industries were found according to the level of big data usage. Full article
(This article belongs to the Special Issue Challenges in Business Intelligence)
Show Figures

Figure 1

8 pages, 1648 KiB  
Data Descriptor
Data on Creative Industries Ventures’ Performance Influenced by Four Networking Types: Designing Strategies for a Sample of Female Entrepreneurs with the Use of Multiple Criteria Analysis
by Naoum Mylonas, Panagiotis Manolitzas and E. Grigoroudis
Data 2020, 5(1), 24; https://doi.org/10.3390/data5010024 - 11 Mar 2020
Cited by 1 | Viewed by 2215
Abstract
This paper presents data that investigates the creative industries ventures’ performance affected by four different types of networking, namely the social, the professional, the family, and that with public sector organizations. Three hundred and seventy-one questionnaires have been collected for the assessment of [...] Read more.
This paper presents data that investigates the creative industries ventures’ performance affected by four different types of networking, namely the social, the professional, the family, and that with public sector organizations. Three hundred and seventy-one questionnaires have been collected for the assessment of networking impact on venture performance. In order to examine the ventures’ performance levels of the female entrepreneurs or self-employed in the creative industries of Greece, we use a multiple criteria method. Based on the data analysis, the most important criterion for the female entrepreneurs in the creative industries to perform highly is professional networking while the least important is observed in the criterion of family networking. Full article
Show Figures

Figure 1

6 pages, 469 KiB  
Data Descriptor
Introducing the Facility List Coder: A New Dataset/Method to Evaluate Community Food Environments
by Ana María Arcila-Agudelo, Juan Carlos Muñoz-Mora and Andreu Farran-Codina
Data 2020, 5(1), 23; https://doi.org/10.3390/data5010023 - 10 Mar 2020
Viewed by 2134
Abstract
Community food environments have been shown to be important determinants to explain dietary patterns. This data descriptor describes a typical dataset obtained after applying the Facility List Coder (FLC), a new tool to asses community food environments that was validated and presented. The [...] Read more.
Community food environments have been shown to be important determinants to explain dietary patterns. This data descriptor describes a typical dataset obtained after applying the Facility List Coder (FLC), a new tool to asses community food environments that was validated and presented. The FLC was developed in Python 3.7 combining GIS analysis with standard data techniques. It offers a low-cost, scalable, efficient, and user-friendly way to indirectly identify community nutritional environments in any context. The FLC uses the most open access information to identify the facilities (e.g., convenience food store, bar, bakery, etc.) present around a location of interest (e.g., school, hospital, or university). As a result, researchers will have a comprehensive list of facilities around any location of interest allowing the assessment of key research questions on the influence of the community food environment on different health outcomes (e.g., obesity, physical inactivity, or diet quality). The FLC can be used either as a main source of information or to complement traditional methods such as store census and official commercial lists, among others. Full article
Show Figures

Figure 1

6 pages, 554 KiB  
Data Descriptor
Dataset of Targeted Metabolite Analysis for Five Taxanes of Hellenic Taxus baccata L. Populations
by Eleftheria Dalmaris, Evangelia V. Avramidou, Aliki Xanthopoulou and Filippos A. Aravanopoulos
Data 2020, 5(1), 22; https://doi.org/10.3390/data5010022 - 06 Mar 2020
Cited by 6 | Viewed by 2089
Abstract
Novel primary sources of one of the world’s leading anticancer agent, paclitaxel, as well as of other antineoplastic taxanes such as 10-deacetylbaccatin-III, are needed to meet an increasing demand. Among the Taxus species the promise of Taxus baccata L. (European or English yew) [...] Read more.
Novel primary sources of one of the world’s leading anticancer agent, paclitaxel, as well as of other antineoplastic taxanes such as 10-deacetylbaccatin-III, are needed to meet an increasing demand. Among the Taxus species the promise of Taxus baccata L. (European or English yew) has been documented. In this study, the metabolite analysis of two marginal T. baccata populations in Greece (Mt. Cholomon and Mt. Olympus), located at the southeastern edge of the species natural distribution, are being explored. A targeted liquid chromatography – mass spectrometry (LC-MS/MS) analysis was used to determine the content of 10-deacetylbaccatin III, baccatin III, 10-deacetyltaxol, paclitaxel and cephalomannine in the needles of each of the populations from three sampling periods (spring, summer and winter). This is the first survey to generate a taxane targeted metabolite data set, since it derives from Hellenic natural populations that have not been explored before. Furthermore, it has used an extensive sample design in order to evaluate chemodiversity at the population level. The analysis revealed significant levels of chemodiversity within and among the investigated populations and significant seasonal variation that could be exploited for the selection of superior germplasm native to Greece, for yew plantations and further exploitation which is necessary for the production of important taxanes. Full article
Show Figures

Figure 1

22 pages, 7495 KiB  
Article
Processing on Structural Data Faultage in Data Fusion
by Fan Chen, Ruoqi Hu, Jiaoxiong Xia and Jie Tao
Data 2020, 5(1), 21; https://doi.org/10.3390/data5010021 - 06 Mar 2020
Cited by 1 | Viewed by 1981
Abstract
With the rapid development of information technology, the development of information management system leads to the generation of heterogeneous data. The process of data fusion will inevitably lead to such problems as missing data, data conflict, data inconsistency and so on. We provide [...] Read more.
With the rapid development of information technology, the development of information management system leads to the generation of heterogeneous data. The process of data fusion will inevitably lead to such problems as missing data, data conflict, data inconsistency and so on. We provide a new perspective that combines the theory in geology to conclude such kind of data errors as structural data faultage. Structural data faultages after data integration often lead to inconsistent data resources and inaccurate data information. In order to solve such problems, this article starts from the attributes of data. We come up with a new solution to process structural data faultages based on attribute similarity. We use the relation of similarity to define three new operations: Attribute cementation, Attribute addition, and Isomorphous homonuclear. Isomorphous homonuclear uses digraph to combine attributes. These three operations are mainly used to handle multiple data errors caused by data faultages, so that the redundancy of data can be reduced, and the consistency of data after integration can be ensured. Finally, it can eliminate the structural data faultage in data fusion. The experiment uses the data of doctoral dissertation in Shanghai University. Three types of dissertation data tables are fused. In addition, the structural data faultages after fusion are processed by the new method proposed by us. Through the statistical analysis of the experiment results and compare with the existing algorithm, we verify the validity and accuracy of this method to process structural data faultages. Full article
Show Figures

Figure 1

25 pages, 5365 KiB  
Article
VARTTA: A Visual Analytics System for Making Sense of Real-Time Twitter Data
by Amir Haghighati and Kamran Sedig
Data 2020, 5(1), 20; https://doi.org/10.3390/data5010020 - 19 Feb 2020
Cited by 3 | Viewed by 4111
Abstract
Through social media platforms, massive amounts of data are being produced. As a microblogging social media platform, Twitter enables its users to post short updates as “tweets” on an unprecedented scale. Once analyzed using machine learning (ML) techniques and in aggregate, Twitter data [...] Read more.
Through social media platforms, massive amounts of data are being produced. As a microblogging social media platform, Twitter enables its users to post short updates as “tweets” on an unprecedented scale. Once analyzed using machine learning (ML) techniques and in aggregate, Twitter data can be an invaluable resource for gaining insight into different domains of discussion and public opinion. However, when applied to real-time data streams, due to covariate shifts in the data (i.e., changes in the distributions of the inputs of ML algorithms), existing ML approaches result in different types of biases and provide uncertain outputs. In this paper, we describe VARTTA (Visual Analytics for Real-Time Twitter datA), a visual analytics system that combines data visualizations, human-data interaction, and ML algorithms to help users monitor, analyze, and make sense of the streams of tweets in a real-time manner. As a case study, we demonstrate the use of VARTTA in political discussions. VARTTA not only provides users with powerful analytical tools, but also enables them to diagnose and to heuristically suggest fixes for the errors in the outcome, resulting in a more detailed understanding of the tweets. Finally, we outline several issues to be considered while designing other similar visual analytics systems. Full article
Show Figures

Graphical abstract

14 pages, 5473 KiB  
Data Descriptor
A Trillion Coral Reef Colors: Deeply Annotated Underwater Hyperspectral Images for Automated Classification and Habitat Mapping
by Ahmad Rafiuddin Rashid and Arjun Chennu
Data 2020, 5(1), 19; https://doi.org/10.3390/data5010019 - 18 Feb 2020
Cited by 11 | Viewed by 4773
Abstract
This paper describes a large dataset of underwater hyperspectral imagery that can be used by researchers in the domains of computer vision, machine learning, remote sensing, and coral reef ecology. We present the details of underwater data acquisition, processing and curation to create [...] Read more.
This paper describes a large dataset of underwater hyperspectral imagery that can be used by researchers in the domains of computer vision, machine learning, remote sensing, and coral reef ecology. We present the details of underwater data acquisition, processing and curation to create this large dataset of coral reef imagery annotated for habitat mapping. A diver-operated hyperspectral imaging system (HyperDiver) was used to survey 147 transects at 8 coral reef sites around the Caribbean island of Curaçao. The underwater proximal sensing approach produced fine-scale images of the seafloor, with more than 2.2 billion points of detailed optical spectra. Of these, more than 10 million data points have been annotated for habitat descriptors or taxonomic identity with a total of 47 class labels up to genus- and species-levels. In addition to HyperDiver survey data, we also include images and annotations from traditional (color photo) quadrat surveys conducted along 23 of the 147 transects, which enables comparative reef description between two types of reef survey methods. This dataset promises benefits for efforts in classification algorithms, hyperspectral image segmentation and automated habitat mapping. Full article
Show Figures

Figure 1

13 pages, 9194 KiB  
Data Descriptor
Identifying GNSS Signals Based on Their Radio Frequency (RF) Features—A Dataset with GNSS Raw Signals Based on Roof Antennas and Spectracom Generator
by Ruben Morales-Ferre, Wenbo Wang, Alejandro Sanz-Abia and Elena-Simona Lohan
Data 2020, 5(1), 18; https://doi.org/10.3390/data5010018 - 17 Feb 2020
Cited by 11 | Viewed by 4834
Abstract
This is a data descriptor paper for a set of raw GNSS signals collected via roof antennas and Spectracom simulator for general-purpose uses. We give one example of possible data use in the context of Radio Frequency Fingerprinting (RFF) studies for signal-type identification [...] Read more.
This is a data descriptor paper for a set of raw GNSS signals collected via roof antennas and Spectracom simulator for general-purpose uses. We give one example of possible data use in the context of Radio Frequency Fingerprinting (RFF) studies for signal-type identification based on front-end hardware characteristics at transmitter or receiver side. Examples are given in this paper of achievable classification accuracy of six of the collected signal classes. The RFF is one of the state-of-the-art, promising methods to identify GNSS transmitters and receivers, and can find future applicability in anti-spoofing and anti-jamming solutions for example. The uses of the provided raw data are not limited to RFF studies, but can extend to uses such as testing GNSS acquisition and tracking, antenna array experiments, and so forth. Full article
(This article belongs to the Special Issue Data from Smartphones and Wearables)
Show Figures

Figure 1

14 pages, 1165 KiB  
Data Descriptor
Residential Power Traces for Five Houses: The iHomeLab RAPT Dataset
by Patrick Huber, Melvin Ott, Martin Friedli, Andreas Rumsch and Andrew Paice
Data 2020, 5(1), 17; https://doi.org/10.3390/data5010017 - 05 Feb 2020
Cited by 10 | Viewed by 3658
Abstract
Datasets with measurements of both solar electricity production and domestic electricity consumption separated into the major loads are interesting for research focussing on (i) local optimization of solar energy consumption and (ii) non-intrusive load monitoring. To this end, we publish the iHomeLab RAPT [...] Read more.
Datasets with measurements of both solar electricity production and domestic electricity consumption separated into the major loads are interesting for research focussing on (i) local optimization of solar energy consumption and (ii) non-intrusive load monitoring. To this end, we publish the iHomeLab RAPT dataset consisting of electrical power traces from five houses in the greater Lucerne region in Switzerland spanning a period from 1.5 up to 3.5 years with a sampling frequency of five minutes. For each house, the electrical energy consumption of the aggregated household and specific appliances such as dishwasher, washing machine, tumble dryer, hot water boiler, or heating pump were metered. Additionally, the data includes electric production data from PV panels for all five houses, and battery power flow measurement data from two houses. Thermal metadata is also provided for the three houses with a heating pump. Full article
Show Figures

Figure 1

12 pages, 328 KiB  
Article
The Business Process Model and Notation Used for the Representation of Alzheimer’s Disease Patients Care Process
by Martin Kopecky and Hana Tomaskova
Data 2020, 5(1), 16; https://doi.org/10.3390/data5010016 - 04 Feb 2020
Cited by 5 | Viewed by 3521
Abstract
Currently, the number of patients with neurological diseases is increasing, especially those older than 65 suffering from Alzheimer’s disease. This development increases the emphasis on understanding and mapping treatment and care processes, not only for the elderly. Service providers (of both treatment and [...] Read more.
Currently, the number of patients with neurological diseases is increasing, especially those older than 65 suffering from Alzheimer’s disease. This development increases the emphasis on understanding and mapping treatment and care processes, not only for the elderly. Service providers (of both treatment and care) are under general pressure to decrease charges and maintain or improve existing levels of care. This situation is significantly influenced by a comprehensive knowledge of the whole process and its values. This publication therefore aims to describe the fundamental procedural aspects of caring for patients with Alzheimer’s disease, using Business Process Model and Notation (BPMN). It also aims to show the possibilities of using BPMN in the description of treatment and care. Modeling of the business process is more frequently being applied not only by businesses but also by scientists involved in process models. It is used to model medical topics, with approximately 10% of its publications only, and most of these publications deal only with clinical pathways, not with overall treatment and care processes. However, the BPMN model allows the whole process of medical and nonmedical care for patients with Alzheimer’s disease to be described, including the decomposition of partial activities into individual threads and sub-processes or atomic tasks. This paper presents the BPMN modeling and mapping of the specific care path for neurodegenerative patients. The text provides a new perspective on the BPMN modeling of Alzheimer’s disease. The presented model offers the option of expanding treatment cost calculation to simulate the process using graphical tools and languages. The overall view of this system creates a much more complex concept of the system and its surroundings. Full article
Show Figures

Figure 1

5 pages, 389 KiB  
Data Descriptor
A Collection of 13 Archaeal and 46 Bacterial Genomes Reconstructed from Marine Metagenomes Derived from the North Sea
by Bernd Wemheuer
Data 2020, 5(1), 15; https://doi.org/10.3390/data5010015 - 04 Feb 2020
Viewed by 2478
Abstract
Marine bacteria are key drivers of ocean biogeochemistry. Despite the increasing number of studies, the complex interaction of marine bacterioplankton communities with their environment is still not fully understood. Additionally, our knowledge about prominent marine lineages is mostly based on genomic information retrieved [...] Read more.
Marine bacteria are key drivers of ocean biogeochemistry. Despite the increasing number of studies, the complex interaction of marine bacterioplankton communities with their environment is still not fully understood. Additionally, our knowledge about prominent marine lineages is mostly based on genomic information retrieved from single isolates, which do not necessarily represent these groups. Consequently, deciphering the ecological contributions of single bacterioplankton community members is one major challenge in marine microbiology. In the present study, we reconstructed 13 archaeal and 46 bacterial metagenome-assembled genomes (MAGs) from four metagenomic data sets derived from the North Sea. Archaeal MAGs were affiliated to Marine Group II within the Euryarchaeota. Bacterial MAGs mainly belonged to marine groups within the Bacteroidetes as well as alpha- and gammaproteobacteria. In addition, two bacterial MAGs were classified as members of the Actinobacteria and Verrucomicrobiota, respectively. The reconstructed genomes contribute to our understanding of important marine lineages and may serve as a basis for further research on functional traits of these groups. Full article
Show Figures

Figure 1

18 pages, 918 KiB  
Data Descriptor
Intracranial Hemorrhage Segmentation Using a Deep Convolutional Model
by Murtadha D. Hssayeni, Muayad S. Croock, Aymen D. Salman, Hassan Falah Al-khafaji, Zakaria A. Yahya and Behnaz Ghoraani
Data 2020, 5(1), 14; https://doi.org/10.3390/data5010014 - 01 Feb 2020
Cited by 104 | Viewed by 14067
Abstract
Traumatic brain injuries may cause intracranial hemorrhages (ICH). ICH could lead to disability or death if it is not accurately diagnosed and treated in a time-sensitive procedure. The current clinical protocol to diagnose ICH is examining Computerized Tomography (CT) scans by radiologists to [...] Read more.
Traumatic brain injuries may cause intracranial hemorrhages (ICH). ICH could lead to disability or death if it is not accurately diagnosed and treated in a time-sensitive procedure. The current clinical protocol to diagnose ICH is examining Computerized Tomography (CT) scans by radiologists to detect ICH and localize its regions. However, this process relies heavily on the availability of an experienced radiologist. In this paper, we designed a study protocol to collect a dataset of 82 CT scans of subjects with a traumatic brain injury. Next, the ICH regions were manually delineated in each slice by a consensus decision of two radiologists. The dataset is publicly available online at the PhysioNet repository for future analysis and comparisons. In addition to publishing the dataset, which is the main purpose of this manuscript, we implemented a deep Fully Convolutional Networks (FCNs), known as U-Net, to segment the ICH regions from the CT scans in a fully-automated manner. The method as a proof of concept achieved a Dice coefficient of 0.31 for the ICH segmentation based on 5-fold cross-validation. Full article
(This article belongs to the Special Issue Benchmarking Datasets in Bioinformatics)
Show Figures

Figure 1

9 pages, 2695 KiB  
Data Descriptor
The Fundamental Clustering and Projection Suite (FCPS): A Dataset Collection to Test the Performance of Clustering and Data Projection Algorithms
by Alfred Ultsch and Jörn Lötsch
Data 2020, 5(1), 13; https://doi.org/10.3390/data5010013 - 30 Jan 2020
Cited by 11 | Viewed by 3741
Abstract
In the context of data science, data projection and clustering are common procedures. The chosen analysis method is crucial to avoid faulty pattern recognition. It is therefore necessary to know the properties and especially the limitations of projection and clustering algorithms. This report [...] Read more.
In the context of data science, data projection and clustering are common procedures. The chosen analysis method is crucial to avoid faulty pattern recognition. It is therefore necessary to know the properties and especially the limitations of projection and clustering algorithms. This report describes a collection of datasets that are grouped together in the Fundamental Clustering and Projection Suite (FCPS). The FCPS contains 10 datasets with the names “Atom”, “Chainlink”, “EngyTime”, “Golfball”, “Hepta”, “Lsun”, “Target”, “Tetra”, “TwoDiamonds”, and “WingNut”. Common clustering methods occasionally identified non-existent clusters or assigned data points to the wrong clusters in the FCPS suite. Likewise, common data projection methods could only partially reproduce the data structure correctly on a two-dimensional plane. In conclusion, the FCPS dataset collection addresses general challenges for clustering and projection algorithms such as lack of linear separability, different or small inner class spacing, classes defined by data density rather than data spacing, no cluster structure at all, outliers, or classes that are in contact. This report describes a collection of datasets that are grouped together in the Fundamental Clustering and Projection Suite (FCPS). It is designed to address specific problems of structure discovery in high-dimensional spaces. Full article
(This article belongs to the Special Issue Benchmarking Datasets in Bioinformatics)
Show Figures

Figure 1

5 pages, 165 KiB  
Editorial
Acknowledgement to Reviewers of Data in 2019
by Data Editorial Office
Data 2020, 5(1), 12; https://doi.org/10.3390/data5010012 - 29 Jan 2020
Viewed by 1579
Abstract
The editorial team greatly appreciates the reviewers who have dedicated their considerable time and expertise to the journal’s rigorous editorial process over the past 12 months, regardless of whether the papers are finally published or not [...] Full article
11 pages, 2714 KiB  
Data Descriptor
Carbon Sequestration Rate Estimates in Delaware Bay and Barnegat Bay Tidal Wetlands Using Interpolation Mapping
by Lena Champlin, David Velinsky, Kaitlin Tucker, Christopher Sommerfield, Kari St. Laurent and Elizabeth Watson
Data 2020, 5(1), 11; https://doi.org/10.3390/data5010011 - 25 Jan 2020
Cited by 6 | Viewed by 3061
Abstract
Quantifying carbon sequestration by tidal wetlands is important for the management of carbon stocks as part of climate change mitigation. This data publication includes a spatial analysis of carbon accumulation rates in Barnegat and Delaware Bay tidal wetlands. One method calculated long-term organic [...] Read more.
Quantifying carbon sequestration by tidal wetlands is important for the management of carbon stocks as part of climate change mitigation. This data publication includes a spatial analysis of carbon accumulation rates in Barnegat and Delaware Bay tidal wetlands. One method calculated long-term organic carbon accumulation rates from radioisotope-dated (Cs-137) sediment cores. The second method measured organic carbon density of sediment accumulated above feldspar marker beds. Carbon accumulation rates generated by these two methods were interpolated across emergent wetland areas, using kriging, with uncertainty estimated by leave-one-out cross validation. This spatial analysis revealed greater carbon sequestration within Delaware, compared to Barnegat Bay. Sequestration rates were found to be more variable within Delaware Bay, and rates were greatest in the tidal freshwater area of the upper bay. Full article
Show Figures

Figure 1

5 pages, 179 KiB  
Data Descriptor
An Open-Access Dataset of Thorough QT Studies Results
by Barbara Wiśniowska, Zofia Tylutki and Sebastian Polak
Data 2020, 5(1), 10; https://doi.org/10.3390/data5010010 - 25 Jan 2020
Cited by 3 | Viewed by 2590
Abstract
Along with the current interest in changes of cardiovascular risk assessment strategy and inclusion of in silico modelling into the applicable paradigm, the need for data has increased, both for model generation and testing. Data collection is often time-consuming but an inevitable step [...] Read more.
Along with the current interest in changes of cardiovascular risk assessment strategy and inclusion of in silico modelling into the applicable paradigm, the need for data has increased, both for model generation and testing. Data collection is often time-consuming but an inevitable step in the modelling process, requiring extensive literature searches and other identification of alternative resources providing complementary results. The next step, namely data extraction, can also be challenging. Here we present a collection of thorough QT/QTc (TQT) study results with detailed descriptions of study design, pharmacokinetics, and pharmacodynamic endpoints. The presented dataset provides information that can be further utilized to assess the predictive performance of different preclinical biomarkers for QT prolongation effects with the use of various modelling approaches. As the exposure levels and population description are included, the study design and characteristics of the study population can be recovered precisely in the simulation. Another possible application of the TQT dataset is the analysis of drug characteristic/QT prolongation/TdP (torsade de pointes) relationship after the integration of provided information with other databases and tools. This includes drug cardiac safety classifications (e.g., CredibleMeds), Comprehensive in vitro Proarrhythmia Assay (CiPA) compounds classification, as well as those containing information on physico-chemical properties or absorption, distribution, metabolism, excretion (ADME) data like PubChem or DrugBank. Full article
18 pages, 4378 KiB  
Article
Does Land Use and Landscape Contribute to Self-Harm? A Sustainability Cities Framework
by Eric Vaz, Richard Ross Shaker, Michael D. Cusimano, Luis Loures and Jamal Jokar Arsanjani
Data 2020, 5(1), 9; https://doi.org/10.3390/data5010009 - 21 Jan 2020
Cited by 8 | Viewed by 2866
Abstract
Self-harm has become one of the leading causes of mortality in developed countries. The overall rate for suicide in Canada is 11.3 per 100,000 according to Statistics Canada in 2015. Between 2000 and 2007 the lowest rates of suicide in Canada were in [...] Read more.
Self-harm has become one of the leading causes of mortality in developed countries. The overall rate for suicide in Canada is 11.3 per 100,000 according to Statistics Canada in 2015. Between 2000 and 2007 the lowest rates of suicide in Canada were in Ontario, one of the most urbanized regions in Canada. However, the interaction between land use, landscape and self-harm has not been significantly studied for urban cores. It is thus of relevance to understand the impacts of land-use and landscape on suicidal behavior. This paper takes a spatial analytical approach to assess the occurrence of self-harm along one of the densest urban cores in the country: Toronto. Individual self-harm data was gathered by the National Ambulatory Care System (NACRS) and geocoded into census tract divisions. Toronto’s urban landscape is quantified at spatial level through the calculation of its land use at different levels: (i) land use type, (ii) sprawl metrics relating to (a) dispersion and (b) sprawl/mix incidence; (iii) fragmentation metrics of (a) urban fragmentation and (b) density and (iv) demographics of (a) income and (b) age. A stepwise regression is built to understand the most influential factors leading to self-harm from this selection generating an explanatory model. Full article
(This article belongs to the Special Issue Big Data for Sustainable Development)
Show Figures

Figure 1

14 pages, 3071 KiB  
Technical Note
A Python Algorithm for Shortest-Path River Network Distance Calculations Considering River Flow Direction
by Nicolas Cadieux, Margaret Kalacska, Oliver T. Coomes, Mari Tanaka and Yoshito Takasaki
Data 2020, 5(1), 8; https://doi.org/10.3390/data5010008 - 16 Jan 2020
Cited by 3 | Viewed by 11632
Abstract
Vector based shortest path analysis in geographic information system (GIS) is well established for road networks. Even though these network algorithms can be applied to river layers, they do not generally consider the direction of flow. This paper presents a Python 3.7 program [...] Read more.
Vector based shortest path analysis in geographic information system (GIS) is well established for road networks. Even though these network algorithms can be applied to river layers, they do not generally consider the direction of flow. This paper presents a Python 3.7 program (upstream_downstream_shortests_path_dijkstra.py) that was specifically developed for river networks. It implements multiple single-source (one to one) weighted Dijkstra shortest path calculations, on a list of provided source and target nodes, and returns the route geometry, the total distance between each source and target node, and the total upstream and downstream distances for each shortest path. The end result is similar to what would be obtained by an “all-pairs” weighted Dijkstra shortest path algorithm. Contrary to an “all-pairs” Dijkstra, the algorithm only operates on the source and target nodes that were specified by the user and not on all of the nodes contained within the graph. For efficiency, only the upper distance matrix is returned (e.g., distance from node A to node B), while the lower distance matrix (e.g., distance from nodes B to A) is not. The program is intended to be used in a multiprocessor environment and relies on Python’s multiprocessing package. Full article
Show Figures

Graphical abstract

10 pages, 743 KiB  
Data Descriptor
SocNav1: A Dataset to Benchmark and Learn Social Navigation Conventions
by Luis J. Manso, Pedro Nuñez, Luis V. Calderita, Diego R. Faria and Pilar Bachiller
Data 2020, 5(1), 7; https://doi.org/10.3390/data5010007 - 14 Jan 2020
Cited by 15 | Viewed by 3766
Abstract
Datasets are essential to the development and evaluation of machine learning and artificial intelligence algorithms. As new tasks are addressed, new datasets are required. Training algorithms for human-aware navigation is an example of this need. Different factors make designing and gathering data for [...] Read more.
Datasets are essential to the development and evaluation of machine learning and artificial intelligence algorithms. As new tasks are addressed, new datasets are required. Training algorithms for human-aware navigation is an example of this need. Different factors make designing and gathering data for human-aware navigation datasets challenging. Firstly, the problem itself is subjective, different dataset contributors will very frequently disagree to some extent on their labels. Secondly, the number of variables to consider is undetermined culture-dependent. This paper presents SocNav1, a dataset for social navigation conventions. SocNav1 aims at evaluating the robots’ ability to assess the level of discomfort that their presence might generate among humans. The 9280 samples in SocNav1 seem to be enough for machine learning purposes given the relatively small size of the data structures describing the scenarios. Furthermore, SocNav1 is particularly well-suited to be used to benchmark non-Euclidean machine learning algorithms such as graph neural networks. This paper describes the proposed dataset and the method employed to gather the data. To provide a further understanding of the nature of the dataset, an analysis and validation of the collected data are also presented. Full article
(This article belongs to the Special Issue Data from Smartphones and Wearables)
Show Figures

Figure 1

42 pages, 3117 KiB  
Review
Basic Features of the Analysis of Germination Data with Generalized Linear Mixed Models
by Alberto Gianinetti
Data 2020, 5(1), 6; https://doi.org/10.3390/data5010006 - 08 Jan 2020
Cited by 16 | Viewed by 6460
Abstract
Germination data are discrete and binomial. Although analysis of variance (ANOVA) has long been used for the statistical analysis of these data, generalized linear mixed models (GzLMMs) provide a more consistent theoretical framework. GzLMMs are suitable for final germination percentages (FGP) as well [...] Read more.
Germination data are discrete and binomial. Although analysis of variance (ANOVA) has long been used for the statistical analysis of these data, generalized linear mixed models (GzLMMs) provide a more consistent theoretical framework. GzLMMs are suitable for final germination percentages (FGP) as well as longitudinal studies of germination time-courses. Germination indices (i.e., single-value parameters summarizing the results of a germination assay by combining the level and rapidity of germination) and other data with a Gaussian error distribution can be analyzed too. There are, however, different kinds of GzLMMs: Conditional (i.e., random effects are modeled as deviations from the general intercept with a specific covariance structure), marginal (i.e., random effects are modeled solely as a variance/covariance structure of the error terms), and quasi-marginal (some random effects are modeled as deviations from the intercept and some are modeled as a covariance structure of the error terms) models can be applied to the same data. It is shown that: (a) For germination data, conditional, marginal, and quasi-marginal GzLMMs tend to converge to a similar inference; (b) conditional models are the first choice for FGP; (c) marginal or quasi-marginal models are more suited for longitudinal studies, although conditional models lead to a congruent inference; (d) in general, common random factors are better dealt with as random intercepts, whereas serial correlation is easier to model in terms of the covariance structure of the error terms; (e) germination indices are not binomial and can be easier to analyze with a marginal model; (f) in boundary conditions (when some means approach 0% or 100%), conditional models with an integral approximation of true likelihood are more appropriate; in non-boundary conditions, (g) germination data can be fitted with default pseudo-likelihood estimation techniques, on the basis of the SAS-based code templates provided here; (h) GzLMMs are remarkably good for the analysis of germination data except if some means are 0% or 100%. In this case, alternative statistical approaches may be used, such as survival analysis or linear mixed models (LMMs) with transformed data, unless an ad hoc data adjustment in estimates of limit means is considered, either experimentally or computationally. This review is intended as a basic tutorial for the application of GzLMMs, and is, therefore, of interest primarily to researchers in the agricultural sciences. Full article
Show Figures

Figure 1

5 pages, 189 KiB  
Editorial
Overcoming Data Scarcity in Earth Science
by Angela Gorgoglione, Alberto Castro, Christian Chreties and Lorena Etcheverry
Data 2020, 5(1), 5; https://doi.org/10.3390/data5010005 - 01 Jan 2020
Cited by 10 | Viewed by 2789
Abstract
The Data Scarcity problem is repeatedly encountered in environmental research. This may induce an inadequate representation of the response’s complexity in any environmental system to any input/change (natural and human-induced). In such a case, before getting engaged with new expensive studies to gather [...] Read more.
The Data Scarcity problem is repeatedly encountered in environmental research. This may induce an inadequate representation of the response’s complexity in any environmental system to any input/change (natural and human-induced). In such a case, before getting engaged with new expensive studies to gather and analyze additional data, it is reasonable first to understand what enhancement in estimates of system performance would result if all the available data could be well exploited. The purpose of this Special Issue, “Overcoming Data Scarcity in Earth Science” in the Data journal, is to draw attention to the body of knowledge that leads at improving the capacity of exploiting the available data to better represent, understand, predict, and manage the behavior of environmental systems at meaningful space-time scales. This Special Issue contains six publications (three research articles, one review, and two data descriptors) covering a wide range of environmental fields: geophysics, meteorology/climatology, ecology, water quality, and hydrology. Full article
(This article belongs to the Special Issue Overcoming Data Scarcity in Earth Science)
8 pages, 702 KiB  
Data Descriptor
Landslide Inventory (2001–2017) of Chittagong Hilly Areas, Bangladesh
by Yasin Wahid Rabby and Yingkui Li
Data 2020, 5(1), 4; https://doi.org/10.3390/data5010004 - 25 Dec 2019
Cited by 17 | Viewed by 4898
Abstract
Landslides are a frequent natural hazard in Chittagong Hilly Areas (CHA), Bangladesh, which causes the loss of lives and damage to the economy. Despite this, an official landslide inventory is still lacking in this area. In this paper, we present a landslide inventory [...] Read more.
Landslides are a frequent natural hazard in Chittagong Hilly Areas (CHA), Bangladesh, which causes the loss of lives and damage to the economy. Despite this, an official landslide inventory is still lacking in this area. In this paper, we present a landslide inventory of this area prepared using the visual interpretation of Google Earth images (Google Earth Mapping), field mapping, and a literature search. We mapped 730 landslides that occurred from January 2001 to March 2017. Different landslide attributes including type, size, distribution, state, water content, and triggers are presented in the dataset. In this area, slide and flow were the two dominant types of landslides. Out of the five districts (Bandarban, Chittagong, Cox’s Bazar, Khagrachari, and Rangamati), most (55%) of the landslides occurred in the Chittagong and Rangamati districts. About 45% of the landslides were small (<100 m2) in size, while the maximum size of the detected landslides was 85202 m2. This dataset will help to understand the characteristics of landslides in CHA and provide useful guidance for policy implementation. Full article
Show Figures

Figure 1

15 pages, 8009 KiB  
Data Descriptor
Multi-Attribute Ecological and Socioeconomic Geodatabase for the Gulf of Mexico Coastal Region of the United States
by Andrew Shamaskin, Sathishkumar Samiappan, Jiangdong Liu, Jennifer Roberts, Anna Linhoss and Kristine Evans
Data 2020, 5(1), 3; https://doi.org/10.3390/data5010003 - 20 Dec 2019
Cited by 6 | Viewed by 3506
Abstract
Strategic, data driven conservation approaches are increasing in popularity as conservation communities gain access to better science, more computing power, and more data. High resolution geospatial data, indicating ecosystem functions and economic activity, can be very useful for any conservation expert or funding [...] Read more.
Strategic, data driven conservation approaches are increasing in popularity as conservation communities gain access to better science, more computing power, and more data. High resolution geospatial data, indicating ecosystem functions and economic activity, can be very useful for any conservation expert or funding agency. A framework was developed for a data driven conservation prioritization tool and a data visualization tool. The developed tools were then implemented and tested for the U.S. Gulf of Mexico coastal region defined by the Gulf Coast Ecosystem Restoration Council. As a part of this tool development, priority attributes and data measures were developed for the region through 13 stakeholder charrettes with local, state, federal, and other non-profit organizations involved in land conservation. This paper presents the measures that were developed to reflect stakeholder priorities. These measures were derived from openly available geospatial and non-geospatial data sources. This database contained 19 measures, aggregated into a one km2 hexagonal grid and grouped by the overarching goals of habitat, water quality and quantity, living coastal and marine resources, community resilience, and economy. The developed measures provided useful data for a conservation planning framework in the U.S. Gulf of Mexico coastal region. Full article
Show Figures

Figure 1

14 pages, 1355 KiB  
Article
Classification of Soils into Hydrologic Groups Using Machine Learning
by Shiny Abraham, Chau Huynh and Huy Vu
Data 2020, 5(1), 2; https://doi.org/10.3390/data5010002 - 19 Dec 2019
Cited by 47 | Viewed by 7780
Abstract
Hydrologic soil groups play an important role in the determination of surface runoff, which, in turn, is crucial for soil and water conservation efforts. Traditionally, placement of soil into appropriate hydrologic groups is based on the judgement of soil scientists, primarily relying on [...] Read more.
Hydrologic soil groups play an important role in the determination of surface runoff, which, in turn, is crucial for soil and water conservation efforts. Traditionally, placement of soil into appropriate hydrologic groups is based on the judgement of soil scientists, primarily relying on their interpretation of guidelines published by regional or national agencies. As a result, large-scale mapping of hydrologic soil groups results in widespread inconsistencies and inaccuracies. This paper presents an application of machine learning for classification of soil into hydrologic groups. Based on features such as percentages of sand, silt and clay, and the value of saturated hydraulic conductivity, machine learning models were trained to classify soil into four hydrologic groups. The results of the classification obtained using algorithms such as k-Nearest Neighbors, Support Vector Machine with Gaussian Kernel, Decision Trees, Classification Bagged Ensembles and TreeBagger (Random Forest) were compared to those obtained using estimation based on soil texture. The performance of these models was compared and evaluated using per-class metrics and micro- and macro-averages. Overall, performance metrics related to kNN, Decision Tree and TreeBagger exceeded those for SVM-Gaussian Kernel and Classification Bagged Ensemble. Among the four hydrologic groups, it was noticed that group B had the highest rate of false positives. Full article
(This article belongs to the Special Issue Overcoming Data Scarcity in Earth Science)
Show Figures

Figure 1

11 pages, 2759 KiB  
Data Descriptor
Daily MODIS Snow Cover Maps for the European Alps from 2002 onwards at 250 m Horizontal Resolution Along with a Nearly Cloud-Free Version
by Michael Matiu, Alexander Jacob and Claudia Notarnicola
Data 2020, 5(1), 1; https://doi.org/10.3390/data5010001 - 18 Dec 2019
Cited by 13 | Viewed by 3110
Abstract
Snow cover dynamics impact a whole range of systems in mountain regions, from society to economy to ecology; and they also affect downstream regions. Monitoring and analyzing snow cover dynamics has been facilitated with remote sensing products. Here, we present two high-resolution daily [...] Read more.
Snow cover dynamics impact a whole range of systems in mountain regions, from society to economy to ecology; and they also affect downstream regions. Monitoring and analyzing snow cover dynamics has been facilitated with remote sensing products. Here, we present two high-resolution daily snow cover data sets for the entire European Alps covering the years 2002 to 2019, and with automatic updates. The first is based on moderate resolution imaging spectroradiometer (MODIS) and its implementation is specifically tailored to the complex terrain, exploiting the highest possible resolution available of 250 m. The second is a nearly cloud-free product derived from the first using temporal and spatial filters, which reduce average cloud cover from 41.9% to less than 0.1%. Validation has been performed using an extensive network of 312 ground stations, and for the cloud filtering also with cross-validation. Average overall accuracies were 93% for the initial and 91.5% for the cloud-filtered product using the ground stations; and 95.3% for the cross-validation of the cloud-filter. The data can be accessed online and via the R and python programming languages. Possible applications of the data include but are not limited to hydrology, cryosphere and climate. Full article
Show Figures

Graphical abstract

Previous Issue
Next Issue
Back to TopTop