Data | July 2021 - Browse Articles

10 pages, 2600 KiB

Open AccessData Descriptor

Multi-Layout Invoice Document Dataset (MIDD): A Dataset for Named Entity Recognition

by Dipali Baviskar, Swati Ahirrao and Ketan Kotecha

Data 2021, 6(7), 78; https://doi.org/10.3390/data6070078 - 20 Jul 2021

Cited by 5 | Viewed by 12559

The day-to-day working of an organization produces a massive volume of unstructured data in the form of invoices, legal contracts, mortgage processing forms, and many more. Organizations can utilize the insights concealed in such unstructured documents for their operational benefit. However, analyzing and [...] Read more.

The day-to-day working of an organization produces a massive volume of unstructured data in the form of invoices, legal contracts, mortgage processing forms, and many more. Organizations can utilize the insights concealed in such unstructured documents for their operational benefit. However, analyzing and extracting insights from such numerous and complex unstructured documents is a tedious task. Hence, the research in this area is encouraging the development of novel frameworks and tools that can automate the key information extraction from unstructured documents. However, the availability of standard, best-quality, and annotated unstructured document datasets is a serious challenge for accomplishing the goal of extracting key information from unstructured documents. This work expedites the researcher’s task by providing a high-quality, highly diverse, multi-layout, and annotated invoice documents dataset for extracting key information from unstructured documents. Researchers can use the proposed dataset for layout-independent unstructured invoice document processing and to develop an artificial intelligence (AI)-based tool to identify and extract named entities in the invoice documents. Our dataset includes 630 invoice document PDFs with four different layouts collected from diverse suppliers. As far as we know, our invoice dataset is the only openly available dataset comprising high-quality, highly diverse, multi-layout, and annotated invoice documents. Full article

(This article belongs to the Section Information Systems and Data Management)

► Show Figures

Figure 1

19 pages, 1739 KiB

Open AccessFeature PaperArticle

Dealing with Randomness and Concept Drift in Large Datasets

by Kassim S. Mwitondi and Raed A. Said

Data 2021, 6(7), 77; https://doi.org/10.3390/data6070077 - 19 Jul 2021

Cited by 1 | Viewed by 3763

Abstract

Data-driven solutions to societal challenges continue to bring new dimensions to our daily lives. For example, while good-quality education is a well-acknowledged foundation of sustainable development, innovation and creativity, variations in student attainment and general performance remain commonplace. Developing data -driven solutions hinges [...] Read more.

Data-driven solutions to societal challenges continue to bring new dimensions to our daily lives. For example, while good-quality education is a well-acknowledged foundation of sustainable development, innovation and creativity, variations in student attainment and general performance remain commonplace. Developing data -driven solutions hinges on two fronts-technical and application. The former relates to the modelling perspective, where two of the major challenges are the impact of data randomness and general variations in definitions, typically referred to as concept drift in machine learning. The latter relates to devising data-driven solutions to address real-life challenges such as identifying potential triggers of pedagogical performance, which aligns with the Sustainable Development Goal (SDG) #4-Quality Education. A total of 3145 pedagogical data points were obtained from the central data collection platform for the United Arab Emirates (UAE) Ministry of Education (MoE). Using simple data visualisation and machine learning techniques via a generic algorithm for sampling, measuring and assessing, the paper highlights research pathways for educationists and data scientists to attain unified goals in an interdisciplinary context. Its novelty derives from embedded capacity to address data randomness and concept drift by minimising modelling variations and yielding consistent results across samples. Results show that intricate relationships among data attributes describe the invariant conditions that practitioners in the two overlapping fields of data science and education must identify. Full article

(This article belongs to the Special Issue Education Data Mining)

► Show Figures

Figure 1

20 pages, 1103 KiB

Open AccessData Descriptor

Impact of COVID-19 on Electricity Demand: Deriving Minimum States of System Health for Studies on Resilience

by Smruti Manjunath, Madhura Yeligeti, Maria Fyta, Jannik Haas and Hans-Christian Gils

Data 2021, 6(7), 76; https://doi.org/10.3390/data6070076 - 16 Jul 2021

Viewed by 2584

Abstract

To assess the resilience of energy systems, i.e., the ability to recover after an unexpected shock, the system’s minimum state of service is a key input. Quantitative descriptions of such states are inherently elusive. The measures adopted by governments to contain COVID-19 have [...] Read more.

To assess the resilience of energy systems, i.e., the ability to recover after an unexpected shock, the system’s minimum state of service is a key input. Quantitative descriptions of such states are inherently elusive. The measures adopted by governments to contain COVID-19 have provided empirical data, which may serve as a proxy for such states of minimum service. Here, we systematize the impact of the adopted COVID-19 measures on the electricity demand. We classify the measures into three phases of increasing stringency, ranging from working from home to soft and full lockdowns, for four major electricity consuming countries of Europe. We use readily accessible data from the European Network of Transmission System Operators for Electricity as a basis. For each country and phase, we derive representative daily load profiles with hourly resolution obtained by k-medoids clustering. The analysis could unravel the influence of the different measures to the energy consumption and the differences among the four countries. It is observed that the daily peak load is considerably flattened and the total electricity consumption decreases by up to 30% under the circumstances brought about by the COVID-19 restrictions. These demand profiles are useful for the energy planning community, especially when designing future electricity systems with a focus on system resilience and a more digitalised society in terms of working from home. Full article

► Show Figures

Figure 1

10 pages, 252 KiB

Open AccessData Descriptor

Preprocessing of Public RNA-Sequencing Datasets to Facilitate Downstream Analyses of Human Diseases

by Naomi Rapier-Sharman, John Krapohl, Ethan J. Beausoleil, Kennedy T. L. Gifford, Benjamin R. Hinatsu, Curtis S. Hoffmann, Makayla Komer, Tiana M. Scott and Brett E. Pickett

Data 2021, 6(7), 75; https://doi.org/10.3390/data6070075 - 15 Jul 2021

Viewed by 3328

Abstract

Publicly available RNA-sequencing (RNA-seq) data are a rich resource for elucidating the mechanisms of human disease; however, preprocessing these data requires considerable bioinformatic expertise and computational infrastructure. Analyzing multiple datasets with a consistent computational workflow increases the accuracy of downstream meta-analyses. This collection [...] Read more.

Publicly available RNA-sequencing (RNA-seq) data are a rich resource for elucidating the mechanisms of human disease; however, preprocessing these data requires considerable bioinformatic expertise and computational infrastructure. Analyzing multiple datasets with a consistent computational workflow increases the accuracy of downstream meta-analyses. This collection of datasets represents the human intracellular transcriptional response to disorders and diseases such as acute lymphoblastic leukemia (ALL), B-cell lymphomas, chronic obstructive pulmonary disease (COPD), colorectal cancer, lupus erythematosus; as well as infection with pathogens including Borrelia burgdorferi, hantavirus, influenza A virus, Middle East respiratory syndrome coronavirus (MERS-CoV), Streptococcus pneumoniae, respiratory syncytial virus (RSV), severe acute respiratory syndrome coronavirus (SARS-CoV), and severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). We calculated the statistically significant differentially expressed genes and Gene Ontology terms for all datasets. In addition, a subset of the datasets also includes results from splice variant analyses, intracellular signaling pathway enrichments as well as read mapping and quantification. All analyses were performed using well-established algorithms and are provided to facilitate future data mining activities, wet lab studies, and to accelerate collaboration and discovery. Full article

(This article belongs to the Section Computational Biology, Bioinformatics, and Biomedical Data Science)

31 pages, 1021 KiB

Open AccessArticle

Performing Learning Analytics via Generalised Mixed-Effects Trees

by Luca Fontana, Chiara Masci, Francesca Ieva and Anna Maria Paganoni

Data 2021, 6(7), 74; https://doi.org/10.3390/data6070074 - 9 Jul 2021

Cited by 6 | Viewed by 3088

Abstract

Nowadays, the importance of educational data mining and learning analytics in higher education institutions is being recognised. The analysis of university careers and of student dropout prediction is one of the most studied topics in the area of learning analytics. From the perspective [...] Read more.

Nowadays, the importance of educational data mining and learning analytics in higher education institutions is being recognised. The analysis of university careers and of student dropout prediction is one of the most studied topics in the area of learning analytics. From the perspective of estimating the likelihood of a student dropping out, we propose an innovative statistical method that is a generalisation of mixed-effects trees for a response variable in the exponential family: generalised mixed-effects trees (GMET). We performed a simulation study in order to validate the performance of our proposed method and to compare GMET to classical models. In the case study, we applied GMET to model undergraduate student dropout in different courses at Politecnico di Milano. The model was able to identify discriminating student characteristics and estimate the effect of each degree-based course on the probability of student dropout. Full article

(This article belongs to the Special Issue Education Data Mining)

► Show Figures

Figure 1

23 pages, 640 KiB

Open AccessArticle

A Robust Distributed Clustering of Large Data Sets on a Grid of Commodity Machines

by Salah Taamneh, Mo’taz Al-Hami, Hani Bani-Salameh and Alaa E. Abdallah

Data 2021, 6(7), 73; https://doi.org/10.3390/data6070073 - 7 Jul 2021

Cited by 1 | Viewed by 2392

Abstract

Distributed clustering algorithms have proven to be effective in dramatically reducing execution time. However, distributed environments are characterized by a high rate of failure. Nodes can easily become unreachable. Furthermore, it is not guaranteed that messages are delivered to their destination. As a [...] Read more.

Distributed clustering algorithms have proven to be effective in dramatically reducing execution time. However, distributed environments are characterized by a high rate of failure. Nodes can easily become unreachable. Furthermore, it is not guaranteed that messages are delivered to their destination. As a result, fault tolerance mechanisms are of paramount importance to achieve resiliency and guarantee continuous progress. In this paper, a fault-tolerant distributed k-means algorithm is proposed on a grid of commodity machines. Machines in such an environment are connected in a peer-to-peer fashion and managed by a gossip protocol with the actor model used as the concurrency model. The fact that no synchronization is needed makes it a good fit for parallel processing. Using the passive replication technique for the leader node and the active replication technique for the workers, the system exhibited robustness against failures. The results showed that the distributed k-means algorithm with no fault-tolerant mechanisms achieved up to a 34% improvement over the Hadoop-based k-means algorithm, while the robust one achieved up to a 12% improvement. The experiments also showed that the overhead, using such techniques, was negligible. Moreover, the results indicated that losing up to 10% of the messages had no real impact on the overall performance. Full article

► Show Figures

Figure 1

20 pages, 1360 KiB

Open AccessArticle

BROAD—A Benchmark for Robust Inertial Orientation Estimation

by Daniel Laidig, Marco Caruso, Andrea Cereatti and Thomas Seel

Data 2021, 6(7), 72; https://doi.org/10.3390/data6070072 - 27 Jun 2021

Cited by 21 | Viewed by 5043

Abstract

Inertial measurement units (IMUs) enable orientation, velocity, and position estimation in several application domains ranging from robotics and autonomous vehicles to human motion capture and rehabilitation engineering. Errors in orientation estimation greatly affect any of those motion parameters. The present work explains the [...] Read more.

Inertial measurement units (IMUs) enable orientation, velocity, and position estimation in several application domains ranging from robotics and autonomous vehicles to human motion capture and rehabilitation engineering. Errors in orientation estimation greatly affect any of those motion parameters. The present work explains the main challenges in inertial orientation estimation (IOE) and presents an extensive benchmark dataset that includes 3D inertial and magnetic data with synchronized optical marker-based ground truth measurements, the Berlin Robust Orientation Estimation Assessment Dataset (BROAD). The BROAD dataset consists of 39 trials that are conducted at different speeds and include various types of movement. Thereof, 23 trials are performed in an undisturbed indoor environment, and 16 trials are recorded with deliberate magnetometer and accelerometer disturbances. We furthermore propose error metrics that allow for IOE accuracy evaluation while separating the heading and inclination portions of the error and introduce well-defined benchmark metrics. Based on the proposed benchmark, we perform an exemplary case study on two widely used openly available IOE algorithms. Due to the broad range of motion and disturbance scenarios, the proposed benchmark is expected to provide valuable insight and useful tools for the assessment, selection, and further development of inertial sensor fusion methods and IMU-based application systems. Full article

(This article belongs to the Section Information Systems and Data Management)

► Show Figures

Figure 1

11 pages, 349 KiB

Open AccessData Descriptor

An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing

by Gonçalo Carnaz, Mário Antunes and Vitor Beires Nogueira

Data 2021, 6(7), 71; https://doi.org/10.3390/data6070071 - 26 Jun 2021

Cited by 3 | Viewed by 2848

Abstract

Criminal investigations collect and analyze the facts related to a crime, from which the investigators can deduce evidence to be used in court. It is a multidisciplinary and applied science, which includes interviews, interrogations, evidence collection, preservation of the chain of custody, and [...] Read more.

Criminal investigations collect and analyze the facts related to a crime, from which the investigators can deduce evidence to be used in court. It is a multidisciplinary and applied science, which includes interviews, interrogations, evidence collection, preservation of the chain of custody, and other methods and techniques of investigation. These techniques produce both digital and paper documents that have to be carefully analyzed to identify correlations and interactions among suspects, places, license plates, and other entities that are mentioned in the investigation. The computerized processing of these documents is a helping hand to the criminal investigation, as it allows the automatic identification of entities and their relations, being some of which difficult to identify manually. There exists a wide set of dedicated tools, but they have a major limitation: they are unable to process criminal reports in the Portuguese language, as an annotated corpus for that purpose does not exist. This paper presents an annotated corpus, composed of a collection of anonymized crime-related documents, which were extracted from official and open sources. The dataset was produced as the result of an exploratory initiative to collect crime-related data from websites and conditioned-access police reports. The dataset was evaluated and a mean precision of

0.808

, recall of

0.722

, and F1-score of

0.733

were obtained with the classification of the annotated named-entities present in the crime-related documents. This corpus can be employed to benchmark Machine Learning (ML) and Natural Language Processing (NLP) methods and tools to detect and correlate entities in the documents. Some examples are sentence detection, named-entity recognition, and identification of terms related to the criminal domain. Full article

(This article belongs to the Section Information Systems and Data Management)

► Show Figures

Figure 1

14 pages, 5084 KiB

Open AccessData Descriptor

An AI-Enabled Approach in Analyzing Media Data: An Example from Data on COVID-19 News Coverage in Vietnam

by Quan-Hoang Vuong, Viet-Phuong La, Thanh-Huyen T. Nguyen, Minh-Hoang Nguyen, Tam-Tri Le and Manh-Toan Ho

Data 2021, 6(7), 70; https://doi.org/10.3390/data6070070 - 25 Jun 2021

Cited by 6 | Viewed by 3758

Abstract

This method article presents the nuts and bolts of an AI-enabled approach to extracting and analyzing social media data. The method is based on our previous rapidly cited COVID-19 research publication, working on a dataset of more than 14,000 news articles from Vietnamese [...] Read more.

This method article presents the nuts and bolts of an AI-enabled approach to extracting and analyzing social media data. The method is based on our previous rapidly cited COVID-19 research publication, working on a dataset of more than 14,000 news articles from Vietnamese newspapers, to provide a comprehensive picture of how Vietnam has been responding to this unprecedented pandemic. This same method is behind our IUCN-supported research regarding the social aspects of environmental protection missions, now appearing in print in Wiley’s Corporate Social Responsibility and Environmental Management. Homemade AI-enabled software was the backbone of the study. The software has provided a fast and automatic approach in collecting and analyzing social data. Moreover, the tool also allows manually sorting the data, AI-generated word tokenizing in the Vietnamese language, and powerful visualization. The method hopes to provide an effective but low-cost method for social scientists to gather a massive amount of data and analyze them in a short amount of time. Full article

(This article belongs to the Special Issue Web Usage Mining)

► Show Figures

Figure 1

18 pages, 5803 KiB

Open AccessArticle

Transitioning to Society 5.0 in Africa: Tools to Support ICT Infrastructure Sharing

by Kennedy Nomamidobo Amadasun, Michael Short, Rajesh Shankar-Priya and Tracey Crosbie

Data 2021, 6(7), 69; https://doi.org/10.3390/data6070069 - 25 Jun 2021

Cited by 5 | Viewed by 2593

Abstract

Society 5.0 represents an opportunity to transform the economy and create a digital society with the goal of long-term sustainable development and economic growth. There is a growing importance of boosting ICT as an effective and efficient means of achieving this transformation, and [...] Read more.

Society 5.0 represents an opportunity to transform the economy and create a digital society with the goal of long-term sustainable development and economic growth. There is a growing importance of boosting ICT as an effective and efficient means of achieving this transformation, and Target 9c of the UN Sustainable Development Goals is to ‘Significantly increase access to information and communications technology and strive to provide universal and affordable access to the Internet in least developed countries’. Mobile telecommunication systems have become the most effective and convenient means of communicating in the world, and as such, they are revolutionizing business operations. Nigeria is the fastest growing telecommunication market in Africa, with approximately 298 million subscribers accommodated by over 53,000 base transceiver stations (BTSs) which are largely concentrated in urban areas. As a result of increasing subscribers, all mobile network service providers in Nigeria are building new BTSs, often without considering existing infrastructure. This has led to a proliferation of masts, defacing the environment and causing unnecessary environmental pollution as BTSs are largely powered by diesel generators. It is therefore becoming paramount for the telecommunication regulatory body in Nigeria to enforce principles of infrastructure sharing and the colocation of sites for all mobile network service provider BTSs to improve network availability, reliability, scalability, customer satisfaction and sustainability. This paper argues, through the development of ICT tools and their application to a case study, that infrastructure sharing and colocation of sites is not only feasible if supported correctly but also offers the potential to reduce operational and capital expenditure, reduce the number of BTSs required for the rapidly growing mobile telecoms industry in Nigeria and in doing so reduce environmental pollution. Full article

(This article belongs to the Special Issue Development of a Smart Future under Society 5.0)

► Show Figures

Figure 1

Journal Menu

Journal Browser

Data, Volume 6, Issue 7 (July 2021) – 10 articles

Further Information

Guidelines

MDPI Initiatives

Follow MDPI