Enhanced Data Mining and Visualization of Sensory-Graph-Modeled Datasets through Summarization

Hashmi, Syed Jalaluddin; Alabdullah, Bayan; Al Mudawi, Naif; Algarni, Asaad; Jalal, Ahmad; Liu, Hui

doi:10.3390/s24144554

Open AccessArticle

Enhanced Data Mining and Visualization of Sensory-Graph-Modeled Datasets through Summarization

by

Syed Jalaluddin Hashmi

¹

,

Bayan Alabdullah

²,

Naif Al Mudawi

³

,

Asaad Algarni

⁴,

Ahmad Jalal

^5,* and

Hui Liu

^6,*

¹

School of Computing, National University of Computer and Emerging Science, Islamabad 44000, Pakistan

²

Department of Information Systems, College of Computer and Information Sciences, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia

³

Department of Computer Science, College of Computer Science and Information System, Najran University, Najran 55461, Saudi Arabia

⁴

Department of Computer Sciences, Faculty of Computing and Information Technology, Northern Border University, Rafha 91911, Saudi Arabia

⁵

Faculty of Computing and AI, Air University, E-9, Islamabad 44000, Pakistan

⁶

Cognitive Systems Lab, University of Bremen, 28359 Bremen, Germany

^*

Authors to whom correspondence should be addressed.

Sensors 2024, 24(14), 4554; https://doi.org/10.3390/s24144554

Submission received: 13 May 2024 / Revised: 11 June 2024 / Accepted: 9 July 2024 / Published: 14 July 2024

(This article belongs to the Section Biosensors)

Download

Browse Figures

Versions Notes

Abstract

:

The acquisition, processing, mining, and visualization of sensory data for knowledge discovery and decision support has recently been a popular area of research and exploration. Its usefulness is paramount because of its relationship to the continuous involvement in the improvement of healthcare and other related disciplines. As a result of this, a huge amount of data have been collected and analyzed. These data are made available for the research community in various shapes and formats; their representation and study in the form of graphs or networks is also an area of research which many scholars are focused on. However, the large size of such graph datasets poses challenges in data mining and visualization. For example, knowledge discovery from the Bio–Mouse–Gene dataset, which has over 43 thousand nodes and 14.5 million edges, is a non-trivial job. In this regard, summarizing the large graphs provided is a useful alternative. Graph summarization aims to provide the efficient analysis of such complex and large-sized data; hence, it is a beneficial approach. During summarization, all the nodes that have similar structural properties are merged together. In doing so, traditional methods often overlook the importance of personalizing the summary, which would be helpful in highlighting certain targeted nodes. Personalized or context-specific scenarios require a more tailored approach for accurately capturing distinct patterns and trends. Hence, the concept of personalized graph summarization aims to acquire a concise depiction of the graph, emphasizing connections that are closer in proximity to a specific set of given target nodes. In this paper, we present a faster algorithm for the personalized graph summarization (PGS) problem, named IPGS; this has been designed to facilitate enhanced and effective data mining and visualization of datasets from various domains, including biosensors. Our objective is to obtain a similar compression ratio as the one provided by the state-of-the-art PGS algorithm, but in a faster manner. To achieve this, we improve the execution time of the current state-of-the-art approach by using weighted, locality-sensitive hashing, through experiments on eight large publicly available datasets. The experiments demonstrate the effectiveness and scalability of IPGS while providing a similar compression ratio to the state-of-the-art approach. In this way, our research contributes to the study and analysis of sensory datasets through the perspective of graph summarization. We have also presented a detailed study on the Bio–Mouse–Gene dataset, which was conducted to investigate the effectiveness of graph summarization in the domain of biosensors.

Keywords:

sensors datasets; Bio–Mouse–Gene; data visualization; big data; data mining; graph summarization; weighted LSH; correction sets

1. Introduction

A graph, consisting of vertices and edges, depicts the insights of biological networks [1]; this can be useful in various applications, such as in the online health community (for summarizing data on certain diseases, like diabetes [2]), in hyperlink networks [3], in social networks [4], in cooperation networks [5], in citation networks [6], in road networks [7], in shared purchasing networks [8], in producing dependency graphs for biomedical relation extraction [9], and in the internet of medical things [10] among others. Growth in the usage of these aspects has led to an increased research interest in the underlying network science and its analytics. The usage of graphs and other relevant datasets are also of great interest to researchers in the fields of graph mining [11,12], neural networks [13,14,15,16,17], graph neural networks [18,19,20,21,22], deep learning [23,24,25,26,27], data mining [28,29], and machine learning [30,31]. Researchers around the globe are using these datasets to gain insights into complex systems, allowing them to make informed decisions [32]. Biosensing is a state-of-the-art area of research that involves studies on the text mining of health documents [33], chemical sensing [34], medical imaging [35,36], brain age prediction [23], food safety [37], and biosensors [38], among others. Exploring these domains by modeling their datasets as graphs is also a valuable research avenue [23,39,40,41,42]. A graph or set of graphs can be used to represent some of the aforementioned aspects; these comprise large numbers of nodes and edges, and they continuously expand at a remarkable pace. These real-world graphs are too large to fit in the main storage of software; yet, answering complex queries on them in real time requires them to be readily available [43]. Therefore, it is vital to represent them in a concise manner which is efficient and scalable [44,45,46].

Compressing graphs into a compact form is useful when working with them in various scenarios such as storing, processing, querying, and visualizing. In this situation, a potential solution is graph summarization, which generates a more compact representation, called a summary graph, of the input graph. A summary graph reduces the footprint of the original graph and facilitates efficient query answering and insightful data visualization. There have been a number of research studies for graph summarization [47]. The group-based approach is one of the most popular methods for graph summarization [45,48]. This group-based approach takes a simple, undirected graph as input and produces a summary graph and a correction set. The objective of summarization in this line of work is to minimize the size of the summary graph and correction sets while preserving all information from the original graph. The summary contains the member nodes of the input graph, which are merged into super nodes based on certain criteria, and correction sets are used to reconstruct the original graph. SWeG is a correction-set-based graph summarization algorithm that has been presented recently [44]. It is fast, yields high compression, and can run in a distributed setting as well. It adds a dividing step that splits nodes into smaller groups before merging them, making the algorithm more efficient and parallelizable, as shown in Figure 1a. It introduces an approximation metric for identifying nodes to merge. Overall, SWeG aims to improve the performance and accuracy of the graph summarization process. However, there are performance bottlenecks in the merging and encoding algorithms that affect their efficiency for larger graphs. The merging phase is an issue because some groups can be extremely large, leading to longer running times. Additionally, the approximation method used for selecting similar nodes for grouping can result in lower compression rates. Some of the inefficiencies of SWEG [44] are improved by a state-of-the-art algorithm named LDME (Locality-sensitive hashing Divide Merge and Encode) [45]. It proposes weighted locality-sensitive hashing, can handle large datasets on a single machine, and can balance compression and running time. It achieves a significant speedup with similar or better compression than SWeG and up to two orders of magnitude of acceleration, but there are reductions in compression.

An important point to note for the above-mentioned approaches is that they are general graph summarization methods, i.e., they perform the merging of the nodes without focusing on the importance of certain highlighted nodes in the network [5,46]. For instance, certain member nodes of a graph have distinct levels of engagement or inclination towards the particular elements or features of a graph. Consequently, the significance of designing visual representations that are customized to meet the needs of the intended viewers and successfully communicate the intended message is underscored. Consider the following scenarios: social media users are more interested in the connections of their close acquaintances rather than those of strangers; travelers prioritize the roads in their vicinity rather than those further away; researchers are more interested in the papers related to their field than those in other fields. This highlights the importance of tailoring network visualizations to suit the specific interests and needs of a given target audience. In this regard, personalized graph summarization (PGS) is an effective approach [43] that takes individual preferences into account during summarization. In PGS, given a large graph and a target node, for personalized summary graph generation, the objective is to obtain a summary that merges the rest of the nodes while considering the existence of target nodes. This algorithm ensures that the resulting summary accurately reflects the preferences and requirements of the target node, as shown in Figure 1b. However, one drawback of PGS is that it is not an efficient algorithm when applied on large-sized graph datasets. We can be motivated by the fact that the graph data from the aforementioned domains contain useful relationship patterns; however, they are huge in size, so it is not easy to mine for information in order to discover knowledge and ensure its visualization. This issue is non-trivial and of high value when graph data represent interactions among living beings, like the data from the fields of healthcare, disease research, and bioengineering, among others.

In this regard, we present a new algorithm that produces a summary graph having similar compression ratio to that of state-of-the-art algorithm (PGS [43]), but it is more efficient when it is applied to large graph datasets. In this way, we aim to provide faster data mining and analysis techniques for experts in this field to support them in studying the problems at hand from a different perspective. We name our algorithm IPGS and propose the concept of weighted locality-sensitive hashing (LSH) for the personalized summarization of an input graph. The proposal of weighted LSH enhances the efficiency of the algorithm, particularly for handling high-dimensional data, where LSH has proven to be more effective. Our algorithm can handle large-sized graph datasets effectively on a single machine. Finally, the proposed approach is lossless, since we maintain a list of correction sets which are also computed in an optimized manner. We present the effectiveness analysis and performance comparisons of our approach on eight real-world datasets—including the Bio–Mouse–Gene dataset which has 43.1 K nodes and 14.5 M edges—and derived better results for execution time in comparison with the PGS [43]. We also perform experiments for the evaluation of the compression that can be achieved in comparison with the current state-of-the-art algorithm for non-personalized graph summarization, LDME [45]; this is highly scalable but was found to be less effective in terms of providing less compression. To make our research widely usable, we release the implementation of our proposed approach, IPGS, along with the implementation of PGS [43] and LDME [45] on https://github.com/jalal-gilgiti/IPGS (accessed on 9 July 2024). We summarize the contributions of our paper below to clearly demonstrate our work.

Given the large sizes of graph datasets, we have proposed an efficient algorithm named IPGS for graph summarization. The algorithm models locality-sensitive hashing to locate similar nodes for compression. The proposed algorithm produces a similar compression ratio as that of a state-of-the-art algorithm but is less time-consuming.
The proposed algorithm provides a lossless summary graph through the concept of a correction set. This is beneficial since we can always reconstruct the original one or use the correction set for querying the result with 100% accuracy.
We performed detailed experimental evaluation on eight real-world and publicly available datasets and provided insightful results by comparing with two state-of-the-art approaches. We also present a detailed study on the Bio–Mouse–Gene dataset to demonstrate the usefulness of our approach and the concept of graph summarization in the domain of biosensors.

2. Literature Review

In this section, we review the existing studies addressing the different topics of biosensors and graph summarization.

2.1. Review of Knowledge Discovery Techniques in Biosensors and Multidisciplinary Domains

In this section, we review various studies in the disciplines of biosensors, bioengineering, and other relevant fields; this is because data mining and knowledge discovery in the fields of health informatics, biosensors, and cross-domain research is presently one of the most active areas of research. In particular, researchers from numerous areas of computer science and artificial intelligence have investigated these areas from the perspective of their own expertise. In this regard, we witness researchers in data mining [49,50,51], machine learning [52,53], pattern mining [54], data compression [55,56], decision support [57,58,59,60], and visualization [61,62,63] producing insightful knowledge and actionable information.

The contribution of decision-support systems to the field of bioengineering [57,59,60] is of massive value. These systems serve as a backbone of a one-window platform for various types of data storage, information retrieval, knowledge discovery, and inference, prediction, and analytic purposes. In this regard, the authors of [60] presented a feature selection-based prediction model for dental care. They used an ensemble of decision trees as the core machine learning model for the task and obtained significantly higher classification performances. Similarly, the authors of [59] present a decision-support system for glaucoma treatment. The dataset included details of demographics, a history of systemic conditions, medication history, ophthalmic measurements, 24-2 VF results, and thickness measurements from OCT imaging, involving around 900 patients. They applied several machine learning algorithms to the data obtained from independent and geographically separated populations and obtained very promising results. On a similar note, the authors of [57] developed a decision-support system to facilitate the prediction of COVID-19 diagnosis; this used the clinical, demographic, and blood marker data of a given patient. They collected the dataset from a hospital in India and applied machine learning and deep learning algorithms for classification purposes. One of the notable contributions of their work is their focus on explainable AI; this means that the end users of the system are able to understand the type of results they obtain and understand why they should believe the results. Estimation of crop yield [64], prediction of sick leave [65], image compression [66], and rice plant disease classification [67] are further examples of the versatile applications of machine leaning techniques.

Graph neural networks (GNNs) comprise another useful approach in advanced machine learning. Researchers have utilized them for various useful applications, like urban region planning [18], summarizing vast amount of text data [19], point-of-interest recommendation [20], human activities recognition [21], and music recommendation [22]. In [18], the authors made use of random forests with CNNs for the purpose of urban planning by modeling the data in the form of a graph. The authors in [19] used GNN for text summarization. Using the concept of graphs, they are able to model the relationship between words, and accurately extract feature information and eliminate redundant information as well. Similarly, the authors in [20] used the same technique for another interesting aspect of finding suitable points of interest for people; this can help in providing appropriate customer matches for merchants. Similarly, music recommendation [22] and humans action recognition in healthcare [21] are wonderful areas of research focus.

Studying the behavior of large-scale biological networks for pattern discovery is of key importance [54], where the authors introduce a innovative model for the degree of distribution of nodes in the network. Normally, the degree of distribution of numerous real-world networks/graphs exhibit power-law degree distribution. In this regard, the contribution of this research is enormous: they provide a versatile distribution model to provide new insights. The authors of [37] review various research studies on the topic of food safety in the context of biosensing. They studied various analytes, like glucose, gluten, gliadin, atrazine, domoic acid, arsenic, and various others, in their research to address the control of food quality and safety; the aforementioned types of data are studied in detail. Analyzing them by modeling them as ontologies is also worthwhile. An ontology—for instance, for genes data [39]—represents a comprehensive view of the underlying data whose inference provides useful insights. Similarly, using wearable sensors in the research and development of biosensing information systems is of value [38]. This system provides a multidimensional view of the collected data for the betterment of healthcare.

Considering the aforementioned brief review of the various studies from the biosensors and bioengineering domain, we find that a number of researchers have explored machine learning and AI techniques for problem solving purposes in remote sensing [38,68,69,70,71,72] and image processing [73,74,75,76,77]—among other versatile areas [78,79,80,81,82] and multidisciplinary fields [83,84,85,86,87]—and have contributed significantly.

2.2. Review of Research on Graph Summarization

Graph summarization is a widely explored research domain encompassing diverse methodologies for effectively summarizing graph datasets [47,88,89,90,91,92]. The purpose of all of these studies is to reduce the size of the input graph so that it can be effectively and efficiently mined in the pursuit of knowledge discovery and visualization. To perform summarization, the existing methodologies include both group-based and non-group-based approaches. Among them, the group-based approach is particularly prominent. This can be further classified into cohesive correction-set-based and non-correction-set-based approaches. Notably, correction-set-based approaches have garnered greater attention, owing to their remarkable compression and summarization outcomes. Consequently, we have chosen to employ a correction-set-based approach as the foundation for our study.

The correction set approach has led to the development of various algorithms for graph summarization, including VOG [93], Mosso [94,95] DGPS, SSumM [32], SWEG [44], PGS [43], SAGS [48], and LDME [45], among others. These algorithms use different methodologies in various domains to summarize graphs. VOG [93] is a lossless graph summarization technique that determines whether large graphs consist of various sub-graphs, such as cliques, stars, and chains. Each sub-graph type contains distinct information and has a significant impact on the entire graph. It is crucial to understand the information contained within sub-graphs and measure them based on their importance for decision making. The above-mentioned study solved the following vital question: how can we measure the significance of sub-graphs within large graphs? SSumM is a similar lossless summarization algorithm that produces a sparse summary graph [32]; it uses the minimum description length (MDL) principle—as does [96], which is a pioneering work in this field. SSumM identifies important structures within large graphs and develops efficient methods for their summarization and visualization. The authors of [94] present lossless incremental summarization to preserve the information of the dynamic changes that have been made to the graph, such as the addition or deletion of edges. SAGS [48] is a similar correction-set-based approach to the summarization of large graphs. It models LSH [97] to locate sets of similar nodes for compression. The non-mergeable nodes in a given iteration in a located set are pruned out based on their dissimilarity from the rest of the nodes. The algorithm proposed in [95] also makes use of the degree of the nodes during the summarization process. It aims to preserve the degree of each node in the summarized graph for better graph processing, storage, and analytics.

SWEG [44] is a useful correction-set-based algorithm for graph summarization that consists of three steps: merging nodes to super nodes, encoding edges to super edges, and dropping edges for compact graph representation. It provides better compression than previous algorithms and improves the existing frameworks by adding a dividing step before merging the nodes; this divides them into disjoint groups for parallel processing. Additionally, it introduced the approximation metric to achieve the best match for merging. However, the SWEG algorithm’s performance is impacted by certain steps. For instance, the merging algorithm has a quadratic running time due to the identification of disjoint groups, which can affect its speed. The authors of [98] leverage the MDL principle to provide intuitive, coarse-level summaries of input graphs while effectively managing the errors. Additionally, there have been efforts to refine existing techniques to enhance their performance, particularly concerning densification procedures [45]. Improvements in densification aim to address the issues that are related to randomness and accuracy, particularly in sparse datasets, which are common on the web. Through theoretical analysis and experimental evaluations, these enhancements demonstrate their superiority over previous schemes, particularly for very sparse datasets.

On the other hand, it may be be noted that the aforementioned correction-set-based approaches primarily perform non-personalized graph summarization. With the escalating size of data, people are presently displaying greater interest in extracting relevant information from big data. Taking user preferences into account, researchers have developed personalized graph summarization algorithms which aim to achieve the summarization of a large graph from the point of view of the input/target nodes [5,43,46]. VEGAS [5] stands out as one of the pioneering algorithms for personalized graph summarization; it is specifically designed for citation networks. It is important to highlight that this algorithm solely focuses on citation networks. Another state-of-the-art algorithm in this domain is personalized graph summarization (PGS) [43], which employs greedy search techniques. Finally, the algorithm in [46] also proposes an efficient, weighted LSH-based algorithm for personalized graph summarization; thus, it is unlike VEGAS [5], which is quite effective but is very slow when applied to large-sized graphs.

3. Problem Statement

In this age of advanced technology, large-sized datasets from various disciplines—like data for brain signals, medical topics, vital signs, medical text, biomedical signals, sensors, and social networks—are available for research and innovation purposes. In this context, our goal in this research is to efficiently generate a summary of a large-sized input graphs, so that meaningful analysis can be performed in-memory and more effectively. Formally, we take a dataset modeled in the form of a graph, G, having vertices, V, representing entities from a corresponding domain, and edges, E, showing interactions among the entities. Taking this, we aim to develop a scalable algorithm to summarize G into a compact representation of a summary graph,

G^{/}

, where those vertices that have similar properties can be merged into super nodes,

V^{/}

, and their corresponding edges can be merged as super edges,

E^{/}

. In particular, we want to have an efficient algorithm where the

G^{/}

is obtained from the point of view of the user-provided target nodes (s), in order to obtain a personalized summary graph. Our

G^{/}

is lossless since we maintain a correction set,

C +

,

C -

, where

C +

is a list of edges which are removed while merging certain vertices and

C -

is a list showing the edges that are added during aggregation. In this way, we aim to generate a compact-sized

G^{/}

, with minimized

V^{/}

+

E^{/}

+

C +

+

C -

, based on the MDL principle [96].

In this regard, PGS [43] provides an algorithm for personalized graph summarization which provides a highly compressed summary; however, it is not scalable when it is applied to large graph datasets. Our aim in this research is to improve its execution time by providing a similar compression ratio. By incorporating this improvement, our new algorithm, IPGS, is highly scalable, and it provides an accurate and comprehensive summary; thus, it meets the diverse needs and requirements of users.

4. The Proposed Algorithm, IPGS

In this section, we present our proposed approach in detail. Our approach is a lossless summarization due to the concept of correction set attached to the summary graph; so, we first present the steps of how to perform the merging of the nodes of a G while maintaining a list of corrections. We then explain the inside details of IPGS, followed by the formal algorithm for the summary generation.

4.1. Correction-Set-Based Approach for Grouping-Oriented Summarization

We take an undirected graph G as input, having vertices V and edges E, as shown in Figure 2a. In this illustration, the algorithm iterates four times. As a first step, each individual node is called a super node. The algorithm then updates every super node by merging the nodes in each iteration based on the maximum saving produced by the merger. The merging process reduces the sum of the super edges P along with positive

C +

and negative

C -

edge corrections, denoted as

(E^{/}) + (C +) + (C -)

. The formula to calculate the savings obtained from merging two nodes, 1 and 2, i.e., A and B in Figure 2a, is shown in Equation (1).

Saving = 1 - \frac{Cos t (1 \cup 2)}{Cos t (1, S) + Cos t (2, S)}

(1)

where Cost(1, S) and Cost(2, S) are the contributions of nodes 1 and 2 in

(E^{/}) + (C +) + (C -)

. In this way, the merging is repeated for a certain number of iterations by randomly selecting a super node then finding a node to be merged with it; this provides the highest savings using Equation (1). This process is repeated once all of the super nodes are merged.

The original edges E from G are then encoded into the super edges and the correction sets. During the encoding of the edges, we encounter two different sets of edges, i.e., the original edges between the super nodes and the total number of possible edges between them. The number of the original edges is represented by

E A B

and the possible edges are represented by

F A B

. To perform edge encoding for a pair of super nodes, A and B, if the number of the original edges is less than or equal to half of the possible edges between them, then one does not encode the edge and instead adds the original edge to

C +

. On the other hand, if the number of original edges is greater than half of the possible edges between them, then one should encode the super edge and add the extraneous edges to

C -

. Additionally, the two super nodes, A and B, are further merged as a new super node.

4.2. Weighted LSH for IPGS

The previous section explains the various steps involved in the merging of nodes in producing a general-purpose summary. One of the bottlenecks is a result of the issue of how we can efficiently identify the nodes providing the highest savings in each iteration, while preserving the personalization aspect during the summarization process. To solve these problems, we modelled LSH to ensure that similar nodes are grouped together. LSH speeds up the node identification process by approximating the Jaccard similarity among the nodes and hashes the similar nodes in groups. LSH groups similar nodes by employing a hash function or a set of hash functions on the neighborhood structure of each node. This hashing process pushes similar nodes to the same buckets.

We now illustrate the concept of weighted Jaccard similarity, denoted as

J_{w} (A, B)

, on to the neighborhoods of two nodes A and B. Their neighbors are represented as vectors of equal length and comprise integer weights, as demonstrated in Figure 3. The definition of this weighted Jaccard similarity is as follows in Equation (2):

J_{w} (A, B) = \frac{\sum_{v} min (A_{v}, B_{v})}{\sum_{v} max (A_{v}, B_{v})}

(2)

To model LSH for weighted Jaccard similarity, we use the concept of densified one permutation hashing (DOPH) [99]. To generate a hash signature for a node in DOPH, we start with a binarized vector, denoted as I. We then shuffle the elements of vector I using a chosen hash function to create a permuted vector. The next step is to determine the desired length for the hash signature and divide the permuted vector I into equal bins based on this length. From each bin, the first non-zero value is selected. In a case where there are no non-zero values in a bin, consider a value from either the left or right neighboring bins. Finally, the resulting hash signature is returned. We illustrate this process in Figure 4, Figure 5, Figure 6, Figure 7, Figure 8 and Figure 9.

To utilize the concept of personalization, we aim to preserve the influential impact and flow patterns within the summarized graph, ensuring that the essential dynamics of aggregation with respect to target node T is retained. By employing this approach, the goal is to retain and accurately represent the influential flow patterns within the summarized graph. This ensures that the resulting summary effectively captures and preserves the influential dynamics of the original data. For illustration purposes, let us consider the weighted variant of our input sample graph to generate a summarization comprising k clusters and l flows, while maximizing an objective function that incorporates cumulative flow rates, as show in Equation (3).

max (\sum_{s = 1}^{l} r (ξ_{s}))

(3)

In this example, we have nodes with weights assigned to their edges. To perform personalized graph summarization based on flow rate maximization, we take the weighted edges into account, while identifying the influential flow patterns within the graph. We consider both the strength of the connections and the flow of influence or information between nodes. In the given weighted graph in Figure 10, let us assess the flow rates considering node A as the target node. The various flow rates are as follows: 5 from A to E, 4 from A to G, 9 from A to B through G, and 12 from A to F through G and B. These flow rates indicate the influence of node A on the flow of information or interactions within the graph. Based on the flow rates and connections, we can summarize the graph to represent the essential pathways of influence from node A. The summarized graph includes node A, along with the merged nodes that receive the highest flow rates directly from node A and through other nodes as the immediate neighbors.

In the summarized graph in Figure 10, we capture the primary most influential pathways from node A to nodes E, G, and also B and F, with moderate flow. This representation focuses on preserving the most significant connections and flow rates originating from node A while providing an overview of the impact of node A on the graph. It is important to note that the summarized graph for the impact of node A may not include all nodes and edges from the original graph. It specifically highlights the influential connections originating from node A, emphasizing the flow patterns and connections that contribute most significantly to the impact of node A on the overall graph dynamics.

4.3. Formal Algorithms of the Proposed IPGS

Our algorithm takes as input a graph along with a target node or set of target nodes, and the counter for the maximum number of iterations to perform. The goal is to generate a summary graph that preserves the influence or impact of the target node(s), while summarizing the original graph. The summary graph consists of super nodes

V^{/}

and super edges

E^{/}

. The algorithm begins by initializing each node in the input graph as a super node. Then, in each iteration, LSH signatures are generated for each super node to group them into candidate groups based on the similarity of their signatures. Within each group, merges are performed to combine nodes; edges are encoded as super edges. Additionally, personalized error or correction sets are calculated to assess the impact of the summarization on the target nodes. After iterating through the specified number of times, the algorithm returns the summary graph and the correction sets to represent the summary error or any corrections made during the summarization process. Overall, this algorithm aims to effectively summarize the input graph while preserving the influence or impact of the target node(s) by employing LSH signatures, merging operations, and personalized error calculations. The flow rate maximization step is integrated within the main summarization loop. It calculates flow rates, identifies influential flow patterns, and updates the summary graph based on the identified patterns. We present both of the variants of the proposed approach, i.e., the pseudocode for the summarization with and without a focus on the preservation of the personalization aspects in Algorithms 1 and 2, respectively. We also present overall architecture of our working of the algorithms in flow diagram in Figure 11.

Algorithm 1: IPGS without considering the personalization aspect.

Algorithm 2: IPGS while considering the personalization aspect.

5. Experimental Evaluation

In this section, we present an experimental evaluation of our proposed algorithm, IPGS. We implemented the algorithms in Java language and the experiments were performed on a PC with 16 GB RAM, 250 GB SSD, and a 2.20 GHz processor. The experiments were performed to compare the execution time and compression ratio of the algorithms. We also present a detailed visualization for the Bio–Mouse–Gene dataset as a case study.

5.1. Data Availability

The experiments are performed on eight publicly available datasets, as listed below:

Bio–Mouse–Gene. Nodes—43.1 K; edges—14.5 https://networkrepository.com/bio-mouse-gene.php (accessed on 9 July 2024).
Cnr2000. Nodes—325,557; edges—5,565,380 https://networkrepository.com/cnr-2000.php (accessed on 9 July 2024).
LastFM-Asia. Nodes—7624; edges—27,806: https://snap.stanford.edu/data/feather-lastfm-social.html (accessed on 9 July 2024).
Caida. Nodes—26,475; edges—53,381: https://snap.stanford.edu/data/as-Caida.html (accessed on 9 July 2024).
DBLP. Nodes—317,080; edges 1,049,866: https://snap.stanford.edu/data/com-DBLP.html (accessed on 9 July 2024).
Skitter. Nodes—1,694,616; edges—11,094,209: https://snap.stanford.edu/data/as-Skitter.html (accessed on 9 July 2024).
Amazon. Nodes—403,394; edges—103,310,688: https://snap.stanford.edu/data/amazon0601.html (accessed on 9 July 2024).
Citation-Patent. Nodes—4 M; edges—17 M: https://snap.stanford.edu/data/cit-Patents.html (accessed on 9 July 2024).

5.2. Exploring Mouse Gene Dataset through Visualization

In this section, we show the effectiveness of our proposed algorithm, IPGS, for visualization of summarized graphs. We demonstrate this through the visualization of the Bio–Mouse–Gene dataset [40,41]. This dataset is very large in size, having 43.1 K nodes and 14.5 M edges; hence, a visualization of this entire input graph is highly cluttered, as can be witnessed in Figure 12a. We have taken this visualization from the main source of the dataset, i.e., https://networkrepository.com/bio-mouse-gene.php (accessed on 9 July 2024), for demonstration purposes. This graph is too large in size, so we have taken a chunk of it, comprising 223 nodes and 997 edges, as shown in Figure 12b; then, we generated its summary graph in Figure 12c for a target node. The target node is highlighted by a red circle. In this kind of visualization, we can inspect the target node’s impact on and relationship to its neighborhood, and to the rest of the summary graph—the tightly bonded sets of nodes that are merged with each other for a given target node. The summary graphs in the aforementioned figures are still dense and show visual clutter, so we took a smaller chunk of the dataset and visualized it in Figure 12d. This smaller chunk had 108 nodes with 110 edges; its summary graph is shown in Figure 12e. We took an even smaller chunk, comprising 12 nodes and 29 edges—shown in Figure 12f—to demonstrate a summary graph of a dataset of this smaller size. Figure 12g shows how the target node is connected to the others in each visualization.

We are using the Bio–Mouse–Gene dataset, which indicates which genes are connected to which, and how they are related to research and the study of diseases of humans. So, using this target node, depicting certain types of genes, the connected super nodes in the summary graph provide very useful insights. This knowledge is of particular interest for the exploration of new types of group-based interactions for the discovery of certain human diseases. We understand that a reader of our research cannot perform interactive analytics of the summary graphs presented in these figures because of the static nature of the images. However, by using the code and implementation shared by us, readers can run the algorithm to generate the summary graphs themselves, using any standard and latest version of graph visualization softwares, like Gephi [63] or Cytoscape [1] for an interactive analysis.

5.3. Comparison of Execution Time

We have performed all the evaluations on a single-threaded machine; this demonstrates the fact that the existing approaches—as well as the proposed approach—do not require much memory. We do not compare LDME for the execution time aspect because it has significantly faster performance than our proposed IPGS algorithm and the state-of-the-art PGS algorithm. Moreover, LDME is used for general-purpose grouping-based summarization applications; in contrast, PGS and IPGS can be used to perform personalized graph summarization. This is one of the reasons that LDME is much faster—it is free from the complexity involved in identifying and arranging the nodes for personalized compression based upon the influence of the target node.

Figure 13 presents the results for the execution times obtained by running PGS and IPGS on complete batch sizes. We observe that the running times of both of the algorithms show a better performance from IPGS in all the cases. The difference becomes clearer when the algorithms are applied to the Citation network dataset. This dataset has a significantly larger size; hence, it serves the purpose of our proposal.

Figure 14 demonstrates the results of the execution time comparisons for PGS and IPGS. In all of the experiments, we find that IPGS achieves a better performance than PGS. The Bio–Mouse–Gene, Skitter, Amazon, and Citation network datasets are much larger in size; yet, we find that the execution time achieved by both of the algorithms is reasonable. IPGS consistently outperforms PGS. The Citation network dataset is the largest dataset used in these comparisons, with 4 million nodes and 17 million edges. PGS took 17 min to run, while IPGS achieved the same task in 13 min. This trend persisted across all the other datasets as well.

For the execution time comparison, we analyzed PGS [43] and IPGS; in contrast, for the assessment of the compression ratio, we compare LDME [45], PGS [43], and IPGS.

5.4. Comparison of Compression Ratio

The experimental results for the comparison of the compression ratio are presented in Figure 15 and Figure 16. The compression ratio is obtained using the formula in Equation (4). The size of the original input graph is computed using Equation (5). This formula calculates the size of the original graph based on the total number of edges (O_Edges) and the total number of nodes (O_Nodes) in the original graph. The formula involves multiplying the number of edges by a factor of 2 and then taking the logarithm base 2 of the total number of nodes. The formula in Equation (6) calculates the size of the summarized graph after summarization. This is based on the number of super edges (S_Edges) and the number of super nodes (_Nodes) in the summary graph, as well as the total number of original nodes (_Nodes) in the original graph. The formula involves multiplying the number of super edges by a factor of 2 and then taking the logarithm base 2 of the total number of super nodes. Additionally, it considers the contribution of the original nodes by multiplying their count with the logarithm base 2 of the total number of super nodes.

(Compression Ratio = \frac{Size of Summary Graph}{Size of Original Graph})

(4)

Size of Original Graph = O_Edges \times 2 \times {log}_{2} (O_nodes)

(5)

Size of Summary Graph = S_Edges \times 2 \times {log}_{2} (S_nodes) + O_nodes \times {log}_{2} (S_nodes)

(6)

The results of our evaluation for the comparison of the compression ratio are highly promising. For the largest network used for comparisons, i.e., the Citation network dataset, we achieved a compression ratio of 0.4; this is same as that achieved by PGS. Both PGS and IPGS achieved a compression of 60 percent. On the other hand, LDME achieved a compression ratio of 0.7, i.e., 30 percent less compression than PGS and IPGS. This substantial improvement in the compression ratio demonstrates the efficacy of our proposed solution. We have successfully optimized the compression ratio, resulting in a more compact representation of the input graph data while preserving the aspect of personalization. This advancement has significant implications for various applications that rely on the elegant storage and processing of graph data. The compression ratios of IPGS and PGS are almost the same because both methods follow a similar approach for node identification in the merging process

6. Conclusions

Research into biosensors to find solutions which will aid in the improvement of healthcare systems is highly important. We reviewed a number of studies exploring the datasets of biosensors and bioengineering from variety of angles. One research direction in this field is the investigation of the wealth of data through graph summarization. This is a process which aims to compress the size of the large graph that is input for efficient data mining and visualization. There are a number of general-purpose graph summarization techniques which produce a summary graph for an entire input graph, without focusing on the impact/existence of certain influential nodes in a given dataset. However, in this research, we present a personalized graph summarization approach which can extract pertinent information from graph data; thus, it can be tailored to individual preferences. This method allows users to extract and customize their analyses, leading to more focused and insightful outcomes. Our research introduces IPGS, a new algorithm which improves the execution time of an existing state-of-the-art approach (PGS), while achieving a similar compression ratio. IPGS is particularly useful in the domain of studying bioengineering because we can analyze the network structure of a particular entity (like a gene, phenotype, etc.) in an efficient and elegant manner. To ensure that our study is applicable in various domains, we considered scalability and efficiency as key considerations during the algorithm’s development; this allowed it to effectively handle various types of graph data using a single machine. Influenced by the high compression ratio achieved by PGS—which comes at the cost of a slower execution time—our algorithm, IPGS, provides a robust and efficient solution for personalized graph summarization, catering to the needs of diverse applications and datasets. Further research in this field may focus on exploring additional optimizations and extensions to enhance the algorithm’s capabilities and broaden its applicability across different domains.

Author Contributions

Conceptualization, S.J.H., B.A. and A.J.; methodology, S.J.H.; data curation, S.J.H. and H.L.; implementation, S.J.H. and A.J.; experiment design, A.A. and N.A.M.; evaluation and validation, S.J.H.; writing—original draft preparation, S.J.H.; writing—review and editing, A.J., H.L. and B.A.; supervision, A.J. and H.L.; funding acquisition, B.A., H.L. and N.A.M. All authors have read and agreed for the submission of the manuscript.

Funding

The APC was funded by the Open Access Initiative of the University of Bremen and the DFG via SuUB Bremen. This research was supported by the Deanship of Scientific Research at Najran University, under the Research Group Funding program grant code (NU/RG/SERC/13/30). Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2024R440), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

Institutional Review Board Statement

Not applicable for this study considering it used published, peer-reviewed, and publicly available datasets.

Informed Consent Statement

Not applicable.

Acknowledgments

Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2024R440), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

Data Availability Statement

The datasets used in this work are publicly available and their details are provided in Section 5.1.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Smoot, M.; Ono, K.; Ruscheinski, J.; Wang, P.; Ideker, T. Cytoscape 2.8: New features for data integration and network visualization. Bioinformatics 2011, 27, 431–432. [Google Scholar] [CrossRef] [PubMed]
Qian, X.; Zhou, Y.; Liao, B.; Xin, Z.; Xie, W.; Hu, C.; Luo, A. Named Entity Recognition of Diabetes Online Health Community Data Using Multiple Machine Learning Models. Bioengineering 2023, 10, 659. [Google Scholar] [CrossRef] [PubMed]
Francesco, L.; Musciotto, F.; Montresor, A.; Battiston, F. Hyperlink communities in higher-order networks. J. Complex Netw. 2024, 12, cnae013. [Google Scholar]
Borgatti, S.; Everett, M.; Johnson, J.; Agneessens, F. Analyzing Social Networks; SAGE Publications Limited: New York, NY, USA, 2024. [Google Scholar]
Hi, L.; Tong, H.; Tang, J.; Lin, C. Vegas: Visual influence graph summarization on citation networks. IEEE Trans. Knowl. Data Eng. 2015, 27, 3417–3431. [Google Scholar]
Sui, P.; Yang, X. A privacy-preserving compression storage method for large trajectory data in road networks. J. Grid Comput. 2018, 16, 229–245. [Google Scholar] [CrossRef]
Uddin, S.; Khan, A.; Lu, H.; Zhou, F.; Karim, S.; Hajati, F.; Moni, M. Road networks and socio-demographic factors to explore COVID-19 infection during its different waves. Sci. Rep. 2024, 14, 1551. [Google Scholar] [CrossRef] [PubMed]
Ma, L.; Li, X.; Kong, X.; Yang, C.; Chen, L. Optimal participation and cost allocation of shared energy storage considering customer directrix load demand response. J. Energy Storage 2024, 81, 110404. [Google Scholar] [CrossRef]
Kim, S.; Yoon, J.; Kwon, O. Biomedical Relation Extraction Using Dependency Graph and Decoder-Enhanced Transformer Model. Bioengineering 2023, 10, 586. [Google Scholar] [CrossRef] [PubMed]
Hussain, A.; Sabu, C.; Balasubramanian, K.; Manyam, R.; Kidambi, R.; Sadiq, A.; Farhan, A. Optimization system based on convolutional neural network and internet of medical things for early diagnosis of lung cancer. Bioengineering 2023, 10, 320. [Google Scholar] [CrossRef]
Xing, J.; Yuan, H.; Hamzaoui, R.; Liu, H.; Hou, J. GQE-Net: A Graph-Based Quality Enhancement Network for Point Cloud Color Attribute. IEEE Trans. Image Process. 2023, 32, 6303–6317. [Google Scholar] [CrossRef]
Hu, F.; Qiu, L.; Wei, S.; Zhou, H.; Bathuure, I.; Hu, H. The spatiotemporal evolution of global innovation networks and the changing position of China: A social network analysis based on cooperative patents. R&D Manag. 2024, 54, 574–589. [Google Scholar] [CrossRef]
Li, X.; Sun, Y. Application of RBF neural network optimal segmentation algorithm in credit rating. Neural Comput. Appl. 2021, 33, 8227–8235. [Google Scholar] [CrossRef]
Wang, K.; Boonpratatong, A.; Chen, W.; Ren, L.; Wei, G.; Qian, Z.; Lu, X.; Zhao, D. The Fundamental Property of Human Leg During Walking: Linearity and Nonlinearity. IEEE Trans. Neural Syst. Rehabil. Eng. 2023, 31, 4871–4881. [Google Scholar] [CrossRef] [PubMed]
Xu, Y.; Wang, E.; Yang, Y.; Chang, Y. A Unified Collaborative Representation Learning for Neural-Network Based Recommender Systems. IEEE Trans. Knowl. Data Eng. 2022, 34, 5126–5139. [Google Scholar] [CrossRef]
Li, H.; Xia, C.; Wang, T.; Wang, Z.; Cui, P.; Li, X. GRASS: Learning Spatial–Temporal Properties From Chainlike Cascade Data for Microscopic Diffusion Prediction. IEEE Trans. Neural Netw. Learn. Syst. 2023, 15, 1–15. [Google Scholar] [CrossRef] [PubMed]
Zhu, H.; Xu, D.; Huang, Y.; Jin, Z.; Ding, W.; Tong, J.; Chong, G. Graph Structure Enhanced Pre-Training Language Model for Knowledge Graph Completion. IEEE Trans. Emerging Top. Comput. Intell. 2024, 1, 1–12. [Google Scholar] [CrossRef]
Sideris, N.; Bardis, G.; Voulodimos, A.; Miaoulis, G.; Ghazanfarpour, D. Enhancing Urban Data Analysis: Leveraging Graph-Based Convolutional Neural Networks for a Visual Semantic Decision Support System. Sensors 2024, 24, 1335. [Google Scholar] [CrossRef]
Huang, J.; Wu, W.; Li, J.; Wang, S. Text summarization method based on gated attention graph neural network. Sensors 2023, 23, 1654. [Google Scholar] [CrossRef]
Wang, X.; Wang, D.; Yu, D.; Wu, R.; Yang, Q.; Deng, S.; Xu, G. Intent-aware Graph Neural Network for Point-of-Interest embedding and recommendation. Neurocomputing 2023, 557, 126734. [Google Scholar] [CrossRef]
Zhang, H.; Zhang, X.; Yu, D.; Guan, L.; Wang, D.; Zhou, F.; Zhang, W. Multi-Modality Adaptive Feature Fusion Graph Convolutional Network for Skeleton-Based Action Recognition. Sensors 2023, 23, 5414. [Google Scholar] [CrossRef]
Wang, D.; Zhang, X.; Yin, Y.; Yu, D.; Xu, G.; Deng, S. Multi-view enhanced graph attention network for session-based music recommendation. ACM Trans. Inf. Syst. 2023, 42, 16. [Google Scholar] [CrossRef]
Lim, H.; Joo, Y.; Ha, E.; Song, Y.; Yoon, S.; Shin, T. Brain Age Prediction Using Multi-Hop Graph Attention Combined with Convolutional Neural Network. Bioengineering 2024, 11, 265. [Google Scholar] [CrossRef] [PubMed]
Yin, Y.; Guo, Y.; Su, Q.; Wang, Z. Task Allocation of Multiple Unmanned Aerial Vehicles Based on Deep Transfer Reinforcement Learning. Drones 2022, 6, 215. [Google Scholar] [CrossRef]
Tian, J.; Wang, B.; Guo, R.; Wang, Z.; Cao, K.; Wang, X. Adversarial Attacks and Defenses for Deep-Learning-Based Unmanned Aerial Vehicles. IEEE Internet Things J. 2022, 9, 22399–22409. [Google Scholar] [CrossRef]
Zheng, W.; Lu, S.; Cai, Z.; Wang, R.; Wang, L.; Yin, L. PAL-BERT: An Improved Question Answering Model. Comput. Model. Eng. Sci. 2024, 139, 2729–2745. [Google Scholar] [CrossRef]
Dang, W.; Cai, L.; Liu, M.; Li, X.; Yin, Z.; Liu, X.; Zheng, W. Increasing Text Filtering Accuracy with Improved LSTM. Comput. Inform. 2024, 42, 1491–1517. [Google Scholar] [CrossRef]
Wu, Y.; Wu, M. Biomedical Data Mining and Machine Learning for Disease Diagnosis and Health Informatics. Bioengineering 2024, 11, 364. [Google Scholar] [CrossRef] [PubMed]
Hu, X.; Tang, T.; Tan, L.; Zhang, H. Fault Detection for Point Machines: A Review, Challenges, and Perspectives. Actuators 2023, 12, 391. [Google Scholar] [CrossRef]
Usategui, I.; Arroyo, Y.; Torres, A.; Barbado, J.; Mateo, J. Systemic Lupus Erythematosus: How Machine Learning Can Help Distinguish between Infections and Flares. Bioengineering 2024, 11, 90. [Google Scholar] [CrossRef]
Zhou, T.; Cai, Z.; Liu, F.; Su, J. In Pursuit of Beauty: Aesthetic-Aware and Context-Adaptive Photo Selection in Crowdsensing. IEEE Trans. Knowl. Data Eng. 2023, 35, 9364–9377. [Google Scholar] [CrossRef]
Lee, K.; Jo, H.; Ko, J.; Lim, S.; Shin, K. Ssumm: Sparse summarization of massive graphs. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual, 6–10 July 2020. [Google Scholar]
Chintalapudi, N.; Angeloni, U.; Battineni, G.; Di Canio, M.; Marotta, C.; Rezza, G.; Amenta, F. LASSO regression modeling on prediction of medical terms among seafarers’ health documents using tidy text mining. Bioengineering 2022, 9, 124. [Google Scholar] [CrossRef] [PubMed]
Christoph, F.; Hirsch, T.; Wolfbeis, O. Photonic crystals for chemical sensing and biosensing. Angew. Chem. Int. Ed. 2014, 53, 3318–3335. [Google Scholar]
Li, T.; Xu, Y.; Wu, T.; Charlton, J.; Bennett, K.; Al-Hindawi, F. BlobCUT: A Contrastive Learning Method to Support Small Blob Detection in Medical Imaging. Bioengineering 2023, 10, 1372. [Google Scholar] [CrossRef] [PubMed]
Kareem, A.; Liu, H.; Velisavljevic, V. A Privacy-preserving Approach to Effectively Utilize Distributed Data for Malaria Image Detection. Bioengineering 2024, 11, 340. [Google Scholar] [CrossRef]
Scognamiglio, V.; Arduini, F.; Palleschi, G.; Rea, G. Biosensing technology for sustainable food safety. TrAC Trends Anal. Chem. 2014, 62, 1–10. [Google Scholar] [CrossRef]
Tae-Gyu, L.; Ko, M.; Lee, S. Wearable multiple biosensing process architecture in human healthcare environments. Int. J.-Bio-Sci.-Bio-Technol. 2014, 6, 177–184. [Google Scholar]
Mikhail, P.; Ha, B.; Peters, B. GOnet: A tool for interactive Gene Ontology analysis. BMC Bioinform. 2018, 19, 470. [Google Scholar]
Rossi, R.; Ahmed, N. The network data repository with interactive graph analytics and visualization. In Proceedings of the AAAI Conference on Artificial Intelligence, Austin, TX, USA, 25–30 January 2015; Volume 29. [Google Scholar]
Bansal, M.; Belcastro, V.; Ambesi-Impiombato, A.; Di Bernardo, D. How to infer gene networks from expression profiles. Mol. Syst. Biol. 2007, 3, 78. [Google Scholar] [CrossRef]
Yu, G.; Ye, Q.; Ruan, T. Enhancing Error Detection on Medical Knowledge Graphs via Intrinsic Label. Bioengineering 2024, 11, 225. [Google Scholar] [CrossRef]
Kang, S.; Lee, K.; Shin, K. Personalized graph summarization: Formulation, scalable algorithms, and applications. In Proceedings of the 2022 IEEE 38th International Conference on Data Engineering (ICDE), Kuala Lumpur, Malaysia, 9–12 May 2022. [Google Scholar]
Shin, K.; Ghoting, A.; Kim, M.; Raghavan, H. Sweg: Lossless and lossy summarization of web-scale graphs. In Proceedings of the World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019; pp. 1679–1690. [Google Scholar]
Yong, Q.; Hajiabadi, M.; Srinivasan, V.; Thomo, A. Efficient graph summarization using weighted lsh at billion-scale. In Proceedings of the 2021 International Conference on Management of Data, Auckland, New Zealand, 7–10 December 2021; pp. 2357–2365. [Google Scholar]
Khan, K.; Dolgorsuren, B.; Anh, T.; Nawaz, W.; Lee, Y. Faster compression methods for a weighted graph using locality sensitive hashing. Inf. Sci. 2017, 421, 237–253. [Google Scholar] [CrossRef]
Liu, Y.; Dighe, A.; Safavi, T.; Koutra, D. A graph summarization: A survey. CoRR 2016, arXiv:1612.04883. [Google Scholar]
Khan, K.; Nawaz, W.; Lee, Y. Set-based approximate approach for lossless graph summarization. Computing 2015, 97, 1185–1207. [Google Scholar] [CrossRef]
Zhao, S.; Liang, W.; Wang, K.; Ren, L.; Qian, Z.; Chen, G.; Lu, X.; Zhao, D.; Wang, X.; Ren, L. A Multiaxial Bionic Ankle Based on Series Elastic Actuation With a Parallel Spring. IEEE Trans. Ind. Electron. 2024, 71, 7498–7510. [Google Scholar] [CrossRef]
Huang, F.; Wang, Z.; Huang, X.; Qian, Y.; Li, Z.; Chen, H. Aligning Distillation For Cold-Start Item Recommendation. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, Taipei, Taiwan, 23–27 July 2023; pp. 23–27. [Google Scholar] [CrossRef]
Zou, W.; Sun, Y.; Zhou, Y.; Lu, Q.; Nie, Y.; Sun, T.; Peng, L. Limited Sensing and Deep Data Mining: A New Exploration of Developing City-Wide Parking Guidance Systems. IEEE Intell. Transp. Syst. Mag. 2022, 14, 198–215. [Google Scholar] [CrossRef]
Xie, X.; Xie, B.; Cheng, J.; Chu, Q.; Dooling, T. A simple Monte Carlo method for estimating the chance of a cyclone impact. Nat. Hazards 2021, 107, 2573–2582. [Google Scholar] [CrossRef]
Qi, H.; Zhou, Z.; Irizarry, J.; Lin, D.; Zhang, H.; Li, N.; Cui, J. Automatic Identification of Causal Factors from Fall-Related Accident Investigation Reports Using Machine Learning and Ensemble Learning Approaches. J. Manag. Eng. 2024, 40, 04023050. [Google Scholar] [CrossRef]
Chakraborty, T.; Naik, S.; Chattopadhyay, S.; Das, S. Learning Patterns from Biological Networks: A Compounded Burr Probability Model. Res. Sq. 2023, 1, 1–21. [Google Scholar]
Xia, W.; Pu, L.; Zou, X.; Shilane, P.; Li, S.; Zhang, H.; Wang, X. The Design of Fast and Lightweight Resemblance Detection for Efficient Post-Deduplication Delta Compression. ACM Trans. Storage 2023, 19, 22. [Google Scholar] [CrossRef]
Tian, G.; Hui, Y.; Lu, W.; Tingting, W. Rate-distortion optimized quantization for geometry-based point cloud compression. J. Electron. Imaging 2023, 32, 13047. [Google Scholar] [CrossRef]
Chadaga, K.; Prabhu, S.; Bhat, V.; Sampathila, N.; Umakanth, S.; Chadaga, R. A decision support system for diagnosis of COVID-19 from non-COVID-19 influenza-like illness using explainable artificial intelligence. Bioengineering 2023, 10, 439. [Google Scholar] [CrossRef]
Xu, X.; Liu, W.; Yu, L. Trajectory prediction for heterogeneous traffic-agents using knowledge correction data-driven model. Inf. Sci. 2022, 608, 375–391. [Google Scholar] [CrossRef]
Christopher, M.; Gonzalez, R.; Huynh, J.; Walker, E.; Radha, S.; Bowd, C.; Belghith, A.; Goldbaum, M.; Fazio, M.; Girkin, C. Proactive Decision Support for Glaucoma Treatment: Predicting Surgical Interventions with Clinically Available Data. Bioengineering 2024, 11, 140. [Google Scholar] [CrossRef] [PubMed]
Kang, I.; Njimbouom, S.; Kim, J. Optimal feature selection-based dental caries prediction model using machine learning for decision support system. Bioengineering 2023, 10, 245. [Google Scholar] [CrossRef] [PubMed]
Bergauer, L.; Akbas, S.; Braun, J.; Ganter, M.; Meybohm, P.; Hottenrott, S.; Zacharowski, K.; Raimann, F.; Rivas, E.; López-Baamonde, M. Visual blood, visualisation of blood gas analysis in virtual reality, leads to more correct diagnoses: A computer-based, multicentre, simulation study. Bioengineering 2023, 10, 340. [Google Scholar] [CrossRef] [PubMed]
Elgendi, M. Eventogram: A visual representation of main events in biomedical signals. Bioengineering 2016, 3, 22. [Google Scholar] [CrossRef]
Bastian, M.; Heymann, S.; Jacomy, M. Gephi: An open source software for exploring and manipulating networks. In Proceedings of the International AAAI Conference on Web and Social Media, San Jose, CA, USA, 17–20 May 2009; Volume 3, pp. 361–362. [Google Scholar]
Ilyas, Q.; Ahmad, M.; Mehmood, A. Automated estimation of crop yield using artificial intelligence and remote sensing technologies. Bioengineering 2023, 10, 125. [Google Scholar] [CrossRef] [PubMed]
Ng, P.; Chen, P.; Sin, Z.; Lai, S.; Cheng, A. Smart Work Injury Management (SWIM) system: A machine learning approach for the prediction of sick leave and rehabilitation plan. Bioengineering 2023, 10, 172. [Google Scholar] [CrossRef] [PubMed]
Xue, X.; Marappan, R.; Raju, S.; Raghavan, R.; Rajan, R.; Khalaf, O.; Abdulsahib, G. Modeling and analysis of hybrid transformation for lossless big medical image compression. Bioengineering 2023, 10, 333. [Google Scholar] [CrossRef]
Sengupta, S.; Dutta, A.; Abdelmohsen, S.; Alyousef, H.; Rahimi-Gorji, M. Development of a rice plant disease classification model in big data environment. Bioengineering 2022, 9, 758. [Google Scholar] [CrossRef]
Huang, J.; Gómez-Dans, J.; Huang, H.; Ma, H.; Wu, Q.; Lewis, P.; Liang, S.; Chen, Z.; Xue, J.; Wu, Y. Assimilation of remote sensing into crop growth models: Current status and perspectives. Agric. For. Meteorol. 2019, 276–277, 107609. [Google Scholar] [CrossRef]
Zhou, G.; Tang, Y.; Zhang, W.; Liu, W.; Jiang, Y.; Gao, E.; Zhu, Q.; Bai, Y. Shadow Detection on High-Resolution Digital Orthophoto Map Using Semantic Matching. IEEE Trans. Geosci. Remote Sens. 2023, 61, 4504420. [Google Scholar] [CrossRef]
Zhou, G.; Liu, X. Orthorectification Model for Extra-Length Linear Array Imagery. IEEE Trans. Geosci. Remote Sens. 2022, 60, 4709710. [Google Scholar] [CrossRef]
Tian, W.; Zhao, Y.; Hou, R.; Dong, M.; Ota, K.; Zeng, D.; Zhang, J. A Centralized Control-Based Clustering Scheme for Energy Efficiency in Underwater Acoustic Sensor Networks. IEEE Trans. Green Commun. Netw. 2023, 7, 668–679. [Google Scholar] [CrossRef]
Hou, X.; Xin, L.; Fu, Y.; Na, Z.; Gao, G.; Liu, Y.; Chen, T. A self-powered biomimetic mouse whisker sensor (BMWS) aiming at terrestrial and space objects perception. Nano Energy 2023, 118, 109034. [Google Scholar] [CrossRef]
Liu, H.; Yuan, H.; Hou, J.; Hamzaoui, R.; Gao, W. PUFA-GAN: A Frequency-Aware Generative Adversarial Network for 3D Point Cloud Upsampling. IEEE Trans. Image Process. 2022, 31, 7389–7402. [Google Scholar] [CrossRef] [PubMed]
Zhou, G.; Li, H.; Song, R.; Wang, Q.; Xu, J.; Song, B. Orthorectification of Fisheye Image under Equidistant Projection Model. Remote Sens. 2022, 14, 4175. [Google Scholar] [CrossRef]
Peng, J.; Chen, X.; Wang, X.; Wang, J.; Long, Q.; Yin, L. Picture fuzzy decision-making theories and methodologies: A systematic review. Int. J. Syst. Sci. 2023, 54, 2663–2675. [Google Scholar] [CrossRef]
Cai, D.; Li, R.; Hu, Z.; Lu, J.; Li, S.; Zhao, Y. A comprehensive overview of core modules in visual SLAM framework. Neurocomputing 2024, 590, 127760. [Google Scholar] [CrossRef]
Yang, D.; Cui, Z.; Sheng, H.; Chen, R.; Cong, R.; Wang, S.; Xiong, Z. An Occlusion and Noise-aware Stereo Framework Based on Light Field Imaging for Robust Disparity Estimation. IEEE Trans. Comput. 2023, 73, 764–777. [Google Scholar] [CrossRef]
Cao, Q.; Wang, R.; Zhang, T.; Wang, Y.; Wang, S. Hydrodynamic Modeling and Parameter Identification of a Bionic Underwater Vehicle: RobDact. Cyborg Bionic Syst. 2022, 2022. [Google Scholar] [CrossRef]
Zou, X.; Yuan, J.; Shilane, P.; Xia, W.; Zhang, H.; Wang, X. From Hyper-Dimensional Structures to Linear Structures: Maintaining Deduplicated Data’s Locality. ACM Trans. Storage 2022, 18, 25. [Google Scholar] [CrossRef]
Liu, H.; Jiang, K.; Gamboa, H.; Xue, T.; Schultz, T. Bell Shape Embodying Zhongyong: The Pitch Histogram of Traditional Chinese Anhemitonic Pentatonic Folk Songs. Appl. Sci. 2022, 12, 8343. [Google Scholar] [CrossRef]
Zhu, G.; Yong, L.; Zhao, X.; Liu, Y.; Zhang, Z.; Xu, Y.; Sun, Z.; Sang, L.; Wang, L. Evaporation, infiltration and storage of soil water in different vegetation zones in the Qilian Mountains: A stable isotope perspective. Hydrol. Earth Syst. Sci. 2022, 26, 3771–3784. [Google Scholar] [CrossRef]
Huang, J.; Ma, H.; Sedano, F.; Lewis, P.; Liang, S.; Wu, Q.; Su, W.; Zhang, X.; Zhu, D. Evaluation of regional estimates of winter wheat yield by assimilating three remotely sensed reflectance datasets into the coupled WOFOST—PROSAIL model. Eur. J. Agron. 2019, 102, 1–13. [Google Scholar] [CrossRef]
Wu, W.; Zhu, H.; Yu, S.; Shi, J. Stereo Matching With Fusing Adaptive Support Weights. IEEE Access 2019, 7, 61960–61974. [Google Scholar] [CrossRef]
Wu, Z.; Zhu, H.; He, L.; Zhao, Q.; Shi, J.; Wu, W. Real-time stereo matching with high accuracy via Spatial Attention-Guided Upsampling. Appl. Intell. 2023, 53, 24253–24274. [Google Scholar] [CrossRef]
Gu, Y.; Hu, Z.; Zhao, Y.; Liao, J.; Zhang, W. MFGTN: A multi-modal fast gated transformer for identifying single trawl marine fishing vessel. Ocean. Eng. 2024, 303, 117711. [Google Scholar] [CrossRef]
Yang, M.; Han, W.; Song, Y.; Wang, Y.; Yang, S. Data-model fusion driven intelligent rapid response design of underwater gliders. Adv. Eng. Inform. 2024, 61, 102569. [Google Scholar] [CrossRef]
Qi, F.; Tan, X.; Zhang, Z.; Chen, M.; Xie, Y.; Ma, L. Glass Makes Blurs: Learning the Visual Blurriness for Glass Surface Detection. IEEE Trans. Ind. Inform. 2024, 20, 6631–6641. [Google Scholar] [CrossRef]
Fan, W.; Li, J.; Wang, X.; Wu, Y. Query preserving graph compression. In Proceedings of the 38th ACM SIGMOD International Conference on Management of Data, Scottsdale, AZ, USA, 20–24 May 2012; pp. 157–168. [Google Scholar]
Khan, K. Set-based approach for lossless graph summarization using locality sensitive hashing. In Proceedings of the 31st IEEE International Conference on Data Engineering Workshops, Seoul, Republic of Korea, 3–17 April 2015; pp. 255–259. [Google Scholar]
Wu, Y.; Jin, R.; Zhang, X. Fast and unified local search for random walk-based k-nearest-neighbor query in large graphs. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, Snowbird, UT, USA, 22–27 June 2014. [Google Scholar]
Kumar, K.; Efstathopoulos, P. Utility-driven graph summarization. Proc. VLDB Endow. 2018, 12, 335–347. [Google Scholar] [CrossRef]
Mishra, P.; Kumar, S.; Chaube, M. Graph interpretation, summarization and visualization techniques: A review and open research issues. Multimed. Tools Appl. 2023, 82, 8729–8771. [Google Scholar] [CrossRef]
Koutra, D.; Kang, U.; Vreeken, J.; Faloutsos, C. Vog: Summarizing and understanding large graphs. In Proceedings of the 2014 SIAM International Conference on Data Mining, Philadelphia, PA, USA, 24–26 April 2014; pp. 91–99. [Google Scholar]
Ko, J.; Kook, Y.; Shin, K. Incremental lossless graph summarization. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Virtual, 6–10 July 2020; pp. 317–327. [Google Scholar]
Zhou, H.; Liu, S.; Lee, K.; Shin, K.; Shen, H.; Cheng, X. Dpgs: Degree-preserving graph summarization. In Proceedings of the 2021 SIAM International Conference on Data Mining (SDM). SIAM, Virtual, 29 April–1 May 2021; pp. 280–288. [Google Scholar]
Navlakha, S.; Rastogi, R.; Shrivastava, N. Graph summarization with bounded error. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Vancouver, BC, Canada, 10–12 June 2008; pp. 419–432. [Google Scholar]
Omid, J.; Maurya, P.; Nagarkar, P.; Islam, K.; Crushev, C. A survey on locality sensitive hashing algorithms and their applications. arXiv 2021, arXiv:2102.08942. [Google Scholar]
Koutra, D.; Kang, U.; Vreeken, J.; Faloutsos, C. Summarizing and understanding large graphs. Stat. Anal. Data Mining Asa Data Sci. J. 2015, 8, 183–202. [Google Scholar] [CrossRef]
Shrivastava, A.; Li, P. Improved densification of one permutation hashing. In Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence (UAI), Quebec City, QC, Canada, 23–27 July 2014; pp. 732–741. [Google Scholar]

Figure 1. Illustration of grouping-based and personalization graph summarization. (a) Correction-set-based summarization. (b) Personalization graph summarization.

Figure 2. Illustration of correction-set-based merging: (a) input graph G; (b) initialization phase. (c) merging iteration 1; (d) merging iteration 2; (e) merging iteration 3; (f) merging iteration 4; (g) representation of C+ (yellow) and C- (green) edges; (h) final summary graph

G^{/}

.

Figure 2. Illustration of correction-set-based merging: (a) input graph G; (b) initialization phase. (c) merging iteration 1; (d) merging iteration 2; (e) merging iteration 3; (f) merging iteration 4; (g) representation of C+ (yellow) and C- (green) edges; (h) final summary graph

G^{/}

.

Figure 3. LSH and WLSH applied on node A, B, and G of input graph G.

Figure 4. Adjacency matrix of input graph in Figure 2.

Figure 5. Applying DOPH on Node A.

Figure 6. Permute the vector using random permutation-based hash function H.

Figure 7. Divide the permuted vector into K equal bins; here, K is signature length chose by user and is assumed as 4.

Figure 8. If bi has a non-zero entry, set Hbi as index of the first non-zero entry; otherwise, let bi be the first bin on the left or right with a non-zero entry; set Hbi to Hbj.

Figure 9. Hash signature for node A.

Figure 10. Illustration of IPGS based on flow rate maximization: (a) input-weighted graph; (b) output summary graph maintaining influence of node A.

Figure 11. Flow diagram for proposed algorithms.

Figure 12. Visualization of original and summary graphs of Bio–Mouse–Gene dataset. The results are provided by varying the size of the input graph for detailed analytics and visualization. (a) Original Bio–Mouse–Gene dataset. (b) Chunk of the original dataset, having 223 nodes and 997 edges. (c) Summary graph, with 119 super nodes and 439 super edges, from the chunk of the dataset shown in Subfigure (b). (d) Bio–Mouse–Gene dataset with 108 nodes. (e) Summary graph of the graph with 108 nodes in Subfigure (d) for target node 2. (f) Smaller chunk from the original graph, with 12 nodes and 29 edges. (g) Summary graph, with 4 super nodes and 12 super edges, of the chunk of the dataset that is shown in Subfigure (d).

Figure 13. Results for execution time on complete size of each dataset.

Figure 14. Execution time (in s) comparison of PGS and IPGS for different data sizes. The sub-figures (a–h) demonstrate the execution time on different datasets.The name of the dataset is shown at the top of each sub-figure.

Figure 15. Results for compression ratio obtained on complete size of each dataset.

Figure 16. Comparison between LDME, PGS, and IPGS for compression achieved for different dataset sizes. The sub-figures (a–h) demonstrate the compression achieved on different datasets. The name of the dataset is shown at the top of each sub-figure.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hashmi, S.J.; Alabdullah, B.; Al Mudawi, N.; Algarni, A.; Jalal, A.; Liu, H. Enhanced Data Mining and Visualization of Sensory-Graph-Modeled Datasets through Summarization. Sensors 2024, 24, 4554. https://doi.org/10.3390/s24144554

AMA Style

Hashmi SJ, Alabdullah B, Al Mudawi N, Algarni A, Jalal A, Liu H. Enhanced Data Mining and Visualization of Sensory-Graph-Modeled Datasets through Summarization. Sensors. 2024; 24(14):4554. https://doi.org/10.3390/s24144554

Chicago/Turabian Style

Hashmi, Syed Jalaluddin, Bayan Alabdullah, Naif Al Mudawi, Asaad Algarni, Ahmad Jalal, and Hui Liu. 2024. "Enhanced Data Mining and Visualization of Sensory-Graph-Modeled Datasets through Summarization" Sensors 24, no. 14: 4554. https://doi.org/10.3390/s24144554

APA Style

Hashmi, S. J., Alabdullah, B., Al Mudawi, N., Algarni, A., Jalal, A., & Liu, H. (2024). Enhanced Data Mining and Visualization of Sensory-Graph-Modeled Datasets through Summarization. Sensors, 24(14), 4554. https://doi.org/10.3390/s24144554

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhanced Data Mining and Visualization of Sensory-Graph-Modeled Datasets through Summarization

Abstract

1. Introduction

2. Literature Review

2.1. Review of Knowledge Discovery Techniques in Biosensors and Multidisciplinary Domains

2.2. Review of Research on Graph Summarization

3. Problem Statement

4. The Proposed Algorithm, IPGS

4.1. Correction-Set-Based Approach for Grouping-Oriented Summarization

4.2. Weighted LSH for IPGS

4.3. Formal Algorithms of the Proposed IPGS

5. Experimental Evaluation

5.1. Data Availability

5.2. Exploring Mouse Gene Dataset through Visualization

5.3. Comparison of Execution Time

5.4. Comparison of Compression Ratio

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Acknowledgments

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI