Deduplication-Aware Healthcare Data Distribution in IoMT

Altowaijri, Saleh M.

doi:10.3390/math12162482

Open AccessArticle

Deduplication-Aware Healthcare Data Distribution in IoMT

by

Saleh M. Altowaijri

Department of Information Systems, Faculty of Computing and Information Technology, Northern Border University, Rafha 91911, Saudi Arabia

Mathematics 2024, 12(16), 2482; https://doi.org/10.3390/math12162482

Submission received: 31 May 2024 / Revised: 4 August 2024 / Accepted: 5 August 2024 / Published: 11 August 2024

(This article belongs to the Special Issue Artificial Intelligence and Internet of Things for Intelligent Systems)

Download

Browse Figures

Versions Notes

Abstract

:

As medical sensors undergo expeditious advancements, there is rising interest in the realm of healthcare applications within the Internet of Medical Things (IoMT) because of its broad applicability in monitoring the health of patients. IoMT proves beneficial in monitoring, disease diagnosis, and better treatment recommendations. This emerging technology aggregates real-time patient health data from sensors deployed on their bodies. This data collection mechanism consumes excessive power due to the transmission of data of similar types. It necessitates a deduplication mechanism, but this is complicated by the variable sizes of the data chunks, which may be either very small or larger in size. This reduces the likelihood of efficient chunking and, hence, deduplication. In this study, a deduplication-based data aggregation scheme was presented. It includes a Delimiter-Based Incremental Chunking Algorithm (DICA), which recognizes the breakpoint among two frames. The scheme includes static as well as variable-length windows. The proposed algorithm identifies a variable-length chunk using a terminator that optimizes the windows that are variable in size, with a threshold limit for the window size. To validate the scheme, a simulation was performed by utilizing NS-2.35 with the C language in the Ubuntu operating system. The TCL language was employed to set up networks, as well as for messaging purposes. The results demonstrate that the rise in the number of windows of variable size amounts to 62%, 66.7%, 68%, and 72.1% for DSW, RAM, CWCA, and DICA, respectively. The proposed scheme exhibits superior performance in terms of the probability of the false recognition of breakpoints, the static and dynamic sizes of chunks, the average sizes of chunks, the total attained chunks, and energy utilization.

Keywords:

data collection; deduplication; cloud and fog computing; healthcare; Internet of Things (IoT)

MSC:

68U04

1. Introduction

The Internet of Things (IoT) encompasses a vast array of smart sensors used to check the health conditions of patients. In addition, IoT senses crucial data and sends them to the main storage repository. Incorporating IoT into healthcare facilitates the frequent and real-time monitoring of patients, resulting in timely interventions and enhanced patient outcomes [1,2]. A wireless body area network (WBAN) contains intelligent sensing devices affixed to the human body to observe and share health data, such as the heart rate, oxygen saturation level, body temperature, and blood pressure. These devices have limited resources for the detection and transmission of crucial health information [3]. WBANs use sensors in wearable devices such as garments, watches, and shoes for accurate reading and frequent patient monitoring. By utilizing WBANs, healthcare providers can ensure that patients, especially those with chronic conditions, receive continuous supervision without the need for frequent hospital visits. All deployed sensors are connected with an aggregator node located at the core of the body of the patient [4,5].

Sensors transmit real-time data to collector devices (CDs), which then exchange the data with sink or fog servers. Finally, the data are sent to cloud repositories for long-term healthcare analysis and estimation for the timely identification of critical issues. The use of cloud repositories will ensure that healthcare data are securely stored and easily accessible in the future. Fog servers offload data processing to reduce the burden on cloud systems and enhance privacy and security. This approach mitigates bandwidth limitations and network congestion. A fog server helps to reduce delays in processes, such as exchanging data from the sensing devices to the cloud [6]. This reduced latency is crucial in emergency scenarios where quick access to patient data can save lives. Data collection and deduplication can occur at the aggregator node. It is beneficial for elders, workers, patients, etc. It involves sending their information frequently to a central storage repository.

Energy management is challenging with regard to sustaining and optimizing the functionality of IoT sensors. This emphasizes the need for efficient energy harvesting and management to ensure uninterrupted IoT network performance. Diverse energy-harvesting methods and their integration with communication technologies have been described. By addressing the challenges associated with energy resource management, such research aims to improve energy conservation, extend battery lives, and ultimately enhance the overall reliability and efficiency of IoT sensor networks [7].

Data aggregation involves the collection of data from a particular area of interest and the transmission of a single message by aggregating the values of different sensors. It helps to reduce the number of messages, the communication cost, and the energy consumption. A detailed analysis of in-network aggregation is presented in [8] for IoT. It categorizes schemes based on the technology utilized in data centers to improve aggregation by offloading tasks from servers to network switches. This method reduces the communication and traffic pressure as data collection is performed directly within the network. Their effectiveness is also explored in addressing issues such as interference, fault tolerance, flexibility, and security. In [9], efficient data aggregation for heterogeneous networks is discussed, where hybrid approaches combine data-centric and address-centric routing. Such methods improve latency, energy efficiency, data accuracy, and temporal correctness. Fog servers are utilized to receive the aggregated data and process it before sharing it with the cloud. By processing the data at the edge of the network, fog servers help to decrease the amount of data transmitted to the cloud, thereby saving bandwidth and reducing the response times.

Deduplication identifies redundancies in aggregation data to maintain a singular entity of records. Deduplication is considered a subcategory of data compression in which long, identical data patterns are changed with small data values. In the case of cyclic and redundant healthcare data distributions, a sensing bottleneck can be faced, which needs to be fixed by utilizing a caching approach [10]. Deduplication can be categorized as (a) post-processing and (b) in-line. In the former case, the latest data are saved on a device and later examined for duplicated values. This confirms that the efficiency of storage actions is not reduced; instead, it saves identical values for only a short time. In the latter case, it involves hash calculations, performed in real time, to remove redundant blocks. This requires minimal storage space as the data are not replicated. However, the disadvantage is that the calculations involved in hash and lookup operations need extensive time for processing [11]. This means that the data saving on the device can be time-consuming, consequently affecting the backup.

The main problem is that sensor devices produce a huge volume of data constantly, so the aggregator node is faced with identical data. From the aggregator device, the collected information is then sent to the main server. The transmission of large volumes of information utilizes a large amount of energy from the battery and decreases the network’s lifespan. Furthermore, the storage of identical data recurrently on the server results in the wastage of storage. The limited power and dynamic nature of sensors additionally exacerbate the issue of effective data dissemination. To decrease congestion in the network, as well as reduce data latency, it is vital to eliminate similar information values via a data deduplication method. The huge volumes of similar data result in per-bit energy losses on the server side, as well as upsurges in energy consumption.

This study includes an extensive examination of various data collection techniques that facilitate the sending of patient data to servers located in hospitals. The deployed sensors collect huge amounts of information, which utilizes high energy due to the frequent transmission of identical data. By recognizing the challenges associated with energy-efficient schemes, the present research contributes in the following ways.

(1): We examined schemes that underscore the significance of data deduplication to overcome the duplication of the collected data for WBAN cases;
(2): Data deduplication of level 1 was performed at the collector node before transmission to the sink. Next, a second level of deduplication was performed after receiving data from different collectors at the sink or fog server. The study further identified redundancies to improve deduplication. This helps to decrease the storage cost and transmission cost correspondingly;
(3): Next, we presented a new Delimiter-Based Incremental Chunking Algorithm (DICA), which considers breakpoint identification among two windows. The main target is to obtain bigger chunks to improve the rate of deduplication.
(4): We performed extensive simulations to compare DICA with base schemes for data deduplication. The results indicate the supremacy of the DICA over its counterparts.

The subsequent sections of the manuscript are structured as follows. The existing studies are examined in Section 2. Section 3 presents the system model and the research problem. Section 4 elaborates on the proposed DICA scheme used for chunking. The results and analysis are presented in Section 5. In Section 6, the conclusion of the work is stated.

2. Literature Review

The existing literature was examined to identify mechanisms for the gathering of data on patients from various healthcare services. The emphasis of the study was on healthcare data collection, as well as the dissemination of such data to the data collection center (DCC), serving diverse uses. Within this context, we investigated the secure aggregation of data and examined the aspect of deduplication. The transmission of data collected using sensors can be facilitated through a cluster-based model [12]. This scheme incorporates an effective method for improved energy consumption and data transmission. The proposed scheme includes four clusters, each comprising Ad hoc Relay Stations (ARSs), Relay Nodes (RNs), and sensor nodes attached to the body for health checking. Layer 1 highlights consistent and effective relay-dependent routing, attaining a 99.9% PDR under puzzling situations. Layer 2 incorporates IoT-based smart homes to assist aging care, retaining relay routing for operative data transfer. Layer 3 uses BS, guaranteeing quality of service (QoS), energy efficiency, nominal end-to-end delay, and strong support for healthcare environments [13]. For better link quality for the calculated shortest path, a composite Route Cost Function along with integrating Link Reliability Factor is utilized [14]. This can prove useful for the efficient transmission of aggregated healthcare data among sensing devices and the BS, improving storage overhead and data reliability.

The aggregation of healthcare-based data requires a broadband setup, including 3G and LTE. The sensed information is sent to cloud servers on the basis of 360 data packets per hour. Initially, Bluetooth is employed for data aggregation, after which the data are transmitted to the SN and subsequently to smartphones and personal digital assistants (PDAs) via the Internet. Finally, the cloud serves as the ultimate destination for the further analysis of the data. The AODV protocol is employed within a priority queue to distinguish the traffic. The results obtained in this way show the impact of both high-priority and low-priority queues. However, AODV falls short due to the vulnerabilities associated with the FIFO approach, particularly when addressing emergency cases. Pre-preemption and non-preemption cases are added to reduce the shortcomings associated with AODV [15]. To check healthcare-related data, different types of communication schemes are employed in [16].

A comparison between RF and HBC was conducted on the basis of usage. The WBAN hub facilitates communication between various devices, such as relays or bridges. The model comprises three main modules, namely the BAN, the node, and the organization of the data. The data and control channels are employed independently. Due to its distinctive features, the smart body area network (BAN) finds application in specialized scenarios or for the observation of unique processes. To minimize packet losses and maintain low transmission delays, AODV, which is a routing protocol, is introduced [17]. The data are categorized based on priority (urgent or routine), ensuring efficient handling. A three-tier architecture is employed, incorporating a diverse range of sensors, such as ECG, movement, acoustic, and BP devices. All devices are attached to the HN. In the third stage, the HN further sends the aggregated data to either a smartphone or a nursing station. Finally, the gathered information is forwarded to the central database to save these records. In an emergency, an alarm is activated. Each sensor is assigned a threshold value. The aggregated data are transferred and shared with the server. This scheme transmits the data when the value increases beyond the threshold. It forwards the data to the SN in a short time to reduce delays and maintains a good data packet transfer ratio along with high throughput.

In [18], Elasticity-Aware Deduplication (EAD) is introduced. It enables users to define a migration trigger value, denoted as T(0, 1), indicating the accepted deduplication level for this specific user. A certain portion of RAM is allocated for indexing prior to the deduplication procedure. The experimental results confirm that EAD enhances the system efficiency fourfold. This scheme illustrates 98% redundancy in the data while utilizing only 5% of the memory. The primary objective behind implementing backup storage is to reduce storage costs, lessen the transmission overhead, and optimize space utilization, mainly in the context of wireless multimedia networks. A frame separated by a comma, denoted as

V = \{v_{1}, v_{2}, \dots, v_{n}\}

, is generated using the chunking algorithm derived from the sensory data. F′ is

|F \cap F^{'}| \to k (k \leq l)

and represents the newly modeled data; if it is similar/equal to F, this shows that the majority of the components of the newly sampled information are steady. Assume that we have a collector node holding data, denoted as R, and the multimedia sensor identifies a chunk, denoted as F′, that bears similarity to a chunk present in R. The replicated content does not need to be switched; however, the indexes of the information may be disseminated. Additionally, the stable sequences ultimately lead to enhanced outcomes in terms of lessening the storage space and addressing the bandwidth needed in a network [19]. The secure deduplication assesses the data, identifying attributes, and if duplicate content is detected, then the cipher-text policy is applied. This approach helps to overcome unnecessary steps or redundancies, permitting smooth and resource-effective storage on the private cloud server. It ensures zero interaction-based key management using ElGamal. It includes a strong verification process that employs attribute keys. These keys ensure that only authorized entities obtain access to the data, which ensures access control [20]. The scheme makes use of prime chunking, which is considered effective for data deduplication within a cloud environment. OPC works by describing boundaries with the use of prime numbers, producing an exclusive and effective chunking strategy. The dynamic strategy of this scheme makes use of these prime chunks until the data chunk achieves an appropriate length. The average data chunk is created using Equation (1). T represents the selected prime digit, and L_k represents data chunking.

C h u n k_{a v e r a g e} = \sum_{i = 0}^{T} \frac{L_{K}}{T}

(1)

This scheme maintains a better chunk size along with reduced complexity, increasing the effectiveness with less storage. The scheme mainly depends on prime numbers for chunking, meaning that is not suitable for all types of data and can result in inappropriate chunk sizes [21].

2.1. Sliding Window-Based Chunking Techniques

Content-defined chunking (CDC) schemes are considered highly flexible in terms of redefining boundary chunks as they rely on the content instead of the size, providing a highly dynamic method for data organization. Ensuring the fulfillment of breakpoints forms the foundation of the CDC technique; this is considered a critical component in the procedure of deduplication. This vital element has a significant effect on the ratio of data deduplication and data repetition performance. Sliding window-based techniques have been employed for the past 15 years; however, their efficiency is compromised as they need byte-by-byte data deduplication. This work provides the basis for the construction of an innovative and effective solution. Leap CDC is presented, which is guaranteed to increase the data deduplication ratio. Data deduplication usually consists of four main stages, which include chunking, the fingerprinting of chunks or fingerprinting calculation, fingerprint indexing, and querying, along with the storage and handling of data [22]. This procedure involves instances where the chunks surpass the specified size, and subordinate solutions are utilized. If the maximum limit of the chunks is not met, an imposed breakpoint can still occur. The leap-based algorithm also makes use of the same idea in secondary situations, and this helps to overcome the imposed proportion of breakpoints for the data chunks. A rolling hash cannot be applied, so, as an alternative, the concept of pseudo-random transformation is used. This study contrasts SW and leap-dependent CDC in terms of CPU and resource utilization and the data deduplication ratio. The main problem in this approach is that it sets some predefined criteria, which ultimately lead to imposed breakpoints that are unsuitable for large data chunks. The Double Sliding Window (DSW) algorithm is an improvement over the CDC mechanism. It is considered to increase the performance of CDC in the context of deduplication. The DSW focuses on the main criteria in deduplication, comprising the RT, the number of chunks, and the re-deletion effectiveness. The DSW increases the threshold, permitting more flexible and improved data processing. This scheme makes use of double windows with variable sizes. These windows are placed at the initial part of the data stream and send the data further. Hash values identify breakpoints. The Markov model is used to find cut points in data and resources, being well utilized within the deduplication mechanism [23].

2.2. Fast and Effective CDC Techniques

After evaluating the drawbacks of CDC techniques, Fast and Efficient CDC (FastCDC) [24] employs gear hash-based CDC, improving the hash judgment while simultaneously simplifying the process. To improve its effectiveness, the sub-lowest data chunk points are ignored. The function of normalized distribution is used to overcome the proportion of deduplication at a particular area. To deal with the lowest data chunks, the chunk size in the normal case is kept large. In the literature, two main categories are highlighted: algorithmic and hardware-based content-defined chunking methods. The two most important phases of the CDC technique are declared to be hashing and hashing judgment. Hashing involves allocating a hash value to a data chunk, while hashing judgment denotes the contrasting of data chunk points. The choice to leverage gear hashing is determined based on the contrast between the Rabin and Adler methods. The gear hash approach needs fewer computations in comparison with the Rabin method. The choice to leverage gear hash over the Rabin and Adler methods is supported by experimental data and a detailed comparative analysis presented in a previous paper. The key advantages of gear hash are highlighted by the experimental results. Firstly, gear hash uses fewer calculation operations compared with the Rabin and Adler methods, which enhances its performance. Gear hash relies on simpler operations, including a left shift, an addition, and an array lookup to generate hash values. In contrast, the Rabin and Adler methods need more complex and time-consuming operations. This simplicity of gear hash results in faster computation times, making it more suitable for content-defined chunking.

The experimental data include a detailed comparison of the hashing stages of the Rabin, Adler, and gear methods, highlighting the fundamental operations involved. The analysis demonstrates that gear hash is faster, providing a pseudocode for each method. It advocates for the use of gear hash for improved performance.

2.3. Sub-Chunk Deduplication-Based Schemes

The content-defined chunking algorithm faces criticism due to its calculation overhead, as explored by [25]. The CDC hashing algorithm utilizes extra processing time throughout the data deduplication mechanism. An anchor-based deduplication scheme [26] introduces dynamic sub-chunk data deduplication. This scheme makes use of a multi-step mechanism to calculate fingerprints for the data chunks. Afterward, these fingerprints are encoded using a feature model. Then, the data chunks are checked for identical data. Optimization approaches are employed to ensure a high rate of precision, along with checking the similarity to avoid duplication [27].

This scheme provides an effective technique for deduplication and the subsequent storage of the data on the cloud. In this method, deduplication is performed, which improves the storage capacity by holding only a singular instance of a file. For sharing health data, advanced deduplication mechanisms are involved to increase storage and maintain data confidentiality. At first, sensitive information is masked, and deduplication is performed in the cloud by creating unique tags for each encoded data block. When a new block is uploaded, the CS checks if the tag is already present in the system. If it does, then it is not uploaded again, thus preventing redundant storage. Additionally, the cloud server checks duplication ratios of diagnostic data, classifying data into high, intermediate, or low duplicate ratios. This method effectively manages efficient storage and retains low duplication [28]. ESDedup utilizes an efficient deduplication strategy for patients’ data that also checks for duplication. The data are divided into chunks, and idle CPU resources are used to concurrently check all chunks. The Simiprim algorithm is used for similarity detection. Afterward, a rewriting algorithm is used to ensure that only unique data blocks are stored. The redundant copies and cost of storing data are minimized. To avoid illegal access to the data, a blockchain strategy is employed [29].

The FastDedup scheme comprises three main layers. Each layer performs the function of refining the data. The first layer serves the function of deduplication and managing both already stored and new data. This layer comprises various FastDedup nodes that provide data storage functionalities. These nodes perform separate deduplication operations on the data provided by the first layer. It allows merging operations, too, which provides the ability to update the files to attain global deduplication. This scheme provides a better deduplication ratio but has a high processing overhead [30]. The presented technique attains better bit error and data rates simultaneously, along with security, as compared to RF. Ghamari et al. [31] introduced a technique that aims to overcome the needed power for communication; this approach involves using an energy-harvesting mechanism in conjunction with low-energy MAC protocols. Besides this, the application layer implements an improved selection and data broadcasting policy that complies with the application. Wireless channels assist in measuring the frequency among the hubs and four predominant nodes located on the upper body, posterior side, and upper arms. The simulation outcomes demonstrate that when the hub is positioned on the temple, it achieves better results, while the least favorable overall performance is attained when the hub is located on the waist.

Another technique is offered with a static-size sliding window to recognize the local maximum data bytes. After a predefined interval, a unique pattern can be detected between two windows taken from the input string. However, this requires a longer processing time and increases the computation costs. The major drawback of this scheme is that it is not suitable for use in deduplication mechanisms for healthcare data, as the lives of patients rely on the effectiveness of the adopted technique [32]. In [33], the chunking algorithm incorporates a cut point positioned to the right of a fixed-size window, as in LMC. This technique is named the Asymmetric Extremum (AE) or Rapid Asymmetric Maximum (RAM). In place of hashing, byte values are considered to determine cut points. To manage this, the RAM utilizes both fixed and variable-sized windows to find the byte with the highest value, which then functions as the cut point for the chunk. Ultimately, this results in fewer comparisons, which assists in lessening computational energy consumption. This method provides a higher or larger value situated at crucial points among two consecutive windows. The fundamental difference is that, contrary to former techniques, it displays a window with a variable size on one side and a fixed size on the other side. When the cut points are determined, the hashing procedure is omitted [34].

In [35], to obtain an appropriate window size, the CWCA scheme is introduced in the domain of IoT systems. Patient data aggregation is performed through sensing devices. Afterward, the data are forwarded to a fog server, where they are checked for duplicate values. Healthcare data, including body temperature, glucose, BP, sugar level, and cholesterol data, exist in string format. The CWCA dynamically controls the size of the chunks by splitting the data into segments based on a delimiter and window size w. Three core conditions are imposed for the selection of the chunk size. If the array size exceeds w, then the last item is eliminated, ensuring that the chunks fit within the window. If the size of the selected array is equal to w, the chunks are returned without any adjustment. Finally, if the array size is equal to or greater than 75% of w and less than w, the array is returned without deleting any items. This technique ensures that the chunking process adjusts to the dynamically controlled window size, enabling the recognition of duplicates and enhancing the effectiveness in processing duplicate-free healthcare data. G. Neelamegam et al. presented a window size chunking algorithm with a biased sampling-based Bloom filter using Advanced Signature-Based Encryption (WCA-BF + ASE) [35]. The Bloom filter is applied to its output to further detect duplication. The data elements are added, and membership verification is performed using a hash function. The bit positions in the Bloom filter, equivalent to the hashed values, are set as 1, demonstrating the insertion of the new element. This procedure guarantees that if the same data element is found in the data stream again, the resultant bit positions are already set as 1, indicating duplicates. To enhance privacy, ASE is used so that only legal authorities can obtain access to the data. This scheme bears a high computational cost [36].

3. System Model and Problem Statement

The system model shows patients and employees with sensor nodes on their garments and bodies to obtain healthcare data values for further transmission. The collector node, having access to onboard sensors, can gather the sensed data from the connected devices. The data readings contain redundant values, such as those reflecting the body temperature of a patient, which do not fluctuate quickly. The same applies to the BP, ECG, oxygen saturation level, etc. These duplicated values need to be recognized before transmitting the data to the server so that the communication costs can be reduced. The first layer performs the function of data aggregation from the attached body sensors. The second one includes a fog layer situated at the cloud’s edge. Fog servers possess the capability to evaluate a patient’s health condition based on the obtained data readings and communicate with the collector node to take immediate action if needed. The third and last one is the cloud layer, serving as the foundation for the processing of the data and their storage, as illustrated in Figure 1.

The core issue in the basic techniques is that the variable-sized windows can be much smaller in size; this is dynamically calculated on the basis of a breakpoint. It may affect the average chunk size, which should be large so as to identify more duplicate values and obtain better deduplication ratios. There could be a scenario in which the LMB is not found based on schemes such as RAM and CWCA due to mismatching the criteria used to find the breakpoint. This ultimately results in an uncertain situation or failure to start the subsequent window. In the current study, we implemented the same functionality as in RAM by including windows of both static and variable size. To recognize breakpoints in variable-sized windows, the scheme traverses the incremental side of the window next to a static-sized window. This results in successful cut-point identification by varying the chunk size until a valid breakpoint is identified, as explored in detail in the next section.

4. Proposed Solution

We propose a novel Delimiter-Based Incremental Chunking Algorithm (DICA) technique for intelligent healthcare IoT, including a fog server that is deployed at the network’s edge. It reduces the transmission delays for healthcare data sent through the fog server to cloud repositories. We introduced an adaptive chunking algorithm at the data sink for secure message transmission from the sensor nodes to the CDs. It forwards the information to the fog for processing, storage, and analysis. Wearable health sensors such as sensor-equipped wristwatches are common in daily life. The identification of lethal viruses by sensor devices, such as the Chikungunya virus, in a timely manner has the potential to save human lives. These serious viruses tend to damage multiple organs. In areas with scarce medical facilities, the risk of disease outbreaks is high. The proposed model, as in [37], uses a three-layer fog and cloud system. The CD conducts the initial deduplication of the aggregated data prior to sending them to the fog server. The fog entity, after executing the deduplication process on the basis of CDC, forwards the data to the cloud server. This chunking mechanism eliminates redundant data, optimizing the storage space at the cloud server. It not only reduces the required bandwidth but also minimizes delays. The cloud server includes data repositories, ensuring that the data are securely stored and remain available on demand.

Duplicated values must be identified before transmission to reduce the communication cost. It is also necessary to identify significant changes in readings based on health parameter thresholds. This scheme is applicable for collaboration between relevant medical staff for the analysis of patients’ histories. Furthermore, sensing devices can continuously transmit health reports to the CD. This functionality ensures that practitioners are aware of a patient’s current condition and enables them to take immediate action if an emergency occurs. The local storage on the fog server-side functions similarly to cache memory, as it keeps recent data available. This setup assures that critical data readings are accessible, enabling fast decisions as per the current condition of a patient. The symbols used in the algorithm are provided in Table 1.

The collector device is essential for deduplication, gathering data from sensors, and transferring them to SN or fog servers. The CDs must implement data deduplication at the primary level. For this purpose, replicated data values are replaced with a Boolean digit. The Boolean representation of data means that identical readings exist in the data stream that was transferred previously. Until any significant change is found in the readings, the data will remain in 1 bit Boolean digit format. The CDs perform encryption on the obtained data by utilizing the secret key, which is shared with the sink node. Next, the sink performs deduplication on large volumes of data using the DICA. This method assists in identifying data chunks of large sizes as the extensive data are presented for sharing with the cloud. These data are stored in a repository to be used later for analysis. The transmission of information by the sink to the cloud server is protected via encryption to avoid security threats. The CDs furthermore obtain the hash values of the healthcare information to protect it from bit tampering attacks, as well as to ensure data integrity. Timestamps are also included to avoid replay attacks. These checks do not merely minimize communication costs and energy consumption. Besides this, they lessen communication delays between the sensor nodes and cloud storage. In the present work, a comprehensive description of the data security measures was not given as it was beyond its scope, and the main emphasis was on data deduplication.

Healthcare sensor readings are gathered and stored in repositories. The identification of the gathered data enables the recognition of the source device, and then the values are stored according to the patients or users for further evaluation. This procedure also aids in maintaining records and in the appropriate generation of alarms in the event of an emergency. Doctors and nurses can utilize this history to evaluate patients’ health. In this case, a cost-effective method is employed at the sink node, utilizing the CDC procedure for data deduplication. The procedure starts by discovering a breakpoint or pivot to establish a window of a fixed size. Subsequently, a window of variable size is created by adding another breakpoint, a threshold-sized window; ultimately, this results in a large data chunk that increases the deduplication performance. The usefulness of this procedure relies on maintaining a satisfactory size of window win_s and on incremental chunking (IC). We consider IC to limit the chunk size in which the IC value is greater than w. Conversely, if IC1 is lower than w, the system checks a second-level delimiter R_B to capture extra bits and generate a larger IC2, thereby expanding the total length of the variable size of the window.

The Delimiter-Based Incremental Chunking Algorithm (DICA) presents a chunking mechanism that involves fixed-size and variable-sized chunks, as elaborated in Algorithm 1. It provides the step-by-step details of the cut-point procedure. It initiates with an iteration starting from index k and ends when the string length ends. The fixed-length chunk is denoted as

L i s t_{C h u n k} [i] [0]

= k + 1, where

L i s t_{C h u n k}

denotes the lists of both types of chunks incremented by index I, and index 0 indicates the starting index of the chunk, whereas 1 represents the ending index, which is calculated by adding the fixed-length chunk size FLC_size. The starting index of IC is only one index after the fixed chunk. The ending index of IC is obtained as per the function Data_Chunking_BreakPoint (k, sLen), which takes index k and string length sLen as input. The function returns the ending index of the IC. At the start of the function, it extracts a substring (k, sLen, readings), where k is the starting index and sLen is the ending index, to choose from the main string readings. The s_readings is split into two string tokens, and they are saved into an array named

r e a d i n g s_b r_{s_l e n 1} []

. Afterward, the algorithm checks the size of the initial token to determine whether it is less than or equal to the max_sized_chunk. If this condition is met, it further checks whether the data size is larger than or identical to win_s. If this holds true, the algorithm returns the size of the chunk. If the chunk is smaller in size than win_s, a large data chunk will be created by finding the next value index of R_A and allotting this index to k. In this way, it continues to increase the chunk size until it exceeds win_s.

Algorithm 1: Delimiter-Based Incremental Chunking Algorithm (DICA)

Input: Input string: readings, data length: sLen

Output: List of chunks

L i s t_{C h u n k}

as per appropriate breakpoint k

Presumed information: size of window win_s, max_sized_chunk

1. Set k as −1

2. While k < sLen do

3.

L i s t_{C h u n k} [i] [0] = k + 1

4.

L i s t_{C h u n k} [i] [1] = k = k + F L C_s i z e

5.

L i s t_{C h u n k} [i + 1] [0] = k + 1

6. k = Function Data_Chunking_BreakPoint (k, sLen)

7.

L i s t_{C h u n k} [i + 1] [1] =

k

8. End While

9. Function Data_Chunking_BreakPoint (index k, sLen)

10. Set s_readings = substring (k, sLen, readings)

11.

r e a d i n g_b r_{s L e n 1} []

= Split (s_readings, R_A) // Two parts

12. Set p as 0 //for 1st part of the split readings

13. If size (

r e a d i n g_b r_{s L e n 1} [p]

) <= max_sized_chunk then

14. If size (

r e a d i n g_b r_{s L e n 1} [p]

) > win_s then

15. return k = k + size (

r e a d i n g_b r_{s L e n 1} [p]

16. Else

17. While size (

r e a d i n g_b r_{s L e n 1} [p]

) < win_s do

18. k = k + Indexof(

r e a d i n g_b r_{s L e n 1} [p]

, R_A)

19. End While

20. return k

21. End if

22. Else If

r e a d i n g_b r_{s L e n 1} [p]

) > max_sized_chunk then

23. Set q as 0 //for 1st part of the split readings_br

24.

r e a d i n g_b r_{s L e n 2} [q]

) = Split

(r e a d i n g_b r_{s L e n 1} [p]

), R_B)

25. If size (

r e a d i n g_b r_{s L e n 2} [q]

) >= win_s then

26. return k = k + size(

r e a d i n g_b r_{s L e n 2} [q]

)

27. Else

28. q = q + 1

29. return k = k + Indexof(

r e a d i n g_b r_{s L e n 2} [q]

, R_B)

30. End if

31. End if

32. End function

In situations where the size of the chunks is larger than the max_size_chunk, the first data string is divided to obtain another string. Here, p is considered as the index value

(r e a d i n g_b r_{s L e n 1} [p]

), finding the data chunk based on this index. In the same way, q is the index value for

(r e a d i n g_b r_{s L e n 2} [q]

). In the proposed DICA, the size of a window of variable size is handled based on the delimiter that is used in the collected data to distinguish between the different values associated with the healthcare parameters. In this algorithm, we employ the delimiter R_A to segment the data into chunks, particularly during the procedure of finding a window of variable size. Each sensing device appends a delimiter to aggregate the data; these delimiters help in distinguishing the healthcare parameters from each other. Furthermore, the aggregator nodes also employ a distinct delimiter to aggregate the data received from various sensors. This approach lessens the cost of identifying the next suitable breakpoint. As the algorithm’s level of complexity is linear, it is represented as O (n) due to the addition of an iterative operation to find the index values. A second iteration is also required to maintain a list of chunks. The storage of data in the cloud repository is illustrated in Algorithm 2. The aggregated data that are received from the fog server are transmitted to the cloud. Before saving these data in the repository, they are extracted and checked for duplicated values. If the data item’s value is the same as the previously received value, it is replaced with a Boolean digit and stored in the cloud repository; otherwise, it is stored in its full form.

Algorithm 2: Receiving and Extracting Data at Cloud

Input: Received Aggregated data

Output: Stored Deduplicated Data At Cloud

Function save_data (DataItem)

1. Extract DataItem

2. If If DataItemValue equals prevoiusdataValue then

3. save as Boolean digit in cloud repository

4. Else

5. save data in their original form in cloud repository

6. End if

7. End function

5. Results and Analysis

To check the performance of the proposed scheme, extensive simulations were performed using Visual Studio 2019 with C-Sharp as a programming language using framework 4.7, where the healthcare values are taken from the SQL Server 2015 database. A comma (,) was used as a delimiter for the separation of health readings and their addition to the database. At the server level, C sharp was used to create a setup, using the ASP.net and WPF services for deployment on the Azure cloud to execute the deduplication functionality of the DICA and base schemes. The proposed scheme, the DICA, was compared with some other robust techniques, including DSW [23], RAM [34], CWCA [35], and WCA-BF + ASE [36]. Afterward, the healthcare-related data readings were used to examine the number of chunks and average chunk sizes for different numbers of chunks.

The simulation in this study utilized real-time values within the typical ranges for various health parameters, ensuring a realistic evaluation of the proposed technique. The data included body temperature and blood pressure readings reflecting normal and abnormal conditions, electrocardiogram (ECG) data capturing the heart’s electrical activity, and oxygen saturation levels within the physiological range. By employing these real-time values, the simulation accurately mirrored real-world scenarios, enabling a reliable evaluation of the technique’s performance in a healthcare IoT environment. The simulation parameters and their corresponding values are presented in Table 2.

Besides the server side, an Android application was developed at the client-side level. The mobile app facilitates the addition of further parameters based on readings taken from patients. It was ensured that the healthcare data do not exceed the minimum and maximum data range; for example, the temperature must not lie beyond 106 °F. To improve the accuracy, we used the simulation tool NS-2.35 on the Ubuntu operating system. It was employed to perform low-level calculations for the exchange of messages between sensor devices, as well as to examine energy utilization and residual energy levels. For the sensor nodes, collector devices, and sink nodes, separate classes are created to handle the node arrangement and corresponding functions. TCL files were utilized to set up the parameters for healthcare and transmit messages; this was managed using the C code, which performs the message transmission and receiving functions, along with setting the data packet parameters. Besides this, in the TCL files, the setdest(·) function was specified to observe patients’ movements. To examine the residual energy after the data collection, data transmission, and deduplication operations, an energy model was employed. Trace files record the residual energy of all types of nodes, including the sensor nodes and collector devices, which were further analyzed using an AWK script. Several evaluation metrics were employed to analyze the proposed scheme, DICA, and the former techniques.

5.1. Number of Chunks

Minimal data chunks with an optimal size are advantageous as they reduce the computation overhead. Furthermore, a balanced number of chunks leads to better storage consumption at the fog server and cloud. Balanced data chunks improve bandwidth consumption and throughput, which is essential, especially in the context of the healthcare sector. In the case of the proposed technique, it dynamically adjusts the size of the chunks on the basis of the healthcare data. Furthermore, the usage of appropriate delimiters to segment the data and the identification of appropriate breakpoints ensure that the chunks are stable in number and optimal in size. The RAM has a slightly larger number of data chunks because it depends on extreme data values existing in the data to find a breakpoint. The RAM is sensitive to data variations, analyzing fixed and variable windows to determine valid breakpoints. In cases where the data readings exhibit frequent variations, the scheme generates smaller chunks than required. The CWCA and WCA-BF + ASE schemes maintain a better number of chunks by dynamically setting the chunk size and focusing on the minimum and maximum threshold values for the window size. Figure 2 illustrates the generated number of chunks throughout the deduplication mechanism. The input data string consists of different sizes, ranging from 10,000 to 50,000 bytes. DSW utilizes two sliding windows with variable intervals to segment the data. This segmentation generates many smaller-sized chunks, making the overall number of chunks slightly higher. By ensuring that the chunks are neither too small nor too large, the DICA enhances the efficiency of data storage and transmission, minimizing the number of chunks and the associated processing costs. To check the efficiency of the proposed scheme, the average number of chunks was considered. The simulation results demonstrate that for a string of 20,000 bytes, the RAM generated 33.9, CWCA produced 31.3, WCA-BF + ASE produced 32.8, DSW attained 32.1, and the proposed DICA generated 30.5 chunks on average.

5.2. Average Chunk Size

Throughout the chunking mechanism within the context of deduplication, it is essential to find an appropriate breakpoint. This has a substantial effect on the size of the chunks. If the breakpoint is close to a fixed window, this results in a smaller average chunk size. In the DICA, optimal-sized chunks are attained by setting a variable-sized window and using an appropriate threshold. If the size of the window is smaller than the defined size, the breakpoint is reset dynamically. This avoids the generation of numerous small-sized data chunks that degrade the performance of the scheme. Additionally, an optimal average chunk size allows the DICA to maintain low latency and reduces its memory utilization, thereby ensuring smooth operation in resource-constrained IoMT environments. The CWCA and WCA-BF + ASE schemes also attain optimum data chunks because the minimum value is set to prevent the generation of excessively small chunks. Similarly, the maximum value stops the production of very large-sized data chunks. The RAM attains considerable chunks of smaller size, and it starts creating smaller chunks increasingly in the event that the input data strings exhibit more fluctuations. DSW determines chunk boundaries based on content rather than fixed intervals. By using hash values and specific conditions to find breakpoints, the algorithm creates chunks that can vary in size but are often smaller due to the frequent changes in the data patterns. In Figure 3, the average chunk size is presented. The results demonstrate that the RAM attained 664, while CWCA and WCA-BF + ASE maintained an average chunk size of 670, DSW attained an average chunk size of 640 bytes, and the DICA achieved an average of 680 bytes.

Figure 4 highlights the difference between IC and fixed-sized windows. Mostly, the dynamic window is greater than the fixed-sized window. In the case of a fixed-sized window, all schemes maintained fixed windows of 250 bytes. In the DSW technique, the dynamic data blocks are smaller due to the dual-window mechanism, which frequently identifies breakpoints based on two scenarios. When the Rabin fingerprint in W1 meets a specific residual condition, a breakpoint is established. When the hash value of W1 equals the initial hash value of W2, another breakpoint is established. This increased sensitivity to data variations results in more frequent chunking. The DICA has better IC due to its adaptive chunking approach, efficient use of delimiters, and effective data deduplication. The DICA dynamically adjusts the chunk sizes by identifying breakpoints using the R_A and R_B delimiters, which segment the data efficiently. In the context of the RAM, the chunk size can become very small because there is no explicit limit applied to the minimum window size. The absence of a limit means that the RAM will continue to create small chunks. CWCA and WCA-BF + ASE have better IC as they use the window size chunking mechanism to dynamically adjust the data chunks based on a predefined window size, ensuring efficient data processing. The proposed scheme attained 680 bytes, whereas CWCA, RAM, DSW, and WCA-BF + ASE maintained 670 bytes, 664 bytes, 640 bytes, and 670 bytes, respectively. The results show that IC increased by up to 66.7%, 68%, 62%, and 72.1% for the RAM, CWCA, DSW, and DICA schemes.

5.3. Cut-Point Identification Failure

Cut-point identification is of great importance in chunking. The determination of appropriate points to divide a data stream into manageable segments increases the efficiency of the scheme. If an appropriate breakpoint is not achieved, this results in the imprecise evaluation of patients and also increases the processing time and computational cost. To check the performance of the DICA, a simulation was performed regarding this metric. Figure 5 shows the likelihood of failure in finding the most appropriate breakpoint within chunking. Sometimes, the optimum breakpoint is missed due to incorrect identification measures. A multiplier (ψ) is employed to determine the computational cost. The DICA maintains the minimum cut-point identification failures as it splits the data string into tokens, evaluates the sizes of chunks, and determines the breakpoint by dynamically adjusting the chunk size based on the window size criteria. By refining the breakpoint accuracy, the overall data reduction and transmission efficiency are enhanced in the proposed technique. In a situation in which the chunk size is larger than the defined threshold, the scheme navigates through the data stream to ensure appropriate cut-point identification.

The RAM scheme has fewer cut-point identification failures; it employs a window of fixed size situated at the start of the data chunk. The RAM focuses on finding greater data values within the window. It iteratively finds a byte greater than the existing maximum value, which is determined as the breakpoint for a specific chunk. There is a high probability that the preceding byte will be smaller. The CWCA and WCA-BF + ASE schemes achieve better cut-point identification. In the DSW approach, if a breakpoint based on a delimiter is not found, the algorithm might cut an arbitrary position. This could result in splitting a data reading into two parts, potentially rendering the value incomplete and unreadable. In the healthcare sector, when data values are split, the resulting chunks might contain incomplete or corrupted information, making it unreadable or unusable. For instance, blood pressure measurement data could be divided in such a way that neither chunk contains the full value, making the data meaningless. The x-axis shows the multiplier (ψ), whereas the vertical axis illustrates the likelihood of finding no breakpoints in the string. When ψ=4, the DICA has the lowest probability of 0.38, while the RAM, CWCA, WCA-BF + ASE, and DSW attain probabilities of 0.55, 0.42, 0.45, and 0.54, respectively.

5.4. Throughput

Throughput refers to the rate at which the transmission of deduplicated healthcare data toward a cloud repository takes place. It can be defined as in Equation (2), where

α

represents the transaction process,

f_{s}

is the size of the data, and

A v g_{f}

is the average transaction size.

N_{t}

is the time for nodes in seconds.

α = \frac{f_{s}}{A v g_{f}} * \frac{1}{N_{t}}

(2)

Figure 6 shows the throughput of the different schemes for different data sizes. The proposed DICA scheme’s better chunking procedure contributes to smooth healthcare data processing, which ultimately results in high throughput during deduplicated data transfer to the cloud repository. Along with this, the high throughput is credited to the algorithm’s efficient deduplication process, which reduces the data volume and accelerates transmission. Thus, the increased throughput guarantees the timely delivery of critical health data, improving real-time monitoring and decision-making in IoMT systems. The WCA-BF + ASE scheme maintains high throughput to alleviate the redundant data volume by employing a Bloom filter, which eventually enhances the throughput. The RAM has comparatively low throughput because of its byte-by-byte comparison. The scheme scans every byte, which lowers its transmission speed. The DSW technique has low throughput due to managing two sliding windows simultaneously. Each window requires continuous hash value calculations and comparisons to determine the breakpoints, which increases the processing time. Moreover, managing a larger number of smaller chunks impacts the throughput, as more resources are needed to handle the chunks. For a data size of 400 KB, the RAM has a throughput of 466, while the CWCA, WCA-BF + ASE, DSW, and DICA schemes have values of 490, 950, 935, and 954 bits/sec, respectively.

5.5. Energy Efficiency

Energy efficiency refers to the energy saved during computational operations related to dynamic window creation, byte comparisons, and the verification of each data string. The low energy efficiency of the RAM scheme is linked to its high complexity in computation, its dynamic breakpoint adjustments, and the probabilistic nature of the byte comparisons. In CWCA, by controlling the chunk size within specified thresholds, the algorithm minimizes unnecessary computational operations and redundant data handling. Efficient string processing using delimiters and early termination conditions reduces energy consumption. The WCA-BF + ASE scheme has relatively low energy efficacy, as the Bloom filter’s probabilistic approach involves multiple hash operations and bit manipulations for each data string, increasing the energy needed for verification and duplicate detection. This constant processing and data handling contribute to the overall higher energy usage. In the DSW technique, the energy consumption is higher due to the continuous computational operations required to manage two sliding windows. It involves constant byte comparisons and frequent hash calculations for each window position to determine the breakpoints. This demands significant processing power and memory usage, leading to increased energy consumption during these operations. The DICA achieves higher energy efficiency, as it has a lower computational cost and does not utilize excessive energy; it achieves this by identifying the dynamic chunk sizes based on delimiters and achieving better chunk sizes as well. Figure 7 shows that for a data size of 800 KB, the RAM, WCA-BF + ASE, DSW, CWCA, and DICA schemes maintain energy efficiency of 420, 530, 550, 635, and 860 operations per unit time, respectively.

5.6. Computational Overhead at Fog Server

The CDs transmit the collected information to the fog server. This fog server performs some processing to refine the data before transmitting them to the cloud repository. The computational overhead is high in the WCA-BF + ASE scheme due to the extensive operation of encryption, involving mathematical operations, hashing, and digital signatures, placing a high computation load on the fog server. In the RAM, the number of comparisons and unconditional branches increases the overhead. In DSW, the reason for the high computational overhead is the frequent re-evaluation of the breakpoints using complex conditions and the processing of additional mechanisms for deduplication, which further increase the computational complexity and resources required. The DICA and CWCA schemes dynamically set the chunk size based on the data characteristics, ensuring that only essential data are processed without needless computations that increase the overhead. The proposed algorithm ensures that the fog servers can handle large-scale data processing without compromising performance. Figure 8 shows the computational overhead at the fog node for all schemes under different numbers of data chunks. The RAM scheme consumes the most computational time of 740 ms; the CWCA, DSW, and WCA-BF + ASE schemes consume 530 ms, 600 ms, and 720 ms; and the proposed scheme, DICA, consumes only 500 ms.

5.7. Energy Consumption

In IoMT, collector devices play a critical role in collecting data from wearable devices that are fixed to the human body. These CDs utilize energy to perform data processing, such as deduplication and the transmission of the data to the fog server. Therefore, this is an important metric that needs to be checked. In Figure 9, the energy consumption of the aggregating devices is illustrated. At time t = 0.6, CD1, CD2, and CD3 utilize 0.0046 µJ, 0.0043 µJ, and 0.0051 µJ. The results indicate that collector devices, including CD1 and CD2, utilize 65% additional power as compared to the sensing devices S1, S8, and S12. Furthermore, in the process of deduplication and data transmission, CD3 utilizes 68% more energy.

The sensing devices are responsible for frequently taking data readings from the patient’s body. This continuous monitoring of data consumes energy. To check the energy consumption at the sensor level, an extensive simulation was performed. The NS-2 tool was used to develop an energy model, and the remaining power data were obtained in a trace record. Afterward, the power utilization was determined through the difference in energy at a particular time t from the trace files that were produced during the simulation. Figure 10 represents the energy consumption of the sensing devices. The x-axis shows the time passed in the simulation, whereas the y-axis represents the energy consumption win micro-Joules. The results show that sensing devices S1, S6, S8, and S12 utilized 0.00296 µJ, 0.00297µJ, 0.00299 µJ, and 0.00297 µJ, respectively, when t = 0.4. Initially, for the sensing devices, energy utilization values were obtained. Afterward, they were extracted for the CDs.

5.8. Discussion

The proposed scheme, DICA, was compared to the RAM, CWCA, DSW, and WCA-BF + ASE schemes under several important evaluation metrics, focusing on data chunking, the average chunk size, the energy consumption, and the cut-point identification factor. The DICA dynamically adjusts the chunk size based on the healthcare data, resulting in a better number and size of chunks. The RAM shows some sensitivity to the data readings as the fluctuations rise, resulting in slightly smaller chunk sizes and numbers. The CWCA and WCA-BF + ASE schemes achieve better chunk numbers and sizes. In terms of cut-point identification, most of the schemes perform well. The DICA minimizes the number of cut-point identification failures by dynamically adjusting the chunk size based on the window size criteria, while the RAM uses a fixed-size window to find breakpoints based on extreme data values. CWCA and WCA-BF + ASE show better cut-point identification results, further contributing to their efficiency. Overall, the DICA is suitable when the data readings exhibit many fluctuations and lower energy and computational costs are a priority. DSW is ideal for scenarios where maximizing the deduplication efficiency is crucial.

The RAM scheme is better to use when the data exhibit less variation. In scenarios where the number of fluctuations increases, this results in a degradation in performance, so it is not an optimal choice for data with frequent changes. Although WCA-BF + ASE maintains good performance, it has relatively high energy and computational overheads. This scheme utilizes the Bloom filter for deduplication, minimizing the likelihood of false positives and false negatives, and a biased sampling approach is used. Afterward, ASE encryption is performed, which, despite enhancing the security, also increases the energy utilization and computational overhead. Thus, this scheme is not appropriate to use where resources are rare. The CWCA scheme preserves a larger number of chunks and is suitable to use when a stable approach to chunking is required, considering both minimum and maximum thresholds for better data processing.

5.9. Impact of the Proposed Technique on the Healthcare Sector

The integration of the DICA into the IoMT can significantly enhance the efficiency and usefulness of the healthcare sector. By improving data deduplication and reducing redundant transmissions, the proposed technique lessens the bandwidth and storage requirements. This improvement can lead to cost savings for healthcare facilities, permitting the distribution of resources toward more critical areas, such as patient care and medical research. Furthermore, the reduced need for extensive data storage and transmission lessens the load on the cloud and fog servers, increasing their performance and reliability. This efficiency can lead to more responsive healthcare systems capable of holding larger volumes of patient data with minimal delays, thus supporting real-time decision-making. The possibility to process and transmit patient data more efficiently has direct implications for patient outcomes. By guaranteeing that critical health data are transmitted with minimal delay and without redundancy, healthcare providers can monitor patients’ conditions more accurately and respond to emergencies more swiftly. This is particularly valuable for patients with chronic conditions or those requiring frequent monitoring, as it decreases the need for frequent hospital visits and allows for timely medical intervention.

Moreover, the reliable and efficient aggregation of health data supports the development of predictive analytics and personalized treatment plans. By comprehensively and accurately evaluating the conditions of patients, healthcare providers can identify patterns and trends to inform their clinical decisions, eventually improving patients’ care and outcomes.

6. Conclusions

The proposed adaptive chunking scheme for smart healthcare IoT offers a robust solution to lessen energy consumption and enhance data deduplication efficiency in the healthcare sector. The sink node obtains healthcare parameters through concatenated strings transmitted by collector devices. These data strings make use of delimiters to differentiate the data of patients or individuals equipped with wearable sensors. Duplication is controlled in the proposed technique at two levels. At the collector device (CD) level, data deduplication is ensured by replacing identical data values with Boolean digits. At the second level, the sink node further refines the process using the Delimiter-Based Incremental Chunking Algorithm (DICA), employing delimiters and variable-sized windows for ideal chunking. The precise transmission of healthcare data readings permits doctors to make the best decisions at any time, especially in emergencies. To check the validity of the proposed scheme, an extensive simulation was performed. The NS-2.35 simulation tool was used to exploit different nodes and their functionalities using the C language and TCL script. The experimental results showed that the variable-sized window increased by up to 66.7%, 68%, and 72.1% for the RAM, DSW, and DICA. Besides this, a better deduplication level is achieved while using minimal energy. The proposed DICA attains improved performance in terms of power consumption, static and dynamic data chunks, energy efficiency, and throughput.

While the DICA demonstrates significant potential, its key limitations are as follows. The average chunk size is maintained to ensure efficient performance for data-level deduplication, but it is not suitable for file-level deduplication, which requires larger chunk sizes as per the available resources. The dynamic chunk size is kept larger than the fixed-size chunks, and the total number of chunks is also reduced to improve the computational costs and avoid possible delays in processing, but the proposed DICA may be improved further by incorporating reinforcement learning-based chunking. This would enable it to dynamically respond to a variety of situations with different types of data streams. The proposed DICA considers the healthcare scenario, and it is not fully applicable for deduplication in other scenarios, such as vehicular networks, flat wireless sensor networks, industrial data-sharing networks, and underwater sensor networks. Thus, the need for the dynamic adaptability of the deduplication scheme as per the nature of the data shared on the network is highlighted. Moreover, the sensitive information of patients must be protected against unauthorized access or breaches by utilizing security and privacy schemes for healthcare data.

In future work, we will consider file-level deduplication scenarios where reinforcement learning-based chunking mechanisms for deduplication may be considered to analyze the performance. Moreover, the impact of deduplication and breakpoint recognition on the cloud will be examined to reduce the storage overhead for healthcare-related data. This will be helpful to resolve a few of the above-mentioned limitations.

Funding

The author extends his appreciation to the Deanship of Scientific Research at Northern Border University, Arar, KSA for funding this research work through the project number “NBU-FPEJ-2024-2429-01”.

Data Availability Statement

The original contributions presented in the study are included in the article, further inquiries can be directed to the corresponding author.

Conflicts of Interest

The author declares no conflicts of interest.

Abbreviations

IoMT	Internet of Medical Things
TCL	Tool Command Language
WBAN	Wireless Body Area Network
IoT	Internet of Things
CD	Collector Device
CDC	Content-Defined Chunking
DCC	Data Collection Center
QoS	Quality of service
PDAs	Personal Digital Assistants
LTE	Long-Term Evolution
AODV	ad hoc On-Demand Distance Vector Routing
BS	Base Station
RF	Radio frequency
HBC	Human Body Communication
ECG	Electrocardiogram
HN	Head Node
EAD	Elasticity-Aware Deduplication
OPC	Optimus Prime Chunking
DSW	Double Sliding Window

References

Rahman, S.M.A.; Ibtisum, S.; Podder, P.; Hossain, S.M.S. Progression and Challenges of IoT in Healthcare: A Short Review. Int. J. Comput. Appl. 2023, 185, 9–15. [Google Scholar] [CrossRef]
Singh, R.; Lopez, B.D.; Ramadan, R. Internet of things in Healthcare: A conventional literature review. Health Technol. 2023, 13, 699–719. [Google Scholar] [CrossRef]
Preethichandra, D.M.G.; Piyathilaka, L.; Izhar, U.; Samarasinghe, R.; De Silva, L.C. Wireless Body Area Networks and Their Applications—A Review. IEEE Access 2023, 11, 9202–9220. [Google Scholar] [CrossRef]
Chen, J.; Yi, C.; Okegbile, S.D.; Cai, J.; Shen, X. Networking Architecture and Key Supporting Technologies for Human Digital Twin in Personalized Healthcare: A Comprehensive Survey. IEEE Commun. Surv. Tutor. 2023, 26, 706–746. [Google Scholar] [CrossRef]
Aski, V.S.D.; Vidyadhar, J.; Parashar, A.; Rida, I. Internet of Things in healthcare: A survey on protocol standards, enabling technologies, WBAN architectures and open issues. Phys. Commun. 2023, 60, 102103. [Google Scholar] [CrossRef]
Ostrowski, K.; Małecki, K.; Dziurzański, P.; Singh, A.K. Mobility-Aware Fog Computing in Dynamic Networks with Mobile Nodes: A Survey. J. Netw. Comput. Appl. 2023, 219, 103724. [Google Scholar] [CrossRef]
Aldin, H.N.S.; Ghods, M.R.; Nayebipour, F.; Torshiz, M.N. A comprehensive review of energy harvesting and routing strategies for IoT sensors sustainability and communication technology. Sens. Int. 2023, 5, 100258. [Google Scholar] [CrossRef]
Feng, A.; Dong, D.; Lei, F.; Ma, J.; Yu, E.; Wang, R. In-network aggregation for data center networks: A survey. Comput. Commun. 2023, 198, 63–76. [Google Scholar] [CrossRef]
Ahmed, S.F.; Bin Alam, M.S.; Afrin, S.; Rafa, S.J.; Taher, S.B.; Kabir, M.; Muyeen, S.M.; Gandomi, A.H. Towards a Secure 5G-Enabled Internet of Things: A Survey on Requirements, Privacy, Security, Challenges, and Opportunities. IEEE Access 2024, 12, 13125–13145. [Google Scholar] [CrossRef]
Alotaibi, B. A Survey on Industrial Internet of Things Security: Requirements, Attacks, AI-Based Solutions, and Edge Computing Opportunities. Sensors 2023, 23, 7470. [Google Scholar] [CrossRef]
Al Azad, M.W.; Mastorakis, S. The promise and challenges of computation deduplication and reuse at the network edge. IEEE Wirel. Commun. 2022, 29, 112–118. [Google Scholar] [CrossRef]
Alsalim, A.S.; Javed, M.A. Efficient and Secure Data Storage for Future Networks: Review and Future Opportunities. IEEE Access 2024, 12, 102619–102636. [Google Scholar] [CrossRef]
Ahmad, N.; Awan, M.D.; Khiyal, M.S.H.; Babar, M.I.; Abdelmaboud, A.; Ibrahim, H.A.; Hamed, N.O. Improved QoS aware routing protocol (IM-QRP) for WBAN based healthcare monitoring system. IEEE Access 2022, 10, 121864–121885. [Google Scholar] [CrossRef]
Iqbal, S.; Ahmed, A.; Siraj, M.; Al Tamimi, M.; Bhangwar, A.R.; Kumar, P. A multi-hop QoS-aware and predicting link quality estimation (PLQE) routing protocol for reliable WBSN. IEEE Access 2023, 11, 35993–36003. [Google Scholar] [CrossRef]
Dragone, M.; Saffiotti, A.; Simoen, P. Internet of Robotic Things–Converging Sensing/Actuating, Hyperconnectivity, Artificial Intelligence and IoT Platforms. In Cognitive Hyperconnected Digital Transformation; River Publishers: London, UK, 2022; pp. 97–155. [Google Scholar]
Vizziello, A.; Savazzi, P.; Magenes, G. Electromyography Data Transmission via Galvanic Coupling Intra-body Communication Link. In Proceedings of the 8th ACM International Conference on Nanoscale Computing and Communication, Milan, Italy, 17 September 2021. [Google Scholar]
Tang, Q.; Tong, G.; Wang, X.; Shi, J.; Han, Y. An energy-efficient scheme for data collection in wireless sensor networks. In Proceedings of the 2016 25th Wireless and Optical Communication Conference (WOCC), Chengdu, China, 21–23 May 2016. [Google Scholar]
Wang, Y.; Tan, C.C.; Mi, N. Using elasticity to improve inline data deduplication storage systems. In Proceedings of the 2014 IEEE 7th International Conference on Cloud Computing, Anchorage, AK, USA, 27 June–2 July 2014; pp. 785–792. [Google Scholar]
Yang, Y.; Qin, X.; Sun, G.; Xu, Y.; Yang, Z.; Zu, Z. Data deduplication in wireless multimedia monitoring network. Int. J. Distrib. Sens. Netw. 2013, 9, 153034. [Google Scholar] [CrossRef]
Tang, X.; Guo, C.; Choo, K.-K.R.; Jiang, X.; Liu, Y. A secure and lightweight cloud data deduplication scheme with efficient access control and key management. Comput. Commun. 2024, 222, 209–219. [Google Scholar] [CrossRef]
Ellappan, M.; Murugappan, A. A Smart Hybrid Content-Defined Chunking Algorithm for Data Deduplication in Cloud Storage. Soft Comput. 2023, 27, 1–16. [Google Scholar] [CrossRef]
Yu, C.; Zhang, C.; Mao, Y.; Li, F. Leap-based Content Defined Chunking-Theory and Implementation. In Proceedings of the 2015 31st Symposium on Mass Storage Systems and Technologies (MSST), Santa Clara, CA, USA, 30 May–5 June 2015; pp. 1–12. [Google Scholar]
Guo, S.; Mao, X.; Sun, M.; Wang, S. Double Sliding Window Chunking Algorithm for Data Deduplication in Ocean Observation. IEEE Access 2023, 11, 70470–70481. [Google Scholar] [CrossRef]
Xia, W.; Zhou, Y.; Jiang, H.; Feng, D.; Hua, Y.; Hu, Y.; Zhang, Y.; Liu, Q. FastCDC: A fast and efficient content-defined chunking approach for data deduplication. In Proceedings of the 2016 USENIX Annual Technical Conference (USENIX ATC 16), Denver, CO, USA, 22–24 June 2016; pp. 101–114. [Google Scholar]
Widodo, R.N.S.; Lim, H.; Atiquzzaman, M. SDM: Smart deduplication for mobile cloud storage. Futur. Gener. Comput. Syst. 2017, 70, 64–73. [Google Scholar] [CrossRef]
Romański, B.; Heldt, L.; Kilian, W.; Lichota, K.; Dubnicki, C. Anchor-driven subchunk deduplication. In Proceedings of the 4th Annual International Conference on Systems and Storage, Haifa, Israel, 30 May–1 June 2011; pp. 1–13. [Google Scholar]
Wang, C.; Wang, K.; Li, M.; Wei, F.; Xiong, N. Chunk2vec: A Novel Resemblance Detection Scheme Based on Sentence-BERT for Post-Deduplication Delta Compression in Network Transmission. IET Commun. 2024, 18, 145–159. [Google Scholar] [CrossRef]
Wang, Z.; Gao, W.; Yang, M.; Hao, R. Enabling Secure Data sharing with data deduplication and sensitive information hiding in cloud-assisted Electronic Medical Systems. Cluster Comput. 2023, 26, 3839–3854. [Google Scholar] [CrossRef] [PubMed]
Xiao, L.; Zou, B.; Zhu, C.; Nie, F. ESDedup: An efficient and secure deduplication scheme based on data similarity and blockchain for cloud-assisted medical storage systems. J. Supercomput. 2023, 79, 2932–2960. [Google Scholar] [CrossRef]
Long, Y.; Fu, Y. A fast deduplication scheme for stored data in distributed storage systems. In Proceedings of the Eighth International Symposium on Advances in Electrical, Electronics, and Computer Engineering (ISAEECE 2023), Hangzhou, China, 17–19 February 2023; Volume 12704, pp. 681–686. [Google Scholar]
Ghamari, M.; Janko, B.; Sherratt, R.S.; Harwin, W.; Piechockic, R.; Soltanpur, C. A survey on wireless body area networks for ehealthcare systems in residential environments. Sensors 2016, 16, 831. [Google Scholar] [CrossRef] [PubMed]
Bjørner, N.; Bloss, A.; Gurevich, Y. Content-dependent chunking for differential compression, the local maximum approach. J. Comput. Syst. Sci. 2010, 76, 154–203. [Google Scholar] [CrossRef]
Zhang, Y.; Jiang, J.; Feng, D.; Xia, W.; Fu, M.; Huang, F.; Zhou, Y. AE: An Asymmetric Extremum content defined chunking algorithm for fast and bandwidth-efficient data deduplication. In Proceedings of the 2015 IEEE Conference on Computer Communications (INFOCOM), Hong Kong, China, 26 April–1 May 2015; pp. 1337–1345. [Google Scholar]
Widodo, R.N.S.; Lim, H.; Atiquzzaman, M. A new content-defined chunking algorithm for data deduplication in cloud storage. Futur. Gener. Comput. Syst. 2017, 71, 145–156. [Google Scholar] [CrossRef]
Ullah, A.; Hamza, K.; Azeem, M.; Farha, F. Secure healthcare data aggregation and deduplication scheme for FoG-orineted IoT. In Proceedings of the 2019 IEEE International Conference on Smart Internet of Things (SmartIoT), Tianjin, China, 9–11 August 2019; pp. 314–319. [Google Scholar]
Neelamegam, G.; Marikkannu, P. Health Data Deduplication Using Window Chunking-Signature Encryption in Cloud. Intell. Autom. Soft Comput. 2023, 36, 1079–1093. [Google Scholar] [CrossRef]
Zhu, H.; Gao, L.; Li, H. Secure and privacy-preserving body sensor data collection and query scheme. Sensors 2016, 16, 179. [Google Scholar] [CrossRef]

Figure 1. System model.

Figure 2. Average number of chunks.

Figure 3. Average chunk size.

Figure 4. Performances of all schemes under IC and fixed-sized windows.

Figure 5. Likelihood of breakpoint failure.

Figure 6. Throughput.

Figure 7. Energy efficiency.

Figure 8. Computational overhead.

Figure 9. Energy consumption at collector devices (CDs).

Figure 10. Energy consumption at sensing devices.

Table 1. List of symbols used.

Symbol	Explanation
win_s	Length of window
K	Breakpoint
Readings	Input string
sLen	Length of string
p, q	Data values at instance of loop
ON HH	Incremental chunking
R_A	Delimiters used for integration of information received by sensors
R_B	Delimiters used between joined data of many patients
$r e a d i n g s_b r_{s_l e n 1} []$	Break string by R_A
$r e a d i n g s_b r_{s_l e n 2} []$	Break string by R_B

Table 2. Simulation parameters with corresponding values.

Parameter	Value
Duration of Simulation	400 s
Size of Data Packet	300 B
Transmission Power at Node	0.928 microjoules
Receiving Power	0.052 microjoules
Nature of Channel	Radio
Type of Channel	Wireless Physical
Category of Mac	802.11n
Nature of Queue	Priority Queue
Link Layer Type	Link Layer
Type of used Antenna	Omnidirectional
Maximum Data Packets in Line	55
Agent Trace	ON
Router Trace	ON
Trace of Mac	OFF
Number of Multipliers (ψ)	1–5
Time Span	0.1–1.0 s
Input String (Bytes)	10,000–50,000 Bytes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Altowaijri, S.M. Deduplication-Aware Healthcare Data Distribution in IoMT. Mathematics 2024, 12, 2482. https://doi.org/10.3390/math12162482

AMA Style

Altowaijri SM. Deduplication-Aware Healthcare Data Distribution in IoMT. Mathematics. 2024; 12(16):2482. https://doi.org/10.3390/math12162482

Chicago/Turabian Style

Altowaijri, Saleh M. 2024. "Deduplication-Aware Healthcare Data Distribution in IoMT" Mathematics 12, no. 16: 2482. https://doi.org/10.3390/math12162482

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Deduplication-Aware Healthcare Data Distribution in IoMT

Abstract

1. Introduction

2. Literature Review

2.1. Sliding Window-Based Chunking Techniques

2.2. Fast and Effective CDC Techniques

2.3. Sub-Chunk Deduplication-Based Schemes

3. System Model and Problem Statement

4. Proposed Solution

5. Results and Analysis

5.1. Number of Chunks

5.2. Average Chunk Size

5.3. Cut-Point Identification Failure

5.4. Throughput

5.5. Energy Efficiency

5.6. Computational Overhead at Fog Server

5.7. Energy Consumption

5.8. Discussion

5.9. Impact of the Proposed Technique on the Healthcare Sector

6. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI