Lossless Compression with Trie-Based Shared Dictionary for Omics Data in Edge–Cloud Frameworks

Adam, Rani; Catchpoole, Daniel R.; Simoff, Simeon J.; Qu, Zhonglin; Kennedy, Paul J.; Nguyen, Quang Vinh

doi:10.3390/jsan14020041

Open AccessArticle

Lossless Compression with Trie-Based Shared Dictionary for Omics Data in Edge–Cloud Frameworks

by

Rani Adam

^1,*

,

Daniel R. Catchpoole

^2,3,4

,

Simeon J. Simoff

¹,

Zhonglin Qu

¹

,

Paul J. Kennedy

⁵

and

Quang Vinh Nguyen

^1,*

¹

School of Computer, Data and Mathematical Sciences, Western Sydney University, Parramatta, NSW 2170, Australia

²

The Tumour Bank, Children’s Cancer Research Unit, Kids Research, The Children’s Hospital at Westmead, Westmead, NSW 2145, Australia

³

The Discipline of Paediatrics and Child Health, The Faculty of Medicine, The University of Sydney, Sydney, NSW 3006, Australia

⁴

Faculty of Information Technology, University of Technology Sydney, Sydney, NSW 2007, Australia

⁵

Australian Artificial Intelligence Institute, Faculty of Engineering and Information Technology, University of Technology Sydney, Sydney, NSW 2007, Australia

^*

Authors to whom correspondence should be addressed.

J. Sens. Actuator Netw. 2025, 14(2), 41; https://doi.org/10.3390/jsan14020041

Submission received: 3 March 2025 / Revised: 4 April 2025 / Accepted: 6 April 2025 / Published: 9 April 2025

Download

Browse Figures

Versions Notes

Abstract

:

The growing complexity and volume of genomic and omics data present critical challenges for storage, transfer, and analysis in edge–cloud platforms. Existing compression techniques often involve trade-offs between efficiency and speed, requiring innovative approaches that ensure scalability and cost-effectiveness. This paper introduces a lossless compression method that integrates Trie-based shared dictionaries within an edge–cloud architecture. It presents a software-centric scientific research process of the design and evaluation of the proposed compression method. By enabling localized preprocessing at the edge, our approach reduces data redundancy before cloud transmission, thereby optimizing both storage and network efficiency. A global shared dictionary is constructed using N-gram analysis to identify and prioritize repeated sequences across multiple files. A lightweight index derived from this dictionary is then pushed to edge nodes, where Trie-based sequence replacement is applied to eliminate redundancy locally. The preprocessed data are subsequently transmitted to the cloud, where advanced compression algorithms, such as Zstd, GZIP, Snappy, and LZ4, further compress them. Evaluation on real patient omics datasets from B-cell Acute Lymphoblastic Leukemia (B-ALL) and Chronic Lymphocytic Leukemia (CLL) demonstrates that edge preprocessing significantly improves compression ratios, reduces upload times, and enhances scalability in hybrid cloud frameworks.

Keywords:

genomic data compression; N-gram analysis; global shared dictionary; health data storage optimization; Zstd compression; health informatics scalability

1. Introduction

The landscape of health data management is experiencing a transformative shift, primarily driven by rapid advancements in technology and significant reductions in the costs of genomic sequencing [1,2]. This brisk pace of progress has led to an explosive increase in the volume of data being accumulated, with stored data growing from 120 petabytes in 2016 to 160 petabytes in 2018 [3] alone, outpacing Moore’s law, which demonstrates an exponential increase rather than a linear trend. Concurrently, despite a 500-fold decrease in the cost per gigabyte of disk storage, there has been no substantial drop in pricing by cloud providers over the past two years [4]. Traditional cloud-based storage models face challenges in terms of cost and bandwidth. In response to these challenges, edge computing has emerged as a viable solution, enabling local data processing before cloud storage, reducing transmission overhead. Many complex omics data types need and would benefit from new compression strategies that enable edge to cloud management and analysis.

Various data compression methods have emerged to address these issues, aiming to reduce storage costs, optimize data transfer, and enhance the efficiency of data handling. Specialized genomic compression tools, such as SPRING [5] and Genozip [6], have set new standards for compressing FASTQ files and other genomic data formats. SPRING focuses on leveraging domain-specific features like reordering reads and utilizing entropy coding to achieve high compression ratios, making it highly effective for DNA sequencing data. Similarly, Genozip supports a wide range of formats, including FASTQ, SAM/BAM, and VCF, employing multi-threaded processing to accelerate compression while maintaining compatibility with various bioinformatics pipelines.

In parallel to these specialized tools, recent studies have introduced hybrid compression frameworks that combine traditional dictionary-based methods with parallel and distributed processing techniques, further enhancing performance on large-scale datasets [7,8]. These approaches exploit modern hardware and cloud architectures to manage the increasing volume and complexity of omics data, bridging the gap between domain-specific and general-purpose solutions.

On the other hand, general-purpose methods, such as Zstandard (Zstd), Snappy, LZ4, and GZIP, continue to provide robust solutions across diverse applications due to their high-speed performance and adaptability [9]. These algorithms rely on dictionary-based techniques, like LZ77, which detect and replace repeated sequences with compact references, thereby reducing redundancy [10,11]. These methods are especially effective in capturing statistical patterns in generic datasets; however, they often fail to optimize compression for genomic and omics data, which contain unique repetitive structures that require specialized handling.

Moreover, researchers have sought to enhance traditional methods by incorporating indexing and machine learning techniques to better capture complex patterns in omics data, achieving improved compression ratios without substantially increasing computational costs [12]. For instance, some methods adapt neural network-based compression ideas—originally developed for images [13]—to the genomic domain.

In addition to these dictionary-based approaches, newer neural-network-driven methods have emerged, exploring adaptive compression by learning contextual patterns in data [5]. However, these methods are still computationally demanding and less practical for real-time or large-scale applications. The trade-off between compression ratio, speed, and scalability remains a significant challenge, particularly for the rapidly expanding datasets in genomics and other omics fields. This challenge is further amplified in portable head-mounted devices, where limited computational power, storage, and network resources necessitate highly efficient compression solutions [14,15].

Recent advances suggest that combining neural-network-driven methods with traditional algorithms can offer a balanced solution that mitigates these issues, delivering high compression efficiency while keeping computational requirements manageable [5,16]. This balance is especially critical for portable devices facing resource constraints.

To address these challenges, this paper proposes a two-stage hybrid compression approach that leverages localized Trie-based sequence replacement at the edge and advanced compression algorithms in the cloud. Our method first constructs a global shared dictionary by applying N-gram analysis to detect frequently recurring sequences across datasets. This dictionary is initially built and maintained in the cloud, but a lightweight index is distributed to edge nodes for local compression and redundancy elimination before data transmission. Unlike traditional sliding window compression techniques [17], which are constrained by memory and window size, our approach leverages a persistent dictionary structure, enabling global sequence matching across multiple datasets. By storing and indexing high-frequency patterns, the system enhances compression efficiency while reducing redundant data transfers between edge and cloud environments.

We hypothesize that a shared dictionary model will progressively improve compression ratios over time while minimizing storage and computational overhead. As more sequences are analyzed and stored, compression efficiency should increase, leading to significant storage savings. The specific paper contributions are as follows:

We developed a new method combining a global shared dictionary at the edge with general-purpose compression algorithms, improving sequence matching and redundancy reduction.
We validated superior outcomes with real-world omics datasets. We achieved compression ratios up to 68.2%, which outperformed traditional methods.

2. Related Work

The exponential increase in genomic and omics data poses significant challenges for storage, transmission, and real-time analysis. As noted in the introduction, traditional cloud-based storage models and compression techniques face inherent trade-offs between efficiency and speed, while the rapid growth in data volume demands innovative, scalable, and cost-effective solutions. In response, various methods have been proposed, each with strengths and limitations as outlined in Table 1.

3. Compression Method

This study builds on a framework for genomic data management [18] by extending its methodologies to incorporate hybrid edge–cloud compression techniques. The proposed compression model integrates N-gram sequence identification with dictionary-based compression and leverages a global shared dictionary to handle large-scale omics datasets. By extending traditional methods like LZ77 and LZW [19] with global sequence recognition, our approach improves both redundancy elimination and storage efficiency. This study adheres to the STROBE reporting guidelines [20] as outlined by the EQUATOR Network [21], ensuring transparency and methodological rigor.

3.1. System Model

The proposed method targets environments that generate large-scale omics data, such as clinical institutions, hospitals, and research laboratories. In these settings, the growing volume of data necessitates efficient storage solutions and rapid transmission. Our approach employs a hybrid edge–cloud architecture to efficiently compress large-scale omics data by dividing the processing workload between resource-limited edge devices—located at the data generation sites—and a robust cloud infrastructure.

Edge devices are constrained by limited computational capabilities and storage capacity. Consequently, lightweight preprocessing methods, such as our Trie-based shared dictionary, are essential to minimize the computational load and reduce data redundancy locally before transmission. Moreover, data transmission from edge devices to the cloud is often restricted by limited bandwidth and associated costs. Minimizing the size of transmitted data is therefore crucial, which underscores the importance of our two-stage compression approach: initial redundancy elimination at the edge significantly reduces the subsequent network load.

This operational context and its challenges justify our hybrid method, highlighting the practical significance and applicability of the proposed hybrid edge–cloud compression workflow.

3.2. Global Dictionary Construction

The global dictionary is constructed through N-gram extraction and refinement, resulting in a shared dictionary hosted on Microsoft SQL Server 2022, as illustrated in Figure 1. This approach is designed to efficiently manage large-scale datasets and optimize global sequence matching.

N-Gram Extraction (Figure 1, Step 1): The process begins by extracting contiguous subsequences (N-grams) of length 15 from the dataset using a sliding window approach. Our prior analysis indicated that no repeated patterns were found above this length, making 15-grams optimal starting point for capturing meaningful patterns. The sequence length is then progressively reduced (e.g., 14-grams, 13-grams) down to a minimum of 1.
N-Gram Storage (Figure 1, Step 2): All extracted N-grams are stored in a centralized repository, ensuring that potential patterns are retained for future reference. Offloading N-gram storage and processing to SQL Server addresses memory limitations associated with in-memory sliding window methods, allowing the system to handle large datasets without compromising performance.
Dictionary Refinement (Figure 1, Step 3): Once extracted, the next step is to refine these N-grams into a functional dictionary optimized for compression. From our experiments, only high-frequency N-grams, such as those with over 100 occurrences and longer than eight characters, are selected. Each selected N-gram is then assigned a unique code for use in sequence replacement during compression. This refinement process creates a compact and scalable dictionary that prioritizes high-impact patterns while excluding less frequent ones, thereby facilitating efficient global sequence matching and redundancy elimination.
Shared Dictionary Construction (Figure 1, Step 4): Finally, a lightweight index derived from the refined dictionary is pushed to edge nodes. This enables local sequence matching and redundancy elimination before data transmission, reducing overall network bandwidth consumption and improving compression performance.

3.3. Hybrid Edge–Cloud Compression Workflow

To leverage both edge and cloud computing resources, we propose an edge–cloud compression workflow where global Trie-based redundancy elimination is performed at the edge (Figure 2, Step 1) before further compression is applied in the cloud (Figure 2, Step 2). Adapted from the hybrid edge–cloud framework architecture in [18], this approach aims to reduce data size prior to transmission, optimize bandwidth usage, and improve overall network efficiency.

A Trie (or prefix tree) is a hierarchical data structure commonly used for efficient string searching and pattern matching [22]. By storing common prefixes only once, it enables fast look-ups, insertions, and deletions [23] and is particularly effective for managing large-scale genomic sequences [24]. At the edge layer, an initial copy of the Trie-based shared dictionary—derived from the high-frequency N-grams refined in Section 3.2—is constructed. The Trie efficiently identifies and matches the longest possible sequences, ensuring optimal redundancy elimination before data are transmitted to the cloud [25].

Once the preprocessed data reach the cloud, additional compression is applied using established codecs such as Zstd, GZIP, Snappy, and LZ4. These codecs were chosen for their proven efficiency and widespread application in data compression [26,27,28,29]. This two-stage hybrid compression pipeline is designed to improve scalability, reduce storage costs, and minimize computational overhead by performing redundancy elimination at the edge before cloud-based compression.

3.4. Evaluation

The effectiveness of the proposed compression approach will be evaluated based on three key metrics:

Compression Ratio: The reduction in data size achieved by the hybrid compression model, comparing the compressed size against the original dataset.
Effect Size Analysis: Cohen’s d will measure the magnitude of compression improvements compared to baseline methods.
Compression Speed: The time required for the compression process at both the edge and cloud layers, with a focus on optimizing performance trade-offs.
Cost-effectiveness: The reduction in storage and data transfer costs resulting from compression, particularly in cloud environments where data storage and bandwidth usage are cost drivers.

To validate these metrics, controlled experiments will be conducted using real genomic datasets. The evaluation process will compare the hybrid edge–cloud compression workflow against traditional compression methods (e.g., standalone LZ77, LZW, and other dictionary-based approaches) to quantify performance improvements. Additionally, we will analyze the impact of edge-based Trie redundancy elimination on reducing transmission load, assessing how much preprocessing at the edge contributes to overall compression efficiency.

This evaluation will ensure that our methodology effectively balances compression performance with computational feasibility, making it a practical solution for large-scale genomic data management.

3.5. Decompression Process

Decompression restores compressed data to their original form using the shared global dictionary. The process follows a structured approach:

The edge or cloud decompression module retrieves the unique codes from the compressed dataset.
These codes are mapped back to their corresponding N-gram sequences using the global dictionary.
The Trie-based structure ensures efficient look-up and sequence restoration, minimizing errors and ensuring data integrity.

While the primary focus of this study is on compression, decompression is equally important to validate the integrity of the original data. Our approach ensures consistency by utilizing the same shared dictionary for encoding and decoding, reducing the risk of errors or data loss.

By integrating a robust decompression pipeline, this methodology guarantees that compressed data can be reliably restored, making it suitable for real-world genomic data applications, including bioinformatics pipelines, research repositories, and healthcare data storage systems.

4. Experimental Evaluation

Real patient omics data from B-cell Acute Lymphoblastic Leukemia (B-ALL) and Chronic Lymphocytic Leukemia (CLL) are utilized to assess the effectiveness and computational efficiency of the proposed approach:

Dataset 1 (B-ALL): contains 178 patients (approximately 7.92 GB) and includes a diverse set of eleven cellular markers, offering a robust environment for performance testing.
Dataset 2 (CLL): contains 162 patients (approximately 3.98 GB) and twelve markers.

All experiments were performed on the same hardware and software environment to ensure fair and accurate comparisons across compression methods. The hardware configuration included a high-performance Windows 11 Pro PC with an Intel Xeon multi-core processor, 64 GB of RAM, and an Internet speed of 50 Mbps.

The compression process was performed in two stages. First, an edge-based compression step was implemented where a Trie-based shared dictionary was used to perform sequence replacement. This was developed using C# with .NET 8 framework, with the shared dictionary managed using Microsoft SQL Server 2022. In the subsequent cloud-based stage, preprocessed data were further compressed using established general-purpose algorithms. Specifically, we leveraged updated libraries such as ZstdNet (v1.4.g) for Zstandard compression [26], Snappy .NET (v1.1.1.8) for Snappy compression [30], lz4 net (v1.0.15.93) for LZ4 compression [28], and System.IO.Compression (part of the .NET framework) for GZIP [31]. Experimental protocols and the rationale behind our methodological choices were based on the proven efficiency and widespread adoption of these libraries in the field, as further supported by [5,32,33]. All implementations were optimized for consistent execution, and results were meticulously logged for subsequent analysis.

Formal ethical approval was not required for this phase of the project as our focus was on developing the system architecture rather than analyzing personal or sensitive information. Furthermore, all data were deidentified, ensuring that there was no risk of reidentification.

5. Results

We evaluate our proposed method using the following evaluation metrics:

Compression Ratio: The degree of data size reduction, calculated as the ratio of the original dataset size to the compressed dataset size.
Effect Size Analysis: Cohen’s d will measure the magnitude of compression improvements compared to baseline methods.
Computational Speed: The time required to compress the data.
Cost-Effectiveness: An estimation of potential cost savings in cloud storage and data transfer, based on the achieved compression efficiency.

Our experiments showed consistent performance across both datasets, validating the robustness of the hybrid compression approach. Notably, Dataset 1 (B-ALL) achieved better compression ratios compared to Dataset 2 (CLL), likely due to the larger number of patients and the increased probability of encountering repeated sequences. This supports our hypothesis that as more sequences are scanned and stored in the shared dictionary, the likelihood of repeated sequences increases, leading to progressively better compression ratios and reduced output sizes.

5.1. Compression Ratios

The two-stage approach, which combines sequence replacement using a Trie-based shared dictionary at the edge with advanced general-purpose compression methods (Zstd, GZIP, Snappy, and LZ4), demonstrated significant improvements in data reduction. Compression ratios were calculated based on the total file sizes for both datasets. The sequence replacement step alone, using the shared dictionary without additional compression, reduced the dataset size from 11,811 MB to 7889 MB, achieving a reduction of 33.2%, independent of any further encoding. This highlights the effectiveness of the shared dictionary sequence replacement in reducing redundancy.

When combined with advanced compression methods, the two-stage approach further enhanced compression efficiency. The results (as illustrated in Figure 3) are as follows:

Zstd achieved the highest efficiency, reducing the dataset size to 3760 MB (a 68.2% reduction), compared to 57.9% (4960 MB) with the traditional approach, representing a 10.3% improvement.
GZIP reduced the dataset size to 3759 MB (a 68.2% reduction), compared to 58.6% (4895 MB) with the traditional approach, representing a 9.6% improvement.
Snappy: The compressed size was 6064 MB (a 48.7% reduction), compared to 32.8% (7939 MB) with the traditional approach, a 15.9% improvement.
LZ4 reduced the dataset size to 6102 MB (a 48.3% reduction), compared to 28.1% (8502 MB) with the traditional approach, representing a 20.2% improvement.

5.2. Effect Size Analysis

The effect size (Cohen’s d) [34] was calculated to quantify the performance improvements for each compression algorithm. The results indicate substantial to exceptionally large improvements with the hybrid approach. Table 1 summarizes the effect sizes, which indicate substantial to exceptionally large improvements. To calculate the effect size (Cohen’s d), the following formula was used:

d = \frac{M_{1} - M_{2}}{S D_{p o o l e d}}

where

M_{1} a n d M_{2}

are the mean compression rates for ZSTD and the comparison codecs (GZIP, Snappy and LZ4).

S D_{p o o l e d}

is the pooled standard deviation of the two groups, calculated as

S D_{p o o l e d} = \sqrt{\frac{(n_{1} - 1) S D 1^{2} + (n_{2} - 1) S D 2^{2}}{n_{1} + n_{2} - 2}}

where

(n_{1} - 1) S D 1^{2} + (n_{2} - 1) S D 2^{2}

are the sample sizes for each group.

S D 1^{2} a n d S D 2^{2}

are the standard deviations for each group.

Table 2 shows the effect size (Cohen’s d) for four compression algorithms. A comparison of these values reveals that while both Zstd and GZIP show substantial improvements, Snappy and LZ4 exhibit even greater performance gains, with LZ4 demonstrating an exceptionally large enhancement. These differences suggest that the hybrid approach is particularly effective in boosting the performance of methods that initially had lower baseline compression efficiency. Overall, the results underscore the significant benefits of integrating sequence replacement via a Trie-based shared dictionary with advanced general-purpose compression methods for large-scale genomic datasets.

5.3. Computational Speeds

The Trie-based sequence replacement step proved instrumental in reducing data complexity before applying the final compression methods, significantly decreasing overall compression times across all tested methods. Table 3 presents the total compression times for the traditional approach and the improvements achieved by the hybrid approach incorporating Trie-based sequence replacement. Figure 4 visually represents these compression times, illustrating the reductions achieved by the hybrid approach across different algorithms.

It is important to note that while the Trie-based replacement step introduces an additional overhead of 865 s on the edge, this extra processing effectively reduces the volume of data that needs to be processed in the cloud. Consequently, the overall cloud compression times are significantly reduced, resulting in net performance gains. This trade-off highlights the potential benefits of further optimizing the Trie structure and database processes to minimize edge overhead while maximizing cloud time savings.

5.4. Cost Implications

The hybrid compression approach demonstrated substantial cost savings for cloud storage. Dataset sizes were reduced from 11,811 MB to 3760 MB, representing a 68.2% reduction, compared to 57.9% (4960 MB) achieved by traditional Zstd compression. This additional 10.3% reduction translates into greater cost savings. For example, using Azure Locally Redundant Storage (LRS) hot pricing at USD 1.70 to USD 2.08 per GB per month [35], Zstd’s traditional approach resulted in minimum monthly savings of USD 12.85 for both datasets, while the hybrid approach increased savings to USD 14.66.

While these savings may seem modest for the cytometry dataset used in our experiments, they scale significantly for larger datasets. For example, large-scale projects such as whole-genome sequencing projects typical of the 1000 Genomes Project [36] or The Cancer Genome Atlas [37] can have a total of approximately 40 TB of data (e.g., 100 million reads per sample for 1000 samples). Using traditional Zstd compression, storage requirements were reduced to 16 TB, achieving minimum monthly savings of USD 41,779. The hybrid approach further reduced storage to 12.76 TB, increasing minimum monthly savings to USD 47,419. This represents additional minimum savings of USD 5640 per month. Annually, this equates to USD 67,680, underscoring the hybrid approach’s transformative cost-efficiency.

As shown in Figure 5, these results emphasize the scalability and economic benefits of the hybrid approach, making it an ideal solution for institutions managing omics and genomic datasets. Beyond storage, the reduction in data size also decreases associated costs for data transfer, further enhancing its cost-effectiveness for large-scale genomic research.

5.5. Trie-Based Dictionary Statistics

The Trie-based shared dictionary was instrumental in optimizing data compression by capturing high-frequency N-gram sequences from the cytometry data. For the CLL dataset, 322,497 unique sequences were processed, with repetition rates ranging from 3,343,158 occurrences for the most frequent sequence to a threshold of 101 occurrences. For the B-ALL dataset, 439,933 unique sequences were processed, with repetition rates ranging from 56,272 occurrences to a threshold of 101 occurrences.

The resulting dictionary sizes were 22.19 MB for B-ALL and 16.93 MB for CLL, demonstrating the compactness and scalability of the Trie-based approach. These statistics underscore how the mechanism enabled substantial data reduction while maintaining manageable dictionary sizes, thereby ensuring scalability and high-performance workflows.

6. Discussion

The findings of this study underscore the transformative potential of the proposed edge compression approach. By performing sequence elimination at the edge before cloud transmission, we reduce bandwidth usage, improve compression ratios, and lower cloud storage costs. This two-stage compression approach integrates Trie-based shared dictionaries with advanced general-purpose compression methods, addressing critical challenges in omics data management. Scalable and cost-effective storage solutions are needed while maintaining high compression ratios and computational efficiency.

Our work builds upon previous research in genomic data compression. The SPRING method [5] demonstrated substantial improvements by leveraging domain-specific features such as read reordering and entropy coding. In contrast, our hybrid approach utilizes a global Trie-based dictionary to perform redundancy elimination at the edge, thereby not only reducing the data volume prior to cloud transmission but also improving overall compression efficiency. Similarly, while Genozip [6] has been shown to be an effective universal compressor for omics data, our approach further enhances performance by integrating edge preprocessing, which directly reduces both storage and transmission costs.

The integration of a Trie-based dictionary for global sequence matching marks a significant advancement in omics and genomics data compression methodologies. Unlike traditional sliding window techniques, which are limited by local sequence contexts, this approach facilitates global redundancy reduction, improving compression efficiency. By achieving a compression ratio of up to 68.2%, the hybrid compression outperformed conventional methods, including Zstd alone, which achieved a 57.9% reduction. This additional 10.3% improvement directly translates into substantial cost savings for institutions managing large-scale genomic datasets, such as those involved in whole-genome sequencing or large cohort studies.

This study further demonstrates that the two-stage compression approach is not only computationally effective but also scalable for handling datasets of varying sizes and complexities. Its successful application to omics datasets (B-ALL and CLL) highlights its adaptability, providing a foundation for broader exploration into other types of genomic and omics data. The cost-effectiveness analysis reinforces the economic benefits of this approach, showing significant potential to reduce monthly storage costs when applied at scale. This makes it a practical solution for institutions grappling with the exponential growth of health data.

Furthermore, the hybrid approach not only demonstrates superior compression ratios but also provides significant benefits in terms of computational speed and cost-effectiveness. Comparing our findings with those of previous studies, it is evident that methods such as SPRING and Genozip, while effective, do not incorporate the edge–cloud architecture that we propose. This novel aspect of our methodology allows for distributed processing, thereby optimizing both the compression process and network efficiency. By comparing our hybrid approach with established methods such as Zstd, GZIP, Snappy, and LZ4, our study not only validates the enhanced performance of our method but also demonstrates its potential to significantly reduce storage and transmission costs in real-world applications.

Beyond its primary application in omics data compression, our hybrid edge–cloud approach can also significantly contribute to the field of actuator and sensor networks. In many sensor network deployments, devices are resource-constrained and operate on limited power and processing capabilities. By integrating a Trie-based shared dictionary for localized redundancy elimination, our method reduces the volume of transmitted data, thereby conserving bandwidth and energy. Overall, the principles demonstrated in our work have the potential to be extended to actuator and sensor networks, offering a pathway toward more robust, efficient, and cost-effective network architectures.

Despite its promising results, the proposed hybrid compression approach has several limitations. The reliance on cytometry datasets may not fully represent the diversity and complexity of all genomic and omics data types. Additionally, the initial setup and maintenance of the shared dictionary requires technical expertise and resources, posing potential challenges for widespread adoption. While the Trie-based sequence replacement step is instrumental in achieving high compression ratios, it introduces notable computational overhead that necessitates optimization.

Implementing this hybrid compression approach in clinical and research environments must consider regulatory and privacy constraints, such as those outlined by the General Data Protection Regulation (GDPR) and HIPAA. Ensuring compliance while maintaining efficiency and scalability is critical for successful adoption. A phased deployment targeting specific genomic or omics datasets could facilitate process refinement and ensure secure and efficient operations within healthcare environments.

Future research should aim to test the hybrid compression approach across a broader range of genomic and omics datasets to evaluate its generalizability and effectiveness in handling diverse data types and scales. Equally important is the need to optimize the computational efficiency of the Trie-based replacement step by investigating alternative data structures or indexing methods that minimize overhead while preserving the advantages of global sequence matching.

7. Conclusions

The hybrid compression approach introduced in this study represents a significant advancement in omics data management. By combining dynamic N-gram analysis, Trie-based shared dictionaries, and advanced compression techniques, our method achieves substantial improvements in data reduction, scalability, and cost-efficiency.

Particularly, our method achieves compression ratios of up to 68.2%, outperforming popular methods like Zstd, which achieved a 57.9% reduction. This improvement translates into considerable cost savings for institutions handling large-scale genomic datasets, addressing critical challenges in storage and data transfer. The global sequence-matching capability of the Trie-based dictionary enables efficient redundancy reduction, overcoming the limitations of local-context approaches and making it particularly well suited for the repetitive nature of genomic and omics data.

We demonstrated the effectiveness of the hybrid and lossless method with two cytometry datasets, and the associated cost–benefit analysis highlighted its potential for large and broader data. Challenges in computational overhead during sequence replacement, the setup of the shared dictionary infrastructure, and the need for broader testing across diverse data types were also identified in the study. Addressing these areas in future research will enhance efficiency and expand applicability.

In summary, this approach offers a promising solution to the escalating demands of large omics data management. It not only reduces storage and transfer costs but also ensures the integrity and scalability required to support the growing scope of research and data. Further refinements and real-world deployment will strengthen its utility and impact, paving the way for advancements in data-driven health research.

Author Contributions

Conceptualization, R.A., Q.V.N., D.R.C., S.J.S., Z.Q. and P.J.K.; methodology, R.A., Q.V.N. and D.R.C.; software, R.A.; formal analysis, R.A.; writing—original draft preparation, R.A.; writing—review and editing, Q.V.N., R.A., D.R.C., S.J.S., Z.Q. and P.J.K.; supervision, Q.V.N., D.R.C., S.J.S., Z.Q. and P.J.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially funded by ARC LIEF LE240100131.

Data Availability Statement

Data are unavailable due to privacy restrictions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hernaez, M.; Pavlichin, D.; Weissman, T.; Ochoa, I. Genomic data compression. Annu. Rev. Biomed. Data Sci 2019, 2, 19–37. [Google Scholar] [CrossRef]
Stoumpos, A.I.; Kitsios, F.; Talias, M.A. Digital transformation in healthcare: Technology acceptance and its applications. Int. J. Environ. Res. Public Health 2023, 20, 3407. [Google Scholar] [CrossRef] [PubMed]
Becker, M.; Worlikar, U.; Agrawal, S.; Schultze, H.; Ulas, T.; Singhal, S.; Schultze, J.L. Scaling genomics data processing with memory-driven computing to accelerate computational biology. In High Performance Computing, Proceedings of the 35th International Conference, ISC High Performance, Frankfurt/Main, Germany, 22–25 June 2020; Springer International Publishing: Cham, Switzerland, 2020; pp. 328–344. [Google Scholar]
Greenfield, D.; Wittorff, V.; Hultner, M. The importance of data compression in the field of genomics. IEEE Pulse 2019, 10, 20–23. [Google Scholar] [CrossRef]
Chandak, S.; Tatwawadi, K.; Ochoa, I.; Hernaez, M.; Weissman, T. SPRING: A next-generation compressor for FASTQ data. Bioinformatics 2019, 35, 2674–2676. [Google Scholar] [CrossRef]
Lan, D.; Tobler, R.; Souilmi, Y.; Llamas, B. Genozip: A universal extensible genomic data compressor. Bioinformatics 2021, 37, 2225–2230. [Google Scholar] [CrossRef] [PubMed]
Fritz MH, Y.; Leinonen, R.; Cochrane, G.; Birney, E. Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res. 2011, 21, 734–740. [Google Scholar] [CrossRef]
Bonfield, J.K.; Mahoney, M.V. Compression of FASTQ and SAM format sequencing data. PLoS ONE 2013, 8, e59190. [Google Scholar] [CrossRef]
Promberger, L.; Schwemmer, R.; Fröning, H. Characterization of data compression across CPU platforms and accelerators. Concurr. Comput. Pract. Exp. 2023, 35, e6465. [Google Scholar] [CrossRef]
Cánovas, R.; Moffat, A.; Turpin, A. Lossy compression of quality scores in genomic data. Bioinformatics 2014, 30, 2130–2136. [Google Scholar] [CrossRef]
Kempa, D.; Prezza, N. At the roots of dictionary compression: String attractors. In Proceedings of the 50th Annual ACM SIGACT Symposium on Theory of Computing, Los Angeles, CA, USA, 25–29 June 2018; pp. 827–840. [Google Scholar]
Kozanitis, C.; Saunders, C.; Kruglyak, S.; Bafna, V.; Varghese, G. Compressing genomic sequence fragments using SlimGene. In Research in Computational Molecular Biology, Proceedings of the 14th Annual International Conference, RECOMB 2010, Lisbon, Portugal, 25–28 April 2010; Springer: Berlin/Heidelberg, Germany, 2010; pp. 310–324. [Google Scholar]
Toderici, G.; Vincent, D.; Johnston, N.; Jin Hwang, S.; Minnen, D.; Shor, J.; Covell, M. Full resolution image compression with recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 5306–5314. [Google Scholar]
Chen, M.; Ma, Y.; Li, Y.; Wu, D.; Zhang, Y.; Youn, C.H. Wearable 2.0: Enabling human-cloud integration in next generation healthcare systems. IEEE Commun. Mag. 2017, 55, 54–61. [Google Scholar] [CrossRef]
Pantelopoulos, A.; Bourbakis, N.G. A survey on wearable sensor-based systems for health monitoring and prognosis. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 2009, 40, 1–12. [Google Scholar] [CrossRef]
Ballé, J.; Laparra, V.; Simoncelli, E.P. End-to-end optimized image compression. arXiv 2016, arXiv:1611.01704. [Google Scholar]
Roguski, Ł.; Deorowicz, S. DSRC 2—Industry-oriented compression of FASTQ files. Bioinformatics 2024, 30, 2213–2215. [Google Scholar] [CrossRef]
Adam, R.; Catchpoole, D.R.; Simoff, S.S.; Kennedy, P.J.; Nguyen, Q.V. Novel Hybrid Edge-Cloud Framework for Efficient and Sustainable Omics Data Management. Innov. Digit. Health Diagn. Biomark. 2024, 4, 81–88. [Google Scholar] [CrossRef]
Ferragina, P. Dictionary-Based Compressors. In Pearls of Algorithm Engineering; Cambridge University Press: Cambridge, UK, 2023; pp. 240–251. [Google Scholar]
Malta, M.; Cardoso, L.O.; Bastos, F.I.; Magnanini MM, F.; Silva, C.M.F.P.D. STROBE initiative: Guidelines on reporting observational studies. Rev. De Saude Publica 2010, 44, 559–565. [Google Scholar] [CrossRef] [PubMed]
Catala-Lopez, F.; Alonso-Arroyo, A.; Page, M.J.; Hutton, B.; Ridao, M.; Tabarés-Seisdedos, R.; Moher, D. Reporting guidelines for health research: Protocol for a cross-sectional analysis of the EQUATOR Network Library. BMJ Open 2019, 9, e022769. [Google Scholar] [CrossRef]
Erdem, O. Tree-based string pattern matching on, F.P.G.A.s. Comput. Electr. Eng. 2016, 49, 117–133. [Google Scholar] [CrossRef]
Grossi, R.; Ottaviano, G. The wavelet trie: Maintaining an indexed sequence of strings in compressed space. In Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Scottsdale, AZ, USA, 21–23 May 2012. [Google Scholar]
Grossi, R.; Vitter, J.S. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In Proceedings of the Thirty-Second Annual ACM Symposium on Theory of Computing, Portland, OR, USA, 21–23 May 2000; pp. 397–406. [Google Scholar]
Pibiri, G.E.; Venturini, R. Efficient Data Structures for Massive N-Gram Datasets. Memory 2017, 10, 21. [Google Scholar]
Collet, Y. Zstandard-Fast Real-Time Compression Algorithm. 2018. Github Repository [online]. Available online: https://github.com/facebook/zstd (accessed on 5 April 2025).
Deutsch, P. RFC1952: Gzip File Format Specification Version 4.3. RFC Editor, USA. 1996. Available online: https://www.rfc-editor.org/rfc/rfc1952.html (accessed on 5 April 2025).
Gunderson, S.H. Snappy: A Fast Compressor/Decompressor. 2015. Available online: https://google.github.io/snappy/ (accessed on 5 April 2025).
Bartík, M.; Ubik, S.; Kubalik, P. LZ4 compression algorithm on FPGA. In Proceedings of the 2015 IEEE International Conference on Electronics, Circuits, and Systems (ICECS), Cairo, Egypt, 6–9 December 2015; pp. 179–182. [Google Scholar]
LZ4. LZ4–Extremely Fast Compression Algorithm. 2016. Available online: https://github.com/lz4/lz4 (accessed on 5 April 2025).
Microsoft. System.IO.Compression Namespace. Available online: https://docs.microsoft.com/en-us/dotnet/api/system.io.compression (accessed on 5 April 2025).
Microsoft. SQL Server 2022. Available online: https://www.microsoft.com/en-us/sql-server/sql-server-2022 (accessed on 5 April 2025).
Microsoft. NET 8. 2023. Available online: https://dotnet.microsoft.com/en-us/download/dotnet/8.0 (accessed on 8 April 2025).
Cohen, J. Statistical Power Analysis for the Behavioral Sciences; Routledge: Oxfordshire, UK, 2013. [Google Scholar]
Krumm, N.; Hoffman, N. Practical estimation of cloud storage costs for clinical genomic data. Pract. Lab. Med. 2020, 21, e00168. [Google Scholar] [CrossRef]
1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 2015, 526, 68. [Google Scholar] [CrossRef]
Weinstein, J.N.; Collisson, E.A.; Mills, G.B.; Shaw, K.R.; Ozenberger, B.A.; Ellrott, K.; Stuart, J.M. The cancer genome atlas pan-cancer analysis project. Nat. Genet. 2013, 45, 1113–1120. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Shared dictionary construction: N-gram extraction and refinement process.

Figure 2. Edge–cloud compression: from shared dictionary to existing codecs.

Figure 3. Comparison of compression performance: traditional versus hybrid.

Figure 4. Compression times for the traditional versus hybrid approach.

Figure 5. Comparison of dataset sizes and monthly costs by compression method.

Table 1. Overview of compression methods.

Method	Key Techniques	Compression Ratio/Performance	Limitations
SPRING [5]	Read reordering, entropy coding	High for FASTQ files	Limited to specific data formats; lacks edge processing
Genozip [6]	Multi-threaded processing, format versatility	Good across various formats	Does not leverage edge preprocessing; may miss global redundancy
Traditional Methods (Zstd, GZIP, Snappy, LZ4)	Dictionary-based techniques (e.g., LZ77)	Moderate to high	Rely on local-context matching; not optimized for global redundancy
Hybrid Compression Frameworks [7,8]	Combination of dictionary-based methods with parallel/distributed processing	Varies by implementation; improved scalability	Typically do not incorporate localized edge preprocessing, leading to increased network overhead
Neural-Network-Based Methods [13]	Adaptive compression using neural networks (e.g., RNNs, autoencoders)	Promising but variable results	High computational cost; less practical for real-time or large-scale applications
Proposed Method	Global Trie-based shared dictionary, hybrid edge preprocessing + cloud postprocessing	Improve the compression ratio for improving efficiency in edge–cloud frameworks.	Additional edge processing overhead (offset by overall gains)

While each of these methods addresses aspects of the omics data compression challenge, they tend to fall short in balancing compression efficiency with processing speed and scalability. In contrast, our proposed method uniquely integrates edge preprocessing—via a global Trie-based shared dictionary—with cloud-based compression. This dual-stage approach not only achieves up to 68.2% data reduction but also significantly reduces network overhead, addressing the limitations observed in existing methods.

Table 2. Effect size (Cohen′s d) for different compression algorithms.

Algorithm	Effect Size (d)	Interpretation
Zstd	12.61	Substantial improvement
GZIP	11.94	Substantial improvement
Snappy	19.71	Very large improvement
LZ4	25.23	Exceptionally large enhancement

Table 3. Compression times before and after Trie-based sequence replacement.

Algorithm	Time (Second) Traditional Approach	Time (Second) Hybrid Approach	Reduction (%)
GZIP	399.00	249.63	37.4%
Snappy	28.96	25.69	11.3%
Zstd	63.73	46.58	26.9%
LZ4	34.93	28.78	17.6%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Adam, R.; Catchpoole, D.R.; Simoff, S.J.; Qu, Z.; Kennedy, P.J.; Nguyen, Q.V. Lossless Compression with Trie-Based Shared Dictionary for Omics Data in Edge–Cloud Frameworks. J. Sens. Actuator Netw. 2025, 14, 41. https://doi.org/10.3390/jsan14020041

AMA Style

Adam R, Catchpoole DR, Simoff SJ, Qu Z, Kennedy PJ, Nguyen QV. Lossless Compression with Trie-Based Shared Dictionary for Omics Data in Edge–Cloud Frameworks. Journal of Sensor and Actuator Networks. 2025; 14(2):41. https://doi.org/10.3390/jsan14020041

Chicago/Turabian Style

Adam, Rani, Daniel R. Catchpoole, Simeon J. Simoff, Zhonglin Qu, Paul J. Kennedy, and Quang Vinh Nguyen. 2025. "Lossless Compression with Trie-Based Shared Dictionary for Omics Data in Edge–Cloud Frameworks" Journal of Sensor and Actuator Networks 14, no. 2: 41. https://doi.org/10.3390/jsan14020041

APA Style

Adam, R., Catchpoole, D. R., Simoff, S. J., Qu, Z., Kennedy, P. J., & Nguyen, Q. V. (2025). Lossless Compression with Trie-Based Shared Dictionary for Omics Data in Edge–Cloud Frameworks. Journal of Sensor and Actuator Networks, 14(2), 41. https://doi.org/10.3390/jsan14020041

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Lossless Compression with Trie-Based Shared Dictionary for Omics Data in Edge–Cloud Frameworks

Abstract

1. Introduction

2. Related Work

3. Compression Method

3.1. System Model

3.2. Global Dictionary Construction

3.3. Hybrid Edge–Cloud Compression Workflow

3.4. Evaluation

3.5. Decompression Process

4. Experimental Evaluation

5. Results

5.1. Compression Ratios

5.2. Effect Size Analysis

5.3. Computational Speeds

5.4. Cost Implications

5.5. Trie-Based Dictionary Statistics

6. Discussion

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI