Compression of Text in Selected Languages—Efficiency, Volume, and Time Comparison
Abstract
:1. Introduction
- Study of Esperanto grammar and vocabulary.
- Development of the theoretical background of the solution.
- Selection of research tools.
- Finding the text in Esperanto, Polish, and English as input data for the compression process.
- Implementation of a program for text compression.
- Process of text compression.
- Analysis of the results.
- Development and update of the program in the future.
2. Related Work
3. Materials and Methods
3.1. Selection of the Planned Language Esperanto
- Striving to introduce a neutral human language around the world;
- Popularization of Esperanto through practice, including increasing the library of sources (both original and translations);
- Waiver of rights by the author of the Esperanto language—Esperanto belongs to everyone;
- No one can introduce new rules into the language. The only source is the Esperanto Fundament;
- Every user of Esperanto is an Esperantist.
3.2. Input Data
- In Polish—wolnelektury.pl (accessed on 5 February 2022) [63];
- In English—gutenberg.org (accessed on 5 February 2022) [64];
- In Esperanto—tekstaro.com (accessed on 15 January 2022) [65].
- Zeta Library (zlib);
- Lempel–Ziv–Markov chain algorithm (lzma);
- Bz2;
- Lz4.
- Examination of the compressed text and measurement of the compression time;
- Data presentation in text form was provided through a specially created class, which stores the collected data and writes them to the console;
- Entering data into the Microsoft Excel spreadsheet table and presenting the results in the form of tables and graphs.
3.3. Research Tools
3.4. Developed Program for Text Compression
- pl—stands for Polish language;
- en—stands for English language;
- eo—stands for Esperanto language;
- eox—stands for Esperanto language in notation x.
- The length of the uncompressed text;
- The name of the language for which the compression is performed;
- The name of the compression algorithm;
- The compression time;
- The percentage of space taken in relation to uncompressed text;
- Bytes space taken up by compressed text.
- Adding a text version with x notation;
- Collecting the length of compressed texts in bytes;
- Collecting the length of uncompressed texts in bytes;
- Adding more algorithms in order to verify the set hypothesis and obtain more reliable results.
4. Results
4.1. Text Volume before and after Compression
4.2. Time of Compression
4.3. Additional Comparison of Compression of Text Translated in Google Translate
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Sayood, K. Introduction to Data Compression; Morgan Kaufmann: Burlington, MA, USA, 2017. [Google Scholar]
- Rahman, M.; Hamada, M. Burrows-wheeler transform based lossless text compression using keys and Huffman coding. Symmetry 2020, 12, 1654. [Google Scholar] [CrossRef]
- Linhares Pontes, E.; Huet, S.; Torres-Moreno, J.-M.; Linhares, A.C. Cross-language text summarization using sentence and multi-sentence compression. In Proceedings of the International Conference on Applications of Natural Language to Information Systems, Paris, France, 13–15 June 2018; Springer: Cham, Switzerland, 2018; pp. 467–479. [Google Scholar]
- Kalajdzic, K.; Ali, S.H.; Patel, A. Rapid lossless compression of short text messages. Comput. Stand. Interfaces 2015, 37, 53–59. [Google Scholar] [CrossRef]
- Chubaryan, A.; Sargsyan, L. The Text-Organizing Function of Compression in English Scientific Discourse. Armen. Folia Angl. 2016, 12, 15–26. [Google Scholar] [CrossRef]
- Inoue, K.; Miyazaki, T.; Sugaya, Y.; Omachi, S. Study on Compression of Images Including Text by Sparse Coding. IEICE Tech. Rep. 2016, 116, 5–10. [Google Scholar]
- Teahan, W.J. A compression-based toolkit for modelling and processing natural language text. Information 2018, 9, 294. [Google Scholar] [CrossRef]
- Zamenhof, L. Fundamento de Esperanto. Available online: https://www.akademio-de-esperanto.org/fundamento/ (accessed on 8 April 2022).
- Rani, M.; Singh, V. A Survey on Lossless Text Data Compression Techniques. Int. J. Adv. Res. Comput. Eng. Technol. 2016, 5, 1741–1744. [Google Scholar]
- Mentzer, F.; Gool, L.V.; Tschannen, M. Learning better lossless compression using lossy compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 6638–6647. [Google Scholar]
- Oswald, C.; Ghosh, A.I.; Sivaselvan, B. An efficient text compression algorithm-data mining perspective. In Proceedings of the International Conference on Mining Intelligence and Knowledge Exploration, Hyderabad, India, 9–11 December 2015; pp. 563–575. [Google Scholar]
- Rahman, M.A.; Hamada, M. Lossless Image Compression Techniques: A State-of-the-Art Survey. Symmetry 2019, 11, 1274. [Google Scholar] [CrossRef]
- Gupta, A.; Bansal, A.; Khanduja, V. Modern lossless compression techniques: Review, comparison and analysis. In Proceedings of the 2017 Second International Conference on Electrical, Computer and Communication Technologies (ICECCT), Coimbatore, India, 22–24 February 2017; pp. 1–8. [Google Scholar]
- Popescu, C.; Grama, L.; Rusu, C. A Highly Scalable Method for Extractive Text Summarization Using Convex Optimization. Symmetry 2021, 13, 1824. [Google Scholar] [CrossRef]
- Jalilian, E.; Hofbauer, H.; Uhl, A. Iris Image Compression Using Deep Convolutional Neural Networks. Sensors 2022, 22, 2689. [Google Scholar] [CrossRef]
- Hu, W.; Zhu, M.; Zhang, H. Application of Block Sparse Bayesian Learning in Power Quality Steady-State Data Compression. Energies 2022, 15, 2479. [Google Scholar] [CrossRef]
- Nonaka, K.; Yamanouchi, K.; Tomohiro, I.; Okita, T.; Shimada, K.; Sakamoto, H. A Compression-Based Multiple Subword Segmentation for Neural Machine Translation. Electronics 2022, 11, 1014. [Google Scholar] [CrossRef]
- Oswald, C.; Sivaselvan, B. An optimal text compression algorithm based on frequent pattern mining. J. Ambient Intell. Humaniz. Comput. 2018, 9, 803–822. [Google Scholar] [CrossRef]
- Bedruz, R.A.; Quiros, A.R.F. Comparison of Huffman Algorithm and Lempel-Ziv Algorithm for audio, image and text compression. In Proceedings of the 2015 International Conference on Humanoid, Nanotechnology, Information Technology, Communication and Control, Environment and Management (HNICEM), Cebu, Philippines, 9–12 December 2015; pp. 1–6. [Google Scholar]
- Oswald, C.; Ghosh, A.I.; Sivaselvan, B. Knowledge engineering perspective of text compression. In Proceedings of the 2015 Annual IEEE India Conference (INDICON), New Delhi, India, 17–20 December 2015; pp. 1–6. [Google Scholar]
- Blalock, D.; Madden, S.; Guttag, J. Sprintz: Time series compression for the internet of things. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 2018, 2, 1–23. [Google Scholar] [CrossRef]
- Qiu, H.; Zheng, Q.; Memmi, G.; Lu, J.; Qiu, M.; Thuraisingham, B. Deep residual learning-based enhanced JPEG compression in the Internet of Things. IEEE Trans. Ind. Inform. 2020, 17, 2124–2133. [Google Scholar] [CrossRef]
- Chowdhury, M.R.; Tripathi, S.; De, S. Adaptive multivariate data compression in smart metering Internet of Things. IEEE Trans. Ind. Inform. 2020, 17, 1287–1297. [Google Scholar] [CrossRef]
- Sujitha, B.; Parvathy, V.S.; Lydia, E.L.; Rani, P.; Polkowski, Z.; Shankar, K. Optimal deep learning based image compression technique for data transmission on industrial Internet of things applications. Trans. Emerg. Telecommun. Technol. 2021, 32, e3976. [Google Scholar] [CrossRef]
- Kagita, M.K.; Thilakarathne, N.; Bojja, G.R.; Kaosar, M. A lossless compression technique for Huffman-based differential encoding in IoT for smart agriculture. Int. J. Uncertain. Fuzziness Knowl. Based Syst. 2021, 29, 317–332. [Google Scholar] [CrossRef]
- Campobello, G.; Segreto, A.; Zanafi, S.; Serrano, S. RAKE: A simple and efficient lossless compression algorithm for the internet of things. In Proceedings of the 2017 25th European Signal Processing Conference (EUSIPCO), Kos, Greece, 28 August–2 September 2017; pp. 2581–2585. [Google Scholar]
- Hadiatna, F.; Hindersah, H.; Yolanda, D.; Triawan, M.A. Design and implementation of data logger using lossless data compression method for Internet of Things. In Proceedings of the 2016 6th International Conference on System Engineering and Technology (ICSET), Bandung, Indonesia, 3–4 October 2016; pp. 105–108. [Google Scholar]
- Perez, R.; Leithardt, V.R.Q.; Correia, S.D. Lossless compression scheme for efficient gnss data transmission on iot devices. In Proceedings of the 2021 International Conference on Electrical, Computer and Energy Technologies (ICECET), Cape Town, South Africa, 9–10 December 2021; pp. 1–6. [Google Scholar]
- Gu, J.; Choi, H.; Jeong, J. English Text Compression using Huffman Coding. In Proceedings of the Korean Society of Broadcast Engineers Conference, The Korean Institute of Broadcast and Media Engineers, Seoul, Korea, 4 November 2016; pp. 69–71. [Google Scholar]
- Cherkunova, M. V Means of Semantic Compression in Modern English Scientific Discourse (Based on Abstracts to the Articles From International Scientific Citation Databases). Prof. Discourse Commun. 2021, 3, 28–38. [Google Scholar] [CrossRef]
- Bekali, O. Semantic-Stylistic Tools in English. Kresna Soc. Sci. Humanit. Res. 2022, 8, 34–37. [Google Scholar]
- Vijayalakshmi, B.; Sasirekha, N. Lossless text compression for unicode tamil documents. ICTACT J. Soft Comput. 2018, 8, 1635–1640. [Google Scholar]
- Gilliver, P. The making of the Oxford English dictionary. Lexikos 2016, 26, 436–445. [Google Scholar]
- Indurani, M.P.; Deepika, M.P.; Padma, M.P. A survey on big data compression. In Proceedings of the National Conference on “Future Research Perspectives in Computer Science and Information Technology”, Madurai, India, 21–22 February 2017. [Google Scholar]
- Sarker, P.; Rahman, M.L. Introduction to Adjacent Distance Array with Huffman Principle: A New Encoding and Decoding Technique for Transliteration Based Bengali Text Compression. In Progress in Advanced Computing and Intelligent Engineering; Springer: Berlin/Heidelberg, Germany, 2021; pp. 543–555. [Google Scholar]
- Gonzales, A.R.; Spring, N.; Kew, T.; Kostrzewa, M.; Säuberli, A.; Müller, M.; Ebling, S. A New Dataset and Efficient Baselines for Document-level Text Simplification in German. In Proceedings of the Third Workshop on New Frontiers in Summarization, Online, Dominican Republic, 7–11 November 2021; pp. 152–161. [Google Scholar]
- Dissemond, J.; Assenheimer, B.; Bültemann, A.; Gerber, V.; Gretener, S.; Kohler-von Siebenthal, E.; Koller, S.; Kröger, K.; Kurz, P.; Läuchli, S. Compression therapy in patients with venous leg ulcers. JDDG J. Dtsch. Dermatol. Ges. 2016, 14, 1072–1087. [Google Scholar] [CrossRef] [PubMed]
- Hilal, T.A.; Hilal, H.A. Arabic text lossless compression by characters encoding. Procedia Comput. Sci. 2019, 155, 618–623. [Google Scholar] [CrossRef]
- Awajan, A.; Jrai, E.A. Hybrid Technique for Arabic Text Compression. Glob. J. Comput. Sci. Technol. 2015, 15, 1–7. [Google Scholar]
- Xu, R.; Yang, Y. Cross-lingual distillation for text classification. arXiv 2017, arXiv:1705.02073. [Google Scholar]
- Ignatoski, M.; Lerga, J.; Stanković, L.; Daković, M. Comparison of entropy and dictionary based text compression in English, German, French, Italian, Czech, Hungarian, Finnish, and Croatian. Mathematics 2020, 8, 1059. [Google Scholar] [CrossRef]
- Marasek, K.; Brocki, Ł.; Korzinek, D.; Wołk, K.; Gubrynowicz, R. Spoken language translation for polish. arXiv 2015, arXiv:1511.07788. [Google Scholar]
- Wołk, K.; Marasek, K. Polish-English statistical machine translation of medical texts. In New Research in Multimedia and Internet Systems; Springer: Berlin/Heidelberg, Germany, 2015; pp. 169–179. [Google Scholar]
- Grzybowski, P.; Juralewicz, E.; Piasecki, M. Sparse coding in authorship attribution for Polish tweets. In Proceedings of the Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), Varna, Bulgaria, 2–4 September 2019; pp. 409–417. [Google Scholar]
- Łabuzek, M.; Piasecki, M. English Translator ± A Bi directional Polish English Translation System. Available online: https://www.fi.muni.cz/tsd2002/papers/108_Marek_Labuzek.ps (accessed on 12 April 2022).
- Byram, M.; Golubeva, I. Conceptualising intercultural (communicative) competence and intercultural citizenship. In The Routledge Handbook of Language and Intercultural Communication; Routledge: England, UK, 2020; pp. 70–85. [Google Scholar]
- Wagner, M.; Byram, M. Intercultural citizenship. Int. Encycl. Intercult. Commun. 2017, 13, 1–6. [Google Scholar]
- The Nobel Foundation The Nobel Prize. Available online: https://www.nobelprize.org/ (accessed on 9 April 2022).
- Universala Esperanto-Asocio Universala Esperanto-Asocio. Available online: https://uea.org/info (accessed on 11 April 2022).
- PEJ—Pola Esperanto-Junularo Podstawy języka Esperanto. Available online: http://pej.pl/pl/o-esperanto/podstawy-jezyka/ (accessed on 8 April 2022).
- Israel, N. Esperantic Modernism: Joyce, Universal Language, and Political Gesture. Modernism/Modernity 2017, 24, 1–21. [Google Scholar] [CrossRef]
- Martín Camacho, J.C. La morfología de las lenguas artificiales. El caso del “volapuk” y de la “langue bleue”. Anu. Estud. Filol. 2019, 42, 189–213. [Google Scholar] [CrossRef]
- LaFarge, P. The Village Voice. Available online: http://www.villagevoice.com (accessed on 12 April 2022).
- Garvía, R. Esperanto and Its Rivals; University of Pennsylvania Press: Philadelphia, PA, USA, 2015. [Google Scholar]
- Guinard, T. An Algorithm for Morphological Segmentation of Esperanto Words. Prague Bull. Math. Linguist. 2016, 105, 63–67. [Google Scholar] [CrossRef]
- Omarov, D.; Tran, K.; Zhexembay, L.; Santana, M.; Hildebrand, A.J. Zipf’s Law: A Universal Law for Empirical Data from Word Frequencies to Olympic Records. Available online: https://faculty.math.illinois.edu/~hildebr/ugresearch/posters/urs2017-zipf-law.pdf (accessed on 12 April 2022).
- Stecuła, B. Budowanie Modelu Kontekstu Świata na Podstawie Tekstu w Języku Esperanto. Master’s Thesis, Silesian University of Technology, Gliwice, Poland, 2020. [Google Scholar]
- Gobbo, F. Machine translation as a complex system: The role of Esperanto. Interdiscip. Descr. Complex Syst. INDECS 2015, 13, 264–274. [Google Scholar] [CrossRef]
- Gobbo, F. Coolification and Language Vitality: The Case of Esperanto. Languages 2021, 6, 93. [Google Scholar] [CrossRef]
- Hernández-Gómez, C.; Basurto-Flores, R.; Obregón-Quintana, B.; Guzmán-Vargas, L. Evaluating the Irregularity of Natural Languages. Entropy 2017, 19, 521. [Google Scholar] [CrossRef]
- Nobliści.pl Laureaci Nagrody Nobla. Available online: http://www.noblisci.pl/1905-henryk-sienkiewicz/ (accessed on 8 April 2022).
- Instytut Książki 115 Lat Temu Henryk Sienkiewicz Odebrał Nagrodę Nobla W Dziedzinie Literatury. Available online: https://instytutksiazki.pl/ (accessed on 8 April 2022).
- Wolnelektury.pl Wolne Lektury. Available online: www.wolnelektury.pl (accessed on 11 February 2022).
- Project Gutenberg Gutenberg. Available online: www.gutenberg.org (accessed on 11 February 2022).
- Tekstaro de Esperanto. Available online: www.tekstaro.com (accessed on 11 February 2022).
- Severance, C. Guido van rossum: The early years of python. Computer 2015, 48, 7–9. [Google Scholar] [CrossRef]
- Kumar, C. Python Advantages and Disadvantages—Step in the Right Direction. Available online: https://techvidvan.com/tutorials/%0Apython-advantages-and-disadvantages/ (accessed on 11 February 2022).
- JetBrains, s.r.o. PyCharm—The Python IDE for Professional Developers. Available online: https://www.jetbrains.com/company/ (accessed on 15 February 2022).
- Yergeau, F. UTF-8, A Transformation Format of ISO 10646. Available online: https://tools.ietf.org/html/rfc3629 (accessed on 13 February 2022).
- W3Techs Usage of Character Encodings Broken Down by Ranking. Available online: https://w3techs.com/technologies/cross/character_encoding/ranking (accessed on 13 February 2022).
Parameter | pl | en | eo | eox |
---|---|---|---|---|
The volume of the uncompressed text (bytes). | 1,187,923 | 1,198,403 | 1,174,480 | 1,174,480 |
Quotation | Number of Characters |
---|---|
“I tak minął Nero, jak mija wicher, burza, pożar, wojna lub mór, a bazylika Piotra panuje dotąd z wyżyn watykańskich miastu i światu”. | 131 |
“Therefore, Nero passed, as a whirlwind, as a storm, as a fire, as war or death passes; but the basilica of Peter rules till now, from the Vatican heights, the city, and the world”. | 174 |
“Tiel pasis Nero, kiel pasas uragano, fulmotondro, brulo, milito aŭ pesto, dum la baziliko de Petro regas ĝis nun de la Vatikana altaĵo la urbon kaj la mondon”. | 157 |
“Tiel pasis Nero, kiel pasas uragano, fulmotondro, brulo, milito aux pesto, dum la baziliko de Petro regas gxis nun de la Vatikana altajxo la urbon kaj la mondon”. | 160 |
Algorithm | pl | en | eo | eox |
---|---|---|---|---|
Compression time [s] | ||||
zlib | 0.0683 | 0.0683 | 0.0722 | 0.0723 |
lzma | 0.4449 | 0.4552 | 0.4473 | 0.4464 |
bz2 | 0.0821 | 0.0813 | 0.0813 | 0.0821 |
lz4 | 0.0818 | 0.0934 | 0.0927 | 0.0927 |
Space used [%] | ||||
zlib | 38.38 | 37.20 | 35.80 | 35.71 |
lzma | 30.66 | 29.45 | 28.58 | 28.56 |
bz2 | 27.93 | 26.86 | 26.03 | 25.98 |
lz4 | 43.47 | 41.87 | 40.52 | 40.48 |
Space used [bytes] | ||||
zlib | 455,947 | 445,863 | 420,472 | 419,452 |
lzma | 364,232 | 352,948 | 335,724 | 335,400 |
bz2 | 331,739 | 321,942 | 305,694 | 305,111 |
lz4 | 516,426 | 501,722 | 475,912 | 475,389 |
Algorithm | pl | en | eo | eox |
---|---|---|---|---|
zlib | 38.38 | 37.2 | 35.8 | 35.71 |
lzma | 30.66 | 29.45 | 28.58 | 28.56 |
bz2 | 27.93 | 26.86 | 26.03 | 25.98 |
lz4 | 43.47 | 41.87 | 40.52 | 40.48 |
Algorithm | pl | en | eo | eox |
---|---|---|---|---|
zlib | 38.38 | 37.53 | 35.40 | 35.31 |
lzma | 30.66 | 29.71 | 28.26 | 28.23 |
bz2 | 27.93 | 27.10 | 25.73 | 25.68 |
lz4 | 43.47 | 42.24 | 40.06 | 40.02 |
Algorithm | pl | en | eo | eox |
---|---|---|---|---|
Zlib | 68 | 68 | 72 | 72 |
Lzma | 444 | 455 | 447 | 446 |
bz2 | 82 | 81 | 81 | 82 |
lz4 | 81 | 93 | 93 | 93 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Stecuła, B.; Stecuła, K.; Kapczyński, A. Compression of Text in Selected Languages—Efficiency, Volume, and Time Comparison. Sensors 2022, 22, 6393. https://doi.org/10.3390/s22176393
Stecuła B, Stecuła K, Kapczyński A. Compression of Text in Selected Languages—Efficiency, Volume, and Time Comparison. Sensors. 2022; 22(17):6393. https://doi.org/10.3390/s22176393
Chicago/Turabian StyleStecuła, Beniamin, Kinga Stecuła, and Adrian Kapczyński. 2022. "Compression of Text in Selected Languages—Efficiency, Volume, and Time Comparison" Sensors 22, no. 17: 6393. https://doi.org/10.3390/s22176393
APA StyleStecuła, B., Stecuła, K., & Kapczyński, A. (2022). Compression of Text in Selected Languages—Efficiency, Volume, and Time Comparison. Sensors, 22(17), 6393. https://doi.org/10.3390/s22176393