3.1. Selection of the Planned Language Esperanto
In 1887, Ludwik Zamenhof officially published the first version of the international language he had created, known as Esperanto. This language was created in order to reconcile nations in the territory he inhabited. These nations communicated in three or more languages, which resulted in frequent misunderstandings, divisions, and prejudices [
46,
47]. Due to his language, Ludwik was nominated eight times for the Nobel Peace Prize [
48]. In 1908, the Universala Esperanto Asocio (UEA), i.e., the World Esperantists’ Union, respected by such organizations as the United Nations, UNICEF, UNESCO, the Council of Europe, the International Organization for Standardization (ISO), and the Organization of American States, was funded [
49].
Esperanto is an artificial language. Its main characteristic is its regularity (due to the immutability of morphemes). Due to its unambiguous and analytical nature, Esperanto is very precise. It should be noted that parts of speech are easily distinguished due to specific grammatical endings. In the context of sentence order, the order of words is characterized by great freedom. The grammar of the language is described in Zamenhof’s Foundation of Esperanto (epo. Fundamento de Esperanto) [
8].
The vocabulary of Esperanto has its roots in European languages, mainly Romance (around 75%) and Germanic (around 20%) [
50]. There are several hundred words in the Esperanto Foundation, and nowadays, their number exceeds a thousand. All of these words form a complete language, as they also form the basis for more complex expressions that are formed by putting together existing words.
Thus, the vocabulary in Esperanto forms a certain system that allows us to obtain a range of meaning many times greater than in the case of languages without extensive prefixing and suffixing. Therefore, it should be stated that within the framework of the currently existing database, it is possible to create concepts that do not yet have equivalents in natural languages.
Esperanto was not the first artificial (planned) language. It is worth mentioning the Volapük language, which directly preceded Esperanto [
51]. The creator of this language, Johann Schleyer, did not agree with any reforms proposed by the academy of this language. The creator’s resistance led to a schism, and then to a decline in the language’s popularity. Volapük was a regular and schematic language, but the extensive and difficult-to-remember and pronounce vocabulary meant that even the creator himself was not able to use it fluently [
52]. Nowadays, the language is still used on the Internet [
53]. However, it appears only in written form—in oral form, it is used only by hobbyists. After Esperanto, many other artificial languages have emerged, such as Occidental (Edgar von Wahl, 1922), Novial (Jesperen, 1928), Interlingua (1951), and Romanid (Zoltan Magyar, 1956) [
54]. Moreover, some languages were directly based on Esperanto, including Ido (1907), Reform Esperanto (1910), and Latin Esperanto (1911) [
54]. However, it should be emphasized that all these popular languages had the disadvantage of being “naturalistic”. This means that they imitated natural languages to such an extent that they consequently lost their original feature; that is, regularity.
Based on the idea accompanying the Esperanto language, a movement called “Esperantism” was founded. Its main assumptions were the following [
8]:
Striving to introduce a neutral human language around the world;
Popularization of Esperanto through practice, including increasing the library of sources (both original and translations);
Waiver of rights by the author of the Esperanto language—Esperanto belongs to everyone;
No one can introduce new rules into the language. The only source is the Esperanto Fundament;
Every user of Esperanto is an Esperantist.
It is worth noting that these goals are very similar to the currently popular idea behind many programming languages or software, which is creating and sharing open source code. Due to the simplicity, regularity, and specific grammar and vocabulary of the Esperanto language, the authors decided to undertake research on the use of this language in computer science. The authors studied the grammar, vocabulary, and principles of the language. Based on the analysis, the authors concluded that Esperanto is optimal for use as a human language, which will be easily analyzed and processed by computer software. It is also worth noting that, thanks to the popularity of this language, numerous source materials that can be used as valuable research sources were created.
Moreover, it is worth noting that due to its regularity, the Esperanto language was applied in many projects and became the subject of many studies in the field of computer science; for example, creating an algorithm for morphological segmentation of Esperanto words [
55], a study on Zipf’s law [
56], building a model of the world context [
57], enhancing the development of human–machine communication [
58]. Moreover, the authors [
59] focused their paper on discussing the digital presence of Esperanto. They assessed its digital vitality on the basis of its language ideology and other sociolinguistic data. Other researchers [
60] quantified the irregularity of different European languages belonging to four linguistic families and an artificial language (Esperanto). They worked on modifying a well-known method to calculate the approximate and sample entropy of written texts. They based their method on the search for regularities in a sequence of symbols and consistently distinguishing between natural and synthetic randomized texts. The mentioned reviewed research examples, followed by the authors’ experience, inspired the authors of this article to conduct their research on text compression in Esperanto.
3.2. Input Data
For a text compression program to produce meaningful results, it had to operate with varied and long text as input research material. The authors had to look for a text available in each of the three selected languages. Therefore, the study uses the entirety of the novel by the Polish writer Henryk Sienkiewicz—
Quo vadis. The author of the novel was the first Polish writer to receive the Nobel Prize for Literature—this was in 1905 [
61].
Quo vadis, written in Polish and published in 1896, was recognized as a worldwide bestseller at that time [
62]. The novel has been translated into over 30 languages. It is a novel in the public domain with readily available translations, including English and Esperanto. The selection of this novel as the subject of the research carried out allowed us to obtain input text (input data) with very similar characteristics and length, minimizing the differences that could affect the result of the experiment in an unexpected way.
The authors searched the database of novels in the public domain and found the novel Quo vadis by Henryk Sienkiewicz. The sources of the downloaded novel and its translations were the following:
The data were saved in the .txt format in 8-bit Unicode Transformation Format (UTF-8) coding.
The most popular compression algorithms were used to compress the text. Four algorithms selected for the research include the following:
The processing and presentation of the collected data included the following:
Examination of the compressed text and measurement of the compression time;
Data presentation in text form was provided through a specially created class, which stores the collected data and writes them to the console;
Entering data into the Microsoft Excel spreadsheet table and presenting the results in the form of tables and graphs.
3.3. Research Tools
The programming language used in the compression program in this study was Python 3.9.5, a general-purpose high-level language. Its author is Guido van Rossum [
66]. The greatest advantages of Python include the readability and transparency of the code, which facilitate code management, development, and debugging. Thanks to these advantages, working in this language is efficient, and the programmer can pay more attention to the logic of the implemented algorithm. Thanks to the interpretability, errors can be easily spotted—in the event of a stoppage, the program shows the place that requires attention. Python is a dynamically typed language, so there is no need to specify the type of variable to be used; it is possible to use the variable without having defined it previously [
67]. The Python programming language is a valued language in the world, which is confirmed by the fact that it was chosen by Alphabet Inc., Mountain View, CA, USA (a conglomerate holding company created through a restructuring of Google) as one of its main programming languages [
67].
PyCharm is an Integrated Development Environment (IDE) for professional Python developers. Its producer is the software producer Jet-Brains, known from other projects, such as IntelliJ IDEA, RubyMine, and the Kotlin language [
68]. It was chosen as a research tool in this study due to its comprehensive approach to programming, which includes automated code refactoring, autocompletion, convenient keyboard shortcuts, and suggestions for improving readability based on the PEP8 standard.
The UTF-8 format was selected as a text form. It is a Unicode encoding system, created in accordance with the American Standard Code for Information Interchange (ASCII) [
69]. UTF-8 is used by over 98% of the largest websites in the world and by over 97% of all types of websites [
70]. It was the optimal choice for research due to its compatibility with the Polish, English, and Esperanto alphabets.
3.4. Developed Program for Text Compression
The developed text compression program initially loads the input data in each of the languages that are the subject of research. After loading the data, the developed program compresses and saves the work. Finally, the results obtained are presented.
Figure 1 presents pseudo code for the developed program. The program reads the text of the book in binary form and stores it in the
text variable. Each language has a separate text file named after the language/notation abbreviation adopted by the authors. Their names are the following:
pl—stands for Polish language;
en—stands for English language;
eo—stands for Esperanto language;
eox—stands for Esperanto language in notation x.
Figure 1.
Pseudo code for the developed program.
Figure 1.
Pseudo code for the developed program.
Each language has its own unique features presented in
Figure 2 and described below, based on the study and the experience of the authors.
Text in Polish—the Polish language is characterized by having characters in its alphabet that are not in ASCII encoding, so each such letter takes up more space.
Text in English—the English language is characterized by the fact that the basic words are short, but the names of more complex concepts often require the use of a few words, so that the volume of the translated text is larger than the original.
Text in Esperanto—Esperanto is characterized by longer base words than the English language, but shorter words for more complicated concepts. Therefore, long and complicated texts are smaller than those in Polish and English.
Text in Esperanto in x notation—in order to exclude the influence of encoding on compression, text written with x notation was used.
Letters with signs above them (for example “ŝ” or “ŭ”) have their ASCII equivalents followed by the letter x (e.g., instead of “ŝaŭmo” it would be “sxauxmo”). These letters appear in a minority of words, so most of the text remained unchanged.
Table 1 shows the data on uncompressed text. It can be noticed that the text length is the same for the eox and eo versions. Encoding letters with special signs takes two bytes, and converting them to two ASCII characters results in the encoding of two single-byte letters, which ultimately does not affect the size of the text.
Table 2 shows the sample text from
Quo vadis in different languages and the adequate number of characters.
The data listed in the console included, inter alia, the following:
The length of the uncompressed text;
The name of the language for which the compression is performed;
The name of the compression algorithm;
The compression time;
The percentage of space taken in relation to uncompressed text;
Bytes space taken up by compressed text.
Table 3 shows the data contained in the console.
The timeit library was used to measure the compression time. It allows an easy measurement of the time of repeatedly providing a function. It gives a value that is the time in seconds of all the completed loops, and the value is finally divided by the number of loops to obtain the unit information. In the research, there were five hundred repetitions of each measurement.
During the research, the authors improved the text compression program. Examples of changes are the following:
Adding a text version with x notation;
Collecting the length of compressed texts in bytes;
Collecting the length of uncompressed texts in bytes;
Adding more algorithms in order to verify the set hypothesis and obtain more reliable results.