Bridging Linguistic Gaps: Developing a Greek Text Simplification Dataset

Agathos, Leonidas; Avgoustis, Andreas; Kryelesi, Xristiana; Makridou, Aikaterini; Tzanis, Ilias; Mouratidis, Despoina; Kermanidis, Katia Lida; Kanavos, Andreas

doi:10.3390/info15080500

Open AccessArticle

Bridging Linguistic Gaps: Developing a Greek Text Simplification Dataset

by

Leonidas Agathos

,

Andreas Avgoustis

,

Xristiana Kryelesi

,

Aikaterini Makridou

,

Ilias Tzanis

,

Despoina Mouratidis

,

Katia Lida Kermanidis

and

Andreas Kanavos

^*

Department of Informatics, Ionian University, 49100 Corfu, Greece

^*

Author to whom correspondence should be addressed.

Information 2024, 15(8), 500; https://doi.org/10.3390/info15080500

Submission received: 1 August 2024 / Revised: 14 August 2024 / Accepted: 19 August 2024 / Published: 20 August 2024

(This article belongs to the Special Issue Information Extraction and Language Discourse Processing)

Download

Browse Figures

Versions Notes

Abstract

:

Text simplification is crucial in bridging the comprehension gap in today’s information-rich environment. Despite advancements in English text simplification, languages with intricate grammatical structures, such as Greek, often remain under-explored. The complexity of Greek grammar, characterized by its flexible syntactic ordering, presents unique challenges that hinder comprehension for native speakers, learners, tourists, and international students. This paper introduces a comprehensive dataset for Greek text simplification, containing over 7500 sentences across diverse topics such as history, science, and culture, tailored to address these challenges. We outline the methodology for compiling this dataset, including a collection of texts from Greek Wikipedia, their annotation with simplified versions, and the establishment of robust evaluation metrics. Additionally, the paper details the implementation of quality control measures and the application of machine learning techniques to analyze text complexity. Our experimental results demonstrate the dataset’s initial effectiveness and potential in reducing linguistic barriers and enhancing communication, with initial machine learning models showing promising directions for future improvements in classifying text complexity. The development of this dataset marks a significant step toward improving accessibility and comprehension for a broad audience of Greek speakers and learners, fostering a more inclusive society.

Keywords:

Greek textsimplification; dataset creation; text complexity; linguistic accessibility; language barriers; machine learning; natural language processing

1. Introduction

In the current era of information proliferation, effective communication is paramount for the comprehension of and interaction with diverse, often intricate, content. Text simplification, defined as the process of converting complex texts into versions that are more accessible while preserving the original meaning and nuances, is crucial for improving readability across various demographics. This includes individuals with cognitive disabilities, non-native speakers, and people with lower literacy levels.

Text simplification encompasses multiple sub-tasks: identifying text complexity through labels, assessing readability via numerical or categorical levels, and generating simplified text from complex input [1]. Recent advancements in machine learning, particularly with deep learning- and transformer-based models, have substantially enhanced these sub-tasks. These advancements rely heavily on specialized datasets designed for specific languages, domains, and applications [2].

Despite significant progress in English text simplification, other languages, notably Greek, have seen limited development. Greek poses distinct challenges due to its extensive vocabulary, intricate grammatical structures, and flexible syntactic ordering. Addressing these challenges necessitates not only the development of tailored simplification techniques but also the creation of a comprehensive dataset that captures the unique characteristics of the language.

This paper presents the development of a Greek text simplification dataset, detailing its significance and the methodology utilized in its assembly. Our dataset aims to broaden accessibility for diverse audiences, including native speakers across various literacy levels, non-native speakers, individuals with cognitive impairments, and those requiring efficient information processing. Texts sourced from Greek Wikipedia—chosen for its broad subject coverage and structural parallels to Simple English Wikipedia—serve as the foundation of our dataset. This choice mirrors foundational efforts in English text simplification, leveraging the structural and content diversity of Wikipedia.

In constructing this dataset, meticulous attention was devoted to ethical and cultural considerations, ensuring that the simplified texts faithfully preserve original meanings and respect linguistic and cultural nuances. This initiative extends beyond enhancing accessibility for Greek speakers; it also supports the creation of text simplification algorithms specifically designed for Greek linguistic features and grammatical rules. Moreover, the dataset acts as a vital resource for researchers and practitioners to evaluate existing models and to innovate new techniques tailored for Greek texts.

The primary goal of this research is to establish the inaugural comprehensive text simplification dataset for Greek, catalyzing further research and innovation in this domain. By advancing Greek-specific text simplification techniques, this work contributes to a more inclusive society by diminishing language barriers, making information universally accessible, and enhancing effective communication. This endeavor not only facilitates the development of advanced language technologies for Greek but also sets a benchmark for similar initiatives in other under-represented languages. Furthermore, this paper discusses the broader implications of text simplification in an age of ubiquitous information. Simplifying text can expedite and enhance knowledge dissemination and bridge the digital divide, ensuring equitable information access for all, irrespective of linguistic or cognitive abilities. Therefore, the creation of a Greek text simplification dataset represents not just an academic venture but a step toward a more inclusive and well-informed global community.

The remainder of this paper is organized as follows: Section 2 reviews related work, surveying prior initiatives in text simplification across languages and specifically highlighting resource gaps for the Greek language. Section 3 details the methodology used to create the Greek text simplification dataset, including the sources of our texts, selection criteria, and annotation processes. Section 4 describes the technical implementation of the dataset, outlining the software tools employed and the data processing techniques applied. Section 5 presents the experimental evaluation of the dataset, demonstrating the application of machine learning techniques to validate the effectiveness of the text simplification process. Finally, Section 6 concludes the paper, summarizing our findings and outlining future directions for research and development in Greek text simplification.

2. Related Work

Text simplification is a vital aspect of natural language processing (NLP) that seeks to make text more accessible while preserving its original intent and meaning. As an interdisciplinary field, it intersects with other areas of NLP such as text summarization, machine translation, and information extraction [3,4]. These intersections have fostered diverse approaches to simplification, ranging from rule-based to data-driven methodologies, each leveraging different technological advances and linguistic theories.

Substantial developments have been observed in the field of text simplification across natural languages, dataset development processes, learning schemas, and domains. Text simplification has been historically linked to other natural language processing tasks such as text summarization [5,6], machine translation [7,8,9,10,11], adopting training processes and evaluation metrics, and information extraction [12].

Regarding corpus development, text simplification has utilized both supervised and unsupervised learning techniques. Supervised approaches typically involve creating parallel corpora that include manually simplified versions of complex sentences [13]. In contrast, unsupervised methods leverage existing high-resource bilingual translation corpora to generate large-scale pseudo-parallel data for training models. This blend of methods underscores the dynamic and evolving nature of corpus development in simplification, aimed at refining model performance across diverse linguistic settings.

Prominent English datasets include the Newsela corpus [14], the Wikipedia Simple English corpus [15], and the One Stop English corpus [16]. The Newsela corpus, for instance, offers over half a million complex–simple sentence pairs and marks a significant milestone for professional applications. The Wikipedia dataset consists of 140k aligned complex–simple English sentence pairs initially evaluated for improving translation efficiency. The One Stop English corpus targets ESL learners, providing texts at three distinct reading levels. The D-Wikipedia dataset emphasizes document-level text simplification and demonstrates the effectiveness of large-scale datasets in simplifying texts [17].

For other languages, a comprehensive German corpus containing about 211,000 sentences was introduced by [18], expanding upon the work by [19] with more parallel and monolingual data, thus facilitating deeper analysis of text simplification and readability. Recent developments include a German news parallel corpus [20]. The PorSimples project [21] in Brazilian Portuguese and the Simplext Project [22] for Spanish are noteworthy efforts, the former including 4500 sentences from general news and popular science articles and the latter containing 1000 sentences. The first Italian corpus for text simplification was designed and annotated by [23], focusing on children’s literature and educational texts. The Alector corpus [24] includes manually simplified versions of French primary school texts, while advancements in the Swedish language have been achieved through the construction of a pseudo-parallel monolingual corpus for automatic text simplification by [25].

Recent research has expanded into multilingual code comment classification, moving beyond English to include languages like Serbian. Ref. [26] introduced a novel taxonomy and the first multilingual dataset of code comments from languages including C, Java, and Python, annotated for diverse classification tasks. It evaluated the effectiveness of monolingual and multilingual neural models, finding that language-specific models performed best for Serbian, while multilingual models were optimal for English. This approach highlights the potential of advanced language models in multilingual settings and underscores the importance of developing tailored classification tools for software documentation across different languages.

The domain-specific applications of text simplification are varied. Ref. [22] aimed to assist individuals with intellectual disabilities, whereas Ref. [27] developed datasets specifically for simplifying medical texts, indicating the expanding scope of text simplification into specialized fields.

Approaches to text simplification range from lexical-based methods [28], where simpler synonyms replace complex words considering the context [29], to rule-based approaches that utilize syntactic information to identify structural changes [30,31,32,33], and data-driven methodologies. Hybrid approaches, combining data-driven and rule-based methods, have also been proposed [34].

Data-driven methodologies in text simplification vary widely, spanning from knowledge-rich approaches that use syntactically parsed alignments between simple and complex sentence pairs [35], to knowledge-poor methods primarily relying on the availability of appropriate parallel data [36]. The most recent advancements involve neural simplification methods, which utilize encoder–decoder architectures, often augmented with long-short term memory (LSTM) layers [37], and employ word embeddings as input [11]. These embeddings can be pre-trained on large datasets or fine-tuned locally to better capture linguistic nuances [38]. Additionally, these neural models have been expanded to include higher-level semantic information through cognitive conceptual annotations [39], enhancing the ability to maintain semantic integrity during simplification. Another work reviewed for reproducibility provides further insights into the effectiveness and replicability of these sophisticated models, indicating a robust future for neural text simplification [40].

Our research introduces a novel Greek text simplification dataset, encompassing 7000 sentences, both complex and simplified, derived from Greek Wikipedia [41]. This choice reflects a strategic approach to capturing a broad spectrum of topics and discourse styles, essential for a comprehensive simplification tool. The dataset was developed with the collaboration of a diverse group of annotators from Ionian University, which contributed to a rich understanding of linguistic simplification across different demographics.

In conclusion, this section has not only highlighted the diverse and evolving landscape of text simplification research but also underscored our significant contribution through the development of a unique Greek text simplification dataset. By integrating insights from both historical and contemporary studies, our work addresses the notable under-representation of Greek in text simplification research and sets the stage for future advancements in creating more inclusive and accessible linguistic technologies. This endeavor not only enriches the academic field but also holds promise for real-world applications, potentially improving accessibility for Greek speakers worldwide.

3. Methodology

The methodology employed in the development of the Greek Wikipedia Simplification Dataset [42] is comprehensive, designed to ensure the creation of a robust and reliable resource for text simplification tasks. This section outlines the systematic approach taken from initial data collection through to the final stages of dataset refinement and annotation. Our processes are grounded in rigorous data science practices, combining advanced computational techniques with meticulous manual reviews to produce a dataset of high quality and broad applicability.

We begin by detailing the data collection process, utilizing sophisticated programming tools and APIs to extract a diverse array of text from Greek Wikipedia. Following this, we describe our quality control measures, which are essential to maintaining the integrity and usability of the dataset. The subsequent sections cover the technical implementation and the specific challenges encountered during the project, providing insight into the solutions devised to address these issues. The development of the dataset is then explained, highlighting the collaborative efforts and the strategic expansion of the dataset to include both original and simplified texts. Finally, the annotation guidelines are discussed, which were carefully crafted to ensure consistency and accuracy in the simplifications provided by various contributors.

Through this multi-faceted approach, we aim to deliver a dataset that not only supports current research in natural language processing but also sets a precedent for future work in the field, particularly in enhancing accessibility and comprehension of text in the Greek language.

3.1. Dataset Collection

The data collection for our Greek text simplification dataset was meticulously structured to ensure robustness, scalability, and broad library support, crucial for effectively handling large-scale data extraction tasks. Python, celebrated for its versatile ecosystem and extensive library support, was chosen as the primary programming language for this project. We specifically employed the ‘wikipedia’ and ‘wikipediaapi’ libraries, which offer superior handling of API requests and exceptional flexibility in accessing and parsing large volumes of data. These libraries are particularly well suited for interfacing with complex web resources, making them ideal for systematically extracting structured content from Greek Wikipedia.

The data collection was automated through a custom script, detailed in Algorithm 1. This script was engineered to efficiently fetch data while ensuring a diverse and representative dataset by accessing multiple Wikipedia pages across a variety of subjects. The automation process involved initializing API settings, fetching and processing text, and storing the results in a structured CSV file, which facilitates ease of further processing and analysis.

Algorithm 1 Data Collection from Greek Wikipedia

1:: Import libraries for Wikipedia access and CSV file manipulation.
2:: Set language for Wikipedia access to Greek (el).
3:: Initialize Wikipedia API with language settings.
4:: Set the number of pages to fetch (3000).
5:: Set the number of sentences per page (5).
6:: Open a new CSV file ‘wikiSentences.csv’ in UTF-8 encoding.
7:: Write the header to the CSV file with columns “Page Title” and “Summary”.
8:: Initialize a page counter to zero.
9:: while page counter < number of pages do
10:: Fetch a random page title from Wikipedia.
11:: Retrieve the page object for the fetched title using the API.
12:: if page exists then
13:: Initialize summary variable and sentence counter.
14:: for each section in the page do
15:: Trim the section text and split into sentences.
16:: for each sentence in the section do
17:: if sentence counter < number of sentences then
18:: Append sentence to summary.
19:: Increment sentence counter.
20:: end if
21:: if sentence counter == number of sentences then
22:: Break from the loop.
23:: end if
24:: end for
25:: end for
26:: if sentence counter == number of sentences then
27:: Write page title and summary to CSV file.
28:: Increment page counter.
29:: end if
30:: end if
31:: end while
32:: Close the CSV file.
33:: Handle exceptions to ensure script stability.

This algorithmic representation not only provides a clear and structured description of the data collection process but also underscores our systematic approach to maintaining the quality and diversity of the dataset. To ensure the integrity and reliability of the data collection process, we implemented robust error handling measures. These included exception handling mechanisms within our data extraction scripts to manage issues such as network interruptions, API limits, and data format errors. Our scripts featured retry logic to attempt data fetching multiple times before logging an error, thus enhancing resilience against transient network or API-related issues.

In addition to error handling, we established rigorous data validation measures to uphold data quality. Automated scripts were employed to verify the correctness of data formats and consistency checks were routinely performed to ensure all retrieved data adhered to our specified criteria. This included validating data against predefined schemas, performing checksums to detect data corruption, and manually cross-referencing entries with secondary sources for accuracy and consistency. Our preprocessing steps further involved cleaning the data by removing duplicates, standardizing metadata, and correcting syntactic inconsistencies.

This meticulous attention to detail in the data collection phase is crucial for minimizing errors and inconsistencies, thereby enhancing the overall quality of the dataset. By integrating both robust error management and comprehensive data validation, we significantly improved the reliability and usability of our Greek text simplification dataset for further research and application development.

The strategy of random page selection and sentence extraction, implemented through a programmatically randomized algorithm, was specifically chosen to maximize the representativeness of the dataset across various topics and styles found in Greek Wikipedia. This method ensures that the dataset encapsulates a broad spectrum of the Greek language as used in diverse contexts, which is essential for developing a robust text simplification model that is effective across different domains and text types. Furthermore, this approach is pivotal in capturing the varied linguistic nuances and cultural contexts inherent in the Greek language, thereby contributing significantly to the creation of a comprehensive and effective text simplification tool.

3.2. Quality Control

Ensuring high quality and reliability in our Greek text simplification dataset was paramount, necessitating a comprehensive and methodical quality control (QC) process. This was especially critical given the inherent challenges of manually curating data from a dynamic, expansive open-source platform like Wikipedia.

3.2.1. Quality Control Process

The initial dataset comprised 3000 paragraphs, each randomly extracted from Greek Wikipedia to ensure topic diversity. Distribution of these data to a team of postgraduate students at the Ionian University—all native Greek speakers with academic backgrounds in linguistics, information science, and computer science—bolstered the robustness of the review process.

The review process assigned to each team member, covering approximately 600 paragraphs, was broken down into precise steps to ensure thoroughness and accuracy:

1.: Content evaluation: Every paragraph was scrutinized for its relevance to the project’s goals, factual accuracy, and information completeness. This evaluation was essential to maintaining the contextual integrity of the data post-simplification.
2.: Sentence separation: Paragraphs were meticulously dissected into individual sentences to ensure that each could stand alone meaningfully—a fundamental requirement for effective text simplification.
3.: Error identification: A detailed manual inspection was conducted for each sentence to identify and correct misspellings, grammatical errors, and structural inconsistencies. Custom scripts developed in Python supported this process by automating the detection of certain error types, enhancing the efficiency and precision of manual reviews.

3.2.2. Challenges and Resolutions

The QC team faced several notable challenges:

Misspellings and grammatical errors: These were the most common issues, accounting for about 80% of all deletions (560 out of 700). Each instance required meticulous manual correction, with support from standardized linguistic rules and automated tools.
Empty lemmas: These constituted approximately 20% of deletions (140 out of 700), involving lemmas that were either empty or contained irrelevant data, necessitating their removal to uphold content quality and relevance.

To ensure data uniformity, any paragraph failing to meet the five-sentence criterion was systematically excluded from the dataset.

3.2.3. Outcome of the Quality Control Process

Following rigorous QC, the dataset was refined to 2312 paragraphs, each meticulously vetted for quality and consistency. The final structured dataset, comprising columns for title, paragraph, and sentences 1 through 5, was designed to facilitate easy access and manipulation for both manual analysis and automated processing techniques. This format adheres to standard practices for linguistic datasets and ensures compatibility with various text processing tools.

3.2.4. Implications for Research and Analysis

The stringent QC process underscores our commitment to producing a high-quality research tool, ensuring the dataset serves as a reliable foundation for developing and testing text simplification algorithms tailored to the Greek language. By establishing a high standard for data quality, the dataset emerges as a vital asset for the research community, enabling more precise studies and fostering innovation in text simplification. The detailed documentation of this process also provides an invaluable blueprint for future research endeavors in multilingual dataset preparation, illustrating the challenges encountered and the methods used to overcome them.

This meticulous approach to quality control and its comprehensive documentation ensure that the dataset is not just a collection of texts but a well-curated resource that significantly advances computational linguistics and natural language processing.

3.3. Technical Implementation and Challenges

Our project harnessed advanced programming techniques and natural language processing (NLP) tools to ensure efficient data collection and processing, crucial for the development of the Greek text simplification dataset.

In developing our dataset, we primarily utilized Python for its robust libraries, pivotal for effective natural language processing. Specifically, we employed the Natural Language Toolkit (NLTK) and spaCy. NLTK is a comprehensive library for building Python programs to work with human language data, offering tools for tasks such as text processing, tokenization, and parsing [43,44]. SpaCy, on the other hand, is an open-source software library for advanced natural language processing in Python, designed specifically for production use. It provides efficient and accurate text processing capabilities, making it highly suitable for large-scale linguistic data analysis [45].

The integration of these technologies facilitated the automated simplification of complex sentences on a large scale, streamlining the workflow and enhancing productivity.

3.3.1. Technical Challenges

Despite the advanced tools employed, we encountered several technical challenges that required innovative solutions:

Data quality variability: The inherent variability in the quality of data sourced from Wikipedia, ranging from well-curated articles to those with inaccuracies or biases, posed significant challenges. This variability impacted the initial quality of our dataset and required rigorous preprocessing to standardize the input data for further processing.
Consistency in simplifications: Ensuring consistency in simplifications across different annotators was critical, especially given the subjective nature of what constitutes a ’simplified’ text. Differences in linguistic interpretation among annotators could lead to inconsistent outputs, affecting the overall quality of the simplified dataset.
Rate limiting of API calls: The rate limiting imposed by the Wikipedia API significantly slowed our data collection processes. Frequent API timeouts and restrictions required careful planning and management to optimize data fetching routines without violating usage policies.

3.3.2. Implemented Solutions

To overcome these challenges, we implemented several strategic solutions:

Error-handling mechanisms: We developed robust error-handling mechanisms to manage API timeouts effectively. This included setting up retry logic with exponential back-off strategies to handle request failures gracefully and ensure data collection could continue smoothly.
Modular annotation framework: A modular annotation framework was established to allow for incremental improvements based on feedback from annotators. This framework included tools for annotators to flag inconsistencies and suggest modifications, which were then reviewed by a supervisory team to ensure they met quality standards before being integrated into the dataset.
Machine learning models: To enhance the uniformity and quality of simplifications, we trained machine learning models to predict and automatically correct inconsistencies. These models were built using supervised learning techniques, with training data curated from a subset of manually verified simplifications to ensure high accuracy and relevance.

This comprehensive approach not only addressed the immediate technical challenges but also laid a foundation for scalable and sustainable dataset development practices. The solutions implemented enhance the reliability and usability of the dataset, ensuring it can serve as a valuable resource for both academic research and practical applications in text simplification.

3.4. Developing the Greek Wikipedia Simplification Dataset

The development of our Greek Wikipedia Simplification Dataset was meticulously structured into strategic phases, each carefully designed to build upon the initial quality control measures to ensure data completeness and quality. This subsection outlines the key steps of dataset expansion, contributor engagement, and final quality assurance, each crafted to enhance the utility and reliability of the dataset.

3.4.1. Dataset Expansion

Originally, the dataset was composed primarily of original sentences extracted from Wikipedia. To extend its utility for text simplification tasks, we expanded the dataset to include simplified versions of each sentence. We introduced a dual-column format for each sentence: one column for the original text and one for its simplified version, from sentence 1 to sentence 5. This structure facilitates detailed linguistic analysis and serves as a practical tool for developing and testing text simplification algorithms. The expansion process involved not only manual simplification by our contributors but also the use of automated text simplification tools to provide initial simplification drafts, which were later refined by human annotators.

3.4.2. Dataset Sharing and Contributor Involvement

Central to our project was the strategy to foster broad access to and collaboration on the dataset. We chose to host the dataset on Google Sheets for its ease of access and collaborative features, which proved invaluable for a distributed project involving multiple contributors. This approach enabled real-time updates and feedback from contributors, enhancing the iterative development of the dataset.

To populate the dataset with high-quality simplified sentences, we engaged in diverse strategies:

We utilized social media channels and university networks to recruit contributors, focusing particularly on students from the Departments of Informatics and Foreign Languages at the Ionian University.
Approximately 40 students from the Department of Informatics, enrolled in an Artificial Intelligence course, contributed over 400 simplified sentence proposals within just one hour, highlighting the effectiveness of integrating academic coursework with research projects.
We also welcomed contributions from 5 students specializing in Foreign Languages, Translation, and Interpreting, adding a rich layer of linguistic diversity and expertise to the dataset.

3.4.3. Demographic Diversity of Contributors

The contributors, ranging in age from 18 to 60 years old and coming from various regions across Greece, brought a wealth of linguistic expressions and perspectives. This demographic diversity not only enriched the dataset but also ensured its applicability to a broad spectrum of linguistic styles and preferences, thereby enhancing its authenticity and applicational breadth.

3.4.4. Quality Control of Simplified Sentences

The quality control process for the simplified sentences was as rigorous as the initial data curation phase. Each simplified sentence underwent a meticulous review to ensure it met our standards of simplicity, accuracy, and linguistic integrity. Corrections were applied to address any inconsistencies or errors identified during the review, affirming our commitment to providing a dataset that researchers and practitioners can rely on for accurate and effective text simplification.

3.4.5. Implications for Research and Analysis

The comprehensive development process not only confirms the dataset’s quality but also underscores its potential as a pivotal resource for advancing text simplification research, particularly within the Greek language context. This dataset exemplifies the collaborative and interdisciplinary approach necessary for successful linguistic tool development, standing as a model for similar initiatives in other languages.

The meticulous and structured development of this dataset not only enhances the field of text simplification but also contributes significantly to the broader domain of computational linguistics and natural language processing, providing a robust tool for future academic and practical applications.

3.5. Annotation Guidelines

The formulation of detailed annotation guidelines was a cornerstone of the Greek Wikipedia Simplification Dataset development, aimed at enhancing natural language processing capabilities for the Greek language. These guidelines were meticulously crafted to instruct contributors on how to simplify complex linguistic structures while ensuring semantic accuracy and readability, making the dataset a vital resource for researchers and practitioners involved in text simplification and readability enhancement.

3.5.1. Guideline Overview

The guidelines were carefully structured to systematically address several key aspects of language simplification, ensuring a uniform approach across all annotations:

1.

Identify complex elements: Contributors were trained to detect complex words, phrases, or sentence structures, highlighting elements like passive constructions, idiomatic expressions, jargon, and technical terms that may obscure understanding for average readers.

2.

Simplify vocabulary:

This process involved replacing complex or advanced vocabulary with simpler, more commonly understood words, ensuring that the substitutes maintained the semantic integrity of the original text.
Employing widely recognized synonyms was encouraged to enhance comprehension without sacrificing content quality.

3.

Shorten sentences: Long sentences were divided into shorter, more digestible segments, utilizing appropriate punctuation such as periods to break down complex sentence structures into manageable parts.

4.

Simplify sentence structure: Simplification efforts included converting passive voice to active voice where feasible, eliminating unnecessary subordinate clauses, and adopting direct and straightforward sentence constructions to enhance clarity.

5.

Clarify meaning: It was imperative that the simplified version retained the original sentence’s meaning and intent. Additional context was incorporated where necessary to clarify any ambiguous phrases or implicit content.

6.

Check for coherence and cohesion: Ensuring a logical flow within and between sentences was critical. Transition words and phrases were used as needed to maintain coherence across the text.

3.5.2. Training and Consistency

To ensure the effective application of these guidelines, contributors underwent comprehensive training sessions that included practical exercises and feedback loops to refine their simplification skills. We utilized online workshops and provided detailed documentation and examples to facilitate understanding and adherence to the guidelines.

3.5.3. Collaborative Efforts and Impact

The collaborative efforts to develop the dataset involved students, faculty, and the broader community, covering a wide demographic range. Engagement strategies included targeted outreach through social media and academic networks, leveraging the diverse linguistic perspectives of participants from various regions across Greece.

This extensive collaboration led to the generation of 3520 simplified sentences from an initial pool of 14,578 sentences. The high conversion rate underscores the effectiveness and precision of our annotation guidelines, reflecting a significant enhancement of the dataset’s value and applicability for future research. The community-driven approach not only enriched the dataset but also fostered a sense of ownership and commitment among contributors, ensuring the guidelines’ continuous improvement and adaptation based on real-world usage and feedback.

4. Implementation

This section delves into the detailed implementation strategies utilized to analyze and refine the Greek Wikipedia Simplification Dataset. Employing a comprehensive suite of statistical tools and Python libraries, we meticulously examined various textual characteristics and their implications for text simplification. Each step, from preprocessing to detailed metric analysis, was executed with precision, ensuring the dataset not only meets the rigorous technical requirements for NLP tasks but also addresses the practical needs of users requiring simplified text. The insights gained here not only deepen our understanding of the dataset’s structure and complexities but also pave the way for its application in enhancing readability and accessibility across diverse user groups. This methodological foundation supports not just theoretical exploration but also practical applications, setting the stage for subsequent sections that build on these initial analyses.

4.1. Statistical Description of the Dataset from Greek Wikipedia

The Greek Wikipedia Simplification Dataset comprises 7545 sentences, split between 4025 complex sentences and 3520 simplified counterparts. These sentences were extracted from 779 diverse paragraphs, encompassing a wide array of topics such as history, geography, landmarks, famous personalities, and more. This diversity ensures the dataset’s comprehensive nature, making it a valuable resource for various NLP applications focused on text simplification. The data are meticulously organized into two columns; one column lists the sentences, while the adjacent column categorizes each sentence as “simple” or “complicated”. This dual-column format is instrumental in facilitating straightforward access and manipulation for subsequent analytical processes, thereby enhancing the efficiency of data handling and analysis.

4.1.1. Preprocessing and Metrics Calculation

The initial preprocessing steps were critical in standardizing the dataset to ensure uniformity and reliability in subsequent analyses. These steps included converting all text to lowercase to eliminate case sensitivity issues, removing punctuation and accent marks to simplify text analysis, and stripping common Greek articles to reduce noise in the data. Such preprocessing was executed using Python’s pandas library, leveraging its robust capabilities for string operations and regular expressions to efficiently process large volumes of text.

4.1.2. Statistical Analysis, Findings, and Implications

Following the preprocessing, we conducted a detailed statistical analysis to assess the complexity and readability of the sentences in the dataset. The key metrics calculated included average sentence length (ASL), average word length (AWL), type–token ratio (TTR), and sentence length standard deviation (SD) [46]. Here are the findings:

Average sentence length (ASL): There was a notable reduction in sentence length, with ASL decreasing from an average of 20.42 words per sentence for complex sentences (totaling 82,215 words across 4025 sentences) to 13.65 words for simplified sentences (totaling 48,073 words across 3520 sentences). This reduction indicates that our simplification strategies effectively minimized sentence length, enhancing readability without compromising content delivery.
Average word length (AWL): The average word length saw a slight decrease from 5.62 to 5.56 characters. This minimal change suggests that the simplification process was achieved primarily through restructuring and condensing sentences rather than simplifying individual words, which is often crucial for maintaining the integrity of the information conveyed.
Type–token ratio (TTR): The TTR slightly increased from 0.93 to 0.98, indicating a richer vocabulary in the simplified sentences. This increase can be attributed to the careful selection of vocabulary that maintains or enhances the quality of the text while ensuring it is accessible to a wider audience.
Sentence length standard deviation (SD): The standard deviation of sentence lengths decreased from 11.45 to 7.43, reflecting a more uniform distribution of sentence lengths. This uniformity is significant for readability as it suggests a more consistent level of simplification across different sentences, facilitating easier understanding for readers.

Additionally, utilizing the Pearson correlation coefficient, we identified a strong positive correlation (approximately 0.81) between the length of complex sentences and their extent of simplification. This correlation strongly indicates that longer sentences tend to be simplified more extensively, which is aligned with our goals of reducing sentence complexity to enhance readability.

These analytical insights are crucial as they not only validate the effectiveness of our text simplification processes but also highlight areas where further refinements could be made. The detailed analysis helps in understanding how different aspects of sentence structure contribute to overall text complexity and readability.

This statistical analysis provides profound insights into the text simplification process employed in the dataset’s development. The reduction in sentence complexity, minimal changes in word length, and the enriched vocabulary usage collectively fulfill the dual objectives of simplification: to enhance readability while maintaining or enhancing lexical richness. The increased uniformity in sentence structures particularly benefits diverse reader groups, such as individuals with learning challenges, non-native speakers, and younger audiences. These enhancements make the dataset an invaluable tool for advancing research in text simplification and improving linguistic accessibility.

Table 1 summarizes the metrics for complex and simplified sentences, highlighting the quantitative aspects of these improvements and providing a clear, visual representation of the effectiveness of our simplification strategies.

These detailed metrics collectively underscore the technical effectiveness of the text simplification process and highlight its practical benefits. By making the text more accessible and easier to understand, the simplification efforts enhance the dataset’s applicability not only to academic research but also to real-world applications in various fields that require simplified text. This reinforces the dataset’s value as a tool for improving linguistic accessibility across diverse user groups.

4.2. Metrics

Quantitative assessment of the complexity and diversity of the Greek Wikipedia Simplification Dataset was crucial for understanding the textual characteristics of the sentences and provided a foundational basis for further analyses, such as language modeling and simplification effectiveness.

4.2.1. Word Count and Sentence Length Metrics

To thoroughly examine the length and complexity of sentences within the dataset we utilized several key metrics:

Word count: Each sentence was tokenized into individual words using the nltk library, which has specific capabilities for processing Greek language text. This word count provides an immediate measure of sentence length and complexity, offering insights into the quantitative aspects of the dataset’s content.
Average word count: Calculating the average word count across all sentences gives a general measure of sentence length and complexity within the dataset. This metric is instrumental in understanding typical sentence construction and identifying potential areas for simplification.
Average word count by labels: We analyzed the average word count for sentences classified under “simple” and “complicated” labels. This analysis highlights differences in sentence length that correlate with each label, offering deeper insights into the dynamics of text simplification and the effectiveness of the simplification processes applied.

4.2.2. Lexical Diversity Metrics

Lexical diversity is another critical dimension of text analysis, particularly in the context of text simplification:

Vocabulary size: We computed the total number of unique words used across the dataset to assess its lexical diversity and richness. A larger vocabulary size indicates a wider variety of language usage, crucial for comprehensive text simplification tasks and ensuring that the simplified text remains engaging and informative.
Mean sentence length by labels: Calculating the mean sentence length for each classification label (“simple” and “complicated”) provided insights into how sentence complexity varies with the type of content. This metric helps in understanding the changes in sentence structure that occur with simplification and is vital for evaluating the success of the simplification strategies employed.

4.2.3. Implementation of Statistical Tools

Throughout the statistical analysis, we employed a suite of Python libraries to facilitate data manipulation, natural language processing, and visualization:

pandas was used for its powerful data manipulation capabilities, allowing us to organize and preprocess the dataset effectively.
nltk facilitated detailed natural language processing tasks, such as tokenization and vocabulary analysis, tailored to the Greek language.
matplotlib.pyplot was crucial for visualizing data trends, enabling us to graphically represent the variations in word count, sentence length, and lexical diversity.

The insights gained from these statistical analyses illuminate key characteristics of the dataset, such as variations in sentence length, word count, and lexical diversity across different sentence types. These findings are pivotal not only for understanding the structure of the dataset but also for guiding subsequent analyses and modeling efforts. By laying a strong quantitative foundation, this section prepares the ground for detailed exploration and discussion of text simplification processes applied to the Greek language, ultimately enhancing the dataset’s applicability to both academic research and practical language processing tasks.

5. Experimental Evaluation

In this section, we assess the effectiveness of various text simplification strategies applied to the Greek Wikipedia Simplification Dataset. Through a series of experiments, we evaluate the dataset’s structural and linguistic characteristics, employ advanced statistical analyses to explore word and sentence complexity, and implement machine learning models to classify sentences based on their complexity. This comprehensive evaluation not only demonstrates the practical applications of our methodologies but also highlights the challenges and achievements in automating text simplification processes. Each subsection is designed to provide a detailed insight into the dataset’s composition and our efforts to refine text simplification techniques, ensuring that the results are robust, interpretable, and actionable for future research and practical applications.

5.1. Descriptive Statistics

The Greek Wikipedia Simplification Dataset comprises a total of 7545 sentences, characterized by an average sentence length of 18.78 words. This key metric provides crucial insights into the typical sentence length and complexity within the dataset, highlighting the extent of information each sentence conveys. Such a measure is instrumental in understanding the baseline readability and accessibility of the text, serving as a benchmark for evaluating the effectiveness of various simplification techniques.

To further understand the dataset’s characteristics, we also computed several additional descriptive statistics:

Mean word length: This statistic measures the average number of characters per word across all sentences, providing an indication of lexical complexity. A higher mean word length could suggest the prevalence of complex vocabulary, which is an essential aspect in determining the necessity and approach for simplification.
Standard deviation of sentence length: Calculating the variability in sentence lengths across the dataset allows us to gauge the consistency in sentence structure. A high standard deviation might indicate a wide range of sentence lengths, posing challenges for standard simplification algorithms that perform best under uniform conditions.
Distribution of sentence types: The proportion of sentences classified as “simple” versus “complicated” is analyzed to assess the balance and representativeness of the dataset. This distribution is crucial for ensuring that simplification models are trained and tested on a balanced mix of sentence complexities, which helps in generalizing the model’s effectiveness across different types of text.

5.2. Word Frequency Analysis

The vocabulary size of the Greek Wikipedia Simplification Dataset is 25,636 unique words, reflecting the extensive lexical diversity and richness of the Greek Wikipedia text. This diversity is indicative of the broad and varied content covered by the dataset, making it an invaluable resource for developing robust text simplification models.

5.2.1. Analysis of Common Words across the Dataset

Table 2 below lists the top 10 most common words found in the dataset. This analysis helps to identify prevalent linguistic patterns and common terms used across various topics, providing insight into the thematic elements that dominate the Greek Wikipedia articles included in the dataset.

The presence of verbs like ’was/were’, ’have/has’, and ’was born’ at high frequencies indicates a common narrative style that involves descriptions of historical events, biographical data, and factual statements. These findings suggest that text simplification strategies might need to focus on simplifying historical and biographical content, which is prevalent across the dataset.

To further examine the overlap in vocabulary between ‘complicated’ and ‘simple’ sentences, we analyzed the shared and unique words in each category. The following Venn diagram in Figure 1 illustrates the intersection and distinct elements of these vocabularies.

In the Venn diagram, the dataset shows 11,853 unique words in complicated sentences and 4752 in simple sentences, with a substantial overlap of 8235 words shared between the two. This substantial overlap underscores the complexity in distinguishing between ‘complicated’ and ‘simple’ based purely on word usage. It suggests that while there is significant lexical overlap, the differentiation in complexity might instead arise from sentence structure, syntax, or the context in which these words are used.

This analysis underscores the importance of considering factors beyond individual word frequency when developing text simplification strategies, especially in languages as structurally rich as Greek. The insights drawn from this Venn diagram will guide future enhancements in our dataset and simplification algorithms, focusing more on syntactic transformations and contextual adaptation.

5.2.2. Most Common Words in Simple Sentences

Table 3 presents the top 25 most common words in sentences categorized as ‘simple’. This analysis aids in understanding the lexical choices that characterize simpler Greek sentences, which can inform the development of simplification strategies.

The common words in simple sentences are largely similar to those found throughout the dataset, indicating that simplification does not necessarily involve changing the words but rather modifying how they are used in sentences. The frequency of these words in simpler contexts underscores their importance in creating accessible content, suggesting that simplification efforts may benefit from focusing on these terms to enhance readability.

Figure 2 provides a visual representation of the frequencies of the most common words in simple sentences. This visualization helps to immediately grasp the relative prevalence of these words, offering an intuitive understanding of their dominance in the dataset’s simpler content.

The bar graph in the figure underscores the frequency with which certain terms appear, highlighting how often simpler linguistic constructions are employed in the dataset. For example, common verbs like ήταν (was/were) and έχει (have/has) appear more frequently, suggesting their pivotal role in constructing straightforward sentences. This visual analysis not only validates the findings from the table but also enriches our understanding by showing the stark differences in usage frequency among the top words. Such insights are instrumental for developing targeted simplification strategies, as they indicate which words and structures should be prioritized to make texts more accessible.

This figure, along with the detailed tabular analysis, effectively guides the development of simplification algorithms by illustrating the linguistic features that are most common in simpler sentences. By focusing on these high-frequency words, text simplifiers can ensure that their efforts are both efficient and impactful, making the content more accessible and engaging for a wider audience.

5.2.3. Most Common Words in Complicated Sentences

Table 4 provides a detailed look at the top 25 most common words found in sentences labeled as ‘complicated’. This analysis reveals the vocabulary frequently appearing in more complex Greek sentence structures, offering insights into the language usage that typically characterizes intricate textual expressions.

The vocabulary listed often pertains to biographical details, geographical locations, and historical contexts, reflecting the complex nature of the sentences from which they are drawn. The frequent use of words like ‘founded’, ‘located’, and ‘studied’ suggests that these sentences often describe detailed, specific events or narratives requiring a higher level of linguistic comprehension.

Figure 3 visually represents the frequency of these words, providing a clear depiction of their prominence within complicated sentences. This graphical illustration helps in quickly identifying which words contribute most to the complexity of texts, guiding the focus for potential simplification efforts.

The analysis shown in the table and figure allows us to pinpoint specific lexical elements that might pose comprehension challenges. These insights can inform the development of simplification strategies, such as replacing or explaining these high-frequency complex words, to make texts more accessible without diluting their informational value. Understanding these usage patterns is crucial for creating effective simplification models that maintain the informative nature of texts while enhancing their readability and accessibility.

5.3. Average Word Count by Categories

The categorization of sentences into ’simple’ and ’complicated’ provides a unique lens through which to view textual density and complexity. Understanding these differences is crucial for tailoring text simplification strategies that can effectively reduce complexity without sacrificing content richness or readability.

Table 5 presents the average word count for each category, illustrating clear distinctions in sentence construction between simple and complicated sentences. This metric is particularly telling of the syntactic and lexical adjustments needed when simplifying text.

The significant difference in average word count between complicated and simple sentences underscores the inherent complexity in structuring information within the dataset. With complicated sentences averaging 22.35 words and simple sentences significantly shorter at 14.70 words, the data illustrate a clear trend towards higher syntactic and informational density in more complex constructions. This disparity highlights key areas for simplification, suggesting a strategic reduction in word count could make complex sentences more accessible without diminishing their informational value. Such insights are crucial for developing effective text simplification algorithms that aim to enhance readability while maintaining content fidelity.

5.4. Sentence Length by Categories

The differentiation between ’simple’ and ’complicated’ categories extends beyond just the number of words used; it also encompasses the overall length of sentences, which can be measured in characters, letters, or tokens. This measure provides deeper insights into the structural complexities and readability challenges associated with each category.

Table 6 details the mean sentence length and the standard deviation for each category, highlighting differences that are crucial for understanding how to approach text simplification effectively.

The data presented above show a considerable difference in sentence length between the complicated and simple categories, with complicated sentences averaging 132.92 characters and simple sentences significantly shorter at 87.21 characters. This substantial variance not only indicates the higher syntactic and informational density found in complicated sentences but also underscores the need for effective simplification strategies that can reduce sentence length while preserving the essential content. By focusing on these metrics, text simplification efforts can be more precisely tailored to enhance the readability of complicated sentences, making the text more accessible to a broader audience. Understanding these differences is pivotal for any text simplification initiative aimed at improving comprehension across diverse reader groups.

5.5. Mean Word Length by Categories

The analysis of mean word length serves as a finer measure of textual complexity and lexical sophistication, as pointed in Section 4.1.2. It highlights subtle distinctions in linguistic construction that may not be immediately evident through broader metrics like sentence or word count. The mean word length metric is particularly valuable for revealing the degree of lexical complexity typically associated with ‘complicated’ versus ‘simple’ sentences.

Table 7 presents the mean word length for each category, providing quantitative evidence of these nuanced differences.

The data from the above table illustrate a slight variance in mean word length between complicated and simple sentences—5.62 for complicated and 5.56 for simple. Although the difference is minimal, it suggests that more complex sentences tend to use slightly longer words, potentially indicating higher lexical richness or the use of more specialized terminology. This slight increase in word length can contribute to the overall cognitive load required to process complicated sentences, making them potentially more challenging for certain reader groups.

5.6. Punctuation Distribution by Categories

Punctuation marks play a crucial role in shaping the readability and syntactic structure of text. The distribution of punctuation such as commas, periods, and semicolons can significantly influence the flow and comprehension of sentences. In the context of text simplification, understanding how punctuation varies between ‘simple’ and ‘complicated’ sentences can provide insights into the structural complexities that need to be addressed.

Table 8 provides a breakdown of punctuation usage within each category, revealing patterns that are instrumental in understanding the structural distinctions between simple and complicated sentences.

The data from Table 8 show that complicated sentences tend to have a higher count of periods, which may suggest a preference for shorter, more definitive statements within more complex constructs. On the other hand, the prevalence of commas in simple sentences indicates a greater use of compound and complex sentence structures that require additional clarifications or pauses for easier reading.

This distinction in punctuation usage not only affects the rhythm and pace of reading but also impacts how information is processed by the reader. Periods typically signal the end of a thought or statement, contributing to a staccato effect in more complicated sentences, while commas facilitate a smoother flow, potentially making simple sentences easier to digest.

5.7. Using the Dataset for Identifying Complex Text

In an effort to automate the process of differentiating between ‘simple’ and ‘complicated’ Greek sentences, we utilized the RapidMiner software v.9.0.to develop a model that classifies text based on its complexity. This endeavor is part of a broader initiative to apply machine learning techniques to enhance text simplification methodologies.

5.7.1. Machine Learning Models and Configuration

To tackle this classification task, we first implemented the k-nearest neighbors (KNN) algorithm, which is renowned for its simplicity and effectiveness in many classification contexts. KNN classifies examples by a plurality vote of its neighbors, with the example being assigned to the class most common among its k nearest neighbors.

We also tested the naive Bayes and support vector machine (SVM) algorithms to broaden our approach. Naive Bayes is known for its efficiency and has been successful in various text classification tasks due to its assumption of independence between predictors. SVM was chosen for its ability to handle high-dimensional spaces and its effectiveness in complex classification landscapes, particularly with its use of the radial basis function (RBF) kernel.

Each algorithm was evaluated using a 10-fold cross-validation method to ensure the robustness and generalizability of the results. This approach reduces variance and ensures that every observation from the original dataset has the chance of appearing in training and test sets.

5.7.2. Evaluation Metrics

To thoroughly assess the effectiveness of each model, we relied on multiple metrics:

Accuracy: Measures the overall correctness of the model across both classes.
Precision: Indicates the accuracy of positive predictions, essential for determining the reliability of predictions for ‘complicated’ sentences.
Recall: Measures the model’s ability to identify all relevant instances, crucial for ensuring that all complex texts are correctly identified.
Area under the curve (AUC): Provides an aggregate measure of performance across all possible classification thresholds.

These metrics help elucidate the strengths and weaknesses of each model in handling the classification task.

The results of this evaluation are detailed in Table 9, which compares the performance of the models across these metrics:

The KNN model shows modest accuracy at 52.12%, with a precision of 55.57% and a recall of 50.77%. Its performance indicates a balanced approach to both false positives and false negatives, suggesting a moderate ability to generalize across the dataset without significant bias towards either class. However, the precision breakdown between simple and complicated sentences (48.84%/55.57%) suggests a slightly better handling of complicated sentences over simple ones.

In contrast, the naive Bayes model exhibited the lowest performance among the models, with an accuracy of just 43.87%, a precision of 44.20%, and notably low recall at 20.39%. The poor recall indicates a significant number of false negatives, where many complex sentences are likely misclassified as simple. This model’s AUC of 0.48 nearly mirrors random guessing, emphasizing its limited capability in this specific classification context.

The SVM model, utilizing an RBF kernel, demonstrated the highest recall at an impressive 97.49%, suggesting it is highly effective at identifying complicated sentences. However, this comes at the cost of precision, particularly for simple sentences (24.63%), indicating a high rate of false positives, where simple sentences are incorrectly labeled as complicated. Its overall accuracy stands at 52.41%, and the AUC of 0.90 indicates excellent model performance in distinguishing between classes under ideal conditions. Yet, the class precision disparity suggests overfitting, particularly in recognizing simple sentences, which is a critical area for further adjustment.

While the initial results of our models indicate only modest success in classifying text as ‘simple’ or ‘complicated’, these outcomes highlight several critical areas for future research. The performance limitations observed point to the need for more tailored approaches that consider the unique aspects of the Greek language. We are particularly interested in exploring more sophisticated machine learning techniques, such as deep learning models, that can potentially capture the nuances of Greek syntax and morphology more effectively. Additionally, expanding the dataset and incorporating more varied linguistic features are likely to improve the training process, allowing for more nuanced understanding and classification capabilities. These steps will form the basis of our ongoing efforts to enhance the models’ accuracy and reliability.

These outcomes highlight the complexities of applying machine learning to natural language processing, especially in distinguishing text complexities in a nuanced language like Greek. The variance in model success rates underscores the need for continued refinement of approaches, possibly integrating more sophisticated or tailored algorithms that can better handle the idiosyncrasies of language data. This evaluation not only directs future model improvements but also stresses the importance of choosing appropriate metrics to capture the true effectiveness of each model comprehensively.

5.7.3. Discussion

The SVM model exhibited the highest recall, indicating its effectiveness in identifying complicated sentences; however, its precision was notably lower for simple sentences, suggesting potential overfitting issues. The KNN model displayed more balanced but modest results in all metrics, reflecting its general applicability but limited ability to handle the linguistic nuances of the dataset.

These results underline the inherent challenges in applying machine learning to text simplification, especially in less-resourced languages like Greek. The models’ performance reflects the complexity of the task, compounded by the linguistic characteristics of Greek, which are not as thoroughly supported in natural language processing tools as more widely studied languages. This highlights a critical need for further development of language-specific tools and resources to improve the efficacy of such technologies in text simplification and other NLP applications.

6. Conclusions and Future Work

In conclusion, the creation of the Greek text simplification dataset marks a pivotal advancement in improving readability and accessibility for Greek-speaking populations. This dataset directly addresses the linguistic intricacies inherent to the Greek language, such as its elaborate morphological structure and flexible syntactic arrangements, which are essential for developing customized text simplification solutions that are finely attuned to the nuances of the Greek language.

The utility of this dataset extends beyond assisting individuals with limited literacy skills. It is equally beneficial for non-native speakers, people with cognitive disabilities, and anyone seeking to streamline their consumption of Greek language information. Consequently, the dataset promotes both accessibility and inclusivity, offering substantial resources for researchers and practitioners to assess and refine text simplification models, craft Greek-specific algorithms, and innovate new methods to enhance text accessibility.

Despite the strides made, this research has its limitations. The primary constraint lies in the dataset’s reliance on texts sourced exclusively from Greek Wikipedia, which may not fully represent the diversity of language used in various contexts like literature, legal texts, or informal communication. This reliance on a single source could affect the generalizability of the simplification models developed from this dataset. Furthermore, the current methodologies predominantly focus on syntactic simplification without extensively exploring semantic simplification, which is crucial for maintaining the meaning and context of more complex sentences. Lastly, the use of traditional machine learning models, rather than more advanced neural network architectures, might limit the potential accuracy and sophistication of the text simplification solutions [47]. Addressing these limitations in future work could significantly enhance the dataset’s utility and the effectiveness of the simplification tools derived from it.

Looking ahead, the field of Greek text simplification is ripe with opportunities for further research and development. Continued enhancement of simplification algorithms is crucial, leveraging user feedback and detailed performance metrics to improve the precision and efficacy of these tools. Such iterative refinements are essential to ensure that the simplification approaches remain effective and responsive to the specific needs of diverse user groups.

Broadening the scope of the dataset to encompass a wider array of text types and genres will significantly enhance the robustness and applicability of simplification techniques. A more comprehensive dataset facilitates the development of versatile tools that can operate effectively across various contexts, thereby broadening their potential impact.

Additionally, exploring advanced machine learning models, especially those based on transformer concepts, will be a priority. Integrating models such as BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformer) could revolutionize our text simplification efforts by leveraging their superior contextual processing capabilities. This approach promises not only to enhance the accuracy of text classification but also to refine the understanding of complex linguistic structures in Greek, thereby improving the effectiveness of simplification algorithms.

In furtherance of this initiative, we aim to explore the capabilities of large language models like ChatGPT v.3.5 to assess and demonstrate the practical applications of our dataset. We plan to conduct comprehensive evaluations to test how well such models can handle the intricacies of Greek text simplification. Given the significant computational resources required for such studies, we are considering collaborations with other research institutions to facilitate this advanced research. These future endeavors will help us harness the full potential of our dataset and demonstrate its applicability across various machine learning contexts.

Establishing robust mechanisms for user feedback and validation is vital for aligning simplification techniques with the actual requirements of end-users. A user-centric design approach ensures that the tools developed genuinely benefit those with reading challenges, enhancing practical usability and impact.

Fostering collaboration across disciplines—such as linguistics, computer science, and cognitive psychology—can lead to deeper insights and more innovative solutions. These collaborative efforts can address complex linguistic challenges and enhance our understanding of text processing and comprehension, paving the way for sophisticated applications in text simplification.

By embracing these initiatives, the domain of Greek text simplification can progress towards creating more advanced tools that cater to a broader spectrum of linguistic needs and promote inclusivity within the digital information landscape. Such advancements will further the utility of text simplification, making information more accessible and comprehensible for all, particularly within the Greek-speaking community.

Author Contributions

Conceptualization, L.A., A.A., X.K., A.M., I.T., D.M., K.L.K. and A.K.; Methodology, L.A., A.A., X.K., A.M., I.T., D.M., K.L.K. and A.K.; Software, L.A., A.A., X.K., A.M., I.T., D.M., K.L.K. and A.K.; Validation, L.A., A.A., X.K., A.M., I.T., D.M., K.L.K. and A.K.; Data curation, L.A., A.A., X.K., A.M., I.T., D.M., K.L.K. and A.K.; Writing—original draft, L.A., A.A., X.K., A.M., I.T., D.M., K.L.K. and A.K.; Writing—review & editing, D.M., K.L.K. and A.K.; Supervision, D.M., K.L.K. and A.K.; Project administration, D.M., K.L.K. and A.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Santucci, V.; Santarelli, F.; Forti, L.; Spina, S. Automatic Classification of Text Complexity. Appl. Sci. 2020, 10, 7285. [Google Scholar] [CrossRef]
Mouratidis, D.; Mathe, E.; Voutos, Y.; Stamou, K.; Kermanidis, K.L.; Mylonas, P.; Kanavos, A. Domain-Specific Term Extraction: A Case Study on Greek Maritime Legal Texts. In Proceedings of the 12th Hellenic Conference on Artificial Intelligence (SETN), Corfu, Greece, 7–9 September 2022; ACM: New York, NY, USA, 2022; pp. 1–6. [Google Scholar]
Kanavos, A.; Theodoridis, E.; Tsakalidis, A.K. Extracting Knowledge from Web Search Engine Results. In Proceedings of the 24th International Conference on Tools with Artificial Intelligence (ICTAI), Athens, Greece, 7–9 November 2012; IEEE Computer Society: Washington, DC, USA, 2012; pp. 860–867. [Google Scholar]
Vonitsanos, G.; Kanavos, A.; Mylonas, P. Decoding Gender on Social Networks: An In-depth Analysis of Language in Online Discussions Using Natural Language Processing and Machine Learning. In Proceedings of the IEEE International Conference on Big Data, Sorrento, Italy, 15–18 December 2023; pp. 4618–4625. [Google Scholar]
Siddharthan, A.; Nenkova, A.; McKeown, K.R. Syntactic Simplification for Improving Content Selection in Multi-Document Summarization. In Proceedings of the 20th International Conference on Computational Linguistics (COLING), Geneva, Switzerland, 23–27 August 2004. [Google Scholar]
Silveira, S.B.; Branco, A. Combining a double clustering approach with sentence simplification to produce highly informative multi-document summaries. In Proceedings of the 13th International Conference on Information Reuse & Integration (IRI), Las Vegas, NV, USA, 8–10 August 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 482–489. [Google Scholar]
Narayan, S.; Gardent, C. Hybrid Simplification using Deep Semantics and Machine Translation. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (ACL), Baltimore, MD, USA, 23–24 June 2014; The Association for Computer Linguistics: Cedarville, OH, USA, 2014; pp. 435–445. [Google Scholar]
Qiang, J.; Zhang, F.; Li, Y.; Yuan, Y.; Zhu, Y.; Wu, X. Unsupervised Statistical Text Simplification using Pre-trained Language Modeling for Initialization. Front. Comput. Sci. 2023, 17, 171303. [Google Scholar] [CrossRef]
Specia, L. Translating from Complex to Simplified Sentences. In Proceedings of the 9th International Conference on Computational Processing of the Portuguese Language (PROPOR), Porto Alegre, Brazil, 27–30 April 2010; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2010; Volume 6001, pp. 30–39. [Google Scholar]
Wubben, S.; van den Bosch, A.; Krahmer, E. Sentence Simplification by Monolingual Machine Translation. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, Jeju Island, Republic of Korea, 8–14 July 2012; The Association for Computer Linguistics: Cedarville, OH, USA, 2012; pp. 1015–1024. [Google Scholar]
Zhang, X.; Lapata, M. Sentence Simplification with Deep Reinforcement Learning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Copenhagen, Denmark, 9–11 September 2017; Association for Computational Linguistics: Cedarville, OH, USA, 2017; pp. 584–594. [Google Scholar]
Evans, R.J. Comparing Methods for the Syntactic Simplification of Sentences in Information Extraction. Lit. Linguist. Comput. 2011, 26, 371–388. [Google Scholar] [CrossRef]
Lu, X.; Qiang, J.; Li, Y.; Yuan, Y.; Zhu, Y. An Unsupervised Method for Building Sentence Simplification Corpora in Multiple Languages. arXiv 2021, arXiv:2109.00165. [Google Scholar]
Newsela Data. Available online: https://newsela.com/data (accessed on 30 July 2024).
Coster, W.; Kauchak, D. Simple English Wikipedia: A New Text Simplification Task. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 19–24 June 2011; The Association for Computer Linguistics: Cedarville, OH, USA, 2011; pp. 665–669. [Google Scholar]
Vajjala, S.; Lucic, I. OneStopEnglish corpus: A new corpus for automatic readability assessment and text simplification. In Proceedings of the 13th Workshop on Innovative Use of NLP for Building Educational Applications@NAACL-HLT, New Orleans, LA, USA, 5 June 2018; Association for Computational Linguistics: Cedarville, OH, USA, 2018; pp. 297–304. [Google Scholar]
Sun, R.; Jin, H.; Wan, X. Document-Level Text Simplification: Dataset, Criteria and Baseline. arXiv 2021, arXiv:2110.05071. [Google Scholar]
Battisti, A.; Ebling, S. A Corpus for Automatic Readability Assessment and Text Simplification of German. arXiv 2019, arXiv:1909.09067. [Google Scholar]
Klaper, D.; Ebling, S.; Volk, M. Building a German/Simple German Parallel Corpus for Automatic Text Simplification. In Proceedings of the 2nd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR@ACL), Sofia, Bulgaria, 8 August 2013; Association for Computational Linguistics: Cedarville, OH, USA, 2013; pp. 11–19. [Google Scholar]
Rios, A.; Spring, N.; Kew, T.; Kostrzewa, M.; Säuberli, A.; Müller, M.; Ebling, S. A New Dataset and Efficient Baselines for Document-level Text Simplification in German. In Proceedings of the 3rd Workshop on New Frontiers in Summarization, Hong Kong, China, 10 November 2021; Association for Computational Linguistics: Cedarville, OH, USA, 2021; pp. 152–161. [Google Scholar]
Aluisio, S.; Specia, L.; Gasperin, C.; Scarton, C. Readability Assessment for Text Simplification. In Proceedings of the NAACL HLT 5th Workshop on Innovative Use of NLP for Building Educational Applications, Los Angeles, CA, USA, 5 June 2010; pp. 1–9. [Google Scholar]
Saggion, H.; Stajner, S.; Bott, S.; Mille, S.; Rello, L.; Drndarevic, B. Making It Simplext: Implementation and Evaluation of a Text Simplification System for Spanish. ACM Trans. Access. Comput. 2015, 6, 1–36. [Google Scholar] [CrossRef]
Brunato, D.; Dell’Orletta, F.; Venturi, G.; Montemagni, S. Design and Annotation of the First Italian Corpus for Text Simplification. In Proceedings of the 9th Linguistic Annotation Workshop (LAW@NAACL-HLT), Denver, CO, USA, 5 June 2015; The Association for Computer Linguistics: Cedarville, OH, USA, 2015; pp. 31–41. [Google Scholar]
Gala, N.; Tack, A.; Javourey-Drevet, L.; François, T.; Ziegler, J.C. Alector: A Parallel Corpus of Simplified French Texts with Alignments of Misreadings by Poor and Dyslexic Readers. In Proceedings of the 12th Language Resources and Evaluation Conference (LREC), Marseille, France, 11–16 May 2020; European Language Resources Association: Paris, France, 2020; pp. 1353–1361. [Google Scholar]
Holmer, D.; Rennes, E. Constructing Pseudo-parallel Swedish Sentence Corpora for Automatic Text Simplification. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), Tórshavn, France, 22–24 May 2023; University of Tartu Library: Tartu, Estonia, 2023; pp. 113–123. [Google Scholar]
Kostic, M.; Batanovic, V.; Nikolic, B. Monolingual, Multilingual and Cross-lingual Code Comment Classification. Eng. Appl. Artif. Intell. 2023, 124, 106485. [Google Scholar] [CrossRef]
Den Bercken, L.V.; Sips, R.; Lofi, C. Evaluating Neural Text Simplification in the Medical Domain. In Proceedings of the World Wide Web Conference (WWW), San Francisco, CA, USA, 13–17 May 2019; ACM: New York, NY, USA, 2019; pp. 3286–3292. [Google Scholar]
Shardlow, M. Out in the Open: Finding and Categorising Errors in the Lexical Simplification Pipeline. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC), Reykjavik, Iceland, 26–31 May 2014; European Language Resources Association (ELRA): Paris, France, 2014; pp. 1583–1590. [Google Scholar]
Bott, S.; Rello, L.; Drndarevic, B.; Saggion, H. Can Spanish Be Simpler? LexSiS: Lexical Simplification for Spanish. In Proceedings of the 24th International Conference on Computational Linguistics (COLING), Mumbai, India, 8–15 December 2012; Indian Institute of Technology Bombay: Mumbai, India, 2012; pp. 357–374. [Google Scholar]
Biran, O.; Brody, S.; Elhadad, N. Putting it Simply: A Context-Aware Approach to Lexical Simplification. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 19–24 June 2011; The Association for Computer Linguistics: Cedarville, OH, USA, 2011; pp. 496–501. [Google Scholar]
Chandrasekar, R.; Srinivas, B. Automatic Induction of Rules for Text Simplification. Knowl. Based Syst. 1997, 10, 183–190. [Google Scholar] [CrossRef]
Qiang, J.; Li, Y.; Zhu, Y.; Yuan, Y.; Shi, Y.; Wu, X. LSBert: Lexical Simplification Based on BERT. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 3064–3076. [Google Scholar] [CrossRef]
Siddharthan, A. Text Simplification using Typed Dependencies: A Comparision of the Robustness of Different Generation Strategies. In Proceedings of the 13th European Workshop on Natural Language Generation (ENLG), Nancy, France, 28–30 September 2011; The Association for Computer Linguistics: Cedarville, OH, USA, 2011; pp. 2–11. [Google Scholar]
Siddharthan, A.; Mandya, A. Hybrid Text Simplification using Synchronous Dependency Grammars with Hand-written and Automatically Harvested Rules. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics (EACL), Gothenburg, Sweden, 26–30 April 2014; The Association for Computer Linguistics: Cedarville, OH, USA, 2014; pp. 722–731. [Google Scholar]
Woodsend, K.; Lapata, M. Learning to Simplify Sentences with Quasi-Synchronous Grammar and Integer Programming. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Edinburgh, UK, 27–29 July 2011; ACL: Cedarville, OH, USA, 2011; pp. 409–420. [Google Scholar]
Garbacea, C.; Guo, M.; Carton, S.; Mei, Q. An Empirical Study on Explainable Prediction of Text Complexity: Preliminaries for Text Simplification. arXiv 2020, arXiv:2007.15823v1. [Google Scholar]
Wang, T.; Chen, P.; Amaral, K.M.; Qiang, J. An Experimental Study of LSTM Encoder-Decoder Model for Text Simplification. arXiv 2016, arXiv:1609.03663. [Google Scholar]
Nisioi, S.; Stajner, S.; Ponzetto, S.P.; Dinu, L.P. Exploring Neural Text Simplification Models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL), Vancouver, BC, Canada, 30 July–4 August 2017; Association for Computational Linguistics: Cedarville, OH, USA, 2017; pp. 85–91. [Google Scholar]
Sulem, E.; Abend, O.; Rappoport, A. Simple and Effective Text Simplification Using Semantic and Neural Methods. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), Melbourne, Australia, 15–20 July 2018; Association for Computational Linguistics: Cedarville, OH, USA, 2018; pp. 162–173. [Google Scholar]
Arvan, M.; Pina, L.; Parde, N. Reproducibility of Exploring Neural Text Simplification Models: A Review. In Proceedings of the 15th International Conference on Natural Language Generation: Generation Challenges, virtual, 18–22 July 2022; Association for Computational Linguistics: Cedarville, OH, USA, 2022; pp. 62–70. [Google Scholar]
Greek Wikipedia. Available online: https://en.wikipedia.org/wiki/Greek_Wikipedia (accessed on 30 July 2024).
HiLab Greek Text Simplification Dataset. Available online: https://hilab.di.ionio.gr/wp-content/uploads/2024/07/HiLab_Greek_text_simplification_Wikipedia_Dataset.zip (accessed on 30 July 2024).
Lee, R.S.T. Natural Language Processing—A Textbook with Python Implementation; Springer: Berlin/Heidelberg, Germany, 2024. [Google Scholar]
Wagner, W. Steven Bird, Ewan Klein and Edward Loper: Natural Language Processing with Python, Analyzing Text with the Natural Language Toolkit—O’Reilly Media. Lang. Resour. Eval. 2010, 44, 421–424. [Google Scholar] [CrossRef]
Honnibal, M.; Montani, I.; Landeghem, S.V.; Boyd, A. spaCy: Industrial-Strength Natural Language Processing in Python; Zenodo: Geneva, Switzerland, 2020. [Google Scholar]
Al-Thanyyan, S.; Azmi, A.M. Automated Text Simplification: A Survey. ACM Comput. Surv. 2022, 54, 1–36. [Google Scholar] [CrossRef]
Mouratidis, D.; Kermanidis, K.; Kanavos, A. Comparative Study of Recurrent and Dense Neural Networks for Classifying Maritime Terms. In Proceedings of the 14th International Conference on Information, Intelligence, Systems & Applications (IISA), Volos, Greece, 10–12 July 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–6. [Google Scholar]

Figure 1. Common words between complicated and simple sentences.

Figure 2. Top 25 most common words in ’simple’ category.

Figure 3. Top 25 most common words in ’complicated’ category.

Table 1. Comparison of metrics between complex and simplified sentences.

Metric	Complex Sentences	Simplified Sentences
Average Sentence Length	20.42	13.65
Average Word Length	5.62	5.56
Type–Token Ratio	0.93	0.98
Sentence Length SD	11.45	7.43

Table 2. Top 10 most common words in the dataset.

Word (ENG)	Word (GR)	Frequency
was/were	ήταν	435
have/has	έχει	166
was born	γεννήθηκε	141
done	έγινε	139
had	είχε	130
years/year	χρόνια	74
name	όνομα	72

Table 3. Top 25 most common words in simple sentences.

Word (GR)	Word (ENG)
ήταν	was/were
έχει	have/has
γεννήθηκε	was born
έγινε	done
είχε	had
χρόνια	years/year
όνομα	name
ομάδα	team
πήρε	took
γιος	son
ξεκίνησε	started
έχουν	they have
βρίσκεται	is located
πόλη	city
περιοχή	location
έκανε	has done
χωριό	village
θεση	position
φορά	time
αιώνα	century
υπάρχουν	exist
εθνική	national
σπούδασε	studied
πανεπιστήμιο	university
μέρος	part

Table 4. Top 25 most common words in complicated sentences.

Word (GR)	Word (ENG)
ήταν, γεννήθηκε	was/were, was born
γιος	son
βρίσκεται	is located
είχε	had
έχει	have/has
όνομα	name
χρόνια	years/year
περιοχή	location
ξεκίνησε	started
έγινε	done
ιδρύθηκε	founded
σύμφωνα	according
ομάδα	team
χωριό	village
κόρης	daughter
αιώνα	century
πόλη	city
πανεπιστήμιο	university
οικογένεια	family
τμήμα	part/section
σπούδασε	studied
δούκα	duke
ηλικία	age
σχολή	school

Table 5. Average word count by category.

Category	Average Word Count
Complicated	22.35
Simple	14.70

Table 6. Mean sentence length by category and standard deviation.

Category	Mean Sentence Length	Standard Deviation
Complicated	132.92	11.45
Simple	87.21	7.43

Table 7. Mean word length by category.

Category	Mean Word Length
Complicated	5.62
Simple	5.56

Table 8. Punctuation distribution by category.

Category	Punctuation Type	Count
Complicated	Period	1394
Simple	Comma	1671

Table 9. Performance metrics for machine learning models.

Method	Accuracy	Precision	Recall	AUC	Class Precision (Simple/Complicated)
KNN	52.12%	55.57%	50.77%	0.50	48.84%/55.57%
Naive Bayes	43.87%	44.20%	20.39%	0.48	43.74%/44.26%
SVM	52.41%	52.91%	97.49%	0.90	24.63%/52.91%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Agathos, L.; Avgoustis, A.; Kryelesi, X.; Makridou, A.; Tzanis, I.; Mouratidis, D.; Kermanidis, K.L.; Kanavos, A. Bridging Linguistic Gaps: Developing a Greek Text Simplification Dataset. Information 2024, 15, 500. https://doi.org/10.3390/info15080500

AMA Style

Agathos L, Avgoustis A, Kryelesi X, Makridou A, Tzanis I, Mouratidis D, Kermanidis KL, Kanavos A. Bridging Linguistic Gaps: Developing a Greek Text Simplification Dataset. Information. 2024; 15(8):500. https://doi.org/10.3390/info15080500

Chicago/Turabian Style

Agathos, Leonidas, Andreas Avgoustis, Xristiana Kryelesi, Aikaterini Makridou, Ilias Tzanis, Despoina Mouratidis, Katia Lida Kermanidis, and Andreas Kanavos. 2024. "Bridging Linguistic Gaps: Developing a Greek Text Simplification Dataset" Information 15, no. 8: 500. https://doi.org/10.3390/info15080500

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Bridging Linguistic Gaps: Developing a Greek Text Simplification Dataset

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Dataset Collection

3.2. Quality Control

3.2.1. Quality Control Process

3.2.2. Challenges and Resolutions

3.2.3. Outcome of the Quality Control Process

3.2.4. Implications for Research and Analysis

3.3. Technical Implementation and Challenges

3.3.1. Technical Challenges

3.3.2. Implemented Solutions

3.4. Developing the Greek Wikipedia Simplification Dataset

3.4.1. Dataset Expansion

3.4.2. Dataset Sharing and Contributor Involvement

3.4.3. Demographic Diversity of Contributors

3.4.4. Quality Control of Simplified Sentences

3.4.5. Implications for Research and Analysis

3.5. Annotation Guidelines

3.5.1. Guideline Overview

3.5.2. Training and Consistency

3.5.3. Collaborative Efforts and Impact

4. Implementation

4.1. Statistical Description of the Dataset from Greek Wikipedia

4.1.1. Preprocessing and Metrics Calculation

4.1.2. Statistical Analysis, Findings, and Implications

4.2. Metrics

4.2.1. Word Count and Sentence Length Metrics

4.2.2. Lexical Diversity Metrics

4.2.3. Implementation of Statistical Tools

5. Experimental Evaluation

5.1. Descriptive Statistics

5.2. Word Frequency Analysis

5.2.1. Analysis of Common Words across the Dataset

5.2.2. Most Common Words in Simple Sentences

5.2.3. Most Common Words in Complicated Sentences

5.3. Average Word Count by Categories

5.4. Sentence Length by Categories

5.5. Mean Word Length by Categories

5.6. Punctuation Distribution by Categories

5.7. Using the Dataset for Identifying Complex Text

5.7.1. Machine Learning Models and Configuration

5.7.2. Evaluation Metrics

5.7.3. Discussion

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI