Exploring the Effectiveness of Shallow and L2 Learner-Suitable Textual Features for Supervised and Unsupervised Sentence-Based Readability Assessment

Kostadimas, Dimitris; Kermanidis, Katia Lida; Andronikos, Theodore

doi:10.3390/app14177997

Open AccessArticle

Exploring the Effectiveness of Shallow and L2 Learner-Suitable Textual Features for Supervised and Unsupervised Sentence-Based Readability Assessment

by

Dimitris Kostadimas

^†

,

Katia Lida Kermanidis

^*,†

and

Theodore Andronikos

^†

Department of Informatics, Ionian University, 7 Tsirigoti Square, 49100 Corfu, Greece

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2024, 14(17), 7997; https://doi.org/10.3390/app14177997 (registering DOI)

Submission received: 29 June 2024 / Revised: 30 August 2024 / Accepted: 2 September 2024 / Published: 7 September 2024

(This article belongs to the Special Issue Knowledge and Data Engineering)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Simplicity in information found online is in demand from diverse user groups seeking better text comprehension and consumption of information in an easy and timely manner. Readability assessment, particularly at the sentence level, plays a vital role in aiding specific demographics, such as language learners. In this paper, we research model evaluation metrics, strategies for model creation, and the predictive capacity of features and feature sets in assessing readability based on sentence complexity. Our primary objective is to classify sentences as either simple or complex, shifting the focus from entire paragraphs or texts to individual sentences. We approach this challenge as both a classification and clustering task. Additionally, we emphasize our tests on shallow features that, despite their simplistic nature and ease of use, seem to yield decent results. Leveraging the TextStat Python library and the WEKA toolkit, we employ a wide variety of shallow features and classifiers. By comparing the outcomes across different models, algorithms, and feature sets, we aim to offer valuable insights into optimizing the setup. We draw our data from sentences sourced from Wikipedia’s corpus, a widely accessed online encyclopedia catering to a broad audience. We strive to take a deeper look at what leads to greater readability classification in datasets that appeal to audiences such as Wikipedia’s, assisting in the development of improved models and new features for future applications with low feature extraction/processing times.

Keywords:

readability assessment; sentence complexity; machine learning; classification; clustering; shallow features; metrics; natural language; Wikipedia corpus; simple Wikipedia

1. Introduction

As globalization continues, an ever-increasing trend of immigration and cultural exchange is observed. Upon arriving in another country, immigrants have to overcome language barriers, along with many other challenges. The degree to which a reader finds a document difficult to understand is measured through readability assessment, a process supported by machine learning and linguistics. The automation of this process allows us to provide appropriate reading material that readers can understand regardless of their language proficiency while keeping the core information intact. When selecting reading material, the degree of difficulty in understanding is crucial. If the content is too sophisticated, the reader might give up on acquiring information or, even worse, be misinformed. This results in wasted valuable time and effort, especially in times of need.

Text readability assessment is an important part of text simplification. As such, machine learning (ML) techniques have been widely employed to offer solutions to the above problems, and they have been used in numerous Natural Language Processing (NLP) tasks, such as fake news detection [1], text categorization [2], translation evaluation [3], etc. These techniques are being used more and more in today’s technologies as we attempt to classify essays, paragraphs, or even sentences by levels of complexity. Through this classification, we can retrieve information at an appropriate complexity level for the target reader, ensuring that they neither find the text too difficult to comprehend nor too easy to maintain focus. By knowing the profile of the user, content classified with a readability score can assist a plethora of technologies like web search engines, as well as other fields such as e-learning platforms, news reporting, online libraries, legal documentation, and many more.

In a wide variety of previous papers, the models tended to focus on specific target demographics. The current work utilizes the Wikipedia dataset, which consists of sentences from this online encyclopedia that are made for everyone and are freely accessible to a wide audience. However, as mentioned in the previous paragraph, static content cannot satisfy everyone’s needs. In this paper, the focus is on binary readability assessment of sentences rather than paragraphs or whole texts. We begin by reviewing the results, effective methods, and features used in influential papers in the literature. Subsequently, we introduce our own feature sets, features, methods, and approaches to readability assessment of the selected corpus.

Performing this task at the sentence level has numerous practical applications, including improving educational materials by ensuring that each sentence is accessible to students with varying reading levels; simplifying technical documentation, especially when the text volume is small; and refining legal documents to eliminate ambiguity. Other examples include customer support scripts for clear and effective troubleshooting, or even in healthcare, where clear and concise sentences in patient information leaflets and consent forms are essential. These applications demonstrate the importance of sentence-level readability in ensuring effective communication across diverse contexts. Compared to paragraph-level or whole-text assessments, sentence-level assessments allow us to acquire feedback on the overall organization, flow, and general complexity of paragraphs while obtaining specific results, which is essential for precise revisions and improvements.

L2 reader-related real-life applications might include language learning apps, customized educational content, and adaptive testing platforms. This paper evaluates the effectiveness of shallow features, particularly in texts designed to aid L2 readers’ comprehension. Due to computational constraints, tool limitations, and in line with our hypothesis, our feature set primarily comprises shallow features. Shallow characteristics refer to the more surface-level aspects of language, such as basic vocabulary, simple sentence structures, and fundamental grammar rules. These features are important because they form the foundation upon which more complex language skills are built and later acquired by L2 readers.

While these features are often considered basic, our goal is to explore their potential to produce promising results, possibly rivaling or surpassing the efficacy of more complex linguistic features while requiring less computational power. We contend that a combination of selected shallow features and formulas, preceded by a thorough dataset analysis to identify the most suitable features for categorization, can lead to high accuracy in our classification and clustering tasks.

Our work aims to highlight the synergistic combination of the classification and clustering algorithms with the proposed feature groups, as well as explore the feasibility and initial performance of the proposed approach, using a limited set of shallow features, algorithms, and a small dataset due to certain constraints. The main goal is exploratory, so we experimented with a large number of parameter settings, algorithms, and feature sets in order to identify valuable insights concerning the factors that most affect performance for the task at hand.

1.1. Contributions

The novelty of the present study, in comparison to previous works, lies primarily in its attempt to assess the readability of individual sentences instead of whole paragraphs or texts, which can prove very useful in practice. We approach the problem using classification and clustering algorithms and compare their performance, showcasing both supervised and unsupervised learning results. Our work not only involves the development of these algorithms but also focuses on how to effectively utilize diverse feature sets to enhance performance. Another novelty is that we make use of a wide variety of shallow features and even employ unusual ones like the McAlpine EFLAW and other L2 learner-suitable textual features that we think fit Simple Wikipedia’s nature and target audience. This effort is supported by a plethora of experimental tests with different feature sets and parameterized classifiers in search of an optimal setup. Additionally, we run extensive tests on the K-Nearest Neighbors and Random Forest classifiers. Finally, we provide insightful explanations of the obtained experimental results in relation to the nature of the dataset.

1.2. Organization

The structure of this paper is as follows. Section 1 provides an introduction to the subject along with relevant references. Section 2 presents a succinct but comprehensive review of the most relevant and recent works in the literature. Section 3 gives a brief overview of some of the most prominent readability assessment features. Section 4 is the most important section of this paper because it includes our experimental tests, the results obtained, and comments explaining their importance. It starts with an overview of the created dataset in Section 4.1, followed by a thorough explanation of our methodological approach in Section 4.2. Then, in Section 4.3 and Section 4.4, we provide the results from our tests using classification and clustering algorithms, respectively, and attempt to explain their performance. Finally, Section 5 summarizes the conclusions derived from the preceding analysis and outlines prospects for future work.

2. A Comprehensive Overview of the Literature

Most papers in the relevant literature study reader target groups and readability assessment on entire corpora or texts. The novelty of this work is twofold: first, we focus on sentences, which are used as building blocks for paragraphs or whole texts/essays; and, secondly, we target a much broader audience. By reviewing the existing literature, we aim to shed light on the key concepts, debates, and gaps in knowledge surrounding this subject in order to improve. This review serves as a foundation for our research and offers a synthesis of the diverse perspectives and methodologies that have shaped ours. We also provide a quick overview of the most important reviewed articles in Table 1.

In their 2013 article [4], Zhang, Liu, and Ni examined readability assessment features for L2 readers. The term Level 2 (L2) readers refers to those learning a second language; for example, immigrants, refugees, or tourists who are attempting to fit into a new environment and understand a new language. L2 learners have much to gain by acquiring reading material that aligns with their language proficiency, as previously discussed. The dataset in the aforementioned paper consisted of 15 news texts taken from the Reuters news agency website, each containing four to seven sentences. Fifty-eight information technology freshmen with English as their second language rated the texts based on reading difficulty, and the mean rating was calculated for each text. The researchers used the Coh-Metrix NLP tool [10] to analyze the correlation coefficients of the reading difficulty features introduced by each text. Their results suggested that long and/or complex sentences, uncommon words, and syntactic structures lower readability. Pronoun density can cause coherence issues, while connectives usually do not affect L2 readers. Content word and stem overlap can confuse readers. The paper offered valuable insights, but limiting the participant pool to IT freshmen may not generalize well to other L2 learner groups.

The overall conclusion was that some features usually relevant to native speakers are irrelevant to L2 and vice versa. Moreover, the number of words is a very important feature. Syntactic complexity also plays a significant role (e.g., number of modifiers per noun phrase). Adjective incidence and stem overlap for words with a common lemma are confusing. Third-person singular pronouns also confuse L2 readers, since they are ambiguous.

Another influential work is [5], which includes part of the data also analyzed more extensively in Feng Lijun’s thesis [11], offering significant insights into a variety of features and corpora. In [5], the dataset consisted of 1433 articles from the Weekly Reader magazine for elementary students. These articles were graded from 2 to 5 based on the minimum grade a student should have in order to understand the articles. The study used a wide variety of features from different families, categorizing them into entity density, lexical chain, coreference inference, and entity grid. They performed tests with 10-fold cross-validation and 10 iterations of the test set to identify the most efficient group and concluded that traditional measures are efficient but not always reliable. Entity-density features outperformed other feature sets, with sentence length being the most efficient shallow feature. Noun-based POS features have strong predictive power, while discourse features are not particularly effective for readability.

In 2010, Feng Lijun [11] used articles from LocalNews2007, LocalNews2008, NewYorkTimes100, Britannica, and LiteracyNet, along with the mentioned corpus. The study compared results across different datasets, feature sets, classifiers, and previous work. Useful results tables and graphs, as well as a rhetorical narrative on performance, were provided.

Many treat readability assessment as a classification task. We also try to tackle this problem using a classification approach, but instead of grade levels, we have two classes: simple and complex. Traditional features prove to be effective, and machine learning packages like LIBSVM and the WEKA machine learning toolkit [12] contribute to these great results. We included WEKA in our research as well.

In [6] (written in 2014), the researchers highlighted the importance of text readability assessment in clinical applications and healthcare in general. In their paper, they focused on the topic of estimating text difficulty and identifying the most crucial features contributing to it, which is something we also attempt to do but for a different target audience. They also used the Wikipedia and Simple Wikipedia corpus datasets, which we utilize as well, and examined the performance of selected classifiers and their 16 proposed features. After feature extraction, it was clear that character and word counts were significantly higher in difficult texts, which indicated their significance for the task. Among the POS features, nouns were found to be more useful compared to others. Finally, vocabulary was considered generally insignificant. Their findings also support the conclusions of other reviewed papers and show that shallow features are quite effective.

The authors also presented results from tests using several machine learning algorithms that made use of the above feature collection. Their experimental setup used 10-fold cross-validation to randomly split the entire dataset, which consisted of 11,800 examples. The algorithm that achieved the highest classification accuracy was Random Forest, with an accuracy of over 74%, followed by Decision Trees and Linear Regression. The least effective algorithms were K-Nearest Neighbors and Naive Bayes, with accuracies below 64%. The tests showed that shallow and part-of-speech features yielded the best results for this kind of dataset and were on par with, or even better than, features from other groups.

In a 2024 paper, the authors of [7] developed a novel multilingual model to assess the readability of Wikipedia articles, extending the evaluation beyond English to 14 languages. Their model demonstrated high accuracy in ranking readability in zero-shot scenarios, with accuracy usually over 80%. An interesting aspect of this paper is that the authors used both a text-based model (TRank) and a sentence-based model (SRank).

Their solution is built upon a Neural Pairwise Ranking Model (NPRM) that uses a multilingual Masked Language Model (MLM), similar to mBERT. The model is trained with a Siamese network architecture, and then Margin Ranking Loss is applied. To evaluate their model, they set baselines representative of the most common approaches, such as the number of sentences, Flesch reading ease, linguistic features, language-agnostic features, and a classification-based approach instead of a ranking-based one. After extensive tests, TRank significantly outperformed SRank as well as all the baselines. The authors also highlighted that Flesch reading ease performs well in English but not in other languages. Their results also highlighted that the sentence-based approach makes classification harder.

On the matter of ranking compared to the classification approach, the authors of [8] developed a system using an SVM to assess text readability for both native speakers and L2 learners. They achieved an accuracy of 0.803, a Pearson Correlation Coefficient (PCC) of 0.900 for native data, and an accuracy of 0.785 with a PCC of 0.924 for L2 data. They compared the effectiveness of ranking and classification models, favoring ranking for its better performance with novel datasets. The study also explored methods like domain adaptation and self-training, which significantly improved accuracy in estimating Common European Framework of Reference (CEFR) levels and readability, especially when L2 data were limited. However, the focus on SVM models may overlook the potential of other machine learning algorithms.

In [9,13], the authors set a goal to construct a readability model that applies specifically to sentences instead of documents and that works well and offers portability across different genres of corpora. They also used the Simple Wikipedia dataset along with others while utilizing tools like Coh-Metrix. Specifically using their WeeBit-trained model on the OneStopEnglish corpus, they performed a classification of sentences that were manually aligned in order to classify them into three different groups; the accuracy of this model was 78%. They also used a dataset of 100 K sentence pairs from Wikipedia and Simple Wikipedia, from which they extracted 151 features and attempted binary classification. The best accuracy they achieved in this task, after multiple tests with different feature subsets and with varying training set sizes, was 66% using SMO as the classifier and 10-fold cross-validation. They also compared the performance of sentence-level models to that of document-level models and concluded that sentence-level models on the sentence task are significantly less effective than document-level models on the document task, which validates the complexity of this problem and our venture. Investigating the long-term applicability and scalability of such models would be beneficial.

In a study published in the MDPI Analytics journal in 2023 [14], regarding the creation of readability models and features, the researchers aimed to develop a universal readability index applicable to different alphabetical languages, which takes into account readers’ short-term memory processing capacity. The study observed that existing readability formulas did not consider the number of words between punctuation marks. A custom formula was proposed that could potentially provide a synthetic measure of human reading difficulty. This further supports the idea that studying the effectiveness of features and their nature could further hint at improvements and lead to the creation of new features, even based on older ones. On the other hand, regarding this specific formula, the focus on short-term memory processing capacity as a key factor might overshadow other important factors such as prior knowledge, motivation, and special reading strategies. Additionally, since there is a lack of empirical validation across diverse reading populations and languages, future iterations might consider testing this. Providing results and further testing the simplicity of use, as well as the accuracy across a wide dataset of different texts in various languages, is important, especially since this single formula claims to be applicable to any alphabetical language.

Other important papers offering useful information about readability measures, feature categories, models, and insights considering the Wikipedia corpus, but not necessarily with sentence-level targets in mind, can be found in [15,16,17,18]. In [19], where different approaches to text simplification and readability assessment are thoroughly discussed. Concerning the clustering approach, the authors of [20], published in 2017, showcased a clustering-based language model using word embeddings to asses readability in the Simple Wikipedia set.

There are also researchers, such as [21] (2021), [22] (2019), and [23] (2012), who question whether Wikipedia and Simple Wikipedia articles are easily comprehensible by their audience. In [22], the authors raised questions concerning the authorship of simple Wikipedia and later used a variety of features, including shallow ones, to compare the values extracted from Simple Wikipedia articles to those extracted from standard Wikipedia articles. However, it was found that sentences from Simple Wikipedia were less complex, on average, than those in standard Wikipedia, with sentence and word lengths playing a significant role in the shallow feature category.

In [21], published in 2021, the authors combined traditional readability metrics with new comprehension parameters to classify articles into two groups: incomprehensible and comprehensible. They suggested that the reader’s motivation for reading, need for information, and prior knowledge regarding the article’s theme could also play a role in whether the article should be classified as comprehensible. They focused on the missing pieces of information in articles (knowledge gaps) that can make the reader lose traction, affecting readability. A great comparison of the average values of shallow and POS features extracted from Simple Wikipedia articles with those extracted from standard Wikipedia articles can be found in [23], and these were later compared with the average features extracted from the Britannica Encyclopedia in order to see if there really is an improvement in readability between Wikipedia versions and encyclopedias. It appears that Simple Wikipedia is generally more readable than standard Wikipedia based on the features the researchers selected, who stated that “Wikipedia seems to lag behind the other encyclopedias”.

3. Features for Readability Assessment

3.1. Shallow Features

To approach the problem of readability assessment, one must first introduce a number of features that are correlated with the difficulty of the text. Subsequently, using ML algorithms and tools, one can classify the text as complex or simple. There are several types of features and formulas that are of assistance in this matter; however, as we previously mentioned, this work focuses on shallow features.

These features employ traditional readability metrics, such as word count or Flesch–Kincaid grade level, and, in general, require much less computational power for their extraction. They are limited to the text’s surface-level attributes and are referred to as shallow. There is limited knowledge about their predictive power in readability [5,11]. Many shallow features are self-explanatory like reading time, polysyllabic/monosyllabic count, word count, mini-word count, and more. Formulas like the Flesch reading ease, Flesh–Kincaid grade level, Coleman–Liau, and Automated Readability Index also belong in this group since they essentially take into account lower-level shallow characteristics and formulate them to obtain a better measure of the readability of a text. These formulas take into account the total number of characters, words, sentences, and syllables in the text to generate a value representing readability. In this section, we discuss the nature of shallow features and showcase, in more detail, some unusual and more algorithmic ones used in our tests.

We use many features that take into account sentence and word lengths. These parameters are common base features employed by multiple formulas. The exact formulas for the features used can be found in [24], which provides further links and references. The rationale behind these formulas is that creating rules that take into account such variables helps achieve a more precise result in readability than plainly using a single variable at a time.

There are some other formulas/features that introduce a complexity variable, such as the SMOG Index and the Gunning FOG formula. In these cases, a word is considered complex or polysyllabic if it has three or more syllables; thus, when such words appear frequently in a text, these features output higher complexity. They were created with the rationale that longer, more sophisticated words tend to confuse L2 readers. Apart from this, there are features that base their results on the school grade a student should be at in order to understand the text. Spache’s formula is suggested for texts intended for students below the fourth U.S. school grade, while for older audiences, the Dale–Chall formula is more appropriate.

An example of a metric that returns a value based on the grade level is the Linsear Write Formula. The difference from the aforementioned ones is that it focuses on a more advanced audience. It was allegedly created for the U.S. Air Force to aid in calculating the readability of the army’s technical manuals. It considers simple variables like sentence length, easy words (with fewer than three syllables), and difficult words (more than three syllables), but it is more algorithmic.

Linsear Write Metric:

\begin{matrix} \frac{(easy words \times 1) + (difficult words \times 3)}{total number of sentences} \end{matrix}

(1)

If the value is $>$ 20, then divide by 2 to obtain the final value;
If the value is $\leq$ 20, divide by 2 and then subtract 1 to obtain the final value.

Focusing on assisting L2 and international readers in their attempt to learn and understand a new language, the features must be specially modified to fit this audience’s requirements. In our paper, the term L2-suitable features refers to features that have been identified in the literature as suitable for defining the complexity of content targeted at L2 learners (e.g., number of words).

Rachel McAlpine introduced a special formula called McAlpine EFLAW that aims to make texts easy for EFL (English foreign learners) to read and understand, eventually enabling successful communication with people all over the world (more information can be found in [25]). This is a feature that can be identified as L2-suitable. The formula takes into account mini-words (words with three or fewer characters, such as let, get, the, and do) because wordy cliches, colloquial expressions, and phrasal verbs confuse L2 learners. Through mini-words, the formula addresses many “flaws” that can make texts for EFL readers challenging. Typical examples include numbering (e.g., dates), idioms, and ambiguities (especially those caused by pronouns, as also seen in [4]). The McAlpine EFLAW formula suggests that reducing the number of mini-words can significantly improve readability for L2 learners.

McAlpine EFLAW:

\begin{matrix} \frac{(number of words + number of mini-words)}{total number of sentences} . \end{matrix}

(2)

Some of the previous formulas make use of the total number of sentences as a variable. We should clarify that in our case, since we use them on single sentences rather than texts, most of the time this variable will have a value of 1. This will generally affect the expected range of the final output, but it will still return relevant results that help in the classification process. Most of these formulas were created for general text readability assessment rather than single sentences. There are not yet any widely applied formulas for readability assessment of sentences. The majority of shallow features and formulas return values that are numerical rather than nominal. Even though their concept is quite simple, it is worth experimenting with the above features, especially at an academic level. Further explanation and a thorough review of traditional readability metrics can be found in [11], where it is also mentioned that, even though shallow features might actually work perfectly fine, there are certain cases in which sentences with small lengths can actually translate to higher text difficulty, as seen in [26]. The general consensus is that, even though shallow features might be good enough in certain scenarios, their performance is unreliable [11,27,28].

3.2. Linguistic Features

Apart from shallow features, there are other feature groups, such as discourse and linguistic features, with more complicated rules. The group of linguistic features contains sub-groups, such as syntactic, part-of-speech, grammatical, lexical, and many other feature sub-groups. Characteristics include noun phrases, the number of adjectives, adverbs, lexical chains, and entity density.

In this paper, we assess the effectiveness of shallow features, especially in texts designed to assist L2 readers in obtaining a better understanding. Due to computing power and tool limitations, and also because of our hypothesis, our feature set consists mostly of features that belong to the shallow set. Even though these types of features are often considered basic, we want to explore their effectiveness and determine whether they can offer promising results and approaches or even surpass the effectiveness of more sophisticated features (like linguistic ones) by comparing them to known results from the literature. We believe that combining certain shallow features and formulas, along with a thorough analysis of the dataset beforehand in order to select the features that best categorize our sentences into classification groups, enables us to achieve high accuracy.

4. Methodology and Experimental Results

This section is organized into several detailed subsections to clearly present our findings. It starts with an overview of the dataset we used, followed by a description of the methodology, including how the dataset was processed; what features, algorithms, and tools we used; and how the experiments were carried out. The two final subsections then present the results from the classification and clustering algorithms.

4.1. The Dataset

In our effort to classify sentences as either simple or complex (also referred to as sentence-level binary readability classification), we made use of the WikiSimpleWiki dataset. The dataset is available at [29], and the process through which it was created is described in [30]. The initial set (version 1.0) contained 137 K aligned sentence pairs from the Wikipedia (the free encyclopedia) corpus [31]. We chose sentences from articles that have both a simple and an original version, which when aligned, had a similarity above 0.50 (a 50% similarity threshold). The updated version contained 167 K sentences. This was achieved by aligning articles from the official English Wikipedia corpus with those from Simple Wikipedia.

To ensure that every paragraph of a simple article was correctly aligned with that of an original article, a TF-IDF cosine similarity measurement was used, and paragraphs with a similarity below 0.50 were considered unrelated. Finally, 75 K aligned paragraphs were retrieved from the 10 K article pairings, for a total of 137K aligned sentence pairings. So, there was a hierarchy from articles to paragraphs to sentences to words. The techniques used to bisect and divide the text are analyzed in [32]. The alignment results were also evaluated by humans to check the accuracy, and out of a sample of 100 random sentence pairs, the evaluators identified 91% of them as being correctly aligned [30].

The datasets were split into two text files, with one containing the simple sentences and the other the original ones. In our case, we used a subset of the aforementioned version 1.0 dataset. Specifically, we selected 4 K sentences from the simple set and another 4 K sentences from the complex/original set, leading to a total of 8K sentences.

4.2. Methodology

Our goal was to separate, i.e., classify, complex sentences from simple ones. We approached the problem as both a classification and a clustering task. To do this, we extracted certain features from our dataset that could help classify the sentences into one of the two classes. Figure 1 clearly illustrates each step of our methodology and explains the connections between them.

As previously mentioned, this paper focuses on the effect of shallow characteristics on low-level readers. For this purpose, we used the TextStat Python library [33], which enables the calculation of statistics from text and assists in determining readability, complexity, and grade level. This tool extracts a variety of shallow features and formulas while being flexible in terms of its input by allowing multiple sentences at a time to be analyzed individually. We also experimented with the well-known Coh-Metrix web tool [10,34], but it appeared to be inappropriate for batch analysis of multiple individual sentences at the time of access. In addition to Coh-Metrix, we ran some tests using the TAACO [35] and TAASSC [36] feature extraction tools. These are two useful tools for POS feature extraction that use CoreNLP, but they did not provide a viable solution for batch analysis of individual sentences for the task at hand.

Making use of TextStat, we created two separate Python scripts. The first Python script performs feature extraction. It takes as input the desired dataset (either simple or complex) and exports a correctly formatted CSV file with the feature values of each sentence in the dataset. The second script takes the exported CSV files and merges them while shuffling the complex and simple sentences, creating a CSV with 8001 total rows, with the columns representing the features/characteristics of each sentence, and the final column representing the complexity.

Taking into account the reviewed literature on the most effective features for the target audience, along with the capabilities of the software used, we decided to employ the following list of features in our lab tests:

Flesch reading ease;
Flesch–Kincaid grade level;
Automated Readability Index;
Gunning Fog;
McAlpine EFLAW;
Linsear Write Formula;
Dale–Chall readability score;
Coleman–Liau index;
Spache readability;
Syllable count;
Monosyllable count;
Polysyllable count;
Word count;
Mini-word count;
Difficult words;
Reading time;
School grade.

We further discuss these features in Section 3. Many of them are self-explanatory since shallow features are simple and consider surface-level characteristics. Our tool’s unique feature, “School Grade”, estimates the grade level needed to understand a sentence, based on various readability formulas (more information can be found in [24]). To clarify, since some of the above formulas take into account the number of sentences, their exported values might be significantly different from those if used for readability assessment of whole texts. Even though this is true, they still serve their purpose since their values just tend to have different ranges per classification group (either complex or simple). We also could not use the SMOG index for this reason since in TextStat, inputs of fewer than 30 sentences are statistically invalid because the SMOG formula was normed on 30 sentence samples.

To run our lab tests, we used the Waikato Environment for Knowledge Analysis (WEKA) [12]. WEKA is a collection of machine learning algorithms that also provides tools for data preparation, classification, regression, clustering, and visualization. In our tests, we ran multiple algorithms on our dataset using our extracted features, seeking those that would return the best percentage of correctly classified examples. In addition, we introduced several feature groups to examine whether certain combinations of features and their multitude offered better or worse results. This is supervised learning, and the ultimate task is to classify our dataset sentences as either complex or simple. We also used 10-fold cross-validation. As a baseline test, we ran the ZeroR classifier using all the features, which achieved 50% accuracy.

4.3. Results for Classification Algorithms

In this subsection, we present our results for the search for the optimal configuration. We ran tests for different feature sets using the following algorithms:

Naive Bayes;
J48 Decision Tree (also known as C4.5);
K-Nearest Neighbors;
Sequential Minimal Optimization;
Random Forest;
Decision Table;
Simple Logistic.

We ran each of the above algorithms multiple times, modifying the parameters to attain optimal results, and then performed a comparative analysis to determine which was the most efficient. The first test used all the features previously introduced to distinguish between simple and complex sentences. Table 2 shows the results of our first test. The first column represents the algorithms/classifiers used and their parameters, where CCI stands for correctly classified instances. The top three test results are highlighted in three different shades of blue: the more vibrant shade represents the top result, and the lighter shades represent the next two best results. The worst result is highlighted in red.

The best algorithm in this case appeared to be Random Forest, with a bag size equal to 1 and 10,000 iterations, achieving 60.5625% accuracy. We explain more about its parameterization later in this paper. It was followed by Naive Bayes with supervised discretization, achieving 59.9875%, and KNN (using the 3501 nearest neighbors, almost half the dataset’s size) achieving the same result. In NB, each feature was conditionally independent of other features. We believe that the reason for the aforementioned accuracy value is the size of the dataset and the fact that in our feature set, the likelihood was evenly distributed, and our results throughout the tests may support this.

J48 also performed well and managed to increase accuracy slightly. By setting the confidence interval for pruning extremely low to just 0.05, we obtained slightly better results (almost a 2% increase). By reducing the CI, we also reduced the depth of the J48 tree through pruning. As expected, adding the unpruned option (-U) made things worse, as the wider and deeper tree caused an overfitting effect. Overfitting happens when the model is unable to generalize and instead fits too closely to the training dataset. We keep in mind that outliers in our 8 K dataset could have also played a role.

The worst-performing algorithm was KNN with exactly one neighbor, underperforming significantly compared to the others. It was common across all our tests that KNN with fewer neighbors performed worse on this dataset. We conducted further tests to determine the number of neighbors that would give us the highest accuracy. Other than that, it seems that all algorithms performed similarly, which is quite surprising.

Subsequently, we closely examined the obtained feature extraction results to identify the most prominent ones. This allowed us to run tests using only the five most useful features for the classification task to see if performance could be improved. Initially, we measured the average/mean values of each feature extracted from our dataset per classification group (complex and simple). For example, complex sentences have an average value of 2.88575 for the polysyllable count feature, while simple sentences have an average of 2.1705. Then, we measured the difference in the averages between the complex and simple sentences. Another example is the mean value of the Flesch–Kincaid grade level, which was 8.31635 for simple sentences and 10.195225 for complex sentences. The difference between the averages in this case was

≲

1.9.

We then converted these difference values into percentages, as shown in Figure 2, in order to have a reading that can be interpreted across all features. Not all features returned values belonging within the same range, so converting to percentages helps make comparisons more meaningful. The figure shows the difference in the mean/average values of each feature between the complex and simple sentences. These values were sorted from top to bottom based on the percentage difference in the mean values between the two classification groups.

The differences in the values per classification group for the features through the WEKA interface are shown in Figure 3. In this figure, the X-axis represents the values that each specific feature takes, while the Y-axis (colored) is the classification group. The five features we selected were the polysyllable count, difficult words, Linsear Write Formula, Automated Readability Index, and Flesch–Kincaid grade level. Based on the visualizations and averages, we can clearly observe that complex sentences introduce difficult words with many more syllables compared to simple ones. Since the Automated Readability Index, Flesch–Kincaid grade level, and Linsear Write Formula take into account the number of characters and/or syllables, it is not surprising that their average differences between the simple and complex sets were slightly higher, rendering them appropriate for this task.

We also examined the Linear Regression model considering all the features but excluding the school grade (which is a feature specific to TextStat that returns a nominal value; our only feature that returns a nominal output). For this, the complexity had to be converted to a numeric value, so we used 1 for complex and 0 for simple sentences. The regression was performed using the greedy method and unfortunately returned a 94.3797% relative absolute error.

Linear Regression model:

\begin{matrix} Complexity = 0.0038 * gunning fog + 0.0121 * coleman liau index \\ - 0.0135 * automated readability index + 0.0246 * dale chall readability score \\ + 0.012 * mcalpine eflaw + 0.1361 * reading time - 0.0161 * monosyllab count \\ - 0.0117 * difficult words - 0.0126 \end{matrix}

We ran the same tests as before, this time using the handpicked feature set we discussed previously, obtaining the results presented in Table 3. The best algorithm in this case was KNN with 551 neighbors, while the variant with 3501 neighbors performed equally well. The lazy approach with narrow samples did not seem to work until now; widening the number of neighbors significantly increased the accuracy, even in this feature set.

The Naive Bayes variants, including NB-K and NB-D, achieved competitive accuracies of 59.675% and 59.7%, respectively. The worst-performing algorithm once again was KNN with a single neighbor. Naive Bayes generally performed well across tests when these specific parameters were applied.

Supervised discretization helped to create “bags” containing a range of values rather than multiple discrete ones. Generally, kernel density estimation and supervised discretization were expected to work better due to the distribution of our dataset, which benefited from such modification. Simple Naive Bayes assumed that the numerical data followed a normal distribution. In this case, supervised discretization achieved around 0.06% higher accuracy than kernel density estimation. Our tests suggest that either technique would work well.

Once again, we observed that all algorithms performed similarly when tweaking some simple parameters, with the only ones that significantly underperformed being Random Forest with no parameterization and with a bag size set to 25%, as well as KNN with a single neighbor.

In addition to the previously selected feature set, which consisted of five handpicked features through our custom selection process, we decided to create another feature set consisting of the top five features selected using the information gain (IG) algorithm through WEKA. IG allowed us to determine the importance or prominence of the features in our dataset based on their classification results. By evaluating our features through IG, we estimated their entropy, which is the amount of information (or surprise) inherent to the output variable’s possible outcomes. Lower-probability (surprising) events have more information, while higher-probability ones (less surprising) have less. Information gain measures the reduction in entropy. Based on this evaluation, we selected the top five features with the highest information gain.

A notable observation from the results of IG, as seen in Table 4, is that three out of the five features with the highest information gain were those that included less common variables in their formulas, rather than simply the number of words or characters. Spache readability includes unfamiliar words, Flesch–Kincaid considers syllables, and Gunning Fog takes into account complex words. This observation may be useful when creating features and formulas for similar datasets.

We ran the same tests on the top five features set based on the IG selection, obtaining the results shown in Table 5. Again, Naive Bayes with supervised discretization achieved the highest accuracy of 59.95%, but the difference from other algorithms was negligible. All the results were once again very consistent, regardless of the algorithm and feature set used, reaching nearly 60%. Naive Bayes with kernel density estimation and KNN with 3501 neighbors followed, with under 0.3% worse performance. The unpruned version of J48 was again less effective than others, while KNN with a single neighbor was once again the worst. The results seemed to follow the same pattern as the previous tests.

It is normal for a lazy algorithm like KNN to perform poorly, especially when multiple features and a spread mix of complex and simple sentences are taken into account. It was clear that a low number of neighbors would not suffice since we had 8 K sentences split into two groups based on complexity. Using the -K flag, we increased the number of neighbors used for the final decision. We observed a consistent pattern with this algorithm across all our tests. We ran the KNN algorithm with different feature sets using 1, 13, 71, 551, and 3501 neighbors (we ensured all the numbers were odd to avoid confusion). Having a single neighbor usually did not perform well. In our case, increasing the number of neighbors significantly increased accuracy, since a limited number of neighbors could not cover the different example cases provided. We found that the sweet spot in most cases was setting the number of neighbors to

1.8 / 4

of the total examples.

Examining our results, as shown in Figure 4, we observe that the lines representing test runs with sets including five features are almost overlapping and form a similar curve. They are also relatively close to the test runs using all the features. The peak performance of these lines lies in the middle to the end of Figure 4. On the other hand, tests with a single feature or pair of features tend to have a much smoother line, especially the line representing the results using McAlpine EFLAW as a single feature, which is almost straight regardless of the number of neighbors.

After considering the literature review, we decided to also run tests based on the Flesch–Kincaid grade level and word count as a feature set, since the literature suggests that this combination can offer great accuracy in certain cases [5]. Evaluating the obtained results, it appears that the combination of these two features is indeed relatively good at assessing readability. The results can be seen in Table 6.

Finally, we examined the predictability of a special feature we selected for this study that focuses specifically on L2 readers. We ran tests using McAlpine EFLAW as a single feature, which yielded interesting results, as shown in Table 7. It should be noted that even though shallow features are usually considered unstable or unreliable, McAlpine EFLAW was very consistent throughout the tests, regardless of the classifier and parameterization. Although the accuracy was not as high as in other tests, it was consistently in the 57th percentile. McAlpine EFLAW is likely a good feature for reliable readability assessment of text focusing on L2 readers since its predictions in our tests, while not as high as those from our other feature sets, were at least very consistent.

In all of our results tables, it can be seen that the Random Forest (RF) algorithm achieved the best results when we increased the number of trees and significantly reduced the bag size. RF used all possible features in every tree. In RF, if the observations (the trees created based on the examples) exceed the limited number of trees, certain observations will be predicted just once or not at all, usually leading to poor results. It is obvious that increasing the number of iterations (random trees) will increase the performance of the algorithm, as trees with good predictions will clearly outnumber those with bad predictions. The best results in Random Forest are seen with a higher tree count and lower bag size. We highlight the importance of the bag size percentage in the Random Forest algorithm. It is evident that in all of our Random Forest test runs, decreasing the bag size yielded better accuracy. Most of the time in our tests, when reducing the bag size to 1, meaning that each bag contained only 1% of the training set, RF performed the best. This is an extreme case with unconventional results, but similar to the extremely low confidence interval in J48, the same seems to work for RF with the bag size percentage in this case. However, these results might arise due to overfitting despite the 10-fold cross-validation, so we ran further tests later on.

To ensure that this parametrization significantly impacts most cases and to attempt to learn what affects the performance of the classifier the most, we decided to run further tests, tuning it even more. In the tests presented in Table 8, we used the exact same dataset and altered the parameters concerning the bag size, the iterations, and the number of features. From the information in Table 8, we can see that increasing the number of iterations (to at least 1000) yielded significantly better results in most cases. Moreover, as seen in the previous tables, reducing the bag size percentage to as low as 1% yielded the best results in our experimental setup with this dataset. It appears that the best-performing setting is with the bag size percentage set to 1% while performing 10,000 iterations. So, we conclude with a well-known fact that supports our results: increasing iterations also increases performance, as does decreasing the bag size. The best run in these tests returned over 60% of correctly classified instances, while the worst, with bag size set to 100% and 10 iterations, returned a below-average 42.4%.

Drawing definitive conclusions is challenging since, based on the above, the fact that increasing the bag size yields worse results (even by about 13% to 17% in some cases) might suggest an overfitting issue. On the other hand, we cannot say this definitively since when setting the bag size to 100% (which is a technique prone to overfitting), we unexpectedly obtain even lower correctly classified instances than when using 50% or 1%.

Through the graphical visualization of the results (Figure 5 and Figure 6), we observe that both lines representing tests using 5 and 17 features have a similar curve, despite the accuracy being lower in the case of the 5-feature set. We also observe a decrease in accuracy when increasing the number of iterations during the test runs with a bag size of 50%, but considering all of our previous tests, we can say that, in general, increasing the number of iterations increases the accuracy slightly, as does the bag size percentage, at least in our setup.

We should also clarify that we could not run tests for 10,000 iterations with bag sizes of 50% and 100% due to hardware and tool limitations. The model building either took too long or WEKA ran out of memory and constantly crashed.

Concluding this part of the research, we present the graph in Figure 7, which shows the best and worst percentages of correctly classified instances per feature group in our tests, regardless of the classifier used. Surprisingly, through our classifier tuning, we achieved almost identical performance across all sets and algorithms. There were certain exceptions like KNN with a single neighbor and Random Forest without parameterization. We also observed that when using all of our available features, we generated the best results, while the lowest accuracy was recorded with McAlpine EFLAW as a single feature. However, the difference was so small (≲3%) that we cannot say that EFLAW underperformed. In addition, McAlpine EFLAW showed high tolerance across all of our tests, with the best and worst performance differing by only 1.97%, while the rest of the feature sets exhibited some fluctuations in performance depending on the algorithm used. The same can be said for the Flesch–Kincaid set with word count to some extent. These fluctuations can be attributed to the number of features used.

4.4. Results for Clustering Algorithms

After completing our main goal of approaching the problem as a classification task, we decided to explore the clustering approach as well. To this end, we again utilized the WEKA environment and algorithms like Expectation Maximization, K-means, and DBSCAN. We used single features and feature sets previously showcased to compare the results of clustering with those from classification.

In these tests, we made use of WEKA’s class-to-cluster evaluation. In this mode, WEKA initially ignores the class attribute (in our case, complexity) and generates clusters. Then, during the test phase, it assigns classes to the clusters based on the majority value of the class attribute within each cluster. Finally, it computes the classification error based on this assignment and creates the corresponding confusion matrix. Performance was measured using true positives, true negatives, and false positives as reported by WEKA, and then we calculated the accuracy, precision, recall, and F-measure for the clusters. In the following tables, “Log L.” stands for log likelihood, “S.S.E.” denotes the sum of squared errors, and “CCI” denotes the correctly classified instances. All the tests were run with the number of clusters set to two, except for DBSCAN, where this parametrization was not possible.

Our first tests used the Expectation-Maximization (EM) algorithm. The recall and F1 values were acceptable compared to the results from the tests using classification algorithms, and the highest accuracy was equally satisfying, reaching 59.6%, which is comparable to the classification approach. The analytical results can be seen in Table 9 and Table 10.

K-means is one of the most popular and commonly used clustering algorithms. It is quite simple since it works by assigning data points to clusters based on the shortest distance to the chosen centroids. In our case, we used two centroids since our parametrization of the algorithm through WEKA allowed us to set the desired number of clusters—one for complex and one for simple sentences. By continually assigning points/examples closer to the centroids, the clusters were updated with each iteration. The results were quite encouraging, reaching an accuracy of 58.98% and an F1 score of 0.61 when using the information gain feature set. The analytic results regarding the K-means algorithm can be seen in Table 11 and Table 12.

We also ran tests on the DBSCAN algorithm but we did not include a results table because, as expected, it was unable to cluster the examples correctly since they were extremely close to each other. Due to the density of the examples, DBSCAN created only a single cluster and achieved 50% accuracy. This is because our dataset consisted only of two clusters—one complex and one simple—so placing everything in a single cluster was bound to result in 50% accuracy by chance.

In addition to the above tables, we visualized the clusters created (as seen in Figure 8 and Figure 9) to obtain better insights regarding the situation at hand.

In conclusion, it seems that the clustering approach is very much on par with the classification approach regarding performance. Looking at the results, we find that the increased number of false negatives compared to false positives indicates that the problem lies in the assignment of many complex sentences to the simple cluster. The opposite situation is not as common in our tests.

5. Conclusions and Future Work

In this paper, we analyzed the performance of novel shallow feature sets on a specific Wikipedia dataset and presented results using a variety of optimally configured classifiers. Instead of the usual classification of whole texts or paragraphs, we performed classification at the sentence level, which is a step down in this hierarchy. This technically makes our classification task more challenging due to the lack of length and context in the data, but it can be more beneficial in certain aforementioned applications.

Through supervised learning using WEKA, we ran multiple algorithms and presented the best-performing ones per case while attempting to analyze the reasons behind their performance. Additionally, we identified “sweet spots” in KNN parametrization concerning the number of neighbors, which appears to be around 1.8/4 of the total examples in our case. Furthermore, we researched in more depth the reasoning behind the efficiency of the Random Forest classifier applied to our dataset using our shallow feature sets. We came to the conclusion that increasing the number of iterations, as well as lowering the bag size percent, produces better results in our tests.

As our final goal is to achieve higher accuracy and a better fit of features to the dataset, we highlighted useful observations relating to the nature of the algorithms and their features. We focused on testing features designed for (or at least suitable for) L2 readers, including infrequently used ones like the McAlpine EFLAW. EFLAW proved particularly stable and consistent across tests, and even when used as a single feature, it offered fair accuracy. This demonstrates an important point: with proper formulation and by only selecting features that accurately represent the nature of the dataset, even simple shallow features can be reliable and have high tolerance in different tests run across various datasets and classifiers. In conclusion, we believe that using features specifically suited to the target audience of the text can lead to better or equally good accuracy.

Moreover, we compared the classification and clustering approaches and concluded that they can be equally effective in this binary classification task. We observed that the problem hindering performance in classification was that many complex examples were assigned to the simple cluster rather than the opposite. Understanding more about the dataset and the way linguistics classifies sentences based on difficulty can improve this approach. Our peak accuracy with the classification algorithms was 60.56% using all of our employed features, while with the clustering algorithms, the highest accuracy reached was 59.6% using the feature set selected through the information gain algorithm, which showed a notable difference compared to the other sets. We observed that, regarding the classification approach, regardless of the feature set and algorithm used—with some minor exceptions for KNN and RF—the accuracy was surprisingly close, varying by less than 4–5%.

In the future, we plan to expand our dataset and test additional features, including those from the linguistic feature family. Additionally, we aim to compare results from different tools and further optimize the classifiers. We also intend to include tests using neural networks and explore deep learning approaches, which we believe will lead to significantly better results. There seems to be a lack of formulas that consider surface-level characteristics created specifically for readability assessment of sentences. Further research into the linguistics behind this task should also be conducted in order to gain a better understanding of what EFL and L2 readers actually consider complex or simple. This will allow us to create our own custom features and formulas that align with this understanding and measure readability more accurately. We hope this work will serve as a foundation for further research, contributing to the important endeavor of advancing readability assessment to benefit a broad audience.

Author Contributions

Conceptualization, D.K. and K.L.K.; methodology, K.L.K.; validation, T.A.; formal analysis, T.A.; investigation, D.K.; writing—original draft preparation, D.K.; writing—review and editing, T.A.; visualization, D.K.; supervision, K.L.K.; project administration, K.L.K. and T.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The source code used for the data collection, as well as the dataset and detailed classification and clustering results from WEKA showcased in this study will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CCI	Correctly Classified Instances
Pre.	Precision
Re.	Recall
POS	Part Of Speech
L2	Language 2
EFL	English Foreign Learners
CEFR	Common European Framework of Reference
NPRM	Neural Pairwise Ranking Model
MLM	Masked Language Model
PCC	Pearson Correlation Coefficient
KNN	K-Nearest Neighbors
RF	Random Forest
RF-P	Random Forest algorithm with custom P percentage value for the bag size percentage
RF-I	Random Forest algorithm with custom I value representing the number of iterations
NB	Naive Bayes
NB-K	Naive Bayes with Kernel Density Estimation
NB-D	Naive Bayes with Supervised Discretization
J48-C	J48 algorithm with customized confidence interval value
J48-U	Unpruned J48 algorithm
lBk	K-Nearest Neighbors Algorithm
lBk-K	K-Nearest Neighbors Algorithm with custom K value for the number of neighbors
IG	Information Gain
EM	Expectation Maximization
Log L.	Log Likelihood
S.S.E.	Sum of Squared Errors
SMO	Sequential Minimal Optimization

References

Zervopoulos, A.; Alvanou, A.G.; Bezas, K.; Papamichail, A.; Maragoudakis, M.; Kermanidis, K. Hong Kong Protests: Using Natural Language Processing for Fake News Detection on Twitter. In IFIP Advances in Information and Communication Technology; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 408–419. [Google Scholar] [CrossRef]
Nikiforos, M.N.; Deliveri, K.; Kermanidis, K.L.; Pateli, A. Machine Learning on Wikipedia Text for the Automatic Identification of Vocational Domains of Significance for Displaced Communities. In Proceedings of the 2022 17th International Workshop on Semantic and Social Media Adaptation & Personalization (SMAP), Corfu, Greece, 3–4 November 2022. [Google Scholar] [CrossRef]
Mouratidis, D.; Kermanidis, K. Ensemble and Deep Learning for Language-Independent Automatic Selection of Parallel Data. Algorithms 2019, 12, 26. [Google Scholar] [CrossRef]
Zhang, L.; Liu, Z.; Ni, J. Feature-Based Assessment of Text Readability. In Proceedings of the 2013 Seventh International Conference on Internet Computing for Engineering and Science, Shanghai, China, 20–22 September 2013; pp. 51–54. [Google Scholar] [CrossRef]
Feng, L.; Jansche, M.; Huenerfauth, M.; Elhadad, N. A Comparison of Features for Automatic Readability Assessment. In Proceedings of the 23rd International Conference on Computational Linguistics, COLING ’10, Beijing China, 23–27 August 2010; pp. 276–284. [Google Scholar]
Kauchak, D.; Mouradi, O.; Pentoney, C.; Leroy, G. Text Simplification Tools: Using Machine Learning to Discover Features that Identify Difficult Text. In Proceedings of the 2014 47th Hawaii International Conference on System Sciences, Waikoloa, HI, USA, 6–9 January 2014; pp. 2616–2625. [Google Scholar] [CrossRef]
Trokhymovych, M.; Sen, I.; Gerlach, M. An Open Multilingual System for Scoring Readability of Wikipedia. arXiv 2024, arXiv:2406.01835. [Google Scholar] [CrossRef]
Xia, M.; Kochmar, E.; Briscoe, T. Text Readability Assessment for Second Language Learners. In Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications. Association for Computational Linguistics, San Diego, CA, USA, 16 June 2016. [Google Scholar] [CrossRef]
Vajjala, S.; Meurers, D. Assessing the relative reading level of sentence pairs for text simplification. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, Gothenburg, Sweden, 26–30 April 2014; pp. 288–297. [Google Scholar] [CrossRef]
CohMetrix Online Tool. Available online: http://cohmetrix.com/ (accessed on 5 November 2023).
Lijun, F. Automatic Readability Assessment. Ph.D. Thesis, The City University of New York, New York, NY, USA, 2010. [Google Scholar]
Frank, E.; Hall, M.A.; Witten, I.H. The WEKA Workbench. Online Appendix for Data Mining: Practical Machine Learning Tools and Techniques, 4th ed.; Morgan Kaufmann: Cambridge, MA, USA, 2016. [Google Scholar]
Vajjala, S.; Meurers, D. Readability assessment for text simplification: From analysing documents to identifying sentential simplifications. ITL—Int. J. Appl. Linguist. 2014, 165, 194–222. [Google Scholar] [CrossRef]
Matricciani, E. Readability indices do not say it all on a text readability. Analytics 2023, 2, 296–314. [Google Scholar] [CrossRef]
Stajner, S.; Evans, R.; Orasan, C.; Mitkov, R. What can readability measures really tell us about text complexity? In Proceedings of the Workshop on Natural Language Processing for Improving Textual Accessibility, Istanbul, Turkey, 27 May 2012; pp. 14–22. [Google Scholar]
Aluisio, S.; Specia, L.; Gasperin, C.; Scarton, C. Readability Assessment for Text Simplification. In Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications, Los Angeles, CA, USA, 5 June 2010; pp. 1–9. [Google Scholar]
Vajjala, S.; Meurers, D. On The Applicability of Readability Models to Web Texts. In Proceedings of the Second Workshop on Predicting and Improving Text Readability for Target Reader Populations, Sofia, Bulgaria, 8 August 2013; pp. 59–68. [Google Scholar]
Napoles, C.; Dredze, M. Learning Simple Wikipedia: A Cogitation in Ascertaining Abecedarian Language. In Proceedings of the NAACL HLT 2010 Workshop on Computational Linguistics and Writing: Writing Processes and Authoring Aids, Los Angeles, CA, USA, 6 June 2010; pp. 42–50. [Google Scholar]
Al-Thanyyan, S.S.; Azmi, A.M. Automated Text Simplification: A Survey. ACM Comput. Surv. 2021, 54, 1–36. [Google Scholar] [CrossRef]
Cha, M.; Gwon, Y.; Kung, H.T. Language Modeling by Clustering with Word Embeddings for Text Readability Assessment. arXiv 2017, arXiv:1709.01888. [Google Scholar] [CrossRef]
Setia, S.; Iyengar, S.R.S.; Verma, A.A.; Dubey, N. Is Wikipedia Easy to Understand?: A Study Beyond Conventional Readability Metrics. In Advances in Computational Collective Intelligence, Proceedings of the 15th International Conference, ICCCI 2023, Budapest, Hungary, 27–29 September 2023; Wojtkiewicz, K., Treur, J., Pimenidis, E., Maleszka, M., Eds.; Springer: Cham, Switzerland, 2021; pp. 175–187. [Google Scholar]
Isaksson, F. Is Simple Wikipedia Simple?—A Study of Readability and Guidelines. 2019. Available online: https://api.semanticscholar.org/CorpusID:208050433 (accessed on 2 February 2023).
Jatowt, A.; Tanaka, K. Is Wikipedia too difficult? Comparative analysis of readability of Wikipedia, simple Wikipedia and Britannica. In Proceedings of the 21st ACM International Conference on Information and Knowledge Management, Maui, HI, USA, 29 October–2 November 2012; pp. 2607–2610. [Google Scholar] [CrossRef]
Bansal, S.; Aggarwal, C. Textstat on PyPI. Available online: https://pypi.org/project/textstat/ (accessed on 5 November 2023).
McAlpine, R. From Plain English to Global English. 2006. Available online: https://www.angelfire.com/nd/nirmaldasan/journalismonline/fpetge.html (accessed on 2 February 2023).
Davison, A.; Kantor, R.N. On the Failure of Readability Formulas to Define Readable Texts: A Case Study from Adaptations. Read. Res. Q. 1982, 17, 187–209. [Google Scholar] [CrossRef]
Feng, L.; Elhadad, N.; Huenerfauth, M. Cognitively Motivated Features for Readability Assessment. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), Athens, Greece, 30 March–3 April 2009; pp. 229–237. [Google Scholar]
Collins-Thompson, K.; Callan, J.P. A Language Modeling Approach to Predicting Reading Difficulty. In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics: HLT-NAACL 2004, Boston, MA, USA, 2–7 May 2004; pp. 193–200. [Google Scholar]
Kauchak, D. Simple English Wikipedia: A New Simplification Task—Wikipeda Pages Data-Set. Available online: http://www.cs.pomona.edu/~dkauchak/simplification/ (accessed on 5 November 2023).
Coster, W.; Kauchak, D. Simple English Wikipedia: A New Text Simplification Task. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 19–24 June 2011; pp. 665–669. [Google Scholar]
Wikipedia: The Free Online Encyclopedia. Available online: https://en.wikipedia.org/wiki/Main_Page (accessed on 5 November 2023).
Barzilay, R.; Elhadad, N. Sentence Alignment for Monolingual Comparable Corpora. In Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, Sapporo, Japan, 11–12 July 2003; pp. 25–32. [Google Scholar]
alxwrd, S. Textstat the Easy to Use Library to Calculate Statistics from Text. Available online: https://github.com/textstat/textstat (accessed on 5 November 2023).
Graesser, A.C.; McNamara, D.S.; Louwerse, M.M.; Cai, Z. Coh-Metrix: Analysis of text on cohesion and language. Behav. Res. Methods Instrum. Comput. 2004, 36, 193–202. [Google Scholar] [CrossRef] [PubMed]
Crossley, S.A.; Kyle, K.; Dascalu, M. The Tool for the Automatic Analysis of Cohesion 2.0: Integrating semantic similarity and text overlap. Behav. Res. Methods 2018, 51, 14–27. [Google Scholar] [CrossRef] [PubMed]
Kyle, K. Measuring Syntactic Development in L2 Writing: Fine Grained Indices of Syntactic Complexity and Usage-Based Indices of Syntactic Sophistication. Ph.D. Thesis, Georgia State University, Atlanta, GA, USA, 2016. [Google Scholar] [CrossRef]

Figure 1. Infographic illustrating each step of our methodology.

Figure 2. This figure shows the percentage differences in mean feature values between complex and simple classification groups, sorted by magnitude.

Figure 3. Visualization from the WEKA environment, showing the values of the features of the dataset examples per classification group. Red represents complex examples and blue represents simple ones.

Figure 4. Visualized results of the KNN algorithm run on different feature sets and numbers of neighbors. The figure shows the percentage of accurately classified instances based on the number number of neighbors per run.

Figure 5. Visualized results of the Random Forest tests considering only 5 features from our set (Flesch–Kincaid, Gunning Fog, Automated Readability Index, Spache Readability Formula, and School Grade).

Figure 6. Visualized results of the Random Forest tests considering all 17 features employed from our set.

Figure 7. This graph shows the best and worst percentages of CCI per feature group in our tests, regardless of the classifier used. The best case is represented in blue, and the worst case is represented in red.

Figure 8. Visualized clusters from the K-means algorithm. The Y-axis represents the Flesch–Kincaid Grade feature values. Crosses and squares represent complex and simple examples, while colors indicate the different clusters formed.

Figure 9. Visualized clusters from the K-means algorithm. The Y-axis represents actual sentence complexity. Crosses and squares represent complex and simple examples, while colors indicate the different clusters formed.

Table 1. Summary of articles, datasets, languages, features, learning algorithms/tools, and results.

Article	Dataset	Language	Features	Learning Algorithms and/or Tools	Results/Key Takeaways
[4]	15 news texts from Reuters	English	Shallow, cohesion, syntactic, lexical features, connectives	Coh-Metrix NLP tool	Long sentences, complex words, and syntactic complexity decrease readability. Pronoun density confuses readers.
[5]	1433 articles from Weekly Reader magazine, LocalNews2007, LocalNews2008	English	Discourse, entity density, lexical chain, coreference interference, entity grid features	10-fold cross-validation, LIBSVM, WEKA	Entity-density features outperformed the others. Sentence length is the most efficient shallow feature. Noun-based POS features have strong predictive power.
[6]	Wikipedia and Simple Wikipedia corpus (11,800 examples)	English	16 features including surface level, POS, vocabulary, concept density, aggregate features	Random Forest, Decision Trees, Linear Regression, K-Nearest Neighbors, Naive Bayes, SVM	Random Forest achieved over 74% accuracy. Shallow and POS features yielded the best results.
[7]	Wikipedia articles in 14 languages	Multilingual	Text-ranking and sentence-ranking models, number of sentences, Flesch reading ease, linguistic, language-agnostic features	Neural Pairwise Ranking Model (NPRM), Multilingual Masked Language Model (MLM), Siamese Network Architecture, Margin Ranking Loss	TRank outperformed SRank and all baselines. Flesch reading ease performed well in English but not in other languages. Sentence-based approach makes classification harder.
[8]	CEFR-graded dataset focusing on L2	English	Shallow, lexical, POS, syntactic, language modeling, entity grid, entity density, parse tree syntactic, lexico-semantic, discourse features	Support Vector Machine (SVM)	Accuracy of 0.803 and PCC of 0.900 for native data. Accuracy of 0.785 and PCC of 0.924 for L2 data. Ranking models outperformed classification models for novel datasets.
[9]	Simple Wikipedia (100K sentence pairs), WeeBit (3125 articles), Common Core Standards Corpus (168 texts)	English	151 Features including shallow, lexical, POS, syntactic complexity, psycholinguistic, features from Celex lexical database	Pearson Correlation, Root Mean Square Error, 10-fold cross-validation, WEKA and WEKA SMO, SVM for binary classification	Best accuracy of 66% for binary classification. Sentence-level models are less effective than document-level models.

Table 2. Results from WEKA using 10-fold cross-validation and all our aforementioned features.

All Features
Classifier	CCI	ICI	Pre	Re	F1
NB	57.7	42.3	0.583	0.577	0.569
NB-K	59.725	40.275	0.599	0.589	0.594
NB-D	59.9875	40.0125	0.600	0.600	0.600
J48	58.3375	41.6625	0.585	0.583	0.581
J48-C 0.05	59.7625	40.2375	0.598	0.598	0.598
J48-U	54.25	45.75	0.544	0.543	0.539
lBk	41.7	58.3	0.416	0.417	0.415
lBk-K 13	56.0375	43.9625	0.560	0.560	0.560
lBk-K 71	59.3125	40.6875	0.597	0.593	0.589
lBk-K 551	58.9125	41.0875	0.590	0.589	0.588
lBk-K 3501	59.9875	40.0125	0.600	0.600	0.600
SMO	58.5625	41.4375	0.586	0.586	0.585
Random Forest	45.35	54.65	0.453	0.454	0.453
Random Forest-P 25	51.4	48.6	0.514	0.514	0.514
Random Forest-P 1	59.925	40.075	0.600	0.599	0.599
Random Forest-P 1 -I 10,000	60.5625	39.4375	0.607	0.606	0.605
Decision Table	59.675	40.325	0.597	0.597	0.596
Simple Logistic	59.5875	40.4125	0.596	0.596	0.596

Table 3. Results from WEKA using 10-fold cross-validation and Flesch–Kincaid + Automated Readability Index + Linsear Write Formula + Difficult Words + Polysyllable Count.

Flesch–Kincaid + Auto. Read. + Linsear Formula + Diff. Words + Polly. Count
Classifier	CCI	ICI	Pre	Re	F1
NB	57.9	42.1	0.588	0.579	0.568
NB-K	59.675	40.325	0.597	0.597	0.597
NB-D	59.7	40.3	0.597	0.597	0.597
J48	58.15	41.85	0.582	0.582	0.581
J48-C 0.05	58.8875	41.1125	0.589	0.589	0.588
J48 -U	57.425	42.575	0.575	0.574	0.573
lBk	40.3375	59.6625	0.402	0.403	0.401
lBk-K 13	54.75	45.25	0.548	0.548	0.547
lBk-K 71	58.725	41.275	0.590	0.587	0.584
lBk-K 551	59.975	40.025	0.604	0.600	0.596
lBk-K 3501	59.5875	40.4125	0.596	0.596	0.596
SMO	58.8625	41.1375	0.592	0.589	0.584
Random Forest	44.125	55.875	0.441	0.441	0.441
Random Forest-P 25	49.95	50.05	0.499	0.500	0.499
Random Forest-P 1	58.8625	41.1375	0.589	0.589	0.588
Random Forest-P 1 -I 10,000	59.525	40.475	0.597	0.595	0.594
Decision Table	59.1	40.9	0.591	0.591	0.591
Simple Logistic	59.5125	40.4875	0.596	0.595	0.594

Table 4. Feature ranking results after ranking the information gain algorithm through the WEKA environment. The final order of the selected attributes is 6, 4, 9, 2, 18, 8, 1, 10, 17, 5, 14, 7, 11, 12, 15, 16, and 13.

Information Gain Ranking Filter Results
Entropy	Feature Number	Feature
0.03611	6	Automated Readability Index
0.03583	4	Gunning Fog
0.03571	9	Spache Readability
0.03486	2	Flesch–Kincaid Grade Level
0.03247	18	School Grade
0.03237	8	Linsear Write Formula
0.02712	1	Flesch Reading Ease
0.02334	10	McAlpine EFLAW
0.02191	17	Difficult Words
0.02144	5	Coleman–Liau Index
0.01825	14	Polysyllable Count
0.01714	7	Dale–Chall Readability Score
0.01597	11	Reading Time
0.01572	12	Syllable Count
0.0103	15	Word Count
0.00822	16	Mini-Word Count
0.00272	13	Monosyllable Count

Table 5. Results from WEKA using 10-fold cross-validation and Flesch–Kincaid + Gunning Fog + Automated Readability Index + Spache Readability Formula + School Grade.

Flesch–Kincaid + Gunning Fog + Automated Read. + Spache Read. + School Grade
Classifier	CCI	ICI	Pre	Re	F1
NB	59.3625	40.6375	0.594	0.594	0.593
NB-K	59.7	40.3	0.597	0.597	0.597
NB-D	59.95	40.05	0.601	0.600	0.598
J48	58.75	41.25	0.589	0.588	0.586
J48-C 0.05	58.8375	41.1625	0.590	0.588	0.587
J48-U	57.2625	42.7375	0.573	0.573	0.572
lBk	41.4	58.6	0.413	0.414	0.412
lBk-K 13	55.1125	44.8875	0.551	0.551	0.551
lBk-K 71	58.45	41.55	0.585	0.585	0.584
lBk-K 551	58.8625	41.1375	0.589	0.589	0.588
lBk-K 3501	59.725	40.275	0.598	0.597	0.596
SMO	58.5625	41.4375	0.586	0.586	0.585
Random Forest	43.4125	56.5875	0.434	0.434	0.434
Random Forest-P 25	49.7875	50.2125	0.498	0.498	0.498
Random Forest -P 1	58.2125	41.7875	0.583	0.582	0.581
Random Forest-P 1 -I 10,000	59.1625	40.8375	0.592	0.592	0.591
Decision Table	59.6375	40.3625	0.598	0.596	0.595
Simple Logistic	59.4375	40.5625	0.594	0.594	0.594

Table 6. Results from WEKA using 10-fold cross-validation and Flesch–Kincaid + Word Count.

Flesch–Kincaid + Word Count
Classifier	CCI	ICI	Pre	Re	F1
NB	56.675	43.325	0.574	0.567	0.556
NB-K	58.6875	41.3125	0.587	0.587	0.586
NB-D	59.3875	40.6125	0.594	0.594	0.593
J48	59.2125	40.7875	0.593	0.592	0.591
J48-C 0.05	59.3625	40.6375	0.595	0.594	0.592
J48-U	59.15	40.85	0.593	0.592	0.590
lBk	54.3	45.7	0.543	0.543	0.543
lBk-K 13	55.95	44.05	0.560	0.560	0.559
lBk-K 71	58.275	41.725	0.584	0.583	0.581
lBk-K 551	59.4125	40.5875	0.598	0.594	0.590
lBk-K 3501	59.3875	40.6125	0.595	0.594	0.593
SMO	58.6375	41.3625	0.588	0.586	0.585
Random Forest	54.8	45.2	0.549	0.548	0.547
Random Forest-P 25	55.7375	44.2625	0.558	0.557	0.556
Random Forest-P 1	58.65	41.35	0.589	0.587	0.584
Random Forest-P 1 -I 10,000	59.2625	40.7375	0.595	0.593	0.590
Decision Table	59.725	40.275	0.598	0.597	0.597
Simple Logistic	58.7125	41.2875	0.588	0.587	0.587

Table 7. Results from WEKA using 10-fold cross-validation and McAlpine EFLAW as a single feature.

McAlpine EFLAW
Classifier	CCI	ICI	Pre	Re	F1
NB	55.9625	44.0375	0.571	0.560	0.541
NB -K	57.525	42.475	0.577	0.575	0.572
NB-D	57.925	42.075	0.586	0.579	0.571
J48	57.9375	42.0625	0.586	0.579	0.572
J48-C 0.05	57.9375	42.0625	0.586	0.579	0.572
J48-U	57.9125	42.0875	0.586	0.579	0.571
lBk	57.25	42.75	0.574	0.573	0.571
lBk-K 13	57.425	42.575	0.576	0.574	0.572
lBk-K 71	57.1375	42.8625	0.574	0.571	0.568
lBk-K 551	57.7125	42.2875	0.579	0.577	0.574
lBk-K 3501	57.4375	42.5625	0.575	0.574	0.574
SMO	56.0375	43.9625	0.572	0.560	0.542
Random Forest	57.5	42.5	0.577	0.575	0.573
Random Forest-P 25	57.5625	42.4375	0.577	0.576	0.573
Random Forest-P 1	56.925	43.075	0.570	0.569	0.569
Random Forest-P 1 -I 10,000	57.875	42.125	0.581	0.579	0.576
Decision Table	57.925	42.075	0.586	0.579	0.571
Simple Logistic	57.35	42.65	0.576	0.574	0.571

Table 8. Results from WEKA using 10-fold cross-validation and different settings for the Random Forest classifier.

Random Forest Results per Setting
All 17 Features
Classifier	CCI	ICI	Pre	Re	F1
-P 1 -I 10	55.8625	44.1375	0.559	0.559	0.558
-P 1 -I 1000	60.3	39.7	0.604	0.603	0.602
-P 1 -I 10,000	60.475	39.525	0.606	0.605	0.604
-P 50 -I 10	47.225	52.775	0.472	0.472	0.471
-P 50 -I 1000	46.3625	53.6375	0.464	0.464	0.463
-P 100 -I 10	44.2375	55.7625	0.442	0.442	0.442
-P 100 -I 1000	45.025	54.975	0.450	0.450	0.450
5 Features
Parameters	CCI	ICI	Pre	Re	F1
-P 1 -I 10	54.375	45.625	0.545	0.544	0.542
-P 1 -I 1000	58.8875	41.1125	0.590	0.589	0.588
-P 1 -I 10,000	59.1625	40.8375	0.592	0.592	0.591
-P 50 -I 10	46.775	53.225	0.468	0.468	0.467
-P 50 -I 1000	45.3	54.7	0.453	0.453	0.453
-P 100 -I 10	42.4	57.6	0.424	0.424	0.424
-P 100 -I 1000	43.825	56.175	0.438	0.438	0.438

Table 9. Results from WEKA in the form of metrics using the Expectation-Maximization algorithm, with class-to-cluster evaluation across different feature sets.

Performance Results for Expectation Maximization
Feature Set	Accuracy	Precision	Recall	F-Measure
ALL	57.2625	0.691	0.559	0.618
TOP 5 (IG)	59.6	0.625	0.591	0.607
Flesch–Kincaid	56.6375	0.799	0.545	0.648
EFLAW	55.725	0.799	0.539	0.644
Kin. + EFLAW	56.7	0.788	0.546	0.645

Table 10. Results from WEKA using the Expectation-Maximization algorithm, with class-to-cluster evaluation across different feature sets.

Clustering Results for Expectation Maximization
Feature Set	TP	TN	FN	FP	Log L.
ALL	2763	1818	2182	1237	−46.2075
TOP 5 (IG)	2498	2270	1730	1502	−12.2730
Flesch–Kincaid	3197	1334	2666	803	−2.9014
EFLAW	3197	1261	2739	803	−3.8414
Kin. + EFLAW	3152	1384	2616	848	−6.5203

Table 11. Results from WEKA in the form of metrics using the K-means algorithm, with class-to-cluster evaluation across different feature sets.

Performance Results for K-Means
Feature Set	Accuracy	Precision	Recall	F-Measure
ALL	56.7375	0.791	0.547	0.646
TOP 5 (IG)	58.9875	0.645	0.581	0.611
Flesch–Kincaid	57.7625	0.725	0.560	0.632
EFLAW	55.9125	0.779	0.541	0.639
Kin. + EFLAW	56.6	0.790	0.546	0.645

Table 12. Results from WEKA using the K-means algorithm, with class-to-cluster evaluation across different feature sets.

Clustering Results for K-Means
Feature Set	TP	TN	FN	FP	S.S.E.
ALL	3164	1375	2625	836	7688.14
TOP 5 (IG)	2580	2139	1861	1420	6463.61
Flesch–Kincaid	2900	1721	2279	1100	43.87
EFLAW	3117	1356	2644	883	59.24
Kin. + EFLAW	3159	1369	2631	841	126.38

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kostadimas, D.; Kermanidis, K.L.; Andronikos, T. Exploring the Effectiveness of Shallow and L2 Learner-Suitable Textual Features for Supervised and Unsupervised Sentence-Based Readability Assessment. Appl. Sci. 2024, 14, 7997. https://doi.org/10.3390/app14177997

AMA Style

Kostadimas D, Kermanidis KL, Andronikos T. Exploring the Effectiveness of Shallow and L2 Learner-Suitable Textual Features for Supervised and Unsupervised Sentence-Based Readability Assessment. Applied Sciences. 2024; 14(17):7997. https://doi.org/10.3390/app14177997

Chicago/Turabian Style

Kostadimas, Dimitris, Katia Lida Kermanidis, and Theodore Andronikos. 2024. "Exploring the Effectiveness of Shallow and L2 Learner-Suitable Textual Features for Supervised and Unsupervised Sentence-Based Readability Assessment" Applied Sciences 14, no. 17: 7997. https://doi.org/10.3390/app14177997

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Exploring the Effectiveness of Shallow and L2 Learner-Suitable Textual Features for Supervised and Unsupervised Sentence-Based Readability Assessment

Abstract

1. Introduction

1.1. Contributions

1.2. Organization

2. A Comprehensive Overview of the Literature

3. Features for Readability Assessment

3.1. Shallow Features

3.2. Linguistic Features

4. Methodology and Experimental Results

4.1. The Dataset

4.2. Methodology

4.3. Results for Classification Algorithms

4.4. Results for Clustering Algorithms

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI