Predicting Student Dropout Rates Using Supervised Machine Learning: Insights from the 2022 National Education Accessibility Survey in Somaliland

Hassan, Mukhtar Abdi; Muse, Abdisalam Hassan; Nadarajah, Saralees

doi:10.3390/app14177593

Open AccessArticle

Predicting Student Dropout Rates Using Supervised Machine Learning: Insights from the 2022 National Education Accessibility Survey in Somaliland

by

Mukhtar Abdi Hassan

¹

,

Abdisalam Hassan Muse

¹

and

Saralees Nadarajah

^2,*

¹

Faculty of Science and Humanities, School of Postgraduate Studies and Research (SPGSR), Amoud University, Borama 25263, Somalia

²

Department of Mathematics, University of Manchester, Manchester M13 9PL, UK

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(17), 7593; https://doi.org/10.3390/app14177593 (registering DOI)

Submission received: 25 June 2024 / Revised: 24 August 2024 / Accepted: 25 August 2024 / Published: 28 August 2024

Download

Browse Figures

Versions Notes

Abstract

:

High student dropout rates are a critical issue in Somaliland, significantly impeding educational progress and socioeconomic development. This study leveraged data from the 2022 National Education Accessibility Survey (NEAS) to predict student dropout rates using supervised machine learning techniques. Various algorithms, including logistic regression (LR), probit regression (PR), naïve Bayes (NB), decision tree (DT), random forest (RF), support vector machine (SVM), and K-nearest neighbors (KNN), were employed to analyze the survey data. The analysis revealed school dropout rate of 12.67%. Key predictors of dropout included student’s grade, age, school type, household income, and type of housing. Logistic regression and probit regression models highlighted age and student’s grade as critical predictors, while naïve Bayes and random forest models underscored the significance of household income and housing type. Among the models, random forest demonstrated the highest accuracy at 95.00%, indicating its effectiveness in predicting dropout rates. The findings from this study provide valuable insights for educational policymakers and stakeholders in Somaliland. By identifying and understanding the key factors influencing dropout rates, targeted interventions can be designed to enhance student retention and improve educational outcomes. The dominant role of demographic and educational factors, particularly age and student’s grade, underscores the necessity for focused strategies to reduce dropout rates and promote inclusive education in Somaliland.

Keywords:

student dropout; machine learning; Somaliland; national education accessibility survey

1. Background of the Study

Somaliland, an independent country in the Horn of Africa, faces multifaceted challenges in terms of its educational landscape. Despite commendable efforts to enhance access to education, dropout rates remain a pressing concern, posing a significant barrier to a country’s development goals. Factors such as poverty, inadequate infrastructure, cultural norms, and limited resources contribute to this challenge, underscoring the need for targeted interventions using data-driven approaches.

According to [1] dropout refers to children who leave the educational system before completing their academic year. This means that they do not receive the final mark for that year or an official document proving that they finished that specific year of primary or secondary school.

Dropout, in the context of education, refers to the phenomenon in which students leave school before completing their studies. The dropout rate is a crucial metric that reflects the number or percentage of students who disengage prematurely from the educational system. Various studies have highlighted the prevalence of dropouts across different educational levels, including primary and secondary education [2].

Some studies have highlighted the complexity of defining dropout, especially in the context of online education, where factors such as work and family constraints can influence students’ decisions to discontinue their studies [3].

The student dropout rate is a critical issue affecting educational systems worldwide, with significant implications for individual students, educational institutions, and society. Despite efforts to address this issue, there remains a need to comprehensively analyze the determinants of student dropout rates to develop effective interventions and policies. The National Education Accessibility Survey 2022 provides a valuable dataset for investigating the factors that influence student dropout rates. By examining this dataset, this study aimed to identify and understand the key determinants contributing to student dropout rates, thereby informing targeted strategies to improve student retention and educational outcomes.

Significnace and Novelty of the Study

This study marks a significant milestone in Somaliland, being the first to investigate student dropout rates using data from the 2022 National Education Accessibility Survey conducted in collaboration with the Ministry of Education. By employing a diverse set of machine learning models such as logistic regression, probit regression, naïve Bayes, random forest, decision tree, support vector machine (SVM), and K-nearest neighbors (KNN), this research delves into the complexities of why students leave school early. It reveals how factors like age, household income, and family size play a role in dropout rates. Offering innovative strategies to address these challenges, this study is a pivotal contribution to enhancing educational outcomes in Somaliland, providing a nuanced understanding for policymakers, educators, and stakeholders.

For the government, this study offers invaluable insights for policymaking and resource allocation in the education sector, enabling targeted interventions to reduce dropout rates and enhance educational outcomes. International NGOs can leverage these findings to develop tailored programs and initiatives that address the identified risk factors, fostering collaboration with local stakeholders for impactful interventions. Researchers and academicians are presented with a pioneering research framework that not only advances the understanding of educational challenges in Somaliland but also showcases the application of diverse machine learning models in educational data analysis, offering methodological insights for future studies.

Parents and community members benefit from increased awareness about the complexities of student dropout, empowering them to actively participate in creating a supportive educational environment for children. Applied scientists and data science practitioners find a valuable case study in this research, highlighting the significance of utilizing advanced analytical techniques to uncover actionable insights from educational data. By proposing innovative strategies to tackle dropout challenges, this study lays the groundwork for sustainable improvements in educational outcomes, ultimately contributing to a more inclusive and successful educational landscape in Somaliland.

2. Related Literature Review

2.1. Review for the Magnitude and Associated Factors for School Dropout

Student dropout rates are a significant concern in educational systems worldwide. Various studies have delved into factors influencing student dropout, utilizing machine learning and neural networks to predict and analyze dropout patterns [4].

According to [5], school dropout is a significant problem in Central America and Latin America due to its impact on economic productivity, inclusive growth, social cohesion, and increased juvenile dangers.

According to [6], approximately 2.25 out of every 100 primary children abandon schools between Primary 1 and Primary 8 in India. This implies that a small proportion of pupils drop out of elementary school. However, secondary schooling was more problematic, with a 12.6% dropout rate. This indicates that more than 12 out of every 100 pupils abandoned school before completing their secondary education. These dropout statistics underline the need to provide students with the necessary assistance to stay in school and complete their education, which is critical for their future possibilities and growth.

The escalating school dropout rates in Africa pose a significant challenge to the continent’s socioeconomic development. Nigeria and Ethiopia were at the forefront, with alarming dropout rates of 16.9% and 13%, respectively. These figures are deeply concerning, as they reflect the disheartening reality of educational disparities and systemic issues plaguing these nations. The factors contributing to high dropout rates include poverty, inadequate infrastructure, insufficient resources, and conflicts, which disproportionately affect marginalized communities. Nigeria, Africa’s most populous country, struggles with widespread poverty and insurgency, exacerbating the challenges in ensuring universal access to quality education. Similarly, Ethiopia faces multifaceted challenges, including poverty, regional conflicts, and limited access to educational facilities, hindering efforts to curb dropout rates and promote inclusive education [7].

Comparatively, other African countries also grapple with varying dropout rates, albeit at lower levels. Kenya, with a dropout rate of 9.4%, faces issues such as poverty, ethnic disparities, and inadequate infrastructure, particularly in rural areas. South Africa, despite its relatively low dropout rate of 6%, continues to face challenges such as socioeconomic inequality, insufficient resources, and high levels of violence that impede educational attainment. Malawi, Ghana, Liberia, Senegal, and Uganda exhibited progressively lower dropout rates ranging from 4.3% to 0.7%. However, even within these countries, disparities persist, with rural areas and marginalized communities bearing the brunt of educational inequalities. Addressing the root causes of dropout rates requires comprehensive strategies encompassing policy reforms, increased investment in education, community engagement, and targeted interventions to support vulnerable populations. Without concerted efforts to tackle these issues, the cycle of poverty and underdevelopment perpetuated by high dropout rates will continue to impede Africa’s progress [7].

2.2. Review for Supervised Machine Learning Used to Predict and Classify School Dropout Rates

The use of machine learning algorithms to predict student dropout has gained significant attention in educational research. Several studies have demonstrated the effectiveness of machine learning in identifying students at risk of dropping out [8,9,10]. These algorithms have been applied in various educational settings, including higher education institutions and online courses, showing their versatility and potential impact [11,12,13]. By leveraging students’ online interactions and academic data, machine learning models can accurately forecast dropout events [11,12].

Furthermore, the use of machine learning techniques such as decision trees and neural networks has shown promise in the early identification of students at risk of desertion [14,15]. These models not only predict dropout but also provide insights into the factors influencing individual predictions [16].

Additionally, the application of machine learning in predicting dropouts in different educational systems, including MOOCs and secondary schools, highlights the broad scope of this research area [17,18].

These models not only forecast dropout, but also contribute to improving student retention rates by providing early warning systems [19]. The integration of machine learning algorithms with data mining techniques has further enhanced the accuracy and efficiency of dropout prediction models [20].

The utilization of machine learning algorithms to predict student dropout is a rapidly evolving field with significant implications for educational institutions. These models offer a data-driven approach for identifying students at risk of dropping out, thereby enabling timely interventions to improve student outcomes and retention rates.

The Somaliland school dropout rates are currently 12.67%, highlighting the need for accurate forecasting. This can help to identify risk factors, implement targeted interventions, allocate resources effectively, develop early warning systems, and design programs to support at-risk students, ultimately fostering a more inclusive and equitable education system in Somaliland.

Table 1 provides a comprehensive summary of various studies that utilize machine learning algorithms to predict student dropout rates. It highlights the authors, titles, algorithms used, chosen algorithms, and results of these studies. For instance, [21] used logistic regression to achieve an accuracy of 84.8%, while [22] employed random forest, attaining an accuracy of 95.00%. This comparison underscores the effectiveness of different machine learning models in predicting student dropout rates across diverse educational contexts.

3. Methodology

3.1. Research Design

This study aimed to analyze student dropout rates using data from the National Education Accessibility Survey 2022, focusing on socioeconomic, demographic, and educational factors.

This study used machine learning algorithms to identify significant predictors and explore variations across different demographic and socioeconomic groups. Subgroup analyses were conducted to explore potential variations in dropout determinants across demographics. The research aims to provide insights into the complex factors driving student dropout rates and inform targeted interventions to promote inclusive schooling.

3.2. Data Source

The National Education Accessibility Survey reached 1957 households in the sampled districts of 2000 households, as shown in Table 2.

3.3. Study Variables

3.3.1. Outcome Variable

In this study, the outcome variable, defined based on [1], pertains to school-age children in Somaliland who disengage from the educational system before successfully completing the academic year. A child is categorized as a “dropout” if he/she does not attain the final mark for the academic year or lacks an official document certifying the completion of that specific year within primary or secondary school. Consequently, for the purposes of this research, the outcome variable is binary: assigned a value of 1 to denote a child who has dropped out of school and 0 to indicate a child who has not.

3.3.2. Predictor Variables

The explanatory variables were classified into two categories: socioeconomic factors such as household leader educational level, household occupational status, region of residence, district of residence, and household wealth status, and demographic factors, including sex of children, number of children in the household, school time, age of the child, disability, and school distance.

3.4. Data Preprocessing

The raw data obtained from the 2022 NEAS dataset underwent comprehensive preprocessing to ensure its readiness for analysis. This phase involved multiple steps, including data cleaning, handling of missing values, and variable transformation, as necessary. These processes were essential in preparing a reliable and accurate dataset for subsequent analysis.

3.5. Data Cleaning

Data cleaning was a vital step to ensure the dataset’s quality and integrity. In this study, a meticulous data cleaning process was conducted to identify and correct errors, inconsistencies, and outliers within the data. This process included the removal of duplicate entries, correction of formatting issues, and addressing data entry errors. By implementing these steps, a high-quality dataset was created, which served as the foundation for all further analyses.

3.6. Missing Value Imputation

Handling missing data is crucial in preserving the dataset’s integrity and ensuring accurate analyses. In this study, missing value imputation was carried out iteratively to achieve complete data for all variables. This rigorous approach minimized the potential bias and errors associated with missing data, enhancing the reliability of the results. Appropriate imputation techniques were employed, resulting in a robust and comprehensive dataset, suitable for in-depth analysis and interpretation.

3.7. Proposed ML Models

In this study, we employed eight supervised machine learning algorithms to predict student dropout rates using data from the National Education Accessibility Survey 2022. These algorithms are decision tree, random forest, support vector machine, naïve Bayes, K-nearest neighbors (KNN), logistic regression, and probit regression.

3.7.1. Decision Tree

A decision tree is a non-parametric supervised learning algorithm that is adept at classification and regression tasks. It iteratively partitions the dataset based on the input feature values and makes decisions along the branches of the tree. In classification, it assigns data points to distinct classes, while in regression, it predicts the target variable’s value. Renowned for simplicity and interpretability, decision trees offer insights into feature importance and decision processes. However, they may be prone to overfitting, necessitating techniques like pruning for improved generalization [31].

3.7.2. Random Forest

Random forest is an ensemble learning method celebrated for its resilience and adaptability. By generating numerous decision trees during training, each on a distinct data subset, it mitigates overfitting and enhances predictive accuracy. In classification tasks, the final prediction stems from the mode of classes predicted by individual trees, while in regression, it is derived from their mean prediction. This ensemble approach ensures robustness and stability across diverse datasets. Moreover, random forest’s capacity to handle high-dimensional data while maintaining interpretability renders it a widely favored choice in various domains [32].

3.7.3. Support Vector Machine (SVM)

Support vector machine (SVM) is a powerful supervised learning algorithm used for both classification and regression tasks. It constructs one or more hyperplanes in a high-dimensional space to separate data into distinct classes. The optimal hyperplane maximizes the margin between classes, improving model generalizability and reducing classification errors. For non-linearly separable data, SVM employs kernel functions to transform the input space, enabling effective separation in a higher-dimensional space. This approach enhances the algorithm’s flexibility in handling complex datasets [33].

3.7.4. Naïve Bayes

Naïve Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from a finite set. There is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naïve Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable [34].

3.7.5. K-Nearest Neighbors (KNN)

K-nearest neighbors (KNN) is a non-parametric algorithm used for classification and regression tasks. It classifies a data point by identifying the majority class among its k-nearest neighbors in the feature space. For regression, the prediction is the average value of the neighbors. KNN is simple and intuitive because it requires no prior model training and assumes that similar data points are near each other. However, it can be computationally intensive with large datasets and is sensitive to the choice of k and feature scaling, affecting its performance and accuracy [35].

3.7.6. Logistic Regression

Logistic regression is a statistical method used in binary classification problems. It predicts the probability of an event’s occurrence by fitting the data to a logistic curve, which is an S-shaped function. This function outputs values between zero and one, representing the probability of an event. The model uses a linear combination of input features transformed by a logistic function to determine the probabilities. Logistic regression is valued for its interpretability, with coefficients indicating the influence of each feature. It works well with linearly separable data but may require regularization to prevent overfitting in complex scenarios [36].

3.7.7. Probit Regression

Probit regression is a statistical method used for binary dependent variables similar to logistic regression. It predicts event probabilities by using a normal cumulative distribution function (CDF) instead of a logistic CDF. This model links input features to the probability of an outcome assuming a normal distribution of the latent variable. Probit regression is useful when the assumption of normally distributed errors is correct. It is often applied in fields such as econometrics and biometrics as an alternative to logistic regression [37].

3.8. Model Comparison and Evaluation

In this section, we compare the performance of the eight machine learning algorithms introduced earlier based on key evaluation metrics commonly used in assessing classification models. These metrics include accuracy, precision, recall, F1 score, and area under the receiver operating characteristic curve (AUC-ROC). By comparing these metrics, our aim is to identify the most effective model for predicting student dropout rates.

Definitions of Notations:

True Positive (TP): The number of instances correctly predicted as positive.

True Negative (TN): The number of instances correctly predicted as negative.

False Positive (FP): The number of instances incorrectly predicted as positive.

False Negative (FN): The number of instances incorrectly predicted as negative.

Metrics:

Accuracy: The ratio of correctly predicted instances to the total instances.

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

Precision: The ratio of correctly predicted positive observations to the total predicted positives.

P r e c i s i o n = \frac{T P}{T P + F P}

Recall: The ratio of correctly predicted positive observations to all observations in the actual class.

R e c a l l = \frac{T P}{T P + F N}

F1 score: Weighted average of precision and recall.

F I - S c o r e = 2 \frac{P r e c i s i o n * R e c a l l}{P r e c i s o n + R e c a l l}

AUC-ROC: Area under the ROC, which plots the true positive rate (TPR) against the false positive rate (FPR).

The performance of each model was evaluated using these metrics on a test dataset. The model with the highest scores across these metrics is considered to be the best for predicting student dropout rates.

Model evaluation is a crucial step in assessing the performance of machine learning models. This process involves testing the models on a separate validation dataset that was not utilized during the training phase. The evaluation process includes specific steps to ensure clarity and repeatability:

Train/Test Split: The dataset was divided into 80% training data and 20% testing data to maintain consistency and transparency in the evaluation process.

Cross-validation: In this study, K-fold cross-validation (k = 10) was adopted to enhance robustness and generalizability of the model.

Evaluation Metrics: We calculate the evaluation metrics, such as accuracy, precision, recall, F1 score, and AUC-ROC, for each model, as described in the Model Comparison subsection.

Confusion Matrix: We ensured a detailed breakdown of the model’s predictions by generating a confusion matrix.

ROC: The ROC for each model to was plotted visualize its performance across various threshold values.

Hyperparameter Tuning: The hyperparameters of each model were tuned using methods like grid search or random search to discover the optimal parameters that maximize performance.

By specifying parameters for model evaluation, such as those outlined above, this study can be repeated independently or expanded upon in future research. This process is vital for enhancing the repeatability and validity of the study results. The model evaluation process assists in identifying the strengths and weaknesses of individual algorithms, aiding in the selection of the most suitable model for predicting student dropout rates using data from the NEAS 2022 dataset, as summarized in Figure 1.

4. Findings and Results

4.1. Descriptive Statistics

Table 3 provides a thorough representation of the demographic and socioeconomic characteristics of all students residing in households that participated in the National Education Accessibility Survey 2022. This dataset provides essential information about the characteristics of households, which is crucial for understanding the specific details that influence educational access and socioeconomic development in the examined area. An exhaustive analysis of each category revealed significant trends and patterns, providing valuable insights for academic discussions and policy development.

The category “Age of Household Head” reveals a significant predominance of young individuals taking on the responsibility of leading households, with 88.57% of heads under the age of 20 years. This demographic phenomenon highlights the early assumption of familial obligations by young people, which is influenced by various variables, including economic pressure, family dynamics, and cultural standards.

Moreover, the examination of “Household Size” reveals that a significant majority of households consist of larger family units with five or more individuals, accounting for 85.01% of the households questioned. These findings have significant consequences for how resources are distributed, the quality of housing, and dynamics within families. Careful and detailed interventions are required to tackle related difficulties and promote strong and adaptable family structures.

When examining the “Sex of Household Head”, the data show a roughly equal distribution of male (51.80%) and female (48.20%) household heads. The presence of equal representation of gender in home leadership positions highlights the progress made towards achieving gender equality in decision-making and household management. This reflects changing societal norms and progressive sociopolitical environments.

Examining geographical divisions, the study of “Region” reveals diverse distribution patterns, with Marodijeh standing out as the region with the largest percentage of households (31.92%). These differences highlight variances in population density, economic activity, and infrastructure development across different regions. This calls for policy interventions that focus on the specific needs and requirements of each region.

Similarly, the analysis of “District” highlights the specific demographic patterns in different areas, revealing that Hargeisa has the highest concentration of households, accounting for 31.94% of the total. This highlights the crucial importance of Hargeisa as a central location for population and economic activity in the examined area, necessitating specific actions to meet related infrastructure and socioeconomic needs.

Furthermore, the classification based on “Location Type” revealed a preference for urban living, as 66.01% of households were in urban regions. These findings highlight the impact of urbanization on resource allocation, infrastructure development, and socioeconomic inequalities. This emphasizes the need for strong urban planning methods that provide fair access to resources and facilities.

An analysis of the “Occupation of the Household Head” reveals a wide range of livelihood pursuits, including agricultural activities, government jobs, and self-employment operations. The presence of various types of jobs within families highlights the diversified character of economic activity. This diversity is the result of different skills, resources, and market conditions.

In addition, the categorization of “Type of Housing” provides insight into the predominant housing circumstances, with “Daar no fence” home being the most common form, accounting for 39.31% of the total. This indicates a high occurrence of informal or temporary housing, highlighting the necessity for changes in housing infrastructure and urban development projects to address housing deficiencies and improve residential living conditions.

An examination of “School Type” revealed common educational enrollment trends, with most pupils (72.16%) attending public schools. This highlights the dependence on educational institutions sponsored by the government and emphasizes the need to strengthen public education systems to guarantee fair access to high-quality education and promote inclusive socio-educational development paths.

To summarize, the table provides important information on the demographic and socioeconomic trends that influence educational accessibility and socioeconomic disparities in the investigated region. These detailed observations are essential for guiding policy decisions founded on evidence, actions that are focused, and academic investigations that aim to promote fair socio-educational development paths and reduce systemic inequalities.

4.2. Magnitude of the School Dropout in Somaliland Based on NEAS Dataset 2022

Figure 2 depicts the extent of school dropout among children of school age in Somaliland as per the NEAS data from 2020.

4.3. Supervised Machine Learning Models

Table 4 presents a comprehensive analysis of various machine learning models used to predict student dropout and completion rates. The models evaluated included logistic regression, probit regression, naïve Bayes, random forest, decision tree, support vector machine (SVM), and K-nearest neighbors (KNN). Each model’s performance was assessed using several critical metrics: accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), prevalence, detection rate, detection prevalence, balanced accuracy, and area under the curve (AUC).

Initially, the table delineated the predictions made by each model, distinguishing between correctly predicted completions and dropouts. For instance, logistic regression correctly predicted 773 completions and 7 dropouts, whereas probit regression demonstrated similar results with 773 completions and 7 dropouts. In contrast, naïve Bayes showed 777 completions and 5 dropouts, indicating a slight improvement in correctly identifying the completions. These predictions are critical for understanding how each model performs in real-world scenarios, where accurate identification of both completions and dropouts is essential.

Accuracy measures the proportion of correct predictions (both completions and dropouts) of the total predictions made. The accuracy values for the models ranged from 0.980 for KNN to 0.988 for decision tree and SVM, as shown in the accompanying figure. This indicates that most models are highly accurate, with the decision tree and SVM slightly outperforming the other models. High accuracy is a desirable attribute in predictive models as it directly correlates with the reliability of the predictions made by the model.

The sensitivity, or recall, is the proportion of actual completions correctly identified by the model. All models exhibited high sensitivity, ranging from 0.984 for KNN to 0.993 for decision tree, SVM, and random forest. This consistency indicates that the models are effective in identifying students who will complete their courses, as illustrated in the sensitivity graph. High sensitivity is particularly crucial in educational contexts where accurately identifying students who are likely to succeed can help provide appropriate support to those at risk.

Specificity measures the proportion of actual dropouts that are correctly identified by the model. Specificity values ranged from 0.923 for logistic regression and probit regression to 0.956 for naïve Bayes, decision tree, and SVM, respectively. A higher specificity indicates better performance in identifying students who are likely to drop out, as depicted in the specificity figure. This metric is essential for understanding how well the model can discern between students who will drop out and those who will complete their courses, thus aiding targeted intervention strategies.

PPV or precision represents the proportion of correctly predicted completions. Most models had a PPV of approximately 0.993, except for logistic regression and probit regression, both at 0.988. This high PPV signifies that when a model predicts that a student will complete the task, it is very likely to be correct. High precision ensures that resources are not wasted on false positives, thereby enhancing the efficiency of the support systems.

NPV is the proportion of predicted dropouts that are accurate. Here, the values range from 0.896 for KNN to 0.956 for the decision tree and SVM. The lower NPV for KNN indicates that it is less reliable in predicting dropouts compared to the decision tree and SVM, which maintain a high accuracy. Understanding NPV helps evaluate how well the model predicts students who are actually at risk of dropping out, ensuring that necessary interventions are directed appropriately.

Prevalence indicates the proportion of the dataset that comprises actual completions, whereas the detection rate measures the proportion of actual completions detected by the model. All models showed a similar prevalence, around 0.870, and detection rates were slightly lower, reflecting their effectiveness in identifying completions. These metrics aid in understanding the baseline distribution of the dataset and the effectiveness of the models in maintaining this distribution in their predictions.

Balanced accuracy, which is the average of sensitivity and specificity, provides a more comprehensive performance measure. The values range from 0.957 for logistic regression and probit regression to 0.975 for Decision tree and SVM, indicating that the latter models perform the best overall. A high balanced accuracy signifies that the model is not biased towards predicting either completions or dropouts, providing a balanced approach to prediction.

The AUC measures the model’s ability to distinguish between classes, with values above 0.970 indicating excellent performance. The AUC values for the models ranged from 0.971 for naïve Bayes to 0.988 for random forest, demonstrating that all models had strong discriminative power, with random forest and SVM being particularly notable. A high AUC indicates that the model has a high capability to correctly classify students in terms of whether they will complete their courses or drop out.

From a detailed analysis, it is clear that decision tree, SVM, and random forest are the top-performing models. They exhibited the highest accuracy, sensitivity, specificity, and AUC values, making them highly reliable for predicting student dropouts and completions. The high balanced accuracy of these models also suggests that they are well-calibrated and robust across different metrics. The detailed performance metrics in the figures underscore the effectiveness of these machine learning models in educational data mining.

For educational institutions aiming to reduce dropout rates, implementing predictive models, such as decision trees, SVM, and random forest, can provide significant insights. These models can help identify at-risk students early, allowing for timely intervention and support systems to improve student retention rates. The high specificity and sensitivity of these models ensure that both completions and dropouts are accurately identified, thereby enhancing the overall effectiveness of predictive analytics in education.

4.4. Features Selection

These features highlight the critical role of socioeconomic factors in influencing student dropout rates in Somaliland. The graphical representation of household income levels suggests that family financial background is a significant predictor of whether a student continues their education or leaves the schooling system prematurely. This finding aligns with the robust body of research that has consistently demonstrated the impact of poverty and resource constraints on educational outcomes, particularly in developing contexts, such as Somaliland.

Figure 3 shows the importance of various features in a logistic regression model, with each bar representing the significance of a feature. Age emerged as the most crucial predictor, followed closely by grade, indicating that these two features played the most significant roles in the model’s outcomes. Sex also shows notable importance, whereas school type, residence, and household size (HHSize) contribute to a lesser extent. Features such as Typehousing, Occupation, District, and Region have minimal to negligible impacts on the model’s predictions, with Region being the least influential. This analysis highlights that demographic and educational factors are the primary drivers of the model predictions.

Figure 3 illustrates the importance of various features in a probit regression model, revealing that age and grade are the most significant predictors, each with an importance score of just above 0.6. Sex also plays a notable role, with a score of approximately 0.25, while household size contributes moderately, with a score of about 0.1. School type, type of housing, and occupation had minimal impact, with scores ranging from approximately 0.05 to just below 0.1. Residence, district, and region had negligible influence, with importance scores close to 0. This analysis underscores that demographic and educational factors are the primary drivers in the model, whereas geographic and housing characteristics have minimal effects.

Figure 3 demonstrates the feature importance of various variables in a random forest model, revealing that “Grade” is overwhelmingly the most influential feature, dwarfing all others such as “School type”, “Region”, “Household size”, “Age”, “Residence”, “District”, “Sex”, “Typehousing”, and “Occupation”. This suggests that the model’s predictive power is primarily derived from the “Grade” feature, with other variables contributing minimally. This heavy reliance on a single feature indicates a need to validate the model to ensure that it is not overfitting and to consider additional feature engineering or transformation to enhance the importance of other variables, align the model’s behavior more closely with domain knowledge, and reduce potential biases.

The KNN model as shown in Figure 3 relies predominantly on “Grade”, followed by “Age” and “Schooltype”. The remaining features contribute little to the decision-making process of the model. This suggests that while “Grade” is the primary driver of the model’s predictions, “Age” and “Schooltype” also provide some additional predictive power, albeit to a much lesser extent. As with the random forest model, it is important to validate the KNN model to ensure that the reliance on “Grade” is justified and to consider ways to enhance the significance of other features if necessary.

As with the random forest and KNN models, this strong reliance on a single feature necessitates careful validation to ensure that the model is not overfitting, and that the dependence on “Grade” is justified. In addition, it may be beneficial to explore feature engineering or selection techniques to enhance the importance of other variables and provide a more balanced model.

In conclusion, feature selection analysis across multiple models underscores the dominant role of demographic and educational factors, particularly age and grade, in predicting student dropout rates in Somaliland. While socioeconomic factors such as household income are significant, their impact varies across models. Logistic and probit regressions highlighted age and grade as critical predictors, and sex and household size also played notable roles. However, in models such as naïve Bayes, random forest, KNN, and SVM, grade overwhelmingly influences predictions, indicating a need for careful model validation to avoid overfitting and ensure a balanced consideration of all relevant features. This comprehensive evaluation provides a clear direction for future feature engineering efforts to enhance predictive accuracy and model robustness, ultimately aiding the development of targeted interventions to reduce dropout rates.

4.5. Model Comparison

The generated plot provides a comprehensive comparison of seven supervised machine learning models based on four critical performance metrics: accuracy, sensitivity, F1 score, and AUROC. The support vector machine (SVM) and decision tree models demonstrated superior performance across most of these metrics, with both achieving the highest values for accuracy and F1 score (0.988 and 0.993, respectively). Furthermore, these models exhibited very high sensitivity (0.993), suggesting that they are particularly adept at identifying positive cases. This consistent performance across multiple metrics indicates that the SVM and decision tree models are highly effective and reliable for classification tasks.

Random forest, while also performing admirably in terms of accuracy and sensitivity, distinguishes itself with the highest AUROC (0.988). AUROC is a measure of the model’s ability to distinguish between classes, with a higher value indicating better discriminatory power. This makes RF particularly suitable for applications where distinguishing between positive and negative classes with high precision is crucial. Conversely, the naïve Bayes and K-nearest neighbors (KNN) models exhibited slightly lower AUROC values (0.971 and 0.973, respectively), suggesting a marginally reduced ability to differentiate between classes. Despite this, they still maintain high levels of accuracy and sensitivity, indicating their overall robustness as classifiers.

The visualization effectively illustrates the comparative strengths and weaknesses of each model, offering valuable insights into selecting the most appropriate model based on specific performance criteria as presented in Figure 4. For instance, if the primary objective is to maximize discriminatory power, the random forest model is the optimal choice because of its superior AUROC. Conversely, if balancing precision and recall is more critical, models such as SVM and decision tree, with high F1 scores, would be preferable. This nuanced comparison allows for researchers and practitioners to make informed decisions tailored to their specific application needs, ensuring that the chosen model aligns with performance priorities.

4.6. Model Evaluation

Assessing classification models that utilize diverse performance indicators is crucial for understanding their efficacy in forecasting student attrition rates. The F1 score is a crucial indicator that combines precision and recall and provides a thorough assessment of a model’s accuracy. The support vector machine (SVM) and decision tree models exhibited outstanding performance in this investigation, with F1 scores of 0.993. A high score demonstrated the competency of these models in properly predicting both dropouts and students who will continue their education. The elevated F1 score of these models highlights their resilience in addressing the intricacies of dropout prediction.

Sensitivity, also known as recall, is a crucial parameter that quantifies the percentage of true dropouts successfully detected by the model. Ensuring high sensitivity in educational environments is essential for promptly identifying all at-risk children and providing timely interventions. In this study, the decision tree, SVM, and random forest models demonstrated a notable sensitivity value of 0.993, indicating their great efficacy in identifying students who are likely to discontinue their education. The high sensitivity of this method permits precise targeting of treatments, which may lead to a large improvement in student retention rates. The capacity to accurately identify almost all students who are at risk of discontinuing their education makes these models exceptionally beneficial for educational authorities aiming to minimize dropout rates.

Precision is a crucial metric that quantifies the ratio of accurate positive predictions to all the positive predictions generated by the model. Ensuring high precision prevents the allocation of resources to false positives; that is, students who are not genuinely at risk of dropping out. The logistic and probit regression models produced accuracy scores of approximately 0.988, indicating a high level of reliability of their predictions. The high level of precision guarantees that interventions are both effective and efficient, targeting only pupils who truly require assistance. The F1 score validates the performance of these models in real applications by demonstrating the balance between precision and recall.

The AUROC (area under the receiver operating characteristic curve) is a metric that evaluates the performance of a model over all possible categorization criteria. A higher AUROC value signifies superior overall performance, demonstrating the capacity of the model to accurately differentiate between dropouts and non-dropouts. The RF model achieved an AUROC of 0.988, demonstrating its exceptional discriminatory capability. The high AUROC value indicates that the RF model is highly dependable in forecasting student dropout, which serves as a strong foundation for the development of early intervention programs. These measurements provided a comprehensive understanding of the strengths and limitations of each model. This ensures that the chosen model performs well both in theory and in practical real-world applications, effectively reducing the student dropout rates.

A thorough assessment of various machine learning models emphasizes their diverse capabilities in forecasting student attrition rates. Models such as decision trees, support vector machines (SVMs), and random forests perform exceptionally well in metrics such as F1 score, sensitivity, and AUROC, showing their capacity to reliably detect children who are at risk while decreasing the occurrence of false positives. Logistic and probit regression models are known for their excellent accuracy and guarantee the efficient allocation of resources to students who truly require assistance. The comprehensive performance indicators offered by these models empower educational institutions to make well-informed decisions regarding their implementation, striking a balance between the requirement for precise predictions and practical considerations of intervention tactics. By utilizing these observations, educational institutions can improve their attempts to retain students by offering timely and focused assistance to those who are most likely to leave. This study’s rigorous evaluation framework not only confirms the efficacy of these models, but also provides a practical roadmap for their implementation in real-world educational environments. This eventually will leads to enhanced student outcomes and decreased dropout rates.

5. Discussion, Conclusions, and Recommendations

5.1. Introduction

The objective of this section is to combine the results obtained from our machine learning models to provide a thorough comprehension of the rates at which students in Somaliland abandoned their studies. The process involves synthesizing the knowledge acquired from predictive analysis, deriving conclusions from this knowledge, and offering practical recommendations. This comprehensive methodology guarantees that the results of this study are not only academically rigorous but also directly relevant to educational policy and intervention techniques.

5.2. Discussion

This study employed various machine learning techniques, such as logistic regression, probit regression, naïve Bayes, random forest, decision tree, support vector machine (SVM), and K-nearest neighbors (KNN), to forecast student attrition rates. Every model offers distinct perspectives on the elements that contribute to dropout rates. Demographic variables such as age and sex had a substantial impact. Younger students, particularly those going through important transitional periods, exhibited elevated dropout rates, indicating the need for interventions tailored to their unique age groups. Observations were made regarding disparities in dropout rates based on sex, emphasizing the necessity of measures that consider sex-specific considerations.

As an educational indicator, grades repeatedly proved to be the most influential predictor in several models, including naïve Bayes and random forest. This implies that the individual difficulties encountered at each grade level may be a significant contributing factor to students leaving school before completion. Socioeconomic factors, such as the level of money within a home and the number of individuals residing in the household, also had an influence. Lower levels of household income were linked to higher rates of students leaving school before completing their education, showing that economic obstacles hinder access to education. Similarly, larger household sizes were found to be related to higher dropout rates, suggesting that larger families have limited access to the resources necessary for education [38].

However, the random forest model consistently outperformed the other algorithms, achieving the highest AUROC score and demonstrating the greatest accuracy in predicting student dropout. This exceptional performance highlights random forest as the most accurate and reliable model for predicting student attrition [23].

The performance of the models varied, with the SVM and decision tree models demonstrating higher performances in terms of high accuracy and F1 scores. This indicates that these models are dependable in predicting student dropout. The random forest model performed exceptionally well, achieving the highest AUROC score, indicating its excellent ability to discriminate between different classes. Although the AUROC was significantly lower, the naïve Bayes and KNN models demonstrated high accuracy and sensitivity, further supporting their usefulness in classification applications. The prevalence of the “Grade” attribute in several models gives rise to worries of overfitting, thus requiring meticulous validation [19]. This study highlights the importance of achieving a well-balanced feature engineering approach to enhance the generalization ability of models.

In addressing the potential concern of overfitting in our study, we took several precautionary measures to ensure the reliability and generalizability of our predictive models for school dropout in Somaliland. To mitigate overfitting, we carefully evaluated our models using a variety of performance metrics on both the training and validation datasets. Additionally, we employed methods such as hyperparameter tuning, model complexity constraints, and ensemble techniques. By fine-tuning hyperparameters, such as adjusting the maximum depth of decision trees in random forest or the number of neighbors in KNN, we aimed to strike a balance between model complexity and performance. Additionally, we restricted the complexity of SVM models by optimizing the regularization parameter and selecting appropriate kernels. For logistic, probit, and naïve Bayes models, we focused on feature selection and model simplification to prevent overfitting and improve predictive accuracy. These strategies collectively aimed to promote model simplicity and robustness, guarding against overfitting while ensuring the models’ effectiveness in predicting school dropout outcomes.

5.3. Conclusions

The predictive study indicated that demographic and educational parameters play a crucial role in comprehending student dropouts in Somaliland. Although socioeconomic determinants have an impact, their influence differs, highlighting the need for model-specific validation. The findings of this study indicate that age and grade play crucial roles in predicting dropout rates, highlighting the need for targeted treatments tailored to certain age groups and grade levels. SVM, decision tree, and random forest models offer reliable frameworks for predicting dropout; however, caution is needed to handle potential overfitting, particularly with regard to the “Grade” attribute. By integrating insights from several models, a comprehensive understanding of the factors that contribute to dropout rates can be obtained, which can then be used to develop more efficient interventional tactics.

5.4. Recommendations

Thorough model validation is essential to prevent overfitting. It is essential to incorporate cross-validation techniques and regularization approaches, especially in models that significantly depend on the “Grade” feature. Subsequent research should investigate sophisticated methods of feature engineering, such as incorporating polynomial features, interaction terms, and domain-specific transformations, to improve the predictive capability of the underrepresented features. This may entail the development of novel functionalities that capture intricate relationships within data.

It is necessary to provide focused interventions to assist younger students who are more likely to quit school, and particular support systems should be implemented to address the specific difficulties faced by students at important grade levels. Gender-sensitive interventions are crucial to guarantee that both males and females receive sufficient assistance to remain enrolled in educational institutions. Policymakers should mitigate socioeconomic gaps by allocating financial assistance and resources to low-income households and families with a larger number of members. It is essential to provide a fair allocation of educational resources in various regions to reduce the regional differences in dropout rates. Policies should offer further assistance to children of various origins, including those who face distinctive socioeconomic obstacles.

Regular and ongoing monitoring and upgrading of prediction models are crucial. Consistently incorporating fresh data through regular updates would facilitate adjustment to evolving educational trends and guarantee the continuous precision and pertinence of information. Implementing a feedback loop that continuously monitors the success of interventions and uses this information to improve predictive models and intervention tactics will significantly enhance the effectiveness of these efforts. Implementing these suggestions might help educational policymakers and practitioners in Somaliland create more efficient approaches to decreasing student attrition rates, consequently enhancing overall educational achievements and fostering socioeconomic progress in the area.

Author Contributions

Conceptualization, M.A.H. and A.H.M.; methodology, M.A.H., A.H.M. and S.N.; software, M.A.H.; validation, M.A.H., A.H.M. and S.N.; formal analysis, M.A.H., A.H.M. and S.N.; investigation, M.A.H. and A.H.M.; resources, M.A.H. and A.H.M.; data curation, M.A.H.; writing—original draft preparation, M.A.H., A.H.M. and S.N.; writing—review and editing, M.A.H., A.H.M. and S.N.; visualization, M.A.H., A.H.M. and S.N.; supervision, A.H.M. and S.N.; project administration, S.N.; funding acquisition, S.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used and/or analyzed during the current study will be made available by the corresponding author upon reasonable request.

Acknowledgments

We extend our gratitude to the Ministry of Education and Science for their support. Special thanks to former Minister H.E Ahmed Mohamed Derie (Toorno), now Minister for the Ministry of National Planning and Coordination, for their invaluable contributions.

Conflicts of Interest

All of the authors declare no conflicts of interest.

References

Estêvão, P.; Álvares, M. What do We Mean by School Dropout? Early School Leaving and The Shifting of Paradigms in School Dropout Measurement. Port. J. Soc. Sci. 2014, 13, 21–32. [Google Scholar] [CrossRef] [PubMed]
Haimovich, F.; Vázquez, E.; Adelman, M. Scalable Early Warning Systems for School Dropout Prevention: Evidence from a 4.000-School Randomized Controlled Trial; Universidad Nacional de La Plata, Centro de Estudios Distributivos, Laborales y Sociales (CEDLAS): La Plata, Argentina, 2021. [Google Scholar]
Grau-Valldosera, J.; Minguillón, J. Rethinking Dropout in Online Higher Education: The Case of the Universitat Oberta De Catalunya. Int. Rev. Res. Open Distrib. Learn. 2014, 15, 290–308. [Google Scholar] [CrossRef]
Alam, M.A.U. College Student Retention Risk Analysis from Educational Database Using Multi-Task Multi-Modal Neural Fusion. Proc. Aaai Conf. Artif. Intell. 2022, 36, 12689–12697. [Google Scholar] [CrossRef]
Adelman, M.A.; Székely, M. An Overview of School Dropout in Central America: Unresolved Issues and New Challenges for Education Progress. Eur. J. Educ. Res. 2017, 6, 235–259. [Google Scholar] [CrossRef]
Mehta, A. A study of the status & public financing of elementary education in India: With special reference to Jharkhand state. Vinoba Bhave J. Econ. 2023, IX, 129–137. [Google Scholar]
Mustard Insights. School Dropout Rate in Africa Worsens as Nigeria, Ethiopia Lead the Pack. 26 August 2022. Available online: https://blog.mustardinsights.com/in-Africa/School-Dropout-Rate-in-Africa-Worsens-as-Nigeria-Ethiopia-Lead-the-Pack-mEEvB (accessed on 24 June 2024).
Amare, M.Y.; Šimonová, S. Global Challenges of Students Dropout: A Prediction Model Development Using Machine Learning Algorithms on Higher Education Datasets. SHS Web. Conf. 2021, 129, 09001. [Google Scholar] [CrossRef]
Eegdeman, I.; Cornelisz, I.; van Klaveren, C.; Meeter, M. Computer or Teacher: Who Predicts Dropout Best? Front. Educ. 2022, 7, 976922. [Google Scholar] [CrossRef]
Oqaidi, K.; Aouhassi, S.; Mansouri, K. Towards a Students’ Dropout Prediction Model in Higher Education Institutions Using Machine Learning Algorithms. Int. J. Emerg. Technol. Learn. Ijet 2022, 17, 103–117. [Google Scholar] [CrossRef]
do Carmo Nicoletti, M.; de Oliveira, O.L. A Machine Learning-Based Computational System Proposal Aiming at Higher Education Dropout Prediction. High. Educ. Stud. 2020, 10, 12. [Google Scholar] [CrossRef]
ER, E. An Explainable Machine Learning Approach to Predicting and Understanding Dropouts in MOOCs. Kastamonu Eğitim Derg. 2023, 31, 143–154. [Google Scholar] [CrossRef]
Menoyo-Ros, D.; Garcia-Cabot, A.; Garcia-Lopez, E.; Domínguez, A. The Use of Machine Learning in Educational Datasets. Eden Conf. Proc. 2020, 131–140. [Google Scholar] [CrossRef]
Albán, M.; Mauricio, D.; Albán, M. Decision Trees for the Early Identification of University Students at Risk of Desertion. Int. J. Eng. Technol. 2018, 7, 51–54. [Google Scholar] [CrossRef]
Mnyawami, Y.N.; Maziku, H.; Mushi, J.C. Implementation of Bayesian Hyperparameter Optimization for Predicting Student Dropout in Sub-Saharan Africa Secondary Schools. Res. Sq. 2022, preprint. [Google Scholar] [CrossRef]
Baranyi, M.; Nagy, M.; Molontay, R. Interpretable Deep Learning for University Dropout Prediction. In Proceedings of the 21st Annual Conference on Information Technology Education, Virtual, 7–9 October 2020. [Google Scholar] [CrossRef]
Kloft, M.; Stiehler, F.; Zheng, Z.; Pinkwart, N. Predicting MOOC Dropout Over Weeks Using Machine Learning Methods. In Proceedings of the EMNLP 2014 Workshop on Analysis of Large Scale Social Interaction in MOOCs, Doha, Qatar, 25 October 2014. [Google Scholar] [CrossRef]
Mduma, N.; Kalegele, K.; Machuve, D. An Ensemble Predictive Model Based Prototype for Student Drop-Out in Secondary Schools. J. Inf. Syst. Eng. Manag. 2019, 4, em0094. [Google Scholar] [CrossRef]
Lee, S.; Chung, J.Y. The Machine Learning-Based Dropout Early Warning System for Improving the Performance of Dropout Prediction. Appl. Sci. 2019, 9, 3093. [Google Scholar] [CrossRef]
Palacios, C.A.; Reyes-Suárez, J.A.; Bearzotti, L.; Leiva, V.; Marchant, C. Knowledge Discovery for Higher Education Student Retention Based on Data Mining: Machine Learning Algorithms and Case Study in Chile. Entropy 2021, 23, 485. [Google Scholar] [CrossRef]
Kiss, V.; Maldonado, E.; Segall, M. The Use of Semester Course Data for Machine Learning Prediction of College Dropout Rates. J. High. Educ. Theory Pract. 2022, 22, 64–74. [Google Scholar]
Capuano, N.; Rossi, D.; Ströele, V.; Caballé, S. Explainable Prediction of Student Performance in Online Courses; Springer: Berlin/Heidelberg, Germany, 2023; pp. 639–652. [Google Scholar]
Solis, M.; Moreira, T.; Gonzalez, R.; Fernandez, T.; Hernandez, M. Perspectives to Predict Dropout in University Students with Machine Learning. In Proceedings of the 2018 IEEE International Work Conference on Bioinspired Intelligence (IWOBI), San Carlos, Costa Rica, 18–20 July 2018; IEEE: Piscateville, NJ, USA, 2018; pp. 1–6. [Google Scholar]
Dake, D.K.; Buabeng-Andoh, C. Using Machine Learning Techniques to Predict Learner Drop-out Rate in Higher Educational Institutions. Mob. Inf. Syst. 2022. [Google Scholar] [CrossRef]
Tan, M.; Shao, P. Prediction of Student Dropout in E-Learning Program Through the Use of Machine Learning Method. Int. J. Emerg. Technol. Learn. 2015, 10, 11–17. [Google Scholar] [CrossRef]
Capuno, R.M.M.; Ferrer, C.J.M.; Manaloto, B.T.L.; Villafria, S.R.; Esquivel, J.A. Towards Predicting Student’s Dropout in Higher Education Using Supervised Machine Learning Techniques. In Proceedings of the International Conference on Industrial Engineering and Operations Management, Manila, Philippines, 7–9 March 2023. [Google Scholar]
Villar, A.; de Andrade, C.R.V. Supervised Machine Learning Algorithms for Predicting Student Dropout and Academic Success: A Comparative Study. Discov. Artif. Intell. 2024, 4, 2. [Google Scholar] [CrossRef]
Kim, S.; Choi, E.; Jun, Y.-K.; Lee, S. Student Dropout Prediction for University with High Precision and Recall. Appl. Sci. 2023, 13, 6275. [Google Scholar] [CrossRef]
Park, C. Development of Prediction Model to Improve Dropout of Cyber University. J. Korea Acad. Ind. Coop. Soc. 2020, 21, 380–390. [Google Scholar]
Xing, W.; Du, D. Dropout Prediction in MOOCs: Using Deep Learning for Personalized Intervention. J. Educ. Comput. Res. 2019, 57, 547–570. [Google Scholar] [CrossRef]
Rokach, L.; Maimon, O. Data Mining and Knowledge Discovery Handbook; Springer: New York, NY, USA, 2010. [Google Scholar]
Chen, L. Basic Ensemble Learning (Random Forest, Adaboost, Gradient Boosting)-Step by Step Explained. Data Sci. 2019. Available online: https://towardsdatascience.com/basic-ensemble-learning-random-forest-adaboost-gradient-boosting-step-by-step-explained-95d49d1e2725 (accessed on 24 June 2024).
Mercadier, M. Quantum-Enhanced versus Classical Support Vector Machine: An Application to Stock Index Forecasting. Available SSRN 4630419 2023. [Google Scholar] [CrossRef]
Zhang, H. The Optimality of Naive Bayes. Aa 2004, 1, 3. [Google Scholar]
Harrison, O. Machine Learning Basics with the k-Nearest Neighbors Algorithm. Data Sci. 2018, 11. Available online: https://towardsdatascience.com/machine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761 (accessed on 24 June 2024).
Kleinbaum, D.G.; Klein, M.; Kleinbaum, D.G.; Klein, M. Parametric survival models. In Survival Analysis: A Self-Learning Text; Springer: New York, NY, USA, 2012; pp. 289–361. [Google Scholar]
Nja, M. Probit Regression in Prediction Analysis. Glob. J. Pure Appl. Sci. 2009, 15, 117–121. [Google Scholar] [CrossRef]
Kim, D.; Kim, S. Sustainable education: Analyzing the Determinants of University Student Dropout by Nonlinear Panel Data Models. Sustainability 2018, 10, 954. [Google Scholar] [CrossRef]

Figure 1. Flowchart for the methodology.

Figure 2. Magnitude of school dropout among adults in Somaliland based on NEAS 2022 dataset.

Figure 3. LR feature selection results from the competitive models, including LR, PR, KNN, and RF algorithms.

Figure 4. Model comparisons.

Table 1. Related literature review for the application of supervised machine learning algorithms in the context of school dropout prediction and classification.

No	Author	Title	Algorithm Used	Chosen Algorithm	Result
1.	[21]	The Use of Semester Course Data for Machine Learning Prediction of College Dropout Rates	Neural networks (NNs), decision tree (DT), logistic regression (LR), support vector machine (SVM)	Logistic regression (LR)	Accuracy: 84.8
2.	[23]	Perspectives to Predict Dropout in University Students with Machine Learning	Neural networks (NNs), random forest (RF), logistic regression (LR), support vector machine (SVM)	Random forest (RF)	Accuracy: 91%
3.	[24]	Using Machine Learning Techniques to Predict Learner Drop-out Rate in Higher Educational Institutions	Multilayer perceptron, decision tree (DT), random forest (RF), support vector machine (SVM)	Random forest (RF)	Accuracy: 70.98
4.	[25]	Prediction of Student Dropout in E-Learning Program Through the Use of Machine Learning Method.	Artificial neural network (ANN), decision tree (DT), Bayesian networks (BNs)	Decision tree (DT)	Accuracy: 63.89%
5.	[19]	The Machine Learning-Based Dropout Early Warning System for Improving the Performance of Dropout Prediction	Random forest (RF), boosted decision tree (BDT), random forest with SMOTE (SMOTE + RF), boosted decision tree with SMOTE (SMOTE + BDT)	Boosted decision tree (BDT)	Accuracy: 89.8%
6.	[26]	Towards Predicting Student’s Dropout in Higher Education Using Supervised Machine Learning Techniques	Random forest, logistic regression, neural network	Random forest	Accuracy: 95.00%
7.	[27]	Supervised Machine Learning Algorithms for Predicting Student Dropout and Academic Success: A Comparative Study	Decision tree, SVM, random forest, gradient boosting, XGBoost, CatBoost, LightGBM	LightGBM, CatBoost	F1 score: 86%
8.	[28]	Student Dropout Prediction for University with High Precision and Recall	Logistic regression, decision tree, random forest, XGBoost	XGBoost	The XGBoost model achieved the highest precision (0.92) and recall (0.91) in predicting student dropout.
9.	[29]	Development of Prediction Model to Improve Dropout of Cyber University	Logistic regression, decision tree, random forest	Random forest	The random forest model achieved the highest accuracy (87.4%) in predicting student dropout in a cyber university setting
10.	[30]	Dropout Prediction in MOOCs: Using Deep Learning for Personalized Intervention	Deep neural network	Deep neural network	The proposed deep learning model achieved an AUC of 0.82 in predicting MOOC dropout, outperforming traditional machine learning algorithms.

Table 2. Study area of respondents.

No	District	Number of Responding Households
1.	Erigavo	121
2.	El-Afwein	47
3.	Hargeisa	708
4.	Burao	336
5.	Odweine	51
6.	Berbera	212
7.	Sheikh	37
8.	Baligubadle	46
9.	Lasanod	117
10.	Ainabo	42
11.	Borama	202
12.	Lug-haya	38
	Total	1957

Table 3. Summary statistics for the demographic and socioeconomic characteristics of all students residing in households that participated in the National Education Accessibility Survey 2022.

No	Variable	Category	Frequency	Percentage (%)
1	Age of HH	Below 20 years old	3977	88.57
		20 years and above	513	11.43
2	Household size	Below 5 members	673	14.99
		5 Members or more	3817	85.01
3	Sex of HH	Female	2164	48.20
		Male	2326	51.80
4	Region	Awdal	562	12.52
		Daadmadheedh	105	2.34
		Hawd	127	2.83
		Marodijeh	1433	31.92
		Sahil	589	13.12
		Sanaag	472	10.51
		Saraar	152	3.39
		Sool	351	7.82
		Togdheer	699	15.57
5	District	Berbera	475	10.58
		Borama	458	10.20
		Burco	702	15.63
		Caynabo	148	3.30
		Ceel-Afweyn	115	2.56
		Ceerigaabo	357	7.95
		Hargeisa	1434	31.94
		Laascaanood	351	7.82
		Lughaya	104	2.32
		Sheikh	116	2.58
		Balligubadle	127	2.83
		Oodweyne	103	2.29
6	Location type	Nomadic	207	4.61
		Rural	1319	29.38
		Urban	2964	66.01
7	Occupation of the HH	Agriculture or agro-pastoralist	416	9.27
		Daily wager	676	15.06
		Fishing	16	0.36
		Government employee	467	10.40
		Household chores	1075	23.94
		Nomadic pastoralist	163	3.63
		Not working but looking for work	477	10.62
		Other specified	242	5.39
		Private employee	313	6.97
		Self-employed	645	14.37
8	Type of housing	Bungalow	477	10.62
		Buul/bus/Somali hut	1079	24.03
		Concrete building	15	0.33
		Daar no fence	1765	39.31
		Derked/cariish/Mudul	249	5.55
		Sandaqad	905	20.16
9	School type	Private	1250	27.84
		Public	3240	72.16

Table 4. ML models.

	Logistic Regression		Probit Regression		Naïve Bayes		Random Forest		Decision Tree		SVM		KNN
Predicted	Completed	Dropout	Completed	Dropout	Completed	Dropout	Completed	Dropout	Completed	Dropout	Completed	Dropout	Completed	Dropout
Completed	773	9	773	777	774	774	774	8	777	5	777	5	776	6
Dropout	7	109	7	7	5	5	5	111	5	111	5	111	12	104
Accuracy	0.982		0.982		0.986		0.985		0.988		0.988		0.980
Sensitivity	0.991		0.991		0.991		0.993		0.993		0.993		0.984
Specificity	0.923		0.923		0.956		0.932		0.956		0.956		0.945
Pos pred value	0.988		0.988		0.993		0.989		0.993		0.993		0.992
Neg pred value	0.939		0.939		0.939		0.956		0.956		0.956		0.896
Prevalence	0.868		0.868		0.873		0.867		0.870		0.870		0.877
Detection rate	0.860		0.860		0.865		0.861		0.865		0.865		0.864
Detection prevalence	0.870		0.870		0.870		0.870		0.870		0.870		0.870
Balanced accuracy	0.957		0.957		0.973		0.963		0.975		0.975		0.965
AUC	0.985		0.984		0.971		0.988		0.975		0.986		0.973

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hassan, M.A.; Muse, A.H.; Nadarajah, S. Predicting Student Dropout Rates Using Supervised Machine Learning: Insights from the 2022 National Education Accessibility Survey in Somaliland. Appl. Sci. 2024, 14, 7593. https://doi.org/10.3390/app14177593

AMA Style

Hassan MA, Muse AH, Nadarajah S. Predicting Student Dropout Rates Using Supervised Machine Learning: Insights from the 2022 National Education Accessibility Survey in Somaliland. Applied Sciences. 2024; 14(17):7593. https://doi.org/10.3390/app14177593

Chicago/Turabian Style

Hassan, Mukhtar Abdi, Abdisalam Hassan Muse, and Saralees Nadarajah. 2024. "Predicting Student Dropout Rates Using Supervised Machine Learning: Insights from the 2022 National Education Accessibility Survey in Somaliland" Applied Sciences 14, no. 17: 7593. https://doi.org/10.3390/app14177593

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Predicting Student Dropout Rates Using Supervised Machine Learning: Insights from the 2022 National Education Accessibility Survey in Somaliland

Abstract

1. Background of the Study

Significnace and Novelty of the Study

2. Related Literature Review

2.1. Review for the Magnitude and Associated Factors for School Dropout

2.2. Review for Supervised Machine Learning Used to Predict and Classify School Dropout Rates

3. Methodology

3.1. Research Design

3.2. Data Source

3.3. Study Variables

3.3.1. Outcome Variable

3.3.2. Predictor Variables

3.4. Data Preprocessing

3.5. Data Cleaning

3.6. Missing Value Imputation

3.7. Proposed ML Models

3.7.1. Decision Tree

3.7.2. Random Forest

3.7.3. Support Vector Machine (SVM)

3.7.4. Naïve Bayes

3.7.5. K-Nearest Neighbors (KNN)

3.7.6. Logistic Regression

3.7.7. Probit Regression

3.8. Model Comparison and Evaluation

4. Findings and Results

4.1. Descriptive Statistics

4.2. Magnitude of the School Dropout in Somaliland Based on NEAS Dataset 2022

4.3. Supervised Machine Learning Models

4.4. Features Selection

4.5. Model Comparison

4.6. Model Evaluation

5. Discussion, Conclusions, and Recommendations

5.1. Introduction

5.2. Discussion

5.3. Conclusions

5.4. Recommendations

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI