1. Background of the Study
Somaliland, an independent country in the Horn of Africa, faces multifaceted challenges in terms of its educational landscape. Despite commendable efforts to enhance access to education, dropout rates remain a pressing concern, posing a significant barrier to a country’s development goals. Factors such as poverty, inadequate infrastructure, cultural norms, and limited resources contribute to this challenge, underscoring the need for targeted interventions using data-driven approaches.
According to [
1] dropout refers to children who leave the educational system before completing their academic year. This means that they do not receive the final mark for that year or an official document proving that they finished that specific year of primary or secondary school.
Dropout, in the context of education, refers to the phenomenon in which students leave school before completing their studies. The dropout rate is a crucial metric that reflects the number or percentage of students who disengage prematurely from the educational system. Various studies have highlighted the prevalence of dropouts across different educational levels, including primary and secondary education [
2].
Some studies have highlighted the complexity of defining dropout, especially in the context of online education, where factors such as work and family constraints can influence students’ decisions to discontinue their studies [
3].
The student dropout rate is a critical issue affecting educational systems worldwide, with significant implications for individual students, educational institutions, and society. Despite efforts to address this issue, there remains a need to comprehensively analyze the determinants of student dropout rates to develop effective interventions and policies. The National Education Accessibility Survey 2022 provides a valuable dataset for investigating the factors that influence student dropout rates. By examining this dataset, this study aimed to identify and understand the key determinants contributing to student dropout rates, thereby informing targeted strategies to improve student retention and educational outcomes.
Significnace and Novelty of the Study
This study marks a significant milestone in Somaliland, being the first to investigate student dropout rates using data from the 2022 National Education Accessibility Survey conducted in collaboration with the Ministry of Education. By employing a diverse set of machine learning models such as logistic regression, probit regression, naïve Bayes, random forest, decision tree, support vector machine (SVM), and K-nearest neighbors (KNN), this research delves into the complexities of why students leave school early. It reveals how factors like age, household income, and family size play a role in dropout rates. Offering innovative strategies to address these challenges, this study is a pivotal contribution to enhancing educational outcomes in Somaliland, providing a nuanced understanding for policymakers, educators, and stakeholders.
For the government, this study offers invaluable insights for policymaking and resource allocation in the education sector, enabling targeted interventions to reduce dropout rates and enhance educational outcomes. International NGOs can leverage these findings to develop tailored programs and initiatives that address the identified risk factors, fostering collaboration with local stakeholders for impactful interventions. Researchers and academicians are presented with a pioneering research framework that not only advances the understanding of educational challenges in Somaliland but also showcases the application of diverse machine learning models in educational data analysis, offering methodological insights for future studies.
Parents and community members benefit from increased awareness about the complexities of student dropout, empowering them to actively participate in creating a supportive educational environment for children. Applied scientists and data science practitioners find a valuable case study in this research, highlighting the significance of utilizing advanced analytical techniques to uncover actionable insights from educational data. By proposing innovative strategies to tackle dropout challenges, this study lays the groundwork for sustainable improvements in educational outcomes, ultimately contributing to a more inclusive and successful educational landscape in Somaliland.
3. Methodology
3.1. Research Design
This study aimed to analyze student dropout rates using data from the National Education Accessibility Survey 2022, focusing on socioeconomic, demographic, and educational factors.
This study used machine learning algorithms to identify significant predictors and explore variations across different demographic and socioeconomic groups. Subgroup analyses were conducted to explore potential variations in dropout determinants across demographics. The research aims to provide insights into the complex factors driving student dropout rates and inform targeted interventions to promote inclusive schooling.
3.2. Data Source
The National Education Accessibility Survey reached 1957 households in the sampled districts of 2000 households, as shown in
Table 2.
3.3. Study Variables
3.3.1. Outcome Variable
In this study, the outcome variable, defined based on [
1], pertains to school-age children in Somaliland who disengage from the educational system before successfully completing the academic year. A child is categorized as a “dropout” if he/she does not attain the final mark for the academic year or lacks an official document certifying the completion of that specific year within primary or secondary school. Consequently, for the purposes of this research, the outcome variable is binary: assigned a value of 1 to denote a child who has dropped out of school and 0 to indicate a child who has not.
3.3.2. Predictor Variables
The explanatory variables were classified into two categories: socioeconomic factors such as household leader educational level, household occupational status, region of residence, district of residence, and household wealth status, and demographic factors, including sex of children, number of children in the household, school time, age of the child, disability, and school distance.
3.4. Data Preprocessing
The raw data obtained from the 2022 NEAS dataset underwent comprehensive preprocessing to ensure its readiness for analysis. This phase involved multiple steps, including data cleaning, handling of missing values, and variable transformation, as necessary. These processes were essential in preparing a reliable and accurate dataset for subsequent analysis.
3.5. Data Cleaning
Data cleaning was a vital step to ensure the dataset’s quality and integrity. In this study, a meticulous data cleaning process was conducted to identify and correct errors, inconsistencies, and outliers within the data. This process included the removal of duplicate entries, correction of formatting issues, and addressing data entry errors. By implementing these steps, a high-quality dataset was created, which served as the foundation for all further analyses.
3.6. Missing Value Imputation
Handling missing data is crucial in preserving the dataset’s integrity and ensuring accurate analyses. In this study, missing value imputation was carried out iteratively to achieve complete data for all variables. This rigorous approach minimized the potential bias and errors associated with missing data, enhancing the reliability of the results. Appropriate imputation techniques were employed, resulting in a robust and comprehensive dataset, suitable for in-depth analysis and interpretation.
3.7. Proposed ML Models
In this study, we employed eight supervised machine learning algorithms to predict student dropout rates using data from the National Education Accessibility Survey 2022. These algorithms are decision tree, random forest, support vector machine, naïve Bayes, K-nearest neighbors (KNN), logistic regression, and probit regression.
3.7.1. Decision Tree
A decision tree is a non-parametric supervised learning algorithm that is adept at classification and regression tasks. It iteratively partitions the dataset based on the input feature values and makes decisions along the branches of the tree. In classification, it assigns data points to distinct classes, while in regression, it predicts the target variable’s value. Renowned for simplicity and interpretability, decision trees offer insights into feature importance and decision processes. However, they may be prone to overfitting, necessitating techniques like pruning for improved generalization [
31].
3.7.2. Random Forest
Random forest is an ensemble learning method celebrated for its resilience and adaptability. By generating numerous decision trees during training, each on a distinct data subset, it mitigates overfitting and enhances predictive accuracy. In classification tasks, the final prediction stems from the mode of classes predicted by individual trees, while in regression, it is derived from their mean prediction. This ensemble approach ensures robustness and stability across diverse datasets. Moreover, random forest’s capacity to handle high-dimensional data while maintaining interpretability renders it a widely favored choice in various domains [
32].
3.7.3. Support Vector Machine (SVM)
Support vector machine (SVM) is a powerful supervised learning algorithm used for both classification and regression tasks. It constructs one or more hyperplanes in a high-dimensional space to separate data into distinct classes. The optimal hyperplane maximizes the margin between classes, improving model generalizability and reducing classification errors. For non-linearly separable data, SVM employs kernel functions to transform the input space, enabling effective separation in a higher-dimensional space. This approach enhances the algorithm’s flexibility in handling complex datasets [
33].
3.7.4. Naïve Bayes
Naïve Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from a finite set. There is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naïve Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable [
34].
3.7.5. K-Nearest Neighbors (KNN)
K-nearest neighbors (KNN) is a non-parametric algorithm used for classification and regression tasks. It classifies a data point by identifying the majority class among its k-nearest neighbors in the feature space. For regression, the prediction is the average value of the neighbors. KNN is simple and intuitive because it requires no prior model training and assumes that similar data points are near each other. However, it can be computationally intensive with large datasets and is sensitive to the choice of k and feature scaling, affecting its performance and accuracy [
35].
3.7.6. Logistic Regression
Logistic regression is a statistical method used in binary classification problems. It predicts the probability of an event’s occurrence by fitting the data to a logistic curve, which is an S-shaped function. This function outputs values between zero and one, representing the probability of an event. The model uses a linear combination of input features transformed by a logistic function to determine the probabilities. Logistic regression is valued for its interpretability, with coefficients indicating the influence of each feature. It works well with linearly separable data but may require regularization to prevent overfitting in complex scenarios [
36].
3.7.7. Probit Regression
Probit regression is a statistical method used for binary dependent variables similar to logistic regression. It predicts event probabilities by using a normal cumulative distribution function (CDF) instead of a logistic CDF. This model links input features to the probability of an outcome assuming a normal distribution of the latent variable. Probit regression is useful when the assumption of normally distributed errors is correct. It is often applied in fields such as econometrics and biometrics as an alternative to logistic regression [
37].
3.8. Model Comparison and Evaluation
In this section, we compare the performance of the eight machine learning algorithms introduced earlier based on key evaluation metrics commonly used in assessing classification models. These metrics include accuracy, precision, recall, F1 score, and area under the receiver operating characteristic curve (AUC-ROC). By comparing these metrics, our aim is to identify the most effective model for predicting student dropout rates.
Definitions of Notations:
True Positive (TP): The number of instances correctly predicted as positive.
True Negative (TN): The number of instances correctly predicted as negative.
False Positive (FP): The number of instances incorrectly predicted as positive.
False Negative (FN): The number of instances incorrectly predicted as negative.
Metrics:
Accuracy: The ratio of correctly predicted instances to the total instances.
Precision: The ratio of correctly predicted positive observations to the total predicted positives.
Recall: The ratio of correctly predicted positive observations to all observations in the actual class.
F1 score: Weighted average of precision and recall.
AUC-ROC: Area under the ROC, which plots the true positive rate (TPR) against the false positive rate (FPR).
The performance of each model was evaluated using these metrics on a test dataset. The model with the highest scores across these metrics is considered to be the best for predicting student dropout rates.
Model evaluation is a crucial step in assessing the performance of machine learning models. This process involves testing the models on a separate validation dataset that was not utilized during the training phase. The evaluation process includes specific steps to ensure clarity and repeatability:
Train/Test Split: The dataset was divided into 80% training data and 20% testing data to maintain consistency and transparency in the evaluation process.
Cross-validation: In this study, K-fold cross-validation (k = 10) was adopted to enhance robustness and generalizability of the model.
Evaluation Metrics: We calculate the evaluation metrics, such as accuracy, precision, recall, F1 score, and AUC-ROC, for each model, as described in the Model Comparison subsection.
Confusion Matrix: We ensured a detailed breakdown of the model’s predictions by generating a confusion matrix.
ROC: The ROC for each model to was plotted visualize its performance across various threshold values.
Hyperparameter Tuning: The hyperparameters of each model were tuned using methods like grid search or random search to discover the optimal parameters that maximize performance.
By specifying parameters for model evaluation, such as those outlined above, this study can be repeated independently or expanded upon in future research. This process is vital for enhancing the repeatability and validity of the study results. The model evaluation process assists in identifying the strengths and weaknesses of individual algorithms, aiding in the selection of the most suitable model for predicting student dropout rates using data from the NEAS 2022 dataset, as summarized in
Figure 1.
4. Findings and Results
4.1. Descriptive Statistics
Table 3 provides a thorough representation of the demographic and socioeconomic characteristics of all students residing in households that participated in the National Education Accessibility Survey 2022. This dataset provides essential information about the characteristics of households, which is crucial for understanding the specific details that influence educational access and socioeconomic development in the examined area. An exhaustive analysis of each category revealed significant trends and patterns, providing valuable insights for academic discussions and policy development.
The category “Age of Household Head” reveals a significant predominance of young individuals taking on the responsibility of leading households, with 88.57% of heads under the age of 20 years. This demographic phenomenon highlights the early assumption of familial obligations by young people, which is influenced by various variables, including economic pressure, family dynamics, and cultural standards.
Moreover, the examination of “Household Size” reveals that a significant majority of households consist of larger family units with five or more individuals, accounting for 85.01% of the households questioned. These findings have significant consequences for how resources are distributed, the quality of housing, and dynamics within families. Careful and detailed interventions are required to tackle related difficulties and promote strong and adaptable family structures.
When examining the “Sex of Household Head”, the data show a roughly equal distribution of male (51.80%) and female (48.20%) household heads. The presence of equal representation of gender in home leadership positions highlights the progress made towards achieving gender equality in decision-making and household management. This reflects changing societal norms and progressive sociopolitical environments.
Examining geographical divisions, the study of “Region” reveals diverse distribution patterns, with Marodijeh standing out as the region with the largest percentage of households (31.92%). These differences highlight variances in population density, economic activity, and infrastructure development across different regions. This calls for policy interventions that focus on the specific needs and requirements of each region.
Similarly, the analysis of “District” highlights the specific demographic patterns in different areas, revealing that Hargeisa has the highest concentration of households, accounting for 31.94% of the total. This highlights the crucial importance of Hargeisa as a central location for population and economic activity in the examined area, necessitating specific actions to meet related infrastructure and socioeconomic needs.
Furthermore, the classification based on “Location Type” revealed a preference for urban living, as 66.01% of households were in urban regions. These findings highlight the impact of urbanization on resource allocation, infrastructure development, and socioeconomic inequalities. This emphasizes the need for strong urban planning methods that provide fair access to resources and facilities.
An analysis of the “Occupation of the Household Head” reveals a wide range of livelihood pursuits, including agricultural activities, government jobs, and self-employment operations. The presence of various types of jobs within families highlights the diversified character of economic activity. This diversity is the result of different skills, resources, and market conditions.
In addition, the categorization of “Type of Housing” provides insight into the predominant housing circumstances, with “Daar no fence” home being the most common form, accounting for 39.31% of the total. This indicates a high occurrence of informal or temporary housing, highlighting the necessity for changes in housing infrastructure and urban development projects to address housing deficiencies and improve residential living conditions.
An examination of “School Type” revealed common educational enrollment trends, with most pupils (72.16%) attending public schools. This highlights the dependence on educational institutions sponsored by the government and emphasizes the need to strengthen public education systems to guarantee fair access to high-quality education and promote inclusive socio-educational development paths.
To summarize, the table provides important information on the demographic and socioeconomic trends that influence educational accessibility and socioeconomic disparities in the investigated region. These detailed observations are essential for guiding policy decisions founded on evidence, actions that are focused, and academic investigations that aim to promote fair socio-educational development paths and reduce systemic inequalities.
4.2. Magnitude of the School Dropout in Somaliland Based on NEAS Dataset 2022
Figure 2 depicts the extent of school dropout among children of school age in Somaliland as per the NEAS data from 2020.
4.3. Supervised Machine Learning Models
Table 4 presents a comprehensive analysis of various machine learning models used to predict student dropout and completion rates. The models evaluated included logistic regression, probit regression, naïve Bayes, random forest, decision tree, support vector machine (SVM), and K-nearest neighbors (KNN). Each model’s performance was assessed using several critical metrics: accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), prevalence, detection rate, detection prevalence, balanced accuracy, and area under the curve (AUC).
Initially, the table delineated the predictions made by each model, distinguishing between correctly predicted completions and dropouts. For instance, logistic regression correctly predicted 773 completions and 7 dropouts, whereas probit regression demonstrated similar results with 773 completions and 7 dropouts. In contrast, naïve Bayes showed 777 completions and 5 dropouts, indicating a slight improvement in correctly identifying the completions. These predictions are critical for understanding how each model performs in real-world scenarios, where accurate identification of both completions and dropouts is essential.
Accuracy measures the proportion of correct predictions (both completions and dropouts) of the total predictions made. The accuracy values for the models ranged from 0.980 for KNN to 0.988 for decision tree and SVM, as shown in the accompanying figure. This indicates that most models are highly accurate, with the decision tree and SVM slightly outperforming the other models. High accuracy is a desirable attribute in predictive models as it directly correlates with the reliability of the predictions made by the model.
The sensitivity, or recall, is the proportion of actual completions correctly identified by the model. All models exhibited high sensitivity, ranging from 0.984 for KNN to 0.993 for decision tree, SVM, and random forest. This consistency indicates that the models are effective in identifying students who will complete their courses, as illustrated in the sensitivity graph. High sensitivity is particularly crucial in educational contexts where accurately identifying students who are likely to succeed can help provide appropriate support to those at risk.
Specificity measures the proportion of actual dropouts that are correctly identified by the model. Specificity values ranged from 0.923 for logistic regression and probit regression to 0.956 for naïve Bayes, decision tree, and SVM, respectively. A higher specificity indicates better performance in identifying students who are likely to drop out, as depicted in the specificity figure. This metric is essential for understanding how well the model can discern between students who will drop out and those who will complete their courses, thus aiding targeted intervention strategies.
PPV or precision represents the proportion of correctly predicted completions. Most models had a PPV of approximately 0.993, except for logistic regression and probit regression, both at 0.988. This high PPV signifies that when a model predicts that a student will complete the task, it is very likely to be correct. High precision ensures that resources are not wasted on false positives, thereby enhancing the efficiency of the support systems.
NPV is the proportion of predicted dropouts that are accurate. Here, the values range from 0.896 for KNN to 0.956 for the decision tree and SVM. The lower NPV for KNN indicates that it is less reliable in predicting dropouts compared to the decision tree and SVM, which maintain a high accuracy. Understanding NPV helps evaluate how well the model predicts students who are actually at risk of dropping out, ensuring that necessary interventions are directed appropriately.
Prevalence indicates the proportion of the dataset that comprises actual completions, whereas the detection rate measures the proportion of actual completions detected by the model. All models showed a similar prevalence, around 0.870, and detection rates were slightly lower, reflecting their effectiveness in identifying completions. These metrics aid in understanding the baseline distribution of the dataset and the effectiveness of the models in maintaining this distribution in their predictions.
Balanced accuracy, which is the average of sensitivity and specificity, provides a more comprehensive performance measure. The values range from 0.957 for logistic regression and probit regression to 0.975 for Decision tree and SVM, indicating that the latter models perform the best overall. A high balanced accuracy signifies that the model is not biased towards predicting either completions or dropouts, providing a balanced approach to prediction.
The AUC measures the model’s ability to distinguish between classes, with values above 0.970 indicating excellent performance. The AUC values for the models ranged from 0.971 for naïve Bayes to 0.988 for random forest, demonstrating that all models had strong discriminative power, with random forest and SVM being particularly notable. A high AUC indicates that the model has a high capability to correctly classify students in terms of whether they will complete their courses or drop out.
From a detailed analysis, it is clear that decision tree, SVM, and random forest are the top-performing models. They exhibited the highest accuracy, sensitivity, specificity, and AUC values, making them highly reliable for predicting student dropouts and completions. The high balanced accuracy of these models also suggests that they are well-calibrated and robust across different metrics. The detailed performance metrics in the figures underscore the effectiveness of these machine learning models in educational data mining.
For educational institutions aiming to reduce dropout rates, implementing predictive models, such as decision trees, SVM, and random forest, can provide significant insights. These models can help identify at-risk students early, allowing for timely intervention and support systems to improve student retention rates. The high specificity and sensitivity of these models ensure that both completions and dropouts are accurately identified, thereby enhancing the overall effectiveness of predictive analytics in education.
4.4. Features Selection
These features highlight the critical role of socioeconomic factors in influencing student dropout rates in Somaliland. The graphical representation of household income levels suggests that family financial background is a significant predictor of whether a student continues their education or leaves the schooling system prematurely. This finding aligns with the robust body of research that has consistently demonstrated the impact of poverty and resource constraints on educational outcomes, particularly in developing contexts, such as Somaliland.
Figure 3 shows the importance of various features in a logistic regression model, with each bar representing the significance of a feature. Age emerged as the most crucial predictor, followed closely by grade, indicating that these two features played the most significant roles in the model’s outcomes. Sex also shows notable importance, whereas school type, residence, and household size (HHSize) contribute to a lesser extent. Features such as Typehousing, Occupation, District, and Region have minimal to negligible impacts on the model’s predictions, with Region being the least influential. This analysis highlights that demographic and educational factors are the primary drivers of the model predictions.
Figure 3 illustrates the importance of various features in a probit regression model, revealing that age and grade are the most significant predictors, each with an importance score of just above 0.6. Sex also plays a notable role, with a score of approximately 0.25, while household size contributes moderately, with a score of about 0.1. School type, type of housing, and occupation had minimal impact, with scores ranging from approximately 0.05 to just below 0.1. Residence, district, and region had negligible influence, with importance scores close to 0. This analysis underscores that demographic and educational factors are the primary drivers in the model, whereas geographic and housing characteristics have minimal effects.
Figure 3 demonstrates the feature importance of various variables in a random forest model, revealing that “Grade” is overwhelmingly the most influential feature, dwarfing all others such as “School type”, “Region”, “Household size”, “Age”, “Residence”, “District”, “Sex”, “Typehousing”, and “Occupation”. This suggests that the model’s predictive power is primarily derived from the “Grade” feature, with other variables contributing minimally. This heavy reliance on a single feature indicates a need to validate the model to ensure that it is not overfitting and to consider additional feature engineering or transformation to enhance the importance of other variables, align the model’s behavior more closely with domain knowledge, and reduce potential biases.
The KNN model as shown in
Figure 3 relies predominantly on “Grade”, followed by “Age” and “Schooltype”. The remaining features contribute little to the decision-making process of the model. This suggests that while “Grade” is the primary driver of the model’s predictions, “Age” and “Schooltype” also provide some additional predictive power, albeit to a much lesser extent. As with the random forest model, it is important to validate the KNN model to ensure that the reliance on “Grade” is justified and to consider ways to enhance the significance of other features if necessary.
As with the random forest and KNN models, this strong reliance on a single feature necessitates careful validation to ensure that the model is not overfitting, and that the dependence on “Grade” is justified. In addition, it may be beneficial to explore feature engineering or selection techniques to enhance the importance of other variables and provide a more balanced model.
In conclusion, feature selection analysis across multiple models underscores the dominant role of demographic and educational factors, particularly age and grade, in predicting student dropout rates in Somaliland. While socioeconomic factors such as household income are significant, their impact varies across models. Logistic and probit regressions highlighted age and grade as critical predictors, and sex and household size also played notable roles. However, in models such as naïve Bayes, random forest, KNN, and SVM, grade overwhelmingly influences predictions, indicating a need for careful model validation to avoid overfitting and ensure a balanced consideration of all relevant features. This comprehensive evaluation provides a clear direction for future feature engineering efforts to enhance predictive accuracy and model robustness, ultimately aiding the development of targeted interventions to reduce dropout rates.
4.5. Model Comparison
The generated plot provides a comprehensive comparison of seven supervised machine learning models based on four critical performance metrics: accuracy, sensitivity, F1 score, and AUROC. The support vector machine (SVM) and decision tree models demonstrated superior performance across most of these metrics, with both achieving the highest values for accuracy and F1 score (0.988 and 0.993, respectively). Furthermore, these models exhibited very high sensitivity (0.993), suggesting that they are particularly adept at identifying positive cases. This consistent performance across multiple metrics indicates that the SVM and decision tree models are highly effective and reliable for classification tasks.
Random forest, while also performing admirably in terms of accuracy and sensitivity, distinguishes itself with the highest AUROC (0.988). AUROC is a measure of the model’s ability to distinguish between classes, with a higher value indicating better discriminatory power. This makes RF particularly suitable for applications where distinguishing between positive and negative classes with high precision is crucial. Conversely, the naïve Bayes and K-nearest neighbors (KNN) models exhibited slightly lower AUROC values (0.971 and 0.973, respectively), suggesting a marginally reduced ability to differentiate between classes. Despite this, they still maintain high levels of accuracy and sensitivity, indicating their overall robustness as classifiers.
The visualization effectively illustrates the comparative strengths and weaknesses of each model, offering valuable insights into selecting the most appropriate model based on specific performance criteria as presented in
Figure 4. For instance, if the primary objective is to maximize discriminatory power, the random forest model is the optimal choice because of its superior AUROC. Conversely, if balancing precision and recall is more critical, models such as SVM and decision tree, with high F1 scores, would be preferable. This nuanced comparison allows for researchers and practitioners to make informed decisions tailored to their specific application needs, ensuring that the chosen model aligns with performance priorities.
4.6. Model Evaluation
Assessing classification models that utilize diverse performance indicators is crucial for understanding their efficacy in forecasting student attrition rates. The F1 score is a crucial indicator that combines precision and recall and provides a thorough assessment of a model’s accuracy. The support vector machine (SVM) and decision tree models exhibited outstanding performance in this investigation, with F1 scores of 0.993. A high score demonstrated the competency of these models in properly predicting both dropouts and students who will continue their education. The elevated F1 score of these models highlights their resilience in addressing the intricacies of dropout prediction.
Sensitivity, also known as recall, is a crucial parameter that quantifies the percentage of true dropouts successfully detected by the model. Ensuring high sensitivity in educational environments is essential for promptly identifying all at-risk children and providing timely interventions. In this study, the decision tree, SVM, and random forest models demonstrated a notable sensitivity value of 0.993, indicating their great efficacy in identifying students who are likely to discontinue their education. The high sensitivity of this method permits precise targeting of treatments, which may lead to a large improvement in student retention rates. The capacity to accurately identify almost all students who are at risk of discontinuing their education makes these models exceptionally beneficial for educational authorities aiming to minimize dropout rates.
Precision is a crucial metric that quantifies the ratio of accurate positive predictions to all the positive predictions generated by the model. Ensuring high precision prevents the allocation of resources to false positives; that is, students who are not genuinely at risk of dropping out. The logistic and probit regression models produced accuracy scores of approximately 0.988, indicating a high level of reliability of their predictions. The high level of precision guarantees that interventions are both effective and efficient, targeting only pupils who truly require assistance. The F1 score validates the performance of these models in real applications by demonstrating the balance between precision and recall.
The AUROC (area under the receiver operating characteristic curve) is a metric that evaluates the performance of a model over all possible categorization criteria. A higher AUROC value signifies superior overall performance, demonstrating the capacity of the model to accurately differentiate between dropouts and non-dropouts. The RF model achieved an AUROC of 0.988, demonstrating its exceptional discriminatory capability. The high AUROC value indicates that the RF model is highly dependable in forecasting student dropout, which serves as a strong foundation for the development of early intervention programs. These measurements provided a comprehensive understanding of the strengths and limitations of each model. This ensures that the chosen model performs well both in theory and in practical real-world applications, effectively reducing the student dropout rates.
A thorough assessment of various machine learning models emphasizes their diverse capabilities in forecasting student attrition rates. Models such as decision trees, support vector machines (SVMs), and random forests perform exceptionally well in metrics such as F1 score, sensitivity, and AUROC, showing their capacity to reliably detect children who are at risk while decreasing the occurrence of false positives. Logistic and probit regression models are known for their excellent accuracy and guarantee the efficient allocation of resources to students who truly require assistance. The comprehensive performance indicators offered by these models empower educational institutions to make well-informed decisions regarding their implementation, striking a balance between the requirement for precise predictions and practical considerations of intervention tactics. By utilizing these observations, educational institutions can improve their attempts to retain students by offering timely and focused assistance to those who are most likely to leave. This study’s rigorous evaluation framework not only confirms the efficacy of these models, but also provides a practical roadmap for their implementation in real-world educational environments. This eventually will leads to enhanced student outcomes and decreased dropout rates.
5. Discussion, Conclusions, and Recommendations
5.1. Introduction
The objective of this section is to combine the results obtained from our machine learning models to provide a thorough comprehension of the rates at which students in Somaliland abandoned their studies. The process involves synthesizing the knowledge acquired from predictive analysis, deriving conclusions from this knowledge, and offering practical recommendations. This comprehensive methodology guarantees that the results of this study are not only academically rigorous but also directly relevant to educational policy and intervention techniques.
5.2. Discussion
This study employed various machine learning techniques, such as logistic regression, probit regression, naïve Bayes, random forest, decision tree, support vector machine (SVM), and K-nearest neighbors (KNN), to forecast student attrition rates. Every model offers distinct perspectives on the elements that contribute to dropout rates. Demographic variables such as age and sex had a substantial impact. Younger students, particularly those going through important transitional periods, exhibited elevated dropout rates, indicating the need for interventions tailored to their unique age groups. Observations were made regarding disparities in dropout rates based on sex, emphasizing the necessity of measures that consider sex-specific considerations.
As an educational indicator, grades repeatedly proved to be the most influential predictor in several models, including naïve Bayes and random forest. This implies that the individual difficulties encountered at each grade level may be a significant contributing factor to students leaving school before completion. Socioeconomic factors, such as the level of money within a home and the number of individuals residing in the household, also had an influence. Lower levels of household income were linked to higher rates of students leaving school before completing their education, showing that economic obstacles hinder access to education. Similarly, larger household sizes were found to be related to higher dropout rates, suggesting that larger families have limited access to the resources necessary for education [
38].
However, the random forest model consistently outperformed the other algorithms, achieving the highest AUROC score and demonstrating the greatest accuracy in predicting student dropout. This exceptional performance highlights random forest as the most accurate and reliable model for predicting student attrition [
23].
The performance of the models varied, with the SVM and decision tree models demonstrating higher performances in terms of high accuracy and F1 scores. This indicates that these models are dependable in predicting student dropout. The random forest model performed exceptionally well, achieving the highest AUROC score, indicating its excellent ability to discriminate between different classes. Although the AUROC was significantly lower, the naïve Bayes and KNN models demonstrated high accuracy and sensitivity, further supporting their usefulness in classification applications. The prevalence of the “Grade” attribute in several models gives rise to worries of overfitting, thus requiring meticulous validation [
19]. This study highlights the importance of achieving a well-balanced feature engineering approach to enhance the generalization ability of models.
In addressing the potential concern of overfitting in our study, we took several precautionary measures to ensure the reliability and generalizability of our predictive models for school dropout in Somaliland. To mitigate overfitting, we carefully evaluated our models using a variety of performance metrics on both the training and validation datasets. Additionally, we employed methods such as hyperparameter tuning, model complexity constraints, and ensemble techniques. By fine-tuning hyperparameters, such as adjusting the maximum depth of decision trees in random forest or the number of neighbors in KNN, we aimed to strike a balance between model complexity and performance. Additionally, we restricted the complexity of SVM models by optimizing the regularization parameter and selecting appropriate kernels. For logistic, probit, and naïve Bayes models, we focused on feature selection and model simplification to prevent overfitting and improve predictive accuracy. These strategies collectively aimed to promote model simplicity and robustness, guarding against overfitting while ensuring the models’ effectiveness in predicting school dropout outcomes.
5.3. Conclusions
The predictive study indicated that demographic and educational parameters play a crucial role in comprehending student dropouts in Somaliland. Although socioeconomic determinants have an impact, their influence differs, highlighting the need for model-specific validation. The findings of this study indicate that age and grade play crucial roles in predicting dropout rates, highlighting the need for targeted treatments tailored to certain age groups and grade levels. SVM, decision tree, and random forest models offer reliable frameworks for predicting dropout; however, caution is needed to handle potential overfitting, particularly with regard to the “Grade” attribute. By integrating insights from several models, a comprehensive understanding of the factors that contribute to dropout rates can be obtained, which can then be used to develop more efficient interventional tactics.
5.4. Recommendations
Thorough model validation is essential to prevent overfitting. It is essential to incorporate cross-validation techniques and regularization approaches, especially in models that significantly depend on the “Grade” feature. Subsequent research should investigate sophisticated methods of feature engineering, such as incorporating polynomial features, interaction terms, and domain-specific transformations, to improve the predictive capability of the underrepresented features. This may entail the development of novel functionalities that capture intricate relationships within data.
It is necessary to provide focused interventions to assist younger students who are more likely to quit school, and particular support systems should be implemented to address the specific difficulties faced by students at important grade levels. Gender-sensitive interventions are crucial to guarantee that both males and females receive sufficient assistance to remain enrolled in educational institutions. Policymakers should mitigate socioeconomic gaps by allocating financial assistance and resources to low-income households and families with a larger number of members. It is essential to provide a fair allocation of educational resources in various regions to reduce the regional differences in dropout rates. Policies should offer further assistance to children of various origins, including those who face distinctive socioeconomic obstacles.
Regular and ongoing monitoring and upgrading of prediction models are crucial. Consistently incorporating fresh data through regular updates would facilitate adjustment to evolving educational trends and guarantee the continuous precision and pertinence of information. Implementing a feedback loop that continuously monitors the success of interventions and uses this information to improve predictive models and intervention tactics will significantly enhance the effectiveness of these efforts. Implementing these suggestions might help educational policymakers and practitioners in Somaliland create more efficient approaches to decreasing student attrition rates, consequently enhancing overall educational achievements and fostering socioeconomic progress in the area.