1. Introduction
In light of Sustainable Development Goal (SDG) #4, quality education can help society escape from poverty and improve socioeconomic conditions. Moreover, the pandemic crisis (COVID-19) has established a new normal in every walk of life. A similar digital transformation can be seen in the case of education and learning environments. Digital media has a significant influence on both adults and children. It is significantly changing society’s social and moral values. Digital media influences emotional, moral, and social development, especially in children and adolescents [
1]. The safety of children from digital media’s adult content, e.g., pornography, violence, and cyberbullying [
2], is critical. Nowadays, mobile applications (
apps) are popular digital media commodities [
3] and have kids’ attention.
App stores, e.g., the Google Play Store, provide age-group information regarding the suitability of apps. Such information can be used to prevent kids from accessing adult apps [
4]. Alternatively, app stores can exploit parental controls to filter and avoid unwanted apps/information. Although such apps have been proven to be underused, the adoption rate is significantly low [
5,
6]. Therefore, it is necessary to devise an automated method for classifying apps that are suitable for kids.
The users of apps usually provide feedback through their reviews. Such reviews not only provide information about bugs or enhancements [
7], but also provide information about the age groups of users, e.g., the reviews with certain words (violence, nudity, alcohol usage, sex, drugs, or tobacco usage) about an app may suggest that parents block the apps from their kids or avoid them [
8]. Consequently, reviews of apps can be used for the classification of the suitability of apps. Although many studies [
9,
10,
11] on text classification by applying NLP techniques have been conducted, none of them have focused on apps’ suitability for kids.
From this perspective, this paper proposes a machine-learning-based approach to the prediction of the suitability of apps for kids. The approach first leverages natural language processing (NLP) techniques to preprocess the user reviews of mobile apps. Second, it performs feature engineering based on the given bag of words (BOW), e.g., abusive words, and constructs a feature vector for each mobile app. Finally, it trains and tests a machine learning classifier (support vector machine (SVM)) on the given feature vectors. To evaluate the proposed approach, we leverage the 10-fold cross-validation technique. The results indicate that the proposed model is significant.
1.1. Research Significance
The significant contributions of this paper towards the identification of apps’ suitability for kids are as follows:
A machine-learning-based classification algorithm is exploited in the proposed approach for the prediction of the suitability of apps.
To the best of our knowledge, we are the first to introduce a machine learning method for keeping adult apps away from kids for their safety.
The evaluation’s results indicate that the proposed approach is accurate and outperforms the baseline approaches. The proposed approach’s average precision, recall, and F1-score are 92.76%, 99.33%, and 95.93%, respectively.
1.2. Organization of the Paper
The remainder of the paper is structured as follows.
Section 3 gives the details of the proposed model.
Section 4 explains the evaluation methods for the proposed model and obtained results.
Section 2 provides a brief overview of the related work.
Section 6 concludes the paper and highlights the future work.
2. Related Work
Machine intelligence is a highly multidisciplinary and active field, as it is implemented in multiple domains [
12,
13]. To this end, machine learning algorithms can be used in apps to prevent children from inappropriate apps, such as those involving cyberbullying. Many machine-learning-based apps have also been developed to monitor the interactions of mobile apps with children and adolescents [
14]. This interactivity can also be used to understand sustainable behavior and to generate a lead for educators to map proactive interest in the SDGs.
Thun et al. proposed an approach that uses a random forest algorithm to detect cyberbullying text on social networking sites [
15]. They applied this algorithm to the open-source data of the famous social networking site Twitter. They also developed an app based on a random forest algorithm to help parents detect and avoid cyberbullying for their children. The accuracy of the random forest was 92%. Furthermore, they used other machine learning algorithms, such as a decision tree, SVM, and naive Bayes. However, the accuracy of the random forest was the highest.
Liu et al. proposed a machine-learning-based approach to identifying and analyzing the privacy of kids in mobile apps [
16]. They applied the SVM algorithm to a dataset of mobile apps from the Google Play Store. The method was evaluated on 1738 apps and achieved 95% accuracy.
Hu et al. proposed a novel approach that used the Automatic App Maturity Rating (AAR) to predict inappropriate content in mobile apps [
17]. Moreover, they applied an SVM for multi-label classification. They applied these techniques to the data of mobile apps collected from the Google Play Store and the App Store. They achieved a mature content prediction accuracy of 85%.
Ying et al. found that the available maturity ratings of mobile apps are not verified and reliable. Therefore, it is not safe for children and adolescents to use mobile apps [
4]. They analyzed data on Android apps collected from the Google Play Store in terms of violence, drugs/alcohol/tobacco usage, sex, and offensive language. They used the Automatic Label of Maturity (ALM) ratings to verify the available maturity ratings. They found that 30% of the mobile apps had false maturity ratings.
Deep learning is a promising machine learning technique that is far more effective than classical machine learning techniques [
18]. Deep learning techniques utilize an unknown and abstract input structure to find better representations, and they often use multiple levels [
19]. These techniques use hierarchal neural networks that enable machines to analyze abstract data/inputs. Deep learning has been proven to be very efficient in the field of text classification [
18], sentiment analysis [
20], and speech recognition [
21].
Deep learning is also very applicable and effective in natural language processing tasks, e.g., analyzing apps’ user reviews to find the discrepancies in the apps’ numeric ratings [
22]. According to Thun et al., deep learning techniques are useful for the prediction of content that is inappropriate for children, such as cyberbullying [
15]. The successful application of deep learning techniques in the analysis of user reviews inspired us to apply these techniques, e.g., CNNs. However, deep learning approaches have not been exploited for the prediction of the inappropriateness of apps for kids because of the small datasets related to apps. A summary of the proposed methods and the datasets that were used for the state-of-the-art approaches is provided in
Table 1.
In conclusion, researchers have identified many problems related to apps that are inappropriate for children and have proposed many machine/deep learning approaches in order to address these problems. However, the automatic prevention of problematic/inappropriate apps from being accessed by children requires attention, as keeping children away from adult apps is a social/parental responsibility.
4. Evaluation
This section discusses the evaluation process, dataset, and results of the machine-learning based suitability prediction for mobile applications for kids (ML-SAK) through research questions.
4.1. Research Questions
The following research questions (RQs) were investigated for the evaluation:
RQ1: How accurate is ML-SAK in predicting suitable apps for kids?
RQ2: Does the review score influence ML-SAK in predicting suitable apps for kids?
RQ3: Is re-sampling required to improve the performance of ML-SAK?
RQ4: Is preprocessing required to improve the performance of ML-SAK?
RQ5: Does ML-SAK outperform other machine/deep learning algorithms in predicting suitable apps for kids?
RQ1 involved the computation of the performance of ML-SAK and compared it with that of the two basic algorithms: the random prediction (RP) and zero-rule (ZR) algorithms. These algorithms were used because none of the state-of-the-art approaches addressed the present problem. The RP algorithm collected the unique and actual outcome values from the training data and assigned random outcome values for the testing data. However, the ZR algorithm assigned the most frequently occurring class for the testing data.
RQ2 concerned the impact of the review score on the prediction of suitable apps for kids. To this end, we compared the results of ML-SAK by enabling/disabling the review score in the input of the proposed classifier.
RQ3 involved the investigation of the improvement of ML-SAK after applying re-sampling. Our dataset did not have equal samples in each class. Consequently, we performed re-sampling to address this problem. Re-sampling can be done by adjusting the threshold value of the classifier, over-sampling, or under-sampling. We were able to make the deviant dataset more consistent by adjusting the threshold of the classifier. Over-sampling could be achieved by including data in the deviant dataset. On the other hand, under-sampling could be achieved by trimming data from the deviant dataset. We applied under-sampling to construct a balanced dataset. Furthermore, we observed the performance of the classifier with/without re-sampling.
To answer RQ4, we examined the differences in the performance of ML-SAK by enabling/disabling the preprocessing step.
RQ5 involved the comparison of the performance of other machine/deep learning classifiers with that of ML-SAK. We evaluated the performance of different machine learning classifiers based on the accuracy, precision, recall, f-measure, and MCC.
4.2. Dataset
We used the app review dataset from Kaggle (
https://www.kaggle.com/datasets/prakharrathi25/google-play-store-reviews, accessed on 10 July 2022) to evaluate ML-SAK. The statistical information of the dataset is presented in
Figure 2. The data contained over 12,000 reviews by real users of different app store applications. The data were classified into the groups of positive (secure) and negative (insecure) for kids. Notably, the data were divided on a Likert scale (five points). We combined the data and converted them into two classes: secure and insecure. The secure class contained samples with ratings of 4 and 5, whereas the insecure class contained samples with ratings of 1 and 2. However, the samples with a rating of 3 (1991) were ignored
4.3. Process
After preprocessing the user reviews as described in
Section 3, ML-SAK was evaluated by applying 10-fold cross-validation. For the cross-validation,
R apps were divided into ten portions
, where
. We selected training data (
) from the user reviews (
R), which were not included in the portions
.
The evaluation steps applied for the cross-validation were as follows:
First, the user reviews for
were extracted from
R, excluding
, and combined.
Second, a multinomial naive Bayes classifier (MNB), logistic regression classifier (LR), random forest classifier (RF), convolutional neural network (CNN), and the proposed classifier (SVM) were trained on .
Third, given the trained MNB, LR, RF, CNN, and SVM, for each user review from , it was predicted whether the app was secure or insecure for kids.
Finally, the exploited metrics (accuracy, precision, recall, and f-measure) were computed for each classifier.
The 10-fold cross-validation was performed to evaluate the performance of ML-SAK and minimize threats to its validity. To evaluate the performance of the classifiers, familiar and well-known evaluation parameters, i.e., precision, recall, and f-measure, were computed. Such parameters can be defined as
where the
Precision,
Recall, and
F-measure indicate the metrics (
precision, recall, and f-measure) used to predict secure and insecure apps for kids.
is the number of correctly predicted apps that are secure for kids,
is the number of correctly predicted apps that are insecure for kids,
is the number of incorrectly predicted apps that are secure for kids, and
is the number of incorrectly predicted apps that are insecure for kids.
We also calculated the Mathews correlation coefficient (
MCC) to measure the quality of the classifier by using the following equation:
The above-mentioned metrics were computed to investigate the RQs. First, we computed the accuracy of ML-SAK by comparing it with that of the state-of-the-art method. Second, we analyzed the performance improvement caused by the review score in the prediction of secure and insecure apps. We compared the performance results of ML-SAK by adding and subtracting the review score to answer RQ2. Third, we analyzed the effect of sampling by comparing the results of ML-SAK with a balanced or imbalanced dataset to answer RQ3. Fourth, we identified the impact of preprocessing by comparing the performance results of ML-SAK with preprocessing enabled/disabled. Finally, we analyzed the proposed classifier’s efficiency by comparing the results with those of other well-known classifiers to answer RQ5.
4.4. Results
4.4.1. RQ1: Performance of ML-SAK
The average performance results of
ML-SAK,
RP, and
ZR were compared to investigate RQ1. The average evaluation results of
ML-SAK for the analysis of the impact of the scores of the user reviews (mentioned in
Section 3.4) are presented in
Table 2. The results of the candidate approaches are presented in
Table 2. Column 1 and Columns 2–5 present the approaches and the performance results of the metrics for each approach, respectively. The rows present the performance of each candidate approach.
From
Table 2, we can make the following observations:
The average results of the metrics (precision, recall, f-measure, and MCC) of the candidate approaches were (92.76%, 99.33%, 95.93%, 0.485), (65.23%, 65.64%, 65.43%, 0.298), and (82.58%, 100.00%, 90.46%, 0.367), respectively.
ML-SAK outperformed the RP and ZR classifiers.
ML-SAK showed improvements in precision compared to RP and ZR by 42.20% = (92.76% − 65.23%)/65.23% and 12.33% = (92.76% − 82.58%)/82.58%, respectively.
ML-SAK showed improvements in recall compared to RP and ZR by 51.33% = (99.33% − 65.64%)/64.64% and (0.67)% = (99.33% − 100.00%)/100.00%, respectively. The reason for the decrease in the performance of ML-SAK in terms of recall against ZR is that ZR always predicts the majority class.
ML-SAK showed improvements in the f-measure compared to RP and ZR by 46.61% = (95.93% − 65.43%)/65.43% and 6.05% = (95.93% − 90.46%)/90.46%, respectively. Therefore, on the basis of the above-discussed literature and findings, this study can conclude that machine learning can help direct, educate, and inform learners in accordance with the SDGs, specifically with respect to SDG #4 (quality education). It can help increase access to affordable and quality education, quality early childhood development, and essential skills for sustainable development (i.e., peace and non-violence).
ML-SAK showed improvements in the MCC compared to RP and ZR by 62.65% = (0.485 − 0.298)/0.298 and 32.15% = (0.485 − 0.367)/0.367, respectively.
Moreover, we provide the
distribution of the cross-validation for
ML-SAK,
RP, and
ZR in a beanplot. In
Figure 3, one bean is plotted for each candidate approach for the comparison of the
distributions; the horizontal lines represent the
of 10-fold cross-validation in a bean, and a long horizontal line presents the average
of each approach. The
of
ML-SAK does not deviate in comparison to
RP and
ZR, suggesting that
ML-SAL is more reliable.
Furthermore, an ANOVA was performed (as presented in
Figure 4) to study the differences among
ML-SAK,
RP, and
ZR. An ANOVA test confirms whether a single difference leads to the difference in performance. The ANOVA returned a
p-value of 2.89057
. The factor had a significant difference at
p < 0.05.
The above analysis allowed the conclusion that ML-SAK significantly improved the performance in comparison with that of RP and ZR.
4.4.2. RQ2: Impact of the Review Score
We compared the performance results of
ML-SAK by adding and subtracting the feature of the review scores to investigate RQ2. The evaluation results of
ML-SAK with and without the feature of the review score are presented in
Table 3. Column 1 and Columns 2–5 present the input settings and the evaluation metrics, whereas the rows of the table present the
ML-SAK’s performance with different input settings.
From these results, we can make the following major observations:
The results of disabling the review score were often insufficient and less accurate compared to the results with the review score enabled in the prediction of the security/suitability apps for kids. Disabling the review score decreased the performance of the parameters significantly, i.e., precision decreased from to , recall decreased from to , the f-measure decreased from to , and the MCC decreased from 0.485 to 0.464.
Second, enabling the review score resulted in a significant improvement in the prediction results, i.e., the improvements in the average precision, recall, and f-measure of ML-SAK reached 1.70% = (92.76% − 91.21%)/91.21%, 0.92% = (99.33% − 98.42%)/98.42%, 1.32% = (95.93% − 94.68%)/94.68%, and 4.53% = (0.485 − 0.464)/0.464, respectively.
The above analysis allows the conclusion that the review score significantly impacts the prediction and performance of ML-SAK. Consequently, it is better to predict with the review score enabled.
4.4.3. RQ3: Impact of Re-Sampling
Re-sampling balances the class samples and corrects the bias of a dataset. We applied under-sampling and over-sampling to investigate the increase in the performance of ML-SAK with sampling. These results are presented in
Table 4. Column 1 and Columns 2–5 present the input settings and the evaluation metrics, whereas the rows of the table present the performance of
ML-SAK with different input settings. Notably, the synthetic minority over-sampling approach was adopted for over-sampling, and random n-samples were selected from the significant class for under-sampling.
It is evident from
Table 4 that the performance of
ML-SAK with each type of sampling was (94.79%, 99.45%, and 97.06%, 0.653) and (93.28%, 99.40%, and 96.24%, 0.579), respectively. This indicates that both re-sampling techniques improve the overall performance of
ML-SAK.
4.4.4. RQ4: Impact of Preprocessing
The texts of user reviews contain meaningless words that can reduce the learning ability of machine learning algorithms. Therefore, removing such words is essential for improving the performance of machine learning algorithms.
We compared the performance results of
ML-SAK by adding and subtracting the preprocessing step to investigate RQ4. The related performance results of
ML-SAK are presented in
Table 5. Column 1 and Columns 2–5 present the input settings and the evaluation metrics, whereas the rows of the table present the performance of
ML-SAK with different input settings.
From
Table 5, we can make the following observations:
The preprocessing of user reviews’ texts resulted in an improvement in performance and improved the average precision, recall, f-measure, and MCC of ML-SAK up to 0.34% = (92.76% − 92.45%)/92.45%, 0.38% = (99.33% − 98.95%)/98.95%, 0.36% = (95.93% − 95.59%)/95.59%, and 1.68% = (0.485 − 0.477)/0.477, respectively.
The possible reason for the improvement is that the user reviews included irrelevant and meaningless words, e.g., special characters.
We used the Lancaster stemming algorithm for the lemmatization. For example, the Poter stemming algorithm returns a word ‘tri’ against a word ‘trying’, which has no meaning in textual analysis. However, the Lancaster stemming algorithm returns a word ‘try’ against a word ‘trying’, which positively impacts textual analysis.
The above analysis allows the conclusion that the preprocessing of user reviews is critical for the prediction of the suitability of apps.
4.4.5. RQ5: Impacts of Other Classification Algorithms
We compared the performance results of
ML-SAK (SVM) with those of other machine/deep learning algorithms (
LR,
RF,
CNN, and
MNB). The related results for the classifiers are presented in
Table 6. Column 1 and Columns 2–5 present the classifiers and the evaluation metrics, whereas the rows of the table present the performance of the classifiers.
From
Table 6, we can make the following observations:
The SVM performed better than the LR, RF, MNB, and CNN classifiers in precision, recall, f-measure, and MCC, respectively.
The
SVM is significant for different reasons. First, the
SVM draws a hyperplane in the feature space [
9,
33], which helps it provide a better generalization of the testing data, in contrast to that of
RF [
34]. Second, a linear
SVM searches different combinations within features and classifies samples with a low computational complexity [
33]. Moreover, an
SVM is considered to be better than other classification algorithms, e.g.,
LR,
RF, and
MNB, for the classification of long texts [
9].
Although deep learning algorithms, such as a
CNN, are better than machine learning algorithms for different classification problems [
35], the CNN did not perform better in prediction of the suitability of apps for kids. A possible reason is that we had a small dataset for the evaluation of
ML-SAK, and deep learning classifiers, e.g., CNN, are good when dealing with large training datasets [
35].
The above analysis allows the conclusion that ML-SAK outperformed the other machine/deep learning classifiers in the prediction the suitability of apps for kids.
6. Conclusions and Future Work
The identification of the suitability of mobile applications for kids is challenging due to the diversity of mobile applications. A support vector machine classifier was proposed to perform effective identification. The proposed model cleans the given reviews, conducts textual analysis to compute the review scores of the reviews, and combines preprocessed information and the review score of each review into a vector for the training and evaluation of the proposed model. The results of the 10-fold cross-validation indicate that the proposed model is significant by improving the precision, recall, and f-measure by up to or more than 42%, 51%, and 46%, respectively.
The impact of our work is an indication that mobile application reviews significantly help in the prediction of the suitability of apps for kids. However, it would be interesting to work with multiple and large datasets to check the validity of the proposed approach. Furthermore, we would also like to explore other metadata features of mobile applications to improve the performance of the proposed approach by using deep learning approaches.