Machine-Learning-Based Suitability Prediction for Mobile Applications for Kids

Meng, Xianjun; Li, Shaomei; Malik, Muhammad Mohsin; Umer, Qasim

doi:10.3390/su141912400

Open AccessArticle

Machine-Learning-Based Suitability Prediction for Mobile Applications for Kids

by

Xianjun Meng

^1,2,

Shaomei Li

¹,

Muhammad Mohsin Malik

³

and

Qasim Umer

^4,*

¹

College of Education, Shaanxi Normal University, Xi’an 710062, China

²

College of Education, Xi’an Fanyi University, Xi’an 710105, China

³

Faculty of Multi Disciplinary Studies, National University of Medical Sciences, Rawalpindi 46000, Pakistan

⁴

Department of Computer Sciences, COMSATS University Islamabad, Vehari 61000, Pakistan

^*

Author to whom correspondence should be addressed.

Sustainability 2022, 14(19), 12400; https://doi.org/10.3390/su141912400

Submission received: 1 August 2022 / Revised: 15 September 2022 / Accepted: 17 September 2022 / Published: 29 September 2022

(This article belongs to the Special Issue Education for Sustainable Development: Current Issues and Future Implications)

Download

Browse Figures

Versions Notes

Abstract

:

Digital media has a massive presence in the modern world, and it significantly impacts kids’ intellectual, cognitive, ethical, and social development. It is nearly impossible to isolate kids from digital media. Therefore, adult content on mobile applications should be avoided by children. Although mobile operating systems provide parental controls, handling such rules is impossible for illiterate people. Consequently, kids may download and use adults’ mobile applications. Mobile applications for adults often publish age group information to distinguish user segments that can be used to automate the downloading process. Sustainable Development Goal (SDG) #4 emphasizes inclusivity and equitability in terms of quality of education and the facilitation of conditions for the promotion of lifelong learning for everyone. The current study can be counted as being in line with SDG#4, as it proposes a machine-learning-based approach to the prediction of the suitability of mobile applications for kids. The approach first leverages natural language processing (NLP) techniques to preprocess user reviews of mobile applications. Second, it performs feature engineering based on the given bag of words (BOW), e.g., abusive words, and constructs a feature vector for each mobile app. Finally, it trains and tests a machine learning algorithm on the given feature vectors. To evaluate the proposed approach, we leverage the 10-fold cross-validation technique. The results of the 10-fold cross-validation indicate that the proposed solution is significant. The average results of the exploited metrics (precision, recall, and F1-score) are 92.76%, 99.33%, and 95.93%, respectively.

Keywords:

machine learning; classification; reliability; kids learning; mobile applications; sustainable learning

1. Introduction

In light of Sustainable Development Goal (SDG) #4, quality education can help society escape from poverty and improve socioeconomic conditions. Moreover, the pandemic crisis (COVID-19) has established a new normal in every walk of life. A similar digital transformation can be seen in the case of education and learning environments. Digital media has a significant influence on both adults and children. It is significantly changing society’s social and moral values. Digital media influences emotional, moral, and social development, especially in children and adolescents [1]. The safety of children from digital media’s adult content, e.g., pornography, violence, and cyberbullying [2], is critical. Nowadays, mobile applications (apps) are popular digital media commodities [3] and have kids’ attention.

App stores, e.g., the Google Play Store, provide age-group information regarding the suitability of apps. Such information can be used to prevent kids from accessing adult apps [4]. Alternatively, app stores can exploit parental controls to filter and avoid unwanted apps/information. Although such apps have been proven to be underused, the adoption rate is significantly low [5,6]. Therefore, it is necessary to devise an automated method for classifying apps that are suitable for kids.

The users of apps usually provide feedback through their reviews. Such reviews not only provide information about bugs or enhancements [7], but also provide information about the age groups of users, e.g., the reviews with certain words (violence, nudity, alcohol usage, sex, drugs, or tobacco usage) about an app may suggest that parents block the apps from their kids or avoid them [8]. Consequently, reviews of apps can be used for the classification of the suitability of apps. Although many studies [9,10,11] on text classification by applying NLP techniques have been conducted, none of them have focused on apps’ suitability for kids.

From this perspective, this paper proposes a machine-learning-based approach to the prediction of the suitability of apps for kids. The approach first leverages natural language processing (NLP) techniques to preprocess the user reviews of mobile apps. Second, it performs feature engineering based on the given bag of words (BOW), e.g., abusive words, and constructs a feature vector for each mobile app. Finally, it trains and tests a machine learning classifier (support vector machine (SVM)) on the given feature vectors. To evaluate the proposed approach, we leverage the 10-fold cross-validation technique. The results indicate that the proposed model is significant.

1.1. Research Significance

The significant contributions of this paper towards the identification of apps’ suitability for kids are as follows:

A machine-learning-based classification algorithm is exploited in the proposed approach for the prediction of the suitability of apps.
To the best of our knowledge, we are the first to introduce a machine learning method for keeping adult apps away from kids for their safety.
The evaluation’s results indicate that the proposed approach is accurate and outperforms the baseline approaches. The proposed approach’s average precision, recall, and F1-score are 92.76%, 99.33%, and 95.93%, respectively.

1.2. Organization of the Paper

The remainder of the paper is structured as follows. Section 3 gives the details of the proposed model. Section 4 explains the evaluation methods for the proposed model and obtained results. Section 2 provides a brief overview of the related work. Section 6 concludes the paper and highlights the future work.

2. Related Work

Machine intelligence is a highly multidisciplinary and active field, as it is implemented in multiple domains [12,13]. To this end, machine learning algorithms can be used in apps to prevent children from inappropriate apps, such as those involving cyberbullying. Many machine-learning-based apps have also been developed to monitor the interactions of mobile apps with children and adolescents [14]. This interactivity can also be used to understand sustainable behavior and to generate a lead for educators to map proactive interest in the SDGs.

Thun et al. proposed an approach that uses a random forest algorithm to detect cyberbullying text on social networking sites [15]. They applied this algorithm to the open-source data of the famous social networking site Twitter. They also developed an app based on a random forest algorithm to help parents detect and avoid cyberbullying for their children. The accuracy of the random forest was 92%. Furthermore, they used other machine learning algorithms, such as a decision tree, SVM, and naive Bayes. However, the accuracy of the random forest was the highest.

Liu et al. proposed a machine-learning-based approach to identifying and analyzing the privacy of kids in mobile apps [16]. They applied the SVM algorithm to a dataset of mobile apps from the Google Play Store. The method was evaluated on 1738 apps and achieved 95% accuracy.

Hu et al. proposed a novel approach that used the Automatic App Maturity Rating (AAR) to predict inappropriate content in mobile apps [17]. Moreover, they applied an SVM for multi-label classification. They applied these techniques to the data of mobile apps collected from the Google Play Store and the App Store. They achieved a mature content prediction accuracy of 85%.

Ying et al. found that the available maturity ratings of mobile apps are not verified and reliable. Therefore, it is not safe for children and adolescents to use mobile apps [4]. They analyzed data on Android apps collected from the Google Play Store in terms of violence, drugs/alcohol/tobacco usage, sex, and offensive language. They used the Automatic Label of Maturity (ALM) ratings to verify the available maturity ratings. They found that 30% of the mobile apps had false maturity ratings.

Deep learning is a promising machine learning technique that is far more effective than classical machine learning techniques [18]. Deep learning techniques utilize an unknown and abstract input structure to find better representations, and they often use multiple levels [19]. These techniques use hierarchal neural networks that enable machines to analyze abstract data/inputs. Deep learning has been proven to be very efficient in the field of text classification [18], sentiment analysis [20], and speech recognition [21].

Deep learning is also very applicable and effective in natural language processing tasks, e.g., analyzing apps’ user reviews to find the discrepancies in the apps’ numeric ratings [22]. According to Thun et al., deep learning techniques are useful for the prediction of content that is inappropriate for children, such as cyberbullying [15]. The successful application of deep learning techniques in the analysis of user reviews inspired us to apply these techniques, e.g., CNNs. However, deep learning approaches have not been exploited for the prediction of the inappropriateness of apps for kids because of the small datasets related to apps. A summary of the proposed methods and the datasets that were used for the state-of-the-art approaches is provided in Table 1.

In conclusion, researchers have identified many problems related to apps that are inappropriate for children and have proposed many machine/deep learning approaches in order to address these problems. However, the automatic prevention of problematic/inappropriate apps from being accessed by children requires attention, as keeping children away from adult apps is a social/parental responsibility.

3. Proposed Methodology

3.1. Overview

The proposed method is illustrated in Figure 1. The main steps involved in the proposed approach are presented in the following:

First, the dataset of apps was collected from Kaggle and reused.
Second, we exploited NLP techniques to preprocess the textual information of reviews.
Third, we computed the app score based on the reviews’ sentiment values by performing emotional analysis.
Fourth, the textual information and app score were combined in a vector for each review.
Finally, we trained an SVM classifier to predict inappropriate content in order to prevent kids from using adult apps. We passed the computed app score and generated vector as inputs to the SVM classifier for the prediction.

The following sections provide the details of the key steps of the proposed model.

3.2. Problem Definition

A user review r among user reviews R can be represented as

r = < t, s >

(1)

where t represents the textual information of user review r and s represents the status of the user review r that identifies the suitability of apps for kids.

The proposed approach classifies user review r as either secure or insecure. Consequently, the classification of user review r can be defined as the mapping of function f,

f : r \to c

c \in \{S e c u r e, I n s e c u r e\}, r \in R

(2)

where c suggests if user review is to be classified as secure or insecure.

3.3. Preprocessing

The user reviews of apps contain both valuable and irrelevant data, e.g., punctuation marks. Such data are often preprocessed in order to employ machine-learning-based classification. Otherwise, irrelevant information will increase the processing time and memory resources. We leveraged NLP techniques for the preprocessing of the data. The key steps involved in the preprocessing can be described as follows:

1.: Tokenization: The textual data of apps’ user reviews contained sentences. Tokenization first split the sentences and then divided each sentence into words (tokens).
2.: Lower-case conversion: The textual data usually contained the same words/letters in upper/lower cases. Such words may be represented as two different words in vector space, but they have similar meanings. To make the process efficient in terms of time/space, we converted all of the uppercase letters/words into lowercase letters.
3.: Stop-word removal: Stop-words are used in the construction of sentences to provide better understanding, e.g., $t h e$ , a, or $t o$ . Although such words are useful in language orientation, they do not have a meaning. Consequently, these words were removed from the tokens.
4.: Lemmatization: In the English language, words are used in different forms, e.g., $l e a r n$ , $l e a r n i n g$ , $l e a r n e d$ . These words have the same meaning; therefore, it is better to convert such words into their base words, i.e., the words mentioned in the previous example can be represented as $l e a r n$ . Lemmatization also considers the context and does not compromise the meaning of the base word.

For the preprocessing of the data, we used the Natural Language Toolkit (NLTK) [23], TextBlob [24], and SpaCy [25] Python libraries.

After preprocessing, the user review r can be represented as

r^{'} = < t_{i}, s >

(3)

t_{i} = < t_{1}, t_{2}, \dots, t_{n} >

(4)

where

t_{1}

,

t_{2}

, …,

t_{n}

are the tokens achieved after the preprocessing of the textual description of each user review r.

3.4. Textual Analysis

The user reviews give significant insights into whether an app is secure or insecure for children. The status of apps can be obtained from the classification of user reviews. For example, “The app is very helpful and informative” indicates that an app is secure for kids. However, “Shooting the other player is very entertaining, and the drugs for stamina boost in this game are very effective” indicates that an app is insecure for kids. Notably, the words

h e l p f u l

and

i n f o r m a t i v e

or

s h o o t i n g

and

d r u g s

help in the computation of the app score.

The secure/insecure classification of the user reviews could be carried out by using emotional analysis repositories. Many emotional analysis repositories are available for textual data, e.g.,

S e n t i W o r d N e t 3.0

[26], in addition to many repositories for the emotional analysis of textual data, e.g.,

D E V A

[27],

E m o T x t

[28],

S e n t i C R

[29],

S e n t i s t r e n g t h S E

[30], and

S e n t i 4 S D

[31]. We used

S p a C y

for the emotional analysis of the user reviews because this repository has been proven to be very efficient and effective in the emotional analysis of general texts. Moreover, the performance of this repository is better than that of the other mentioned repositories for general texts.

The textual information t is passed to

S p a C y

as the input for user review r. After the computation of the emotion of the user review r, a user review can be represented with its emotion as

r^{″} = < a, t_{i}, s >

(5)

where a represents the review score of r.

3.5. Feature Modeling

For the feature modeling, we generated a matrix of user reviews. This matrix had the secure/insecure status and features of each user review. The features consisted of all of the meaningful words extracted from the user reviews. The rows illustrate the processed user review, whereas the columns of the matrix illustrate the features and review score. The matrix representation of a user review can be described as

r^{″} = < a, f_{1}, f_{2}, \dots, f_{n} >

(6)

where a and

f_{1}

,

f_{2}

, …,

f_{n}

are the review score and features of the user review, respectively. Notably, the value of n was 5000 in our case.

The feature modeling of r can be represented as in Equation (6). The matrix is populated based on each feature’s frequency (N).

f_{i} (r^{″}) = \{\begin{matrix} 0, i f t_{i} \notin t_{s} \\ N, i f t_{i} \in t_{s} \end{matrix}

(7)

where

f_{i}

is the set of features extracted from the preprocessed data

r^{'}

of each user review

r^{″}

.

3.6. Training

The support vector machine algorithm was trained with a random training set [32]. Therefore, we used the feature sets of each user review obtained after performing the feature modeling with Equation (6). A classification model was constructed to determine the secure/insecure status s, as defined in Equation (1), for the given set of user reviews R = (

r_{1}

,

r_{2}

, …,

r_{n}

). Each user review from the set R was labelled according to the specified binary categories

z_{i}

based on the classification status s, emotion score r″, and textual features

f_{i}

. This labeled information was given a weight vector w that helped the discriminant function in the separation of categories (i = 1, 2, …, n). This can be represented as

f (r_{i}) = w^{⊤} r_{i} + b

(8)

where f computes the decision surface as a discriminant function for the training data r, and w and b describe the weight vector and bias, respectively.

3.7. Prediction

After defining the training set vector, we predicted the secure/insecure status of the user review by testing the training set. Each user review from the training set is be secure if

w^{⊤} r + b > 1

(9)

and is insecure otherwise.

4. Evaluation

This section discusses the evaluation process, dataset, and results of the machine-learning based suitability prediction for mobile applications for kids (ML-SAK) through research questions.

4.1. Research Questions

The following research questions (RQs) were investigated for the evaluation:

RQ1: How accurate is ML-SAK in predicting suitable apps for kids?
RQ2: Does the review score influence ML-SAK in predicting suitable apps for kids?
RQ3: Is re-sampling required to improve the performance of ML-SAK?
RQ4: Is preprocessing required to improve the performance of ML-SAK?
RQ5: Does ML-SAK outperform other machine/deep learning algorithms in predicting suitable apps for kids?

RQ1 involved the computation of the performance of ML-SAK and compared it with that of the two basic algorithms: the random prediction (RP) and zero-rule (ZR) algorithms. These algorithms were used because none of the state-of-the-art approaches addressed the present problem. The RP algorithm collected the unique and actual outcome values from the training data and assigned random outcome values for the testing data. However, the ZR algorithm assigned the most frequently occurring class for the testing data.

RQ2 concerned the impact of the review score on the prediction of suitable apps for kids. To this end, we compared the results of ML-SAK by enabling/disabling the review score in the input of the proposed classifier.

RQ3 involved the investigation of the improvement of ML-SAK after applying re-sampling. Our dataset did not have equal samples in each class. Consequently, we performed re-sampling to address this problem. Re-sampling can be done by adjusting the threshold value of the classifier, over-sampling, or under-sampling. We were able to make the deviant dataset more consistent by adjusting the threshold of the classifier. Over-sampling could be achieved by including data in the deviant dataset. On the other hand, under-sampling could be achieved by trimming data from the deviant dataset. We applied under-sampling to construct a balanced dataset. Furthermore, we observed the performance of the classifier with/without re-sampling.

To answer RQ4, we examined the differences in the performance of ML-SAK by enabling/disabling the preprocessing step.

RQ5 involved the comparison of the performance of other machine/deep learning classifiers with that of ML-SAK. We evaluated the performance of different machine learning classifiers based on the accuracy, precision, recall, f-measure, and MCC.

4.2. Dataset

We used the app review dataset from Kaggle (https://www.kaggle.com/datasets/prakharrathi25/google-play-store-reviews, accessed on 10 July 2022) to evaluate ML-SAK. The statistical information of the dataset is presented in Figure 2. The data contained over 12,000 reviews by real users of different app store applications. The data were classified into the groups of positive (secure) and negative (insecure) for kids. Notably, the data were divided on a Likert scale (five points). We combined the data and converted them into two classes: secure and insecure. The secure class contained samples with ratings of 4 and 5, whereas the insecure class contained samples with ratings of 1 and 2. However, the samples with a rating of 3 (1991) were ignored

4.3. Process

After preprocessing the user reviews as described in Section 3, ML-SAK was evaluated by applying 10-fold cross-validation. For the cross-validation, R apps were divided into ten portions

P_{n}

, where

n = (1, 2, 3, \dots, 10)

. We selected training data (

R_{t r}

) from the user reviews (R), which were not included in the portions

P_{n}

.

The evaluation steps applied for the

n_{t h}

cross-validation were as follows:

First, the user reviews for $R_{t r}$ were extracted from R, excluding $P_{n}$ , and combined.

$D_{t r_{n}} = ⋃_{m \in [1, 10] \land m \neq n} P_{m}$

(10)
Second, a multinomial naive Bayes classifier (MNB), logistic regression classifier (LR), random forest classifier (RF), convolutional neural network (CNN), and the proposed classifier (SVM) were trained on $R_{t r}$ .
Third, given the trained MNB, LR, RF, CNN, and SVM, for each user review from $P_{i}$ , it was predicted whether the app was secure or insecure for kids.
Finally, the exploited metrics (accuracy, precision, recall, and f-measure) were computed for each classifier.

The 10-fold cross-validation was performed to evaluate the performance of ML-SAK and minimize threats to its validity. To evaluate the performance of the classifiers, familiar and well-known evaluation parameters, i.e., precision, recall, and f-measure, were computed. Such parameters can be defined as

P r e c i s i o n = \frac{T P}{T P + F P}

(11)

R e c a l l = \frac{T P}{T P + F N}

(12)

F - m e a s u r e = \frac{2 * P r e c i s i o n * R e c a l l}{P r e c i s i o n + R e c a l l}

(13)

where the Precision, Recall, and F-measure indicate the metrics (precision, recall, and f-measure) used to predict secure and insecure apps for kids.

T P

is the number of correctly predicted apps that are secure for kids,

T N

is the number of correctly predicted apps that are insecure for kids,

F P

is the number of incorrectly predicted apps that are secure for kids, and

F N

is the number of incorrectly predicted apps that are insecure for kids.

We also calculated the Mathews correlation coefficient (MCC) to measure the quality of the classifier by using the following equation:

M C C = \frac{T P * T N - F P * F N}{\sqrt{(T P + F P) (T P + F N) (T N + F P) (T N + F N)}}

(14)

The above-mentioned metrics were computed to investigate the RQs. First, we computed the accuracy of ML-SAK by comparing it with that of the state-of-the-art method. Second, we analyzed the performance improvement caused by the review score in the prediction of secure and insecure apps. We compared the performance results of ML-SAK by adding and subtracting the review score to answer RQ2. Third, we analyzed the effect of sampling by comparing the results of ML-SAK with a balanced or imbalanced dataset to answer RQ3. Fourth, we identified the impact of preprocessing by comparing the performance results of ML-SAK with preprocessing enabled/disabled. Finally, we analyzed the proposed classifier’s efficiency by comparing the results with those of other well-known classifiers to answer RQ5.

4.4. Results

4.4.1. RQ1: Performance of ML-SAK

The average performance results of ML-SAK, RP, and ZR were compared to investigate RQ1. The average evaluation results of ML-SAK for the analysis of the impact of the scores of the user reviews (mentioned in Section 3.4) are presented in Table 2. The results of the candidate approaches are presented in Table 2. Column 1 and Columns 2–5 present the approaches and the performance results of the metrics for each approach, respectively. The rows present the performance of each candidate approach.

From Table 2, we can make the following observations:

The average results of the metrics (precision, recall, f-measure, and MCC) of the candidate approaches were (92.76%, 99.33%, 95.93%, 0.485), (65.23%, 65.64%, 65.43%, 0.298), and (82.58%, 100.00%, 90.46%, 0.367), respectively.
ML-SAK outperformed the RP and ZR classifiers.
ML-SAK showed improvements in precision compared to RP and ZR by 42.20% = (92.76% − 65.23%)/65.23% and 12.33% = (92.76% − 82.58%)/82.58%, respectively.
ML-SAK showed improvements in recall compared to RP and ZR by 51.33% = (99.33% − 65.64%)/64.64% and (0.67)% = (99.33% − 100.00%)/100.00%, respectively. The reason for the decrease in the performance of ML-SAK in terms of recall against ZR is that ZR always predicts the majority class.
ML-SAK showed improvements in the f-measure compared to RP and ZR by 46.61% = (95.93% − 65.43%)/65.43% and 6.05% = (95.93% − 90.46%)/90.46%, respectively. Therefore, on the basis of the above-discussed literature and findings, this study can conclude that machine learning can help direct, educate, and inform learners in accordance with the SDGs, specifically with respect to SDG #4 (quality education). It can help increase access to affordable and quality education, quality early childhood development, and essential skills for sustainable development (i.e., peace and non-violence).
ML-SAK showed improvements in the MCC compared to RP and ZR by 62.65% = (0.485 − 0.298)/0.298 and 32.15% = (0.485 − 0.367)/0.367, respectively.

Moreover, we provide the

F M

distribution of the cross-validation for ML-SAK, RP, and ZR in a beanplot. In Figure 3, one bean is plotted for each candidate approach for the comparison of the

F M

distributions; the horizontal lines represent the

F M

of 10-fold cross-validation in a bean, and a long horizontal line presents the average

F M

of each approach. The

F M

of ML-SAK does not deviate in comparison to RP and ZR, suggesting that ML-SAL is more reliable.

Furthermore, an ANOVA was performed (as presented in Figure 4) to study the differences among ML-SAK, RP, and ZR. An ANOVA test confirms whether a single difference leads to the difference in performance. The ANOVA returned a p-value of 2.89057

\times 10^{- 26}

. The factor had a significant difference at p < 0.05.

The above analysis allowed the conclusion that ML-SAK significantly improved the performance in comparison with that of RP and ZR.

4.4.2. RQ2: Impact of the Review Score

We compared the performance results of ML-SAK by adding and subtracting the feature of the review scores to investigate RQ2. The evaluation results of ML-SAK with and without the feature of the review score are presented in Table 3. Column 1 and Columns 2–5 present the input settings and the evaluation metrics, whereas the rows of the table present the ML-SAK’s performance with different input settings.

From these results, we can make the following major observations:

The results of disabling the review score were often insufficient and less accurate compared to the results with the review score enabled in the prediction of the security/suitability apps for kids. Disabling the review score decreased the performance of the parameters significantly, i.e., precision decreased from $92.76 %$ to $91.21 %$ , recall decreased from $99.33 %$ to $98.42 %$ , the f-measure decreased from $95.93 %$ to $94.68 %$ , and the MCC decreased from 0.485 to 0.464.
Second, enabling the review score resulted in a significant improvement in the prediction results, i.e., the improvements in the average precision, recall, and f-measure of ML-SAK reached 1.70% = (92.76% − 91.21%)/91.21%, 0.92% = (99.33% − 98.42%)/98.42%, 1.32% = (95.93% − 94.68%)/94.68%, and 4.53% = (0.485 − 0.464)/0.464, respectively.

The above analysis allows the conclusion that the review score significantly impacts the prediction and performance of ML-SAK. Consequently, it is better to predict with the review score enabled.

4.4.3. RQ3: Impact of Re-Sampling

Re-sampling balances the class samples and corrects the bias of a dataset. We applied under-sampling and over-sampling to investigate the increase in the performance of ML-SAK with sampling. These results are presented in Table 4. Column 1 and Columns 2–5 present the input settings and the evaluation metrics, whereas the rows of the table present the performance of ML-SAK with different input settings. Notably, the synthetic minority over-sampling approach was adopted for over-sampling, and random n-samples were selected from the significant class for under-sampling.

It is evident from Table 4 that the performance of ML-SAK with each type of sampling was (94.79%, 99.45%, and 97.06%, 0.653) and (93.28%, 99.40%, and 96.24%, 0.579), respectively. This indicates that both re-sampling techniques improve the overall performance of ML-SAK.

4.4.4. RQ4: Impact of Preprocessing

The texts of user reviews contain meaningless words that can reduce the learning ability of machine learning algorithms. Therefore, removing such words is essential for improving the performance of machine learning algorithms.

We compared the performance results of ML-SAK by adding and subtracting the preprocessing step to investigate RQ4. The related performance results of ML-SAK are presented in Table 5. Column 1 and Columns 2–5 present the input settings and the evaluation metrics, whereas the rows of the table present the performance of ML-SAK with different input settings.

From Table 5, we can make the following observations:

The preprocessing of user reviews’ texts resulted in an improvement in performance and improved the average precision, recall, f-measure, and MCC of ML-SAK up to 0.34% = (92.76% − 92.45%)/92.45%, 0.38% = (99.33% − 98.95%)/98.95%, 0.36% = (95.93% − 95.59%)/95.59%, and 1.68% = (0.485 − 0.477)/0.477, respectively.
The possible reason for the improvement is that the user reviews included irrelevant and meaningless words, e.g., special characters.
We used the Lancaster stemming algorithm for the lemmatization. For example, the Poter stemming algorithm returns a word ‘tri’ against a word ‘trying’, which has no meaning in textual analysis. However, the Lancaster stemming algorithm returns a word ‘try’ against a word ‘trying’, which positively impacts textual analysis.

The above analysis allows the conclusion that the preprocessing of user reviews is critical for the prediction of the suitability of apps.

4.4.5. RQ5: Impacts of Other Classification Algorithms

We compared the performance results of ML-SAK (SVM) with those of other machine/deep learning algorithms (LR, RF, CNN, and MNB). The related results for the classifiers are presented in Table 6. Column 1 and Columns 2–5 present the classifiers and the evaluation metrics, whereas the rows of the table present the performance of the classifiers.

From Table 6, we can make the following observations:

The SVM performed better than the LR, RF, MNB, and CNN classifiers in precision, recall, f-measure, and MCC, respectively.
The SVM is significant for different reasons. First, the SVM draws a hyperplane in the feature space [9,33], which helps it provide a better generalization of the testing data, in contrast to that of RF [34]. Second, a linear SVM searches different combinations within features and classifies samples with a low computational complexity [33]. Moreover, an SVM is considered to be better than other classification algorithms, e.g., LR, RF, and MNB, for the classification of long texts [9].
Although deep learning algorithms, such as a CNN, are better than machine learning algorithms for different classification problems [35], the CNN did not perform better in prediction of the suitability of apps for kids. A possible reason is that we had a small dataset for the evaluation of ML-SAK, and deep learning classifiers, e.g., CNN, are good when dealing with large training datasets [35].

The above analysis allows the conclusion that ML-SAK outperformed the other machine/deep learning classifiers in the prediction the suitability of apps for kids.

5. Threats

Threats to Validity

The selection of evaluation metrics is a threat to construct validity. Note that we selected the precision, recall, f-measure, and MCC to evaluate ML-SAK, as they are considered reliable for classification problems.

The implementation of machine/deep learning algorithms is a threat to internal validity. Implementations of machine/deep learning algorithms are cross-checked to mitigate this threat. However, unseen errors may be found in an implementation.

The abstraction of ML-SAK is a threat to external validity. Using other datasets to evaluate ML-SAK may influence its performance.

6. Conclusions and Future Work

The identification of the suitability of mobile applications for kids is challenging due to the diversity of mobile applications. A support vector machine classifier was proposed to perform effective identification. The proposed model cleans the given reviews, conducts textual analysis to compute the review scores of the reviews, and combines preprocessed information and the review score of each review into a vector for the training and evaluation of the proposed model. The results of the 10-fold cross-validation indicate that the proposed model is significant by improving the precision, recall, and f-measure by up to or more than 42%, 51%, and 46%, respectively.

The impact of our work is an indication that mobile application reviews significantly help in the prediction of the suitability of apps for kids. However, it would be interesting to work with multiple and large datasets to check the validity of the proposed approach. Furthermore, we would also like to explore other metadata features of mobile applications to improve the performance of the proposed approach by using deep learning approaches.

Author Contributions

Conceptualization, Q.U.; Data curation, S.L.; Formal analysis, S.L.; Investigation, X.M.; Methodology, X.M., S.L. and Q.U.; Supervision, Q.U.; Validation, X.M. and M.M.M.; Visualization, Q.U.; Writing—original draft, X.M. and S.L.; Writing—review & editing, M.M.M. and Q.U. The authors contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Special Funds for Education and Teaching Reform in Shaanxi Province, China (JYTYB2022-13).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

All authors acknowledge funding support given by the Special Funds for Education and Teaching Reform in Shaanxi Province, China.

Conflicts of Interest

The authors declare no conflict of interest.

References

Globokar, R. Impact of digital media on emotional, social and moral development of children. Nova Prisut. 2018, 16, 545–560. [Google Scholar] [CrossRef]
Jevremovic, A.; Veinovic, M.; Cabarkapa, M.; Krstic, M.; Chorbev, I.; Dimitrovski, I.; Garcia, N.; Pombo, N.; Stojmenovic, M. Keeping Children Safe Online With Limited Resources: Analyzing What is Seen and Heard. IEEE Access 2021, 9, 132723–132732. [Google Scholar] [CrossRef]
Lupton, D. The sociology of mobile apps. In The Oxford Handbook of Sociology and Digital Media; Oxford University Press: New York, NY, USA, 2020. [Google Scholar]
Chen, Y.; Xu, H.; Zhou, Y.; Zhu, S. Is this app safe for children? A comparison study of maturity ratings on Android and iOS applications. In Proceedings of the 22nd International Conference on World Wide Web, Rio de Janeiro, Brazil, 13–17 May 2013; pp. 201–212. [Google Scholar]
Ghosh, A.K.; Badillo-Urquiola, K.; Guha, S.; LaViola, J.J., Jr.; Wisniewski, P.J. Safety vs. surveillance: What children have to say about mobile apps for parental control. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, Montreal, QC, Canada, 21–26 April 2018; pp. 1–14. [Google Scholar]
Ghosh, A.K.; Badillo-Urquiola, K.; Rosson, M.B.; Xu, H.; Carroll, J.M.; Wisniewski, P.J. A matter of control or safety? Examining parental use of technical monitoring apps on teens’ mobile devices. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, Montreal, QC, Canada, 21–26 April 2018; pp. 1–14. [Google Scholar]
Palomba, F.; Linares-Vásquez, M.; Bavota, G.; Oliveto, R.; Di Penta, M.; Poshyvanyk, D.; De Lucia, A. User reviews matter! tracking crowdsourced reviews to support evolution of successful apps. In Proceedings of the 2015 IEEE International Conference on software Maintenance and Evolution (ICSME), Bremen, Germany, 29 September–1 October 2015; pp. 291–300. [Google Scholar]
Brewer, E.; Ng, Y.K. Age-Suitability Prediction for Literature Using a Recurrent Neural Network Model. In Proceedings of the 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI), Portland, OR, USA, 4–6 November 2019; pp. 1592–1596. [Google Scholar]
Khan, A.; Baharudin, B.; Lee, L.H.; Khan, K. A review of machine learning algorithms for text-documents classification. J. Adv. Inf. Technol. 2010, 1, 4–20. [Google Scholar]
Colas, F.; Brazdil, P. Comparison of SVM and some older classification algorithms in text classification tasks. In Proceedings of the IFIP International Conference on Artificial Intelligence in Theory and Practice, Santiago, Chile, 21–24 August 2006; Springer: Berlin/Heidelberg, Germany, 2006; pp. 169–178. [Google Scholar]
Umer, Q.; Liu, H.; Illahi, I. CNN-based automatic prioritization of bug reports. IEEE Trans. Reliab. 2019, 69, 1341–1354. [Google Scholar] [CrossRef]
Torkey, H.; Atlam, M.; El-Fishawy, N.; Salem, H. Machine Learning Model for Cancer Diagnosis based on RNAseq Microarray. Menoufia J. Electron. Eng. Res. 2021, 30, 65–75. [Google Scholar] [CrossRef]
Salem, H.; El-Hasnony, I.M.; Kabeel, A.; El-Said, E.M.; Elzeki, O.M. Deep Learning model and Classification Explainability of Renewable energy-driven Membrane Desalination System using Evaporative Cooler. Alex. Eng. J. 2022, 61, 10007–10024. [Google Scholar] [CrossRef]
Topcu-Uzer, C.; Tanrıkulu, İ. Technological solutions for cyberbullying. In Reducing Cyberbullying in Schools; Elsevier: Amsterdam, The Netherlands, 2018; pp. 33–47. [Google Scholar]
Thun, L.J.; Teh, P.L.; Cheng, C.B. CyberAid: Are your children safe from cyberbullying? J. King Saud Univ.-Comput. Inf. Sci. 2022, 34, 4099–4108. [Google Scholar] [CrossRef]
Liu, M.; Wang, H.; Guo, Y.; Hong, J. Identifying and analyzing the privacy of apps for kids. In Proceedings of the 17th International Workshop on Mobile Computing Systems and Applications, St. Augustine, FL, USA, 23–24 February 2016; pp. 105–110. [Google Scholar]
Hu, B.; Liu, B.; Gong, N.Z.; Kong, D.; Jin, H. Protecting your children from inappropriate content in mobile apps: An automatic maturity rating framework. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, Melbourne, Australia, 19–23 October 2015; pp. 1111–1120. [Google Scholar]
Minaee, S.; Kalchbrenner, N.; Cambria, E.; Nikzad, N.; Chenaghlu, M.; Gao, J. Deep learning–Based text classification: A comprehensive review. ACM Comput. Surv. (CSUR) 2021, 54, 1–40. [Google Scholar] [CrossRef]
Bengio, Y. Deep learning of representations for unsupervised and transfer learning. In Proceedings of the ICML Workshop on Unsupervised and Transfer Learning, Bellevue, WA, USA, 2 July 2011; pp. 17–36. [Google Scholar]
Zhang, L.; Wang, S.; Liu, B. Deep learning for sentiment analysis: A survey. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2018, 8, e1253. [Google Scholar] [CrossRef]
Deng, L.; Platt, J. Ensemble deep learning for speech recognition. In Proceedings of the Interspeech, 15th Annual Conference of the International Speech Communication Association, Singapore, 14–18 September 2014. [Google Scholar]
Sadiq, S.; Umer, M.; Ullah, S.; Mirjalili, S.; Rupapara, V.; Nappi, M. Discrepancy detection between actual user reviews and numeric ratings of Google App store using deep learning. Expert Syst. Appl. 2021, 181, 115111. [Google Scholar] [CrossRef]
Loper, E.; Bird, S. Nltk: The natural language toolkit. arXiv 2002, arXiv:cs/0205028. [Google Scholar]
Loria, S. Textblob Documentation. Release 0.15. 2018. Available online: https://media.readthedocs.org/pdf/textblob/latest/textblob.pdf (accessed on 10 July 2022).
Honnibal, M.; Montani, I. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To Appear 2017, 7, 411–420. [Google Scholar]
Baccianella, S.; Esuli, A.; Sebastiani, F. SentiWordNet 3.0: An Enhanced Lexical Resource for Sentiment Analysis and Opinion Mining. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC’10), Valletta, Malta, 17–23 May 2010; Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M., Tapias, D., Eds.; European Language Resources Association: Paris, France, 2010. [Google Scholar]
Islam, M.R.; Zibran, M.F. DEVA: Sensing emotions in the valence arousal space in software engineering text. In Proceedings of the 33rd Annual ACM Symposium on Applied Computing, Pau, France, 9–13 April 2018; pp. 1536–1543. [Google Scholar]
Calefato, F.; Lanubile, F.; Novielli, N. EmoTxt: A toolkit for emotion recognition from text. In Proceedings of the 2017 Seventh International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW), San Antonio, TX, USA, 23–26 October 2017; pp. 79–80. [Google Scholar]
Ahmed, T.; Bosu, A.; Iqbal, A.; Rahimi, S. SentiCR: A customized sentiment analysis tool for code review interactions. In Proceedings of the 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE), Urbana, IL, USA, 30 October–3 November 2017; pp. 106–111. [Google Scholar]
Islam, M.R.; Zibran, M.F. SentiStrength-SE: Exploiting domain specificity for improved sentiment analysis in software engineering text. J. Syst. Softw. 2018, 145, 125–146. [Google Scholar] [CrossRef]
Calefato, F.; Lanubile, F.; Maiorano, F.; Novielli, N. Sentiment polarity detection for software development. Empir. Softw. Eng. 2018, 23, 1352–1382. [Google Scholar] [CrossRef]
Tong, S.; Koller, D. Support vector machine active learning with applications to text classification. J. Mach. Learn. Res. 2001, 2, 45–66. [Google Scholar]
Li, Y.; Bontcheva, K.; Cunningham, H. Adapting SVM for data sparseness and imbalance: A case study in information extraction. Nat. Lang. Eng. 2009, 15, 241–271. [Google Scholar] [CrossRef]
Cristianini, N.; Shawe-Taylor, J. An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods; Cambridge University Press: Cambridge, UK, 2000. [Google Scholar] [CrossRef]
Ramay, W.Y.; Umer, Q.; Yin, X.C.; Zhu, C.; Illahi, I. Deep Neural Network-Based Severity Prediction of Bug Reports. IEEE Access 2019, 7, 46846–46857. [Google Scholar] [CrossRef]

Figure 1. Overview of the proposed approach.

Figure 2. Dataset statistics.

Figure 3. Distribution of the f-measure.

Figure 4. ANOVA of the f-measure.

Table 1. Summary of state-of-the-art approaches.

Author	Classifier	Dataset
Thun et al. [15]	Random Forest	Twitter
Liu et al. [16]	SVM	Google Play Store
Hu et al. [17]	SVM	Google Play and iOS App Store
Ying et. al [4]	ALM	Google Play Store and iOS App Store
Sadiq et al. [22]	CNN	Google Play Store

Table 2. Comparison with alternative approaches.

Approach	Precision	Recall	F-Measure	MCC
ML-SAK	92.76%	99.33%	95.93%	0.485
RP	65.23%	65.64%	65.43%	0.298
ZR	82.58%	100.00%	90.46%	0/367

Table 3. Impact of the review score.

Review Score	Precision	Recall	F-Measure	MCC
Enabled	92.76%	99.33%	95.93%	0.485
Disabled	91.21%	98.42%	94.68%	0.464

Table 4. Impact of Re-sampling.

Re-Sampling	Precision	Recall	F-Measure	MCC
NO	92.76%	99.33%	95.93%	0.485
Under-sampling	94.79%	99.45%	97.06%	0.653
Over-sampling	93.28%	99.40%	96.24%	0.579

Table 5. Impact of preprocessing.

Preprocessing	Precision	Recall	F-Measure	MCC
Enable	92.76%	99.33%	95.93%	0.485
Disable	92.45%	98.95%	95.59%	0.477

Table 6. Impacts of classifiers.

Classifier	Precision	Recall	F-Measure	MCC
SVM	92.76%	99.33%	95.93%	0.485
LR	92.56%	98.99%	95.67%	0.481
RF	88.74%	95.66%	92.07%	0.344
CNN	88.76%	95.99%	92.23%	0.343
MNB	85.14%	90.39%	87.95%	0.313

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Meng, X.; Li, S.; Malik, M.M.; Umer, Q. Machine-Learning-Based Suitability Prediction for Mobile Applications for Kids. Sustainability 2022, 14, 12400. https://doi.org/10.3390/su141912400

AMA Style

Meng X, Li S, Malik MM, Umer Q. Machine-Learning-Based Suitability Prediction for Mobile Applications for Kids. Sustainability. 2022; 14(19):12400. https://doi.org/10.3390/su141912400

Chicago/Turabian Style

Meng, Xianjun, Shaomei Li, Muhammad Mohsin Malik, and Qasim Umer. 2022. "Machine-Learning-Based Suitability Prediction for Mobile Applications for Kids" Sustainability 14, no. 19: 12400. https://doi.org/10.3390/su141912400

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Machine-Learning-Based Suitability Prediction for Mobile Applications for Kids

Abstract

1. Introduction

1.1. Research Significance

1.2. Organization of the Paper

2. Related Work

3. Proposed Methodology

3.1. Overview

3.2. Problem Definition

3.3. Preprocessing

3.4. Textual Analysis

3.5. Feature Modeling

3.6. Training

3.7. Prediction

4. Evaluation

4.1. Research Questions

4.2. Dataset

4.3. Process

4.4. Results

4.4.1. RQ1: Performance of ML-SAK

4.4.2. RQ2: Impact of the Review Score

4.4.3. RQ3: Impact of Re-Sampling

4.4.4. RQ4: Impact of Preprocessing

4.4.5. RQ5: Impacts of Other Classification Algorithms

5. Threats

Threats to Validity

6. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI