4.1.1. Situation Analysis
Let us consider the situation analysis step of the policymaking process, a policy-maker first tries to inspect the current literature about the factors affecting the dropout rate in the context of HL. Using our platform, the policy-maker can see if the features that come from the literature are available in the data lake connected to the platform or if they can be calculated starting from what is available. For instance, if one feature that is relevant in the literature related to the scenario of this walkthrough is the total usage of the HA device per month, but the dataset contains just the timestamp where the HA device has been turned on and off by a given subject, there is the need to compute the usage time as the sum of these periods for each subject for a month. The platform supports policy-makers in doing these preliminary analyses offering all the pre-processing capabilities requested. The policy-maker can combine analytics tasks into workflows to perform the data pre-processing.
Having done the pre-processing workflow, the policy-makers can use the computed total HA usage (6262105 records in our case for all the 1000 subject of the trial) to carry out two critical trends analyses that have been shown and that are correlated to the dropout rate of the patients, namely:
Average HA usage among the entire population grouped by the participants’ age for each month since the fitting date.
Average HA usage among the entire population grouped by the participants’ education level age for each month since the fitting date.
In the first group the participants have been divided into several groups according to their age and overall distribution in each group, including (i) participants younger than 35 years; (ii) participants between 35 and 45 years old; (iii) participants between 45 and 55 years old; (iv) participants between 55 and 65 years old; (v) participants between 65 and 75 years old; (vi) participants older than 75 years.
Similarly, participants are divided into three groups according to their education level, namely:
Participants having low education level (participants that have spent eight or fewer years in education, i.e., equivalent to elementary school).
Participants having medium education level (participants that have spent between 9 and 14 years in education, i.e., equivalent to the secondary education).
Participants having high education levels (participants that have spent 14 or more years in education, i.e., equivalent to the University degree).
The results of analyses of these two trends are shown in
Figure 4 and
Figure 5, which can be seen below.
As it can be observed from the given
Figure 4, the results indicate that the older participants, aged from 65 to 75 years, have been using HA device the most consistently throughout investigated time, followed by the group of patients aged 75 years or more and 55 to 65 years old. On the other hand, younger (below 35 years old) and middle-aged participants (45 to 55 years old) show a similar, significantly lower trend of HA usage. Participants aged from 35 to 45 years old tend to be the riskiest group when it comes to the odds of dropping out, being constantly at the bottom of average HA usage. Moreover, it can be indicated that the participants belonging to this age group exhibit a gradual decrease in HA usage during the time.
Figure 5 demonstrates the average HA usage per education level group during the course of the time. Highly educated participants show both the best progress, i.e., constant growth of HA usage, except for the last month, as well as the highest average HA usage over the examined period. The second group of the participants, i.e., participants having a medium level of education exhibit the highest HA usage in their starting month; however, as time passes slower growth and stagnation can be indicated in their HA usage. Lastly, participants having lower education levels exhibit a lot of oscillations in their average HA usage. Thus, this group of participants can be considered to be most likely under the risk of dropping out from the further study.
Results of the showcased analyses suggest that highly educated patients, aged 65 to 75 years are the safest group for the study.
Figure 6 indicates a comparison of the average HA usage of this group compared to the average HA usage of the remaining participants.
As can be observed from the given figure, the very positive trend of HA usage and overall high average usage time can be observed for this given group.
The performed analyses reveal some of the essential trends that could be utilized by policy-makers in order to aid them in deciding on which groups of participants they should focus on future studies. These analyses concluded the situation analysis stage of the policymaking process. They give ideas on how to proceed with the other stages for which our platform provides prediction and simulation capabilities. In this case, for instance, it seems that old and highly educated patients do not drop out on average.
This evidence will be used in the next stages to tune the analysis and decide the dropout prediction analytics to be used and the features to be considered.
4.1.2. Action Plan Stage
With respect to the action plan, based on the preliminary results (in terms of features and corresponding correlations), the policy-makers can select the most suitable analysis approach for achieving the policy goal among the available approaches suggested by the platform. In the case of our scenario, the goal is to predict dropout in order to see what is impacting it and release a policy capable of selectively preventing the dropout. In this walkthrough, we show how the policy-maker adopts a trial and error approach in order to find suitable analytics for its goal.
Given the analysis on the previous policymaking stage, the policy-makers selects the following features available in the data set of our scenario that show a certain degree of correlation with the dropout: AVG_HA_USAGE (average hearing aids usage), TOTAL_HA_USAGE (total hearing aids usage), VARIANCE_HA_USAGE (Variance Hearing Aids Usage), TOTAL_SCORE (MOCA MENTAL Score), AGE (patient’s age), HI_DEGREE_CURRHL_L (degree of hearing loss in the left ear), HI_DEGREE_CURRHL_R (degree of hearing loss in the right ear), HEARING_LOSS_SEVERITY (combined hearing loss level for both ears), and EDUCATION_LEVEL (education level).
In addition, since the entire dataset was labeled for the dropout, the policy-maker also selects the classification label called DROPOUT(1: dropout, 0: non-dropout) as the label, for the list of known dropout patients, as well as for the other patients which are still in the study. In order to reduce the bias and deal with the class imbalance, the number of records containing non-dropout users was reduced to the same number (134 records for each label) as the number of available patients marked as dropouts.
For demonstrating prediction capabilities of our platform, three different classification algorithms, namely decision tree classification, logistic regression and support vector machines (SVM) were selected. Afterward, the previously selected dataset was randomly split into training and testing datasets by using three different ratios: 70:30 (i.e., 70% training and 30% testing), 60:40 (i.e., 60% training and 40% testing), and 50:50 (i.e., 50% training and 50% testing). Furthermore, for the purpose of this demonstration four features out of the list of available features were selected, including AVG_HA_USAGE, VAR_HA_USAGE, AGE, and EDUCATION_LEVEL. Accordingly, models for decision tree classification, logistic regression and SVM were created for all training sets. Moreover, decision tree classification was built using the default parameters, whereas SVM and logistic regression model parameters were tuned in order to boost their performance.
Afterward, predictions were made on the corresponding test datasets, while in the case of the model trained on the entire dataset, the prediction was also calculated on the whole dataset.
In
Table 1, the performance of each algorithm is summarized. In the terms of correctly predicting dropout cases (TP), it can be seen that decision tree classification performed the best in all three cases (correctly predicting 62 out of 64, 49 out of 51 and 36 out of 37 dropouts cases in 50:50, 60:40 and 70:30 dataset splits respectively), while SVM performed the worst in all three cases (correctly predicting 57 out of 64, 40 out of 51 and 32 out of 37 dropout cases in 50:50, 60:40, and 70:30 dataset splits respectively). Furthermore, when it comes to correctly predicting non-dropouts (TN), decision tree classification again maintained the best overall performance, by having the highest performance in two out of three cases, namely 50:50 and 70:30 splits with 69 out of 70 and 38 out of 39 correctly predicted labels respectively. In the case of 60:40 dataset split, SVM outperformed decision tree classification by correctly predicting all cases (55) compared to 50 correctly predicted cases by decision tree classification.
Consequently, precision, recall, and f-measure results are shown in
Figure 7. It can be observed, decision tree classification performed the best in all three cases in terms of recall and f-measure, as well as in two out of three cases in terms of precision. On the other hand, logistic regression and SVM algorithms had a comparable performance, which was in turn significantly lower than the performance of decision tree classification. This is especially the case in the third use case, where the dataset was divided according to the ratio 70:30. Performance drop can be potentially attributed to the fact that the further parameter tuning for those models is required. As a matter of fact, due to the ease of use and high performance, for this specific use case, decision tree classification is a recommended classification algorithm of choice.
By following the given use cases, a policy-maker can utilize the same algorithms on any combination of the available features, as well as perform the further parameter tuning in order to achieve the desired performance.