The following sections present the results for each of the three sets of experiments performed. First, we study how the inclusion of Doc2Vec features affect all the proposed models. Finally, we incorporate multiple instance learning to fixed early detection and threshold models.
4.4.1. Baseline
In this section, we present the baseline results using the
metric, as introduced in [
43] for all three models used on the experiments: the fixed model, threshold model, and dual model.
Firstly, in
Table 2, it must be noticed that even if the general values differ, the best point of decision for each fixed model is almost the same for all the models and datasets. This can be explained by the fact that the amount of information processed is almost the same for all models and both datasets (i.e., point 10, in eight cases out of ten), which leads to finding the optimal point of decision related to the penalisation applied by the metric. On Instagram, the best baseline performance is obtained with LDA for
naïve Bayes (at point 10) reaching
. On the other hand, for the Vine dataset, the best baseline performance combines Tf-Idf and LDA features for
naïve Bayes (also at point 10), obtaining
.
Secondly,
Table 3 shows the results for the threshold model, where more differences can be seen, as variation in thresholds does not correlate directly with the amount of items processed to make a decision. This contrasts with the results shown previously, but can be explained due to the fact that a higher threshold could mean a lower probability of making the wrong choice, but also could increment the number of items processed to take the final decision. As for the results, focusing on Instagram, the best baseline score is
, obtained with Tf-idf features, using
naïve Bayes, and the threshold early detection model parameters are
and
. For the Vine dataset, LDA features are used, also using the
naïve Bayes model, and the threshold parameters are also
and
. The performance obtained for the best baseline features in this dataset is
.
Finally, in
Table 4, even taking into account the difficulty in the interpretation of the results due to the complexity of the dual model (comprising one model for positive cases and other for negative cases, each one with its own features and thresholds), it can be seen that the same pattern applies, with slight differences in the configuration of the models, as stated for the fixed and threshold baselines. For instance, the same values are obtained for all combinations of negative features in the same models, as is the case of
NB for both the Instagram and Vine datasets. For these results, different threshold configurations were tested, using low values for the positive threshold and high values for the negative threshold, as in the threshold early detection model. However, for sake of simplicity, in
Table 4, we show the results for
and
, as this configuration achieved the best results.
In this case, for the Instagram dataset, the best outcome is when just Tf-idf are used as positive features, no matter the combination of negative features selected for the naïve Bayes model. In the case of the Vine dataset, all combinations of negative features also present the same output, also reaching the highest value at for naïve Bayes when LDA features are used.
4.4.2. Doc2Vec Features
In this first set of experiments, we evaluated whether Doc2Vec features can significantly improve performance. In all the experiments, sentiment and profane analysis features were included, and different combinations of syntactic and semantic features (i.e., Tf-idf, LDA, and Doc2Vec) were tested.
Figure 3a,b show the results for the fixed early detection model at points 1, 5, 10, 15, and 25 for the Instagram and Vine datasets, respectively. The X-axis represents the syntactic and semantic features extracted from the posts. The first three items correspond to standard features (i.e., Tf-idf, LDA, and the combination of both), which constitute our baseline. The remaining items correspond to Doc2Vec and its combination with Tf-idf and LDA features.
From
Figure 3a, we observe how the use of Doc2Vec features (stand-alone or in combination) improves performance, especially for the logistic regression and SVM models, and independently from the point when the fixed model makes the decision. This behaviour is clearer on the Vine dataset (
Figure 3b), as there is a higher improvement for all machine learning models, except naïve Bayes.
Introducing Doc2Vec features immediately improves performance, but the best-performing model on the Instagram dataset is , combining Doc2Vec with LDA and reaching (versus previous 0.3755 value), while for Vine, the use of Doc2Vec reaches up to with (versus 0.3794 previously obtained) also at point 10.
Following this, we replicated the same experiments for the threshold early detection model.
Figure 4a,b show the results for both datasets. Since this model requires positive and negative threshold values, we show different line types for different thresholds. We limited our results to low positive thresholds (i.e.,
and
) and high negative thresholds (i.e.,
and
), since they have provided good results in previous works [
6]. The rationale behind these values is clear: to provide an easy-to-surpass threshold for the early detection of cyberbullying cases with little information, while negative cases require a higher degree of confidence.
From the figures, we observe a similar behaviour as in the previous case, with an important performance improvement. By introducing Doc2Vec, in combination with the other features, the performance significantly improves, except in the case of naïve Bayes and extra tree, where this difference is lower. With this model, on the Instagram dataset, that improvement is only clearly present for the and configuration, while on Vine, a slight increase in the results is present for , with the inclusion of Doc2Vec in some of the combinations of features. In this last case, it obtains worse results without the use of TI features. It must be noticed that in any case, the best results for naïve Bayes are higher than the ones achieved with the other models. On Instagram, naïve Bayes is the best-performing model, reaching up to , using Doc2Vec with Tf-idf features when a low positive threshold is used(), no matter the negative threshold used ( and ).
On the other hand, Vine’s best-performing threshold model uses AdaBoost, and requires Doc2Vec and LDA features to score ( and ), significantly improving the baseline performance.
Finally, in this set of experiments, we validated the Doc2Vec behaviour for the dual model, and the results are shown on
Figure 5a,b. Each column represents the results for one machine learning model, while rows correspond to positive features. Since the best scores were achieved using Doc2Vec as positive features, we limited the rows to those cases. As in the previous figures, negative features are represented on the X-axis, and different values for positive and negative thresholds are represented on each graph.
Focusing on the Instagram dataset (
Figure 5), the best-performing models are AdaBoost and naïve Bayes, and
tends to achieve better results in most cases. The best score is
, which corresponds to naïve Bayes using Doc2Vec and Tf-idf as positive features for all negative feature combinations with a low positive threshold (
).
With respect to the Vine dataset (
Figure 5b), we observed a limited impact on all machine learning models for some feature combinations, both positive and negative. Just low negative thresholds (
) present a notable variation for logistic regression model. In the rest of the cases, the performance remains quite stable. AdaBoost and extra trees achieve good results in general and present a consistent behaviour, independently of positive and negative thresholds. In fact, the best score (
) is obtained by AdaBoost, using all features as negative and Doc2Vec and LDA as positive, with
and
. This result significantly improves the dual model baseline performance for the Vine dataset.
Table 5 summarises the main results for this first set of experiments, comprising the inclusion of Doc2Vec features. We observe that its incorporation significantly improves the results obtained by the early detection models for both datasets. Interestingly, the performance improvement is more important on the Vine dataset. We consider that this may be motivated by the smaller post sizes (in terms of word count) and the limitations of Tf-idf and LDA to extract valid features, while Doc2Vec is able to better capture the information available.
Regarding the machine learning models, naïve Bayes for the Instagram dataset and AdaBoost for Vine provide the best results. Another interesting point is the fact that the threshold and dual models significantly improve performance over the fixed model for all cases, except for the threshold model on the Vine dataset. However, there is no significant statistical difference in performance between them. In the remaining experiments, we focus on the threshold early detection model, since it requires a simpler configuration, and the performance results can be considered equivalent to the dual model.
4.4.3. Multiple Instance Learning
In the final set of experiments, we study whether the use of MIL has a positive impact on the early detection of cyberbullying measured with early detection metric. With this aim, we incorporated MIL to the fixed early detection model by adding a bag representation of the posts processed. Those bags of labels were generated on a previous step to ML model training. This addresses the problem of ambiguity generated by single objects while entities with multiple items are supposed to have multiple alternative instances. The features from the posts in the bag are aggregated using the following functions: minimum, maximum, average between maximum and minimum, arithmetic mean, and median.
To help the comparison, we included an aggregation function, denoted as None, that represents a fixed early detection model without MIL. In all our experiments, we ran models with the combination of all features defined: profane words, sentiment analysis, Tf-idf, LDA and Doc2Vec.
Figure 6a,b show the results obtained for the fixed model.
Regarding the Instagram dataset (
Figure 6a), the use of MIL, in many cases, reduces the performance of the model. There are some slight improvements for all models (except logistic regression) at the higher points. In fact, the best score using MIL is obtained by SVM at point 10 using the minimum as aggregation function, and corresponds to
. However, there is no statistical difference in comparison with the fixed model without MIL, and it is significantly worse than the threshold and dual models.
However, when considering the results for the Vine dataset (
Figure 6b), there is a significant change, especially for the AdaBoost and extra trees models, with MIL models providing major improvements. The best score in this case is
, achieved by AdaBoost at point 10 using the arithmetic mean, which is significantly better than the results without MIL for the fixed, threshold, and dual models.
We consider that this important performance improvement by using multiple instance learning for the Vine dataset is directly related to the reduced post sizes on this social network and the lower use of standard English. On average, each post is composed of about 5 words, which provides limited information to be extracted by the different syntactic and semantic features for each of them. However, the use of a simple aggregation function (e.g., average) for a number of posts allows for better combination of the information from the different posts than the concatenation of the posts themselves. On the other hand, Instagram posts are larger and mostly in English, and therefore, the features extracted provide sufficient information from each post, making the aggregation less relevant.
We also conducted experiments testing different sampling alternatives for the training set, instead of using all post sessions. In particular, we analysed using the first 10 posts, the last 10 posts, or 10 random posts, with no relevant improvements obtained for the fixed early detection model using MIL.
We also explored the impact of MIL on the threshold early detection model performance. The experiments were limited to the AdaBoost and extra trees models, since they provided the highest improvements on our previous experiments, and again, we used all features in our models. In this case, due to the important change on the features introduced by MIL, we considered the whole range for positive and negative thresholds:
,
,
,
, and
.
Figure 7a,b present the results for these experiments.
The behaviour on the Instagram dataset (
Figure 7a), as expected, is below the results achieved by the baselines provided in
Section 4.4.2 for the fixed, threshold, and dual models (being significantly worse for the last two). Increasing the negative threshold improves performance, and the median aggregation function is consistently providing the best results in all cases, but there is no significant improvement in
results for this dataset.
Regarding the Vine dataset (
Figure 7b), we observe how AdaBoost provides a better performance than extra trees. As in the previous case, increasing the negative threshold improves performance, while for the positive threshold, the best values are obtained with
and
. This is a meaningful change with respect to the standard threshold model, motivated by the aggregation of information from multiple posts, which produces an increase in the class probability for the cyberbullying cases detected. From the aggregation functions, the maximum concentrates the higher scores.
In fact, the best score of
is obtained by AdaBoost using the maximum and
and
. Again, this model significantly improves performance over the best fixed, threshold, and dual models in
Section 4.4.2, although there is no statistical difference with respect to the fixed model from our previous experiments.