**6. Results**

The accuracy scores are validated via 10-fold validation. The baseline neural network (NN) model [**?** ] expectedly under-performs [**?** ]. For the SEM2018 datasets, both ensembles outperform each individual model. The stacked ensemble provides the best results in the Train subset while the weighted ensemble marginally outperforms stacked ensemble in the Development subset, Table **??**. The accuracy of both ensembles is limited, to a degree, by the inherit bias of the dataset. The performance of our ensembles outperforms all submitted models in the Codalab Competition https://competitions. codalab.org/competitions/17751#results.


**Table 2.** Accuracy score for SEM2018.

The baseline NN performed better in TOXIC dataset. NN performance is boosted by the big number of unclassified elements in the dataset, Table **??**, more than 40% of the samples as seen in Figure **??**b. TOXIC dataset included more than 25,000 unique terms before cleaning. The number of unique terms affects the length of the tokenization and subsequently the dimension of the embedding. The required dimension reduction reduced the training time of each model but affected its performance. Our best performing model is in the op 35% of the Kaggle Competition submissions https://www. kaggle.com/c/jigsaw-toxic-comment-classification-challenge/leaderboard but more than 1% worse when compared to the top performing one.

**Table 3.** Accuracy score for TOXIC.


Our ensemble methods improved upon single models in six out of eight cases. The classification accuracy is improved by at least one of the ensembles across both datasets. The ensembles for SEM2018 dataset performed excellently compared to other architectures. On the other hand, the extensive data cleaning of TOXIC -requirement due to computation/time constraints- hindered the performance of our models and their ensembles. Given the heavy class imbalance and the cleaning of TOXIC, the achieved accuracy of 97+% is decent. The baseline stacked ensemble under-performed our proposed weighted ensemble in three out of four cases.

All the models presented, and in extent their ensembles, can be further improved by a range of techniques. Test augmentation [**?** ], hyperparameter optimization [**?** ], bias reduction [**?** ] and tailored emotional embeddings [**? ?** ] are some techniques that could further improve the generalization capabilities of our networks. However, the computational load over multiple iterations is extensive, as the most complex models required hours of training per epoch and dataset.
