**5. Datasets**

For our comparison, we require datasets with binary multilabel categorization. We identified two modern datasets, Semeval 2018 Task 1 Dataset (SEM2018) https://competitions.codalab.org/ competitions/17751 and Toxic Comments Dataset from Kaggle (TOXIC) https://www.kaggle.com/c/ jigsaw-toxic-comment-classification-challenge.

Both datasets exhibit a level of class imbalance, Figures **??**a and **??**a. However, they are different not only in context, where SEM2018 is based on Twitter and TOXIC in Wikipedia, but also in the properties of the actual text. The sentence length, after the source is cleaned, is different from the original mainly due to the removal of infrequent terms, Table **??**. We discussed before that the dimensions of our term embeddings need to be low. We reduced the dimension by removing the terms that appear no more than 10 times, alongside a tailored stop term removal.


**Table 1.** Sentence length.

**Figure 8.** SEM20118 Class Distribution (**a**) and frequency of unique class combinations (**b**).
