Hateful Memes Detection Based on Multi-Task Learning
Abstract
:1. Introduction
- A new artificial intelligence model is proposed for hateful memes detection. It effectively improved the hateful memes detection accuracy in that our model outperformed the comparing models.
- The multi-task strategy and adaptive weight adjustment strategy used in our model captured the consistency and variability between different modalities and numerically improved the generalization and robustness of the model.
- Our auxiliary tasks using self-supervised unimodal auxiliary label generation module enhanced the feature learning capability without human-defined labels or additional data.
2. Related Works
2.1. Datasets
2.2. Textual Model
2.3. Visual Model
2.4. Multimodal Model
2.5. Multi-Task Learning
3. Method
3.1. Setup
3.2. Architecture
3.3. Unimodal Label Generation Module
3.4. Optimization Objectives
Algorithm 1: The algorithm of our model in training stage [24] |
4. Experiments
4.1. Dataset
4.2. Compared Models
4.3. Results
4.4. Ablation Study
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Devroye, L.; Györfi, L.; Lugosi, G. A Probabilistic Theory of Pattern Recognition; Springer Science & Business Media: New York, NY, USA, 2013; Volume 31. [Google Scholar]
- Fan, J.; Li, R.; Zhang, C.H.; Zou, H. Statistical Foundations of Data Science; Chapman and Hall/CRC: New York, NY, USA, 2020. [Google Scholar]
- Hastie, T.; Tibshirani, R.; Friedman, J.H.; Friedman, J.H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer: Berlin/Heidelberg, Germany, 2009; Volume 2. [Google Scholar]
- Mohri, M.; Rostamizadeh, A.; Talwalkar, A. Foundations of Machine Learning; MIT Press: Cambridge, MA, USA, 2018. [Google Scholar]
- Bertsekas, D.P. Nonlinear programming. J. Oper. Res. Soc. 1997, 48, 334. [Google Scholar] [CrossRef]
- Tewari, A.; Bartlett, P.L. On the Consistency of Multiclass Classification Methods. J. Mach. Learn. Res. 2007, 8, 1007–1025. [Google Scholar]
- Zhang, T. Statistical analysis of some multi-category large margin classification methods. J. Mach. Learn. Res. 2004, 5, 1225–1251. [Google Scholar]
- Vapnik, V.N. An overview of statistical learning theory. IEEE Trans. Neural Netw. 1999, 10, 988–999. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Zadeh, A.; Chen, M.; Poria, S.; Cambria, E.; Morency, L.P. Tensor fusion network for multimodal sentiment analysis. arXiv 2017, arXiv:1707.07250. [Google Scholar]
- Tsai, Y.H.H.; Bai, S.; Liang, P.P.; Kolter, J.Z.; Morency, L.P.; Salakhutdinov, R. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 28 July–2 August 2019; NIH Public Access: Bethesda, MD, USA, 2019; Volume 2019, p. 6558. [Google Scholar]
- Poria, S.; Hazarika, D.; Majumder, N.; Mihalcea, R. Beneath the tip of the iceberg: Current challenges and new directions in sentiment analysis research. IEEE Trans. Affect. Comput. 2020, 1. [Google Scholar] [CrossRef]
- Bartlett, P.L.; Jordan, M.I.; McAuliffe, J.D. Convexity, classification, and risk bounds. J. Am. Stat. Assoc. 2006, 101, 138–156. [Google Scholar] [CrossRef] [Green Version]
- i Orts, Ò.G. Multilingual detection of hate speech against immigrants and women in Twitter at SemEval-2019 task 5: Frequency analysis interpolation for hate in speech detection. In Proceedings of the 13th International Workshop on Semantic Evaluation, Minneapolis, MN, USA, 6–7 June 2019; pp. 460–463. [Google Scholar]
- Burnap, P.; Williams, M.L. Hate speech, machine classification and statistical modelling of information flows on Twitter: Interpretation and communication for policy decision making. In Proceedings of the Internet, Policy & Politics Conference, Oxford, UK, 26 September 2014. [Google Scholar]
- Baltrušaitis, T.; Ahuja, C.; Morency, L.P. Multimodal machine learning: A survey and taxonomy. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 423–443. [Google Scholar] [CrossRef] [Green Version]
- Guo, W.; Wang, J.; Wang, S. Deep multimodal representation learning: A survey. IEEE Access 2019, 7, 63373–63394. [Google Scholar] [CrossRef]
- Zadeh, A.; Liang, P.P.; Mazumder, N.; Poria, S.; Cambria, E.; Morency, L.P. Memory fusion network for multi-view sequential learning. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
- Sun, Z.; Sarma, P.; Sethares, W.; Liang, Y. Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 8992–8999. [Google Scholar]
- Rahman, W.; Hasan, M.K.; Lee, S.; Zadeh, A.; Mao, C.; Morency, L.P.; Hoque, E. Integrating multimodal information in large pretrained transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), online, 5–10 July 2020; NIH Public Access: Bethesda, MD, USA, 2020; Volume 2020, p. 2359. [Google Scholar]
- Wang, S.; Zhang, H.; Wang, H. Object co-segmentation via weakly supervised data fusion. Comput. Vis. Image Underst. 2017, 155, 43–54. [Google Scholar] [CrossRef]
- Hazarika, D.; Zimmermann, R.; Poria, S. Misa: Modality-invariant and-specific representations for multimodal sentiment analysis. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1122–1131. [Google Scholar]
- Yu, W.; Xu, H.; Meng, F.; Zhu, Y.; Ma, Y.; Wu, J.; Zou, J.; Yang, K. Ch-sims: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 3718–3727. [Google Scholar]
- Zhang, Y.; Yang, Q. A survey on multi-task learning. IEEE Trans. Knowl. Data Eng. 2021, 34, 5586–5609. [Google Scholar] [CrossRef]
- Yu, W.; Xu, H.; Yuan, Z.; Wu, J. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 10790–10797. [Google Scholar]
- Kiela, D.; Firooz, H.; Mohan, A.; Goswami, V.; Singh, A.; Ringshia, P.; Testuggine, D. The hateful memes challenge: Detecting hate speech in multimodal memes. Adv. Neural Inf. Process. Syst. 2020, 33, 2611–2624. [Google Scholar]
- Zadeh, A.B.; Liang, P.P.; Poria, S.; Cambria, E.; Morency, L.P. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, Melbourne, Australia, 15–20 July 2018; Volume 1: Long Papers, pp. 2236–2246. [Google Scholar]
- Gomez, R.; Gibert, J.; Gomez, L.; Karatzas, D. Exploring hate speech detection in multimodal publications. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 1470–1478. [Google Scholar]
- Warner, W.; Hirschberg, J. Detecting hate speech on the world wide web. In Proceedings of the Second Workshop on Language in Social Media, Montréal, QC, Canada, 7 June 2012; pp. 19–26. [Google Scholar]
- Djuric, N.; Zhou, J.; Morris, R.; Grbovic, M.; Radosavljevic, V.; Bhamidipati, N. Hate speech detection with comment embeddings. In Proceedings of the 24th International Conference on World Wide Web, Florence, Italy, 18–22 May 2015; pp. 29–30. [Google Scholar]
- Chen, Y. Convolutional Neural Network for Sentence Classification. Master’s Thesis, University of Waterloo, Waterloo, ON, Canada, 2015. [Google Scholar]
- Waseem, Z.; Davidson, T.; Warmsley, D.; Weber, I. Understanding abuse: A typology of abusive language detection subtasks. arXiv 2017, arXiv:1705.09899. [Google Scholar]
- Benikova, D.; Wojatzki, M.; Zesch, T. What does this imply? Examining the impact of implicitness on the perception of hate speech. In Proceedings of the International Conference of the German Society for Computational Linguistics and Language Technology, Berlin, Germany, 13–14 September 2017; Springer: Berlin/Heidelberg, Germany, 2017; pp. 171–179. [Google Scholar]
- Wiegand, M.; Siegel, M.; Ruppenhofer, J. Overview of the germeval 2018 shared task on the identification of offensive language. In Proceedings of the 14th Conference on Natural Language Processing (KONVENS 2018), Vienna, Austria, 21 September 2018. [Google Scholar]
- Kumar, R.; Ojha, A.K.; Malmasi, S.; Zampieri, M. Benchmarking aggression identification in social media. In Proceedings of the First Workshop on Trolling, Aggression and Cyberbullying (TRAC-2018), Santa Fe, NM, USA, 25 August 2018; pp. 1–11. [Google Scholar]
- Nobata, C.; Tetreault, J.; Thomas, A.; Mehdad, Y.; Chang, Y. Abusive language detection in online user content. In Proceedings of the 25th International Conference on World Wide Web, Montreal, QC, Canada, 11–15 May 2016; pp. 145–153. [Google Scholar]
- Aggarwal, P.; Horsmann, T.; Wojatzki, M.; Zesch, T. LTL-UDE at SemEval-2019 Task 6: BERT and two-vote classification for categorizing offensiveness. In Proceedings of the 13th International Workshop on Semantic Evaluation, Minneapolis, MN, USA, 6–7 June 2019; pp. 678–682. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; NIPS: La Jolla, CA, USA, 2017; Volume 30. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Sabat, B.O.; Ferrer, C.C.; Giro-i Nieto, X. Hate speech in pixels: Detection of offensive memes towards automatic moderation. arXiv 2019, arXiv:1910.02334. [Google Scholar]
- Liu, K.; Li, Y.; Xu, N.; Natarajan, P. Learn to combine modalities in multimodal deep learning. arXiv 2018, arXiv:1805.11730. [Google Scholar]
- Chen, Y.C.; Li, L.; Yu, L.; El Kholy, A.; Ahmed, F.; Gan, Z.; Cheng, Y.; Liu, J. Uniter: Universal image-text representation learning. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 104–120. [Google Scholar]
- Li, X.; Yin, X.; Li, C.; Zhang, P.; Hu, X.; Zhang, L.; Wang, L.; Hu, H.; Dong, L.; Wei, F.; et al. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 121–137. [Google Scholar]
- Su, W.; Zhu, X.; Cao, Y.; Li, B.; Lu, L.; Wei, F.; Dai, J. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv 2019, arXiv:1908.08530. [Google Scholar]
- Aken, B.v.; Winter, B.; Löser, A.; Gers, F.A. Visbert: Hidden-state visualizations for transformers. In Proceedings of the Companion Proceedings of the Web Conference 2020, Taipei, Taiwan, 20–24 April 2020; pp. 207–211. [Google Scholar]
- Yu, F.; Tang, J.; Yin, W.; Sun, Y.; Tian, H.; Wu, H.; Wang, H. Ernie-vil: Knowledge enhanced vision-language representations through scene graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 3208–3216. [Google Scholar]
- Li, L.H.; Yatskar, M.; Yin, D.; Hsieh, C.J.; Chang, K.W. Visualbert: A simple and performant baseline for vision and language. arXiv 2019, arXiv:1908.03557. [Google Scholar]
- Lu, J.; Batra, D.; Parikh, D.; Lee, S. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems; NIPS: La Jolla, CA, USA, 2019; Volume 32. [Google Scholar]
- Liu, W.; Mei, T.; Zhang, Y.; Che, C.; Luo, J. Multi-task deep visual-semantic embedding for video thumbnail selection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3707–3715. [Google Scholar]
- Zhang, W.; Li, R.; Zeng, T.; Sun, Q.; Kumar, S.; Ye, J.; Ji, S. Deep model based transfer and multi-task learning for biological image analysis. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, 10–13 August 2015; pp. 1475–1484. [Google Scholar]
- Akhtar, M.S.; Chauhan, D.S.; Ghosal, D.; Poria, S.; Ekbal, A.; Bhattacharyya, P. Multi-task learning for multi-modal emotion recognition and sentiment analysis. arXiv 2019, arXiv:1905.05812. [Google Scholar]
- Chen, Z.; Badrinarayanan, V.; Lee, C.Y.; Rabinovich, A. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In Proceedings of the International Conference on Machine Learning (PMLR 2018), Stockholm, Sweden, 10–15 July 2018; pp. 794–803. [Google Scholar]
- Efron, B.; Hastie, T. Computer Age Statistical Inference, Student Edition: Algorithms, Evidence, and Data Science; Cambridge University Press: Cambridge, UK, 2021; Volume 6. [Google Scholar]
- Zhang, T. Statistical behavior and consistency of classification methods based on convex risk minimization. Ann. Stat. 2004, 32, 56–85. [Google Scholar] [CrossRef]
- Chen, D.R.; Sun, T. Consistency of multiclass empirical risk minimization methods based on convex loss. J. Mach. Learn. Res. 2006, 7, 2435–2447. [Google Scholar]
- Su, W.; Boyd, S.; Candes, E. A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights. In Advances in Neural Information Processing Systems; NIPS: La Jolla, CA, USA, 2014; Volume 27. [Google Scholar]
- Sandulescu, V. Detecting hateful memes using a multimodal deep ensemble. arXiv 2020, arXiv:2012.13235. [Google Scholar]
- Mao, H.; Yao, S.; Tang, T.; Li, B.; Yao, J.; Wang, Y. Towards real-time object detection on embedded systems. IEEE Trans. Emerg. Top. Comput. 2016, 6, 417–431. [Google Scholar] [CrossRef]
- Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
- Mokady, R.; Hertz, A.; Bermano, A.H. Clipcap: Clip prefix for image captioning. arXiv 2021, arXiv:2111.09734. [Google Scholar]
Type | Model | Validation | Test |
---|---|---|---|
Unimodal | Image-Grid | 52.73% | 52.00% |
Image-Region | 52.66% | 52.13% | |
Text BERT | 58.26% | 59.20% | |
Multimodal | Late Fusion | 61.53% | 59.66% |
Concat BERT | 58.60% | 59.13% | |
MMBT-Grid | 58.20% | 60.06% | |
MMBT-Region | 58.73% | 60.23% | |
ViLBERT | 62.20% | 62.30% | |
Visual BERT | 62.10% | 63.20% | |
ViLBERT CC | 61.40% | 61.10% | |
Visual BERT COCO | 65.06% | 64.73% | |
Our model | 65.92% | 66.30% |
Model | Validation | Test |
---|---|---|
M | 61.92% | 63.40% |
M,T | 62.67% | 63.10% |
M,V | 62.05% | 62.24% |
M,T | 62.83% | 63.45% |
M,V | 62.33% | 62.60% |
M,T,V | 63.00% | 64.65% |
M,T,V | 65.92% | 66.30% |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ma, Z.; Yao, S.; Wu, L.; Gao, S.; Zhang, Y. Hateful Memes Detection Based on Multi-Task Learning. Mathematics 2022, 10, 4525. https://doi.org/10.3390/math10234525
Ma Z, Yao S, Wu L, Gao S, Zhang Y. Hateful Memes Detection Based on Multi-Task Learning. Mathematics. 2022; 10(23):4525. https://doi.org/10.3390/math10234525
Chicago/Turabian StyleMa, Zhiyu, Shaowen Yao, Liwen Wu, Song Gao, and Yunqi Zhang. 2022. "Hateful Memes Detection Based on Multi-Task Learning" Mathematics 10, no. 23: 4525. https://doi.org/10.3390/math10234525
APA StyleMa, Z., Yao, S., Wu, L., Gao, S., & Zhang, Y. (2022). Hateful Memes Detection Based on Multi-Task Learning. Mathematics, 10(23), 4525. https://doi.org/10.3390/math10234525