2.1. Illegal Websites Identification
Illegal websites (phishing, gambling, pornography websites, etc.) identification methods can be classified into four categories: blacklist-based methods, URL-based methods, single-feature-based methods, and mixed-feature-based methods. The performance of the blacklist-based method suffers due to the constantly changing URLs of illegal websites. Thus, we introduce some related works of URL-based methods, single-feature-based methods, and mixed-feature-based methods in the following section.
URL-based methods. URL-based methods extract features from URLs for classification. Fan et al. [
6] use the Fast Unfolding algorithm to achieve websites clustering and extract URL features of illegal websites. They detect whether an unknown website has the URL features of an illegal website to identify illegal websites. Garera et al. [
7] identify several fine-grained heuristics that can be used to distinguish between a phishing URL and a benign URL. They use these heuristics to model a logistic regression classifier. Ma et al. [
8] propose an automated URL classification method, which uses statistical approaches to discover lexical features and host-based properties of malicious Web site URLs. The URL-based methods cannot achieve high identification accuracy because URLs contain insufficient information.
Single-feature-based methods. Single-feature-based methods extract a single feature from the content of a webpage for classification. Web pages contain a lot of information, such as images, text, links, JS code, CSS, etc. Single-feature-based methods extract a single feature from one of these information to identify illegal websites. Liu et al. [
12] propose a CNN-based method, which extracts visual features from web snapshots for detecting malicious websites. Zhang et al. [
9] extract Chinese text content from webpages, and combine bloom filter and TF-IDF algorithm to classify webpages into different themes. Sun et al. [
10] propose a Bert fine-tuning based text classification method to identify online gambling websites. Li et al. [
11] develop a website screenshot tool to collect screenshots of gambling and porn websites. They extract Speeded-Up Robust Features (SURF) from webpage screenshots based on the bag of words (BoW) model and use the support vector machine (SVM) classifier to distinguish gambling websites and porn websites from normal websites. Jain et al. [
13] propose a phishing website detection method by analyzing the hyperlinks in the webpage. However, it is difficult to represent a website and achieve excellent performance using a single feature, especially when malicious websites disguise themselves by disguising, misleading, and bypassing.
Mixed-feature-based methods. Mixed-feature-based methods extract and combine several different features to improve the identification accuracy. Cernica et al. [
16] combine the Computer Vision technique with other techniques to detect phishing webpages. Zhang et al. [
17] propose a two-stage extreme learning machine technique based on the mixed features of URL, web, and text content for phishing website detection. Chen et al. [
18] use the Doc2Vec model to extract textual features and use the local spatial improved bag-of-visual-words (Spa-BoVW) model to extract visual features of website screenshots. They use a data fusion algorithm based on logistic regression (LR) to obtain the final prediction result. Zuhair et al. [
19] propose a hybrid feature-based classifier, which hybridized two machine learning algorithms for phishing website detection. Yang et al. [
20] combine URL statistical features, webpage code features, webpage text features, and use a deep learning method for classification. However, improper feature combinations cannot effectively improve the performance of model recognition. For example, simple concatenation of visual and semantic feature vectors is often insufficient to achieve good results because visual features and semantic features are not in the same feature space and there are great differences between them. In addition, deep neural networks extract features layer by layer, and early features and late features contain different information, so how to fully fuse early and late features of multimodal data is a challenge. Thus, it is still difficult to fuse different features well.
In conclusion, URL-based methods cannot achieve a high identification accuracy because URLs contain insufficient information. The performance of single-feature-based methods still has a high false positive rate because a single feature is insufficient to represent a website. The existing mixed-feature-based illegal website identification methods extract features from the multi-source data of websites, which improves the identification performance. However, the collection and processing of multi-source data also increase the complexity of the algorithm. In addition, among these methods, they ignore one important piece of information, that is, the text information in the image. The text information in the image often has obvious semantic features and clearly points to gambling websites, which is very helpful to improve the identification performance. Furthermore, we can extract multimodal data simply from webpage screenshots, without the additional processing of other data such as HTML, links, CSS, etc., reducing the complexity. Therefore, it is necessary to study how to extract these texts in the images and fuse the visual feature and semantic features of website screenshots to improve the performance in identifying gambling websites.
2.2. Multimodal Data Fusion
Multimodal data fusion [
25,
26,
27] integrates information from multiple modalities effectively and takes advantage of different modalities. Multimodal data fusion methods can be roughly divided into two categories: model-independent fusion methods and model-based fusion methods. In this paper, we focus on model-independent fusion methods. Model-independent fusion methods can be classified into three categories: early fusion, late fusion, and hybrid fusion. Early fusion methods fuse the features extracted from multimodal data. Because the representation, distribution, and density of various modalities may be different, simply connecting the attributes may ignore the unique attributes and correlations of each modality, and may cause redundancy between data. Late fusion methods fuse the prediction results of different modality classifiers. Compared with early fusion, this method can ignore the difference in features. However, the late fusion lacks the low-level interaction between multiple modalities. Hybrid fusion combines the advantages of early fusion and late fusion. This method can fuse multimodal data both at the feature level and decision level.
Multimodal data fusion has many application scenarios, such as image classification, document classification, emotion recognition, etc. Choi et al. [
28] propose a deep learning architecture called “EmbraceNet” for multimodal information-based classification tasks, which combines the representations of multiple modalities in a probabilistic approach. Gallo et al. [
29] propose a multimodal approach that fuses images and text descriptions to improve classification performance in real-world scenarios. Audebert et al. [
30] propose a multimodal deep network that learns from both a document image and its textual content extracted by OCR to perform document classification. Jain et al. [
31] experimentally demonstrate that combining both text and visual modalities of the document can improve the accuracy of document classification. Huang et al. [
32] propose an image–text multimodal attentive fusion method to exploit the discriminative features and the internal correlation between visual and semantic contents for sentiment analysis. Nemati [
33] proposes a feature-level and decision-level hybrid multimodal fusion method for emotion recognition. In conclusion, multimodal data fusion methods can integrate multimodal information to improve the performance of tasks in the real world.
In webpage screenshots, the image and the text content extracted from screenshots are different modalities of data, even if they come from the same source. In addition, the text content can be used as an effective supplement to image features and improve the performance in identifying gambling websites. Therefore, it is necessary to study an appropriate multimodal data fusion method to fuse the image data and OCR text data from webpage screenshots in identifying gambling websites.