MEDAL: A Multimodality-Based Effective Data Augmentation Framework for Illegal Website Identification

Wen, Li; Zhang, Min; Wang, Chenyang; Guo, Bingyang; Ma, Huimin; Xue, Pengfei; Ding, Wanmeng; Zheng, Jinghua

doi:10.3390/electronics13112199

Open AccessArticle

MEDAL: A Multimodality-Based Effective Data Augmentation Framework for Illegal Website Identification

by

Li Wen

,

Min Zhang

,

Chenyang Wang

,

Bingyang Guo

,

Huimin Ma

,

Pengfei Xue

^*,

Wanmeng Ding

and

Jinghua Zheng

College of Electronic Engineering, National University of Defense Technology, Hefei 230037, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(11), 2199; https://doi.org/10.3390/electronics13112199

Submission received: 9 April 2024 / Revised: 26 May 2024 / Accepted: 31 May 2024 / Published: 5 June 2024

(This article belongs to the Section Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

The emergence of illegal (gambling, pornography, and attraction) websites seriously threatens the security of society. Due to the concealment of illegal websites, it is difficult to obtain labeled data with high quantity. Meanwhile, most illegal websites usually disguise themselves to avoid detection; for example, some gambling websites may visually resemble gaming websites. However, existing methods ignore the means of camouflage in a single modality. To address the above problems, this paper proposes MEDAL, a multimodality-based effective data augmentation framework for illegal website identification. First, we established an illegal website identification framework based on tri-training that combines information from different modalities (including image, text, and HTML) while making full use of numerous unlabeled data. Then, we designed a multimodal mutual assistance module that is integrated with the tri-training framework to mitigate the introduction of error information resulting from an unbalanced single-modal classifier performance in the tri-training process. Finally, the experimental results on the self-developed dataset demonstrate the effectiveness of the proposed framework, performing well on accuracy, precision, recall, and F1 metrics.

Keywords:

illegal website identification; multimodality; data augmentation; tri-training

1. Introduction

The Internet has long become an indispensable information network in people’s lives. However, the emergence of a large number of websites with malicious and illegal content, such as gambling, pornography, and phishing, has dramatically damaged the ecological environment of the Internet [1,2,3]. The spread of illegal information on these sites has a serious impact on the well-being of families and the security of society. Therefore, automated identification methods are now particularly important. To clean up cyberspace and effectively curb the spread of cyber hazards, many detection methods have been proposed.

The current methods for identifying these illegal websites can be divided into four main types: blacklist-based [4,5], URL-based [6,7,8,9], single modal-based [10,11,12,13,14,15,16], and multimodal fusion-based methods [17,18,19,20].

Blacklist-based methods build uniform resource locators (URLs) blacklists or domain lists to detect illegal websites by matching suspicious sites to an established blacklist. However, building blacklists is time consuming, and the cost of maintaining and updating the data is enormous. URL-based methods only extract features from URLs, which leads to a serious shortage and makes it difficult to achieve a high recognition rate.

Due to the rich textual and visual content of web pages, a large number of detection methods based on text or image classification tasks have been proposed. However, these methods often build classifiers on single-modal data, which are limited by insufficient features, resulting in low accuracy and are easily deceived. For example, some gambling websites often display visual content similar to game or sports websites, leading them to be mistaken as normal websites.

As shown in Figure 1, a website may be considered a game promotion website from a visual point of view. But from the text in the HTML (HyperText Markup Language), “博彩游戏” indicates that the games on the website are gambling games. So it can be seen that the classification results of different modalities may be different. Confusing information in a single modality may mislead the classification results. Some criminals even use this confused method to evade detection [21]. Therefore, a good combination of information from different modalities can effectively reduce misclassification and thus improve the identification accuracy of illegal websites.

Multimodal fusion-based methods are more effective website detection schemes that combine the multiple-modal data of websites for analysis. In existing detection methods, the data from different modalities are not effectively fused. The obtainable content information from website contents includes web page screenshots, HTML text, and image OCR text. These three modal data contain different and distinct features from different views. But they also suffer from insufficient high-quality labeled data and a lack of utilization of unlabeled data. As there is no publicly available dataset of illegal websites, and crawling web content is easy, we were able to easily obtain large amounts of unlabeled data. Labeling illegal websites is also a hazard to people’s mental health. We conducted psychological tests and counseling for the researchers concerned. Therefore, unsupervised or semi-supervised methods have greater research and application value in illegal website detection in order to mitigate the risk to the researchers involved.

In summary, there are still three problems to be solved. The first is how to combine the classifiers built from the three modalities and train them together to improve the performance. The second is how to utilize unlabeled data effectively. The third is how to cope with the unbalanced information across modalities. Tri-training [22] is a popular semi-supervised learning algorithm that uses unlabeled data with three classifiers and explores the inconsistencies in three different views. In this work, we used tri-training to fuse three modalities of information, but this poses the third problem during the tri-training process. The performances of the individual classifiers built based on the three modalities are significantly unbalanced, and the poorer classifiers may introduce larger errors in the process of obtaining pseudo-labels, thus misleading the training process. Therefore, it is necessary to consider how to address the issue.

To cope with the above three issues, we propose a multimodality-based effective data augmentation framework for illegal website identification that effectively utilizes multimodal information and unlabeled data. Firstly, we constructed three basic classifiers and trained them on the three modal data extracted from web pages, including HTML text, images, and text extracted from images. Secondly, to improve the performances of the three types of classifiers, we used tri-training to expand the dataset and retrained the three basic classifiers iteratively. For after the tri-training process, we designed a multimodal mutual assistance module by first predicting the unlabeled data using the basic strong classifiers and then updating the training dataset based on the prediction results of the strong classifiers. Then, the classifiers are further trained to improve the performance based on the basic tri-training method, utilizing the unlabeled data to improve the generalization of the overall model. Finally, the hard voting rule is used to predict the test samples through disagreement judgment.

The contributions of our work are as follows:

We propose MEDAL, a multimodality-based effective data augmentation framework for illegal website identification. Compared to existing methods, this method combines more information from different modalities (including image, OCR text, and HTML). Through a semi-supervised retraining method, we make full use of a large amount of unlabeled data.
We designed a multimodal mutual assistance module attached to the tri-training framework to effectively mitigate the impact of the error information due to the imbalanced performance of the basic classifiers. We achieved mutual lifting across modalities.
We evaluated the performance by conducting experiments on the collected datasets. The experimental results demonstrate that the proposed method can effectively improve the performance in identifying illegal websites.

The remainder of this paper is organized as follows. Section 2 introduces related work on illegal website identification and tri-training. Section 3 introduces the proposed method. The experimental results are reported in Section 4. Section 6 concludes this paper.

2. Related Work

Research on detecting illegal websites, including phishing, gambling, pornography, etc., can roughly be divided into two categories: single-modal based detection methods and multimodal fusion-based detection methods.

2.1. Single-Modal Based Identification Methods

Early detection methods aimed to establish black-and-white lists by creating lists of URLs known to be malicious and using matching methods to detect them. This kind of method is based on simple queries, so it is fast [5]. However, the domains of illegal websites change regularly and rapidly. It is therefore difficult to identify the large number of new URLs that are generated every day, and maintaining blacklists is time consuming and laborious. Then, URL-based methods extract features from website URLs, such as URL strings [6,7,8] and URL statistical rules [9], which in turn use machine learning and deep learning to classify websites. However, this kind of method struggles to cope with the rapid changes in illegal websites, and due to the small amount of information in the URL, the extracted features are limited.

The website contains rich information, such as text, images, links, cascading style sheets (CSS), HTML tags, web fingerprinting, etc. Some studies have utilized hyperlinks [10] or HTML tags [11]. Ma et al. [12] proposed a lightweight graph-based method to detect pornographic and gambling websites by extracting the textual content in HTML. The text collected from the crawled HTMl is only a part of the web page content. It is challenging to distinguish between the backend rendered pages, which contain less textual content in the HTML and more information in the page images. Li et al. [23] and Li et al. [24] focused on open-world app fingerprinting and achieved high performances. However, we argue that website content has more distinct characteristics. Sun et al. [13] proposed a classification method based on text analysis for detecting gambling domains and certificate detection. However, the information contained in a domain name is extremely limited, and certificates can also be forged. Li et al. [15] argued that due to the growing complexity and size of websites, the textual content-based approach is limited by the dimensionality. They chose visual content to identify gambling and pornographic websites and found that the SURF features could be used to improve the performance. Image feature extraction now has more methods. Liu et al. [14] used the logo images of web pages as recognition objects for feature matching and recognition. The main problem is that the gambling logo image data are relatively single, requiring less individual logo information and a large amount of labeled sample data. Yuan et al. [16] found that an adversarial image detection method and proposed a detection method for the adversarial images of hidden pornography in the real world. However, the detection method based on single content is limited by the acquired information sources, and the recognition accuracy and generalization need to be improved.

2.2. Multimodal Fusion-Based Identification Methods

Multimodal-based approaches can fuse information from multiple perspectives and can confirm identification from multiple views. Kumar and Sachdeva [25] proposed a deep neural model for cyberbullying detection in three different modalities of social data. This study enriched the data sources but did not enable the cross-corroboration of multimodal data models. Lin et al. [26] combined visual and text information to collect evidence for detection using query–document matching. This research demonstrated that multimodal information can corroborate with each other. Zhou et al. [27] adopted Stacking, an ensemble learning algorithm, combined multiple modalities such as text, images, and URLs, and proposed a multimodal fraudulent website identification method. However, limited by the amount of data, less than 100 per category, unlabeled latent data could not be exploited despite combining multimodal data. Chen et al. [17] designed an automatic system for detecting pornographic and gambling websites based on visual and textual content with a decision mechanism. Zhao et al. [18] proposed a pornographic website detection method that learns website representation features by jointly aggregating image-based, text-based, and structure-based features, formalizing pornographic website detection as a node classification task on the graph. Wang et al. [20] proposed ITSA for gambling web page detection and discussed the impact of different multimodal fusion approaches on detection effectiveness. These three approaches present different ways of utilizing multimodal features. However, these three methods cannot update the model and may not work well with new samples. Wang et al. [19] proposed a gambling website identification method based on a co-training algorithm that utilizes unlabeled data to improve the performances of classification models mutually. However, this method does not combine the results of the two modalities and ignores the available text in HTML files. Inspired by this approach, we propose our method. We summarize the research done in this field in Table 1.

Moreover, we have to face the lack of high-quality labeled data. Ul Hassan et al. [28] applied undersampling, oversampling, and SMOTE for balancing the dataset in the malicious website detection task. The resampling method is constrained by the already labeled data and is unable to fully utilize the substantial quantity of unlabeled data. The study of multimodal data and how to effectively combine multiple features is an issue that needs to be discussed in more depth.

Tri-training [22] is a popular semi-supervised learning algorithm that is developed from the co-training [29] paradigm. U et al. [30] designed a self-ensembled semi-supervised fake news detection model. The model learns the representations of labeled and unlabeled fake news by incorporating an adaptive pseudo-labeling mechanism on unlabelled data. Qian et al. [31] proposed a three-view tri-training method for the the literature–author identification task to iteratively identify the authors of unlabeled data to expand the training dataset. The core idea is to first represent each document as three different views and then use tri-training to exploit a large number of unlabeled documents. Wang et al. [32] used a tri-attention gated mechanism to construct a reliable training set from unlabeled data in the training phase, using a convolutional neural network (CNN) to extract multimodal depth features, i.e., spatial, motion, and audio, from live video streams. Yu et al. [33] proposed a general socially aware self-supervised tri-training framework for recommendations, and the performance achieved significant gains. An et al. [34] proposed a deep tri-training(DTT) framework in which the three networks have the same initialization structure but different parameters. However, at each optimization step, one of the networks is trained under the guidance of the other two networks. In basic tri-training methods, these two networks are treated equally, which is unreasonable.

In conclusion, existing research on the detection of illegal websites is starting to evolve toward the application of multimodal hybrid features, which have better performances and broad development prospects. However, website content is easy to collect and rich in information sources; the labeling work is time- and energy-consuming. To cope with the situation of sparse labeled data and rich unlabeled data, we adopted a semi-supervised learning approach to improve the detection performance and enhance generalization. We also explored improvements in the tri-training framework to effectively use unlabeled data in the case of unbalanced performances of multiple classifiers. Moreover, the current detection methods for illegal websites do not detect attraction websites. We defined an attraction website as a website that attracts a large amount of traffic to illegal websites such as gambling and pornographic websites, and it contains many other links. For example, illegal groups may operate both pornographic websites and gambling websites and advertise gambling websites through specific attraction websites with pornographic images and gambling adverts to attract people to visit them. The urgency of detecting attraction websites deserves attention, so we added it to the model’s classification categories.

3. Methodology

Based on the current development of illegal website detection methods, we propose a multimodal data augmentation framework for illegal website identification. We introduce a multimodal instruction approach to help the model improve the tri-training process. In the following section, we first describe the multimodal data extracted from illegal websites, then introduce the construction and selection of the basic classifier, and finally introduce the training process with the multimodal mutual assistance module and the multimodal data augmentation framework in detail.

3.1. Overview of the Proposed Method

Figure 2 shows the overall architecture of the proposed illegal website identification method. Structurally, the model can be divided into three parts: the basic classifier trained on a small amount of manually labeled data, the multimodal mutual assistance approach module, and the overall tri-training fusion module after the second part.

Based on the collected website data, we separately built labeled and unlabeled datasets, including web page screenshots, HTML text, and image OCR text. We trained models on these image data to extract visual features, which were then fed into a multi-layer perceptron (MLP) for classification. We extracted text from HTML code files, used the OCR technique to extract text from web screenshots, and used pre-trained word embedding layers to obtain word vectors, and then we extracted semantic features from two text sources separately and finally fed them into the MLP for classification. Based on a small amount of labeled data, the trained classifiers have high and low performance differences. We designed a multimodal mutual assistance module to exploit the differences in training effects under multimodalities and filter samples conducive to improving the performances of low-performance models Then, the three classes of classifiers were used to simultaneously classify unlabeled data. The results of the samples were obtained, and their pseudo-labels were obtained based on disagreement judgments. The samples with pseudo-labels were added to the corresponding training dataset. The process can fully utilize the unlabeled data through iteration to improve the generalization of the overall model. Finally, the prediction results

y_{1}

from the image modality,

y_{2}

from the OCR text modality and

y_{3}

from the HTML text modality were obtained separately, and the final prediction results were obtained using the hard voting rule.

3.2. Model Selection

3.2.1. Basic Image Classifier

With the rapid development of computer vision, the convolutional neural network (CNN) has achieved great success in the field of digital image processing. Convolutional and pooling layers bring a strong inductive bias to the convolutional neural network, and the CNN has the assumption of spatial invariance to process the whole feature map with the same weight as a sliding window, which has a strong image feature extraction capability, to extract features.

The image classifier for illegal website detection is based on the CNN model. An example of the basic image classifier structure is shown in Figure 3. The inputs are web page screenshots, and the outputs of the fully connected network are prediction labels, which are gambling, pornography, attraction, or normal. We constructed feature extraction networks based on VGG16 [35], ResNet50, ResNet101 [36], and ViT [37]. We used these backbones to perform feature extraction using the image information, generate feature maps, and then use the fully connected layer to perform the four classification tasks. After comparisons, we selected the optimal backbone.

3.2.2. Basic Text Classifier

Text classification is an important task in natural language processing and has a wide range of applications in web search, information retrieval, sorting, and classification tasks. To detect illegal websites based on text, we constructed text classification models based on TextRNN [38], TextRNN-Attention [39], FastText [40], and a Transformer [41] and finally selected FastText as the text classification model of our method through experimental analysis. The structure of the text classifier based on Fasttext is shown in Figure 4. FastText is a text classification tool introduced by Facebook AI Research. It can also be used to train word vectors, sentence vectors, etc. Its main advantage is the simplicity of the model, fast training speed, and satisfactory text classification accuracy.

3.3. Medal

3.3.1. Multimodal Mutual Assistance Module

In the basic tri-training framework, the three basic models are treated equally. However, we argue that with the incorporation of multimodality, models of different modalities should be treated differently due to the imbalance in information (which leads to an imbalance in the performances of the underlying classifiers). Thus, we came up with the improved multimodal tri-training approach, namely multimodal mutual assistance.

Figure 5 shows the multimodal mutual assistance module we used during the tri-training process. As the performances of the basic classifiers constructed on the basis of the three modalities differ significantly, we also explored which training strategy could achieve a better performance improvement.

Firstly, we evaluated the basic model of the three modalities and divided them into high, medium, and low according to their performance. When the prediction results from the two relatively better models were the same, the prediction labels were used as pseudo-labels, and the corresponding samples were added to the training dataset. The prediction results of the best model were used as the pseudo-labels, and the corresponding samples were added to the tri-training dataset of the second-best model. The models were iteratively trained until the performances of the two poorer classifiers improved to a certain standard.

3.3.2. Multimodal Data Augmentation Framework for Illegal Website Identification

We added the multimodal mutual assistance approach as an add-on module to the tri-training framework. The module can be appended anytime as needed to improve the classifier performance with more visible samples in practical applications. The overall procedure is described as follows.

First, we define labeled dataset L, unlabeled dataset

T

, and pseudo-labeled dataset

T_{p s e u d o}

as follows:

L = {\{(x_{i}^{A}, x_{i}^{B}, x_{i}^{C}, y_{i})\}}_{i = 1}^{n}

(1)

T = {\{(x_{j}^{A}, x_{j}^{B}, x_{j}^{C})\}}_{j = 1}^{m}

(2)

T_{p s e u d o} = \emptyset

(3)

where

x_{i}^{A}

denotes the web screenshot data of sample

x_{i}

,

x_{i}^{B}

denotes the HTML text data of sample

x_{i},

x_{i}^{C}

denotes the text data extracted from the web screenshot of sample

x_{i}

, n is the total number of samples, and

y_{i}

denotes the class label of sample

x_{i}

.

Second,

F_{1}, F_{2},

and

F_{3}

are three types of models built on

x_{i}^{A}, x_{i}^{B},

and

x_{i}^{C}

. After the multimodal mutual assistance training, the improved classifiers are defined as

F_{1}^{'}, F_{2}^{'},

and

F_{3}^{'}

, respectively. During the multimodal tri-training process,

F_{1}^{'}, F_{2}^{'},

and

F_{3}^{'}

predict the data selected from the unlabeled dataset

T_{p s e u d o}

, respectively, and obtain the prediction results of the three types of classifiers for the data of the corresponding modalities.

The hard voting results are used as the pseudo-label

L_{p s e u d o}

, and the samples are added to the training dataset to obtain the expanded dataset

L^{'}

.

L_{p s e u d o} = a r g m a x (S o f t m a x (F^{'} (x_{j})))

(4)

\begin{matrix} T_{p s e u d o} = {\max_{j \in (1, 2 \dots m)} (C o u n t (L_{p s e u d o} (x_{j}^{A}), L_{p s e u d o} (x_{j}^{B}) & , L_{p s e u d o} (x_{j}^{C})} \end{matrix}

(5)

L' = L \cup T_{p s e u d o}

(6)

The training is repeated for

F_{1}^{'}, F_{2}^{'},

and

F_{3}^{'}

.

The multimodal data augmentation process is shown in Algorithm 1.

Algorithm 1 Multimodal Data Augmentation Algorithm

Require: Labeled Dataset:

L = {(x_{i}^{A}, x_{i}^{B}, x_{i}^{C}, y_{i})}_{i = 1}^{n}

Unlabeled Dataset:

T = {(x_{j}^{A}, x_{j}^{B}, x_{j}^{C})}_{j = 1}^{m}

Pseudo-Labeled Dataset:

T_{p s e u d o} = \emptyset

Iteration number:

n_{1}, n_{2}

Ensure: The three trained classifiers

for int i = 1 to

n_{1}

do

F_{1}, F_{2}, F_{3} \leftarrow train (L)

end for

While

Accuracy (F_{1}, F_{2}, F_{3}) < θ

do

{F^{'}}_{1}, {F^{'}}_{2}, {F^{'}}_{3} \leftarrow \begin{matrix} M u l t i m o d a l M u t u a l A s s i s t a n c e (F_{1}, F_{2}, F_{3}, T) \end{matrix}

for int iter = 1 to

n_{2}

do

L_{p s e u d o} A \leftarrow {F^{'}}_{1} (x_{j}^{A})

L_{p s e u d o} B \leftarrow {F^{'}}_{2} (x_{j}^{B})

L_{p s e u d o} C \leftarrow {F^{'}}_{3} (x_{j}^{C})

T_{p s e u d o} = {Vote (L_{p s e u d o}^{A}, L_{p s e u d o}^{B}, L_{p s e u d o}^{C})}

L^{'} = L \cup T_{p s e u d o}

Further train

{F^{'}}_{1} {F^{'}}_{2} {F^{'}}_{3}

with dataset

L^{'}

end for

4. Experiments and Analysis

In this study, we conducted experiments to evaluate the performance of the proposed framework. Our experiments were conducted on a server with an Intel(R) Xeon(R) Gold 6230 CPU @ 2.10 GHz, NVIDIA GeForce RTX 3090, and 24 GB video memory. The deep learning framework we used was Pytorch (2.0.1).

4.1. Datasets and Evaluation Metrics

Datasets: With the scarcity of available public datasets about illegal websites, we started with a list of data from previous related competitions (https://datacon.qianxin.com/opendata/openpage?resourcesId=8, accessed on 9 April 2024), discovering connecting sites and expanding data sources. We used crawlers based on Python, the Beautiful Soup (https://pypi.org/project/beautifulsoup4/, accessed on 9 April 2024), and Requests (https://pypi.org/project/requests/, accessed on 9 April 2024) libraries to filter out active websites and obtain HTML and web page screenshots of the websites. Due to moral and ethical issues, our data are not publicly accessible. The datasets contain a total of 3679 illegal websites divided into gambling, pornography, and attraction—three illegal categories including 1253 gambling websites, 1221 pornographic websites, and 1205 attraction websites. To simulate the scene of imbalanced datasets, we collected 4892 normal websites (the categories included news, medical, shopping, education, entertainment, movies, sports, etc.) We removed duplicate data, keeping 1200 gambling, pornography, and attraction website data points each and 4800 normal websites as the training dataset. Table 2 shows our experimental dataset, which divides the website dataset into a labeled dataset and an unlabeled dataset. The labeled dataset contains 500 gambling websites, 500 pornographic websites, 500 attraction websites, and 2000 normal websites as the training dataset, and the test set includes 200 gambling websites, 200 pornographic websites, 200 attraction websites, and 800 normal websites, all of which include web page screenshots, HTML text, and text extracted from web pages using OCR technology. The unlabeled dataset contains 500 gambling sites, 500 pornographic sites, 500 lead generation sites, and 2000 normal sites.

Evaluation Metrics: In the experiments, accuracy, precision, recall, and

F 1

metrics were used as evaluation metrics. The confusion matrix consisted of four metrics: true positive (

T P

), true negative (

T N

), false positive (

F P

), and false negative (h). We evaluated the model performance overall in all classifications of the categorization. For the ith (gambling pornography, attraction, or normal) class,

T P_{i}

is the number of websites correctly predicted by the model to be class i,

T N_{i}

is the number of websites that are correctly predicted as not in class i,

F P_{i}

is the number of websites that are incorrectly predicted as class i, and

F N_{i}

is the number of websites that are incorrectly predicted as not in class i.

Accuracy measures the rate of illegal (gambling, porn, or attraction) and normal websites that were predicted correctly as a proportion of all the websites, n is the number of classes, and N is the number of all samples.

A c c u r a c y = \frac{\sum_{i = 0}^{n} T P_{i}}{N}

(7)

Precision measures the rate of class i (gambling, porn, attraction, or normal) websites that were predicted correctly as a proportion of all websites predicted as class i:

P r e c i s i o n_{i} = \frac{T P_{i}}{(T P_{i} + F P_{i})}

(8)

Recall measures the rate of instances correctly predicted as illegal (gambling, porn, or attraction) as a proportion of all illegal websites (gambling, porn, or attraction):

R e c a l l_{i} = \frac{T P_{i}}{(T P_{i} + F N_{i})}

(9)

Since we are more concerned with the classification of illegal sites, we chose the macro-averaging method to calculate precision, recall, and

F 1

metrics:

P r e c i s i o n_{m a c r o} = \frac{1}{n} \sum_{i = 1}^{n} P_{i}

(10)

R e c a l l_{m a c r o} = \frac{1}{n} \sum_{i = 1}^{n} R_{i}

(11)

F 1_{m a c r o} = \frac{1}{n} \sum_{i = 1}^{n} F 1_{i} = \frac{1}{n} \sum_{i = 0}^{n} \frac{2 \times (P r e c i s i o n_{i} \times R e c a l l_{i})}{(P r e c i s i o n_{i} + R e c a l l_{i})}

(12)

4.2. Evaluation of Basic Classifiers

For this part, we built four models for different modal data for a comparative analysis and selected the one with the better basic classification ability as the basic classifier model for multimodal tri-training. We trained VGG16-based, ResNet50-based, Resnet101-based, and ViT-based image classifiers on screenshot image data, and TextRNN-based, TextRNN-Attention-based, FastText-based, and Transformer-based text classifiers on OCR text data. TextRNN-based, TextRNN-Attention-based, FastText-based, and Transformer-based text classifiers were trained on HTML text data. The structures of the CNN backbones remained the same as those of the original models [35,36,37], and the image input size was 224 × 224 × 3. The hidden layer sizes of the text backbones were set to 256 (two layers), and the embedding dimension was set to 300. All the models were trained with the Adam optimizer (https://pytorch.org/docs/stable/generated/torch.optim.Adam.html, accessed on 9 April 2024) using a basic setup. The optimal hyperparameters such as learning rate, batch size, maximum training epochs, and dropout rate were tuned for each task and each model separately. The test results of the three types of modality basic classifiers during the basic training process with the best hyperparameters are shown in Figure 6.

The values of the evaluation metrics for the most effective model among the three different modalities are given in Table 3. The bolded text indicates the best value of the basic classifier for each metric. From Table 3, we can find that FastText achieved better classification results on OCR text than on HTML text. Its accuracy, precision, recall, and

F 1

metrics are better than those of other neural network models. The VGG16 model achieved the highest accuracy among the image classifiers. The ViT and Transformer were less effective at recognizing illegal websites with fewer labels, due to the small training dataset available.

For the four-class classification problems of gambling, pornography, attraction, and normal websites, there are obvious imbalances in classifier performance under different modalities, and the performances of the classifiers in the three modalities are distributed in a stepped manner. Based on the OCR text obtained from the web screenshots, the accuracies of the trained classifiers were significantly higher than the those of other two modalities, due to the semantic richness and obvious features. This suggests that the text content in the image obtained through the page screenshot is what is displayed to the visitor after the page is loaded, typically illegal information that the visitor wants to access. At the same time, the information in the HTML is useful and can provide categorization features. Then, the best classifier model of the three modalities in the training process was selected as the basic classifier for the next step of tri-training.

4.3. Evaluation of Multimodal Mutual Assistance Module

The imbalance in prediction performance between modalities may introduce more noise information during the tri-training process, thus misleading the process of further training the modal. Figure 7 shows the results of the experiments conducted using only the prediction results of the HTML text-based classification model, with the worst basic performance used as the pseudo-label. The trend line in Figure 7 shows that the accuracy performance index of each modal classifier decreases. Therefore, the tri-training process should be well designed, and we designed a training approach with a multimodal mutual assistance module. To verify the effectiveness of this approach, in this study, we compared the methods using co-training, the basic classifier after basic tri-training, and the basic single-modality model after the tri-training framework with the enhanced module. The results of the experiments carried out are shown in Table 4, where MA means that the model was applied with the multimodal mutual assistance module, and bolded texts show the better metrics.

Compared to the original tri-training, our proposed multimodal mutual assistance module has the most significant improvement in the image classifier, with over a 1.1% improvement in all metrics, including a 2.19% increase in the recall rate. While the indicators of the FastText model built from OCR data do not improve significantly, the other two are significantly improved. This shows that from a single modality, the dataset built based on OCR text has strong features and training with fusion enhancement does not improve the model. Also, less effective models learn more from the training process. From Table 4, we can see that there is room for improvement of each classifier via tri-training, and effective information is introduced to the model by obtaining pseudo-labels of different modality data. The performance of the basic classifier of the tri-training improved using the training approach with multimodal mutual assistance during the tri-training process; the model improvement is better than that when not using this method. It is evident that the evaluation metrics of the basic classifiers trained in our proposed method outperform the basic models. This suggests that information from different modalities can be used to guide model training for different modalities through pseudo-labeled data to improve results.

4.4. Evaluation of Multimodal Data Augmentation Framework

We compared our approach to the co-training based approach [19], which did not combine the results of the two modalities. However, we combined the results of the two modalities. In Table 5, Co-TI represents the co-training results of the training models based on OCR text and web screenshot data, Co-TH represents the co-training results of the training models based on OCR text and HTML text data, Co-HI represents the co-training results of the training models based on HTML text and web screenshot data, Tri-TIH means the model based on the three types of data using tri-training methods, and MEDAL represents the framework we proposed. The bolded text shows the better metrics.

The results of the models fused using the hard voting integration approach, compared to the co-training model, show significant improvement in each evaluation index. The graphical comparison in Figure 8 shows that co-training does not combine the advantages of both classifiers well enough to achieve an increase in the detection rate; instead, the co-training method decreases the accuracy by 3.11%, the precision by 3.55%, the recall by 2.44%, and the

F 1

metric by 3.36%, compared to the best unimodal classifier, FastText(OCR). In comparison with co-training [19], the average improvements of MEDAL in accuracy, detection rate, and the

F 1

metric were 4.71%, 6.23%, 5.75%, and 6.36%, respectively. This demonstrates a dramatic improvement in our approach compared to co-training [19]. We can also see that the basic tri-training method fuses the three classifiers better, while our method has a greater advantage over the basic tri-training in terms of improving the accuracy and precision by 1% and recall and the

F 1

metric by more than 1.4%. Finally, our method performed well on the accuracy, precision, recall, and

F 1

metrics. The tri-training method can effectively improve the performance, while with the help of the multimodal mutual assistance module and hard-voting, the modeling effects further improved. MEDAL achieves the accurate detection of illegal websites from a small amount of labeled data and efficiently uses a large amount of unlabeled data.

5. Discussion

Multimodal Mutual Assistance Module. In Section 4.3, we evaluate the effectiveness of the multimodal mutual assistance module in improving the basic classifier. From the results, it can be seen that the performance of the weak classifier is substantially improved, and the performance of the best unimodal classifier has a slight decrease. On the one hand, this proves that different modalities can indeed improve each other; on the other hand, this also shows that the sample selection method of the mutual assistance module needs to be further improved, which is where our work is insufficient, although this small decrease fails to affect the significant improvement of the overall framework performance. What we would like is to be able to improve the performance for all basic classifiers, but the module fails to guarantee this.

Multimodal Data Augmentation Framework. In Section 4.4, we evaluate the effectiveness of the framework that we propose. We compared our method with co-training, although we achieved better evaluation metrics in the experiment, the time complexity of our algorithm is higher, and the model training process is relatively complex. The experiment results show the validity of the multimodal cross-enhancement approach. Also, when to conduct retraining needs to be set up in a more rational way.

Web Page Content Defacement. As the relevant departments continue to increase the crackdown on illegal websites, in order to escape detection, illegal websites have begun to use elaborate camouflage; for example, the simplest camouflage is the title of the site being disguised as the name of the company so that the simple identification method based on the title fails. Moreover, the attraction website is a method of attracting traffic to illegal gambling and pornographic websites, which are usually server-side rendered. Then, the information contained in the HTML that can be crawled is sparse, making detection impossible. There is a need to propose an explicit method for identifying the camouflage methods used by illegal websites in different modalities. We must recognize that the categories of illegal websites are diverse and that there is significant value in identifying illegal websites based on an open-world setting.

6. Conclusions

In this paper, we propose a multimodality-based data augmentation framework for illegal website identification. We combined three types of modal data, namely, text data extracted from HTML source code, text extracted from web screenshots using OCR, and web screenshots themselves. Based on these three types of modal data, we constructed three basic classifiers for further improvement. Our method addresses the problems of the effective use of large amounts of unlabeled data and the utilization of disagreement information during modal fusion. Due to imbalanced baseline performances, we designed a multimodal mutual assistance module for the tri-training process, which makes the final prediction results more accurate and reliable. This module achieves mutual lifting across modalities. The results of the experiment show that our method performs well in the accuracy, precision, recall, and

F 1

metrics. Our proposed method also has the potential to progress with increasing samples in practical applications. In our future work, we will reduce the complexity of the algorithm in our application and design an effective attention model to handle new samples in order to better cope with new types of illegal websites. Also, we will endeavor to reveal the camouflage strategies of illegal websites in order to better detect concealed illegal websites.

Author Contributions

Conceptualization, L.W. and P.X.; methodology, L.W. and P.X.; software, C.W.; validation, L.W. and C.W.; formal analysis, B.G. and H.M.; investigation M.Z. and W.D.; resources, B.G. and H.M.; data curation, J.Z.; writing—original draft preparation, L.W.; writing—review and editing, M.Z.; visualization, P.X.; supervision, J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key R&D Program of China (Grant No. 2021YFB3100500).

Informed Consent Statement

Not applicable.

Data Availability Statement

Data is contained within the article.

Acknowledgments

We appreciate the fund from the National Key R&D Program of China.

Conflicts of Interest

The authors have no competing interests to declare that are relevant to the content of this article.

References

Yang, H.; Du, K.; Zhang, Y.; Hao, S.; Li, Z.; Liu, M.; Wang, H.; Duan, H.; Shi, Y.; Su, X.; et al. Casino royale: A deep exploration of illegal online gambling. In Proceedings of the 35th Annual Computer Security Applications Conference, San Juan, PR, USA, 9–13 December 2019. [Google Scholar]
Gao, Y.; Wang, H.; Li, L.; Luo, X.; Xu, G.; Liu, X. Demystifying Illegal Mobile Gambling Apps. Proc. Web Conf. 2021, 2021, 1447–1458. [Google Scholar]
Gu, Z.; Gou, G.; Liu, C.; Yang, C.; Zhang, X.; Li, Z.; Xiong, G. Let gambling hide nowhere: Detecting illegal mobile gambling apps via heterogeneous graph-based encrypted traffic analysis. Comput. Netw. 2024, 243, 110278. [Google Scholar] [CrossRef]
Sheng, S.; Wardman, B.; Warner, G.; Cranor, L.F.; Hong, J.I.; Zhang, C. An Empirical Analysis of Phishing Blacklists. In Proceedings of the International Conference on Email and Anti-Spam, Perth, Australia, 1–2 September 2011. [Google Scholar]
Sahoo, D.; Liu, C.; Hoi, S.C.H. Malicious URL Detection using Machine Learning: A Survey. arXiv 2017, arXiv:1701.07179. [Google Scholar]
Fan, Y.; Yang, T.; Wang, Y.; Jiang, G. Illegal Website Identification Method Based on URL Feature Detection. Comput. Eng. 2018, 44, 171–177. [Google Scholar] [CrossRef]
Huang, Y.; Yang, Q.; Qin, J.; Wen, W. Phishing URL Detection via CNN and Attention-Based Hierarchical RNN. In Proceedings of the 2019 18th IEEE International Conference on Trust, Security and Privacy in Computing and Communications/13th IEEE International Conference on Big Data Science and Engineering (TrustCom/BigDataSE), Rotorua, New Zealand, 5–8 August 2019; pp. 112–119. [Google Scholar]
Le, H.; Pham, Q.; Sahoo, D.; Hoi, S.C.H. URLNet: Learning a URL Representation with Deep Learning for Malicious URL Detection. arXiv 2018, arXiv:1802.03162. [Google Scholar]
Verma, R.M.; Das, A. What’s in a URL: Fast Feature Extraction and Malicious URL Detection. In Proceedings of the 3rd ACM on International Workshop on Security And Privacy Analytics, Scottsdale, AZ, USA, 22–24 March 2017. [Google Scholar]
Shin, J.; Lee, S.; Wang, T. Semantic Approach for Identifying Harmful Sites Using the Link Relations. In Proceedings of the 2014 IEEE International Conference on Semantic Computing, Newport Beach, CA, USA, 16–18 June 2014; pp. 256–257. [Google Scholar]
Sheu, J.J. Distinguishing Medical Web Pages from Pornographic Ones: An Efficient Pornography Websites Filtering Method. Int. J. Netw. Secur. 2017, 19, 839–850. [Google Scholar]
Ma, X.; Zheng, C.; Li, Z.; Yin, J.; Liu, Q.; Chen, X. A Lightweight Graph-based Method to Detect Pornographic and Gambling Websites with Imperfect Datasets. In Proceedings of the 2022 IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), Wuhan, China, 9–11 December 2022. [Google Scholar]
Sun, G.; Ye, F.; Chai, T.; Zhang, Z.; Tong, X.; Prasad, S. Gambling Domain Name Recognition via Certificate and Textual Analysis. Comput. J. 2022, 66, 1829–1839. [Google Scholar] [CrossRef]
Liu, D.; Lee, J.; Wang, W.; Wang, Y. Malicious Websites Detection via CNN based Screenshot Recognition. In Proceedings of the 2019 International Conference on Intelligent Computing and its Emerging Applications (ICEA), Tainan City, Taiwan, 30 August–1 September 2019; pp. 115–119. [Google Scholar]
Li, L.; Gou, G.; Xiong, G.; Cao, Z.; Li, Z. Identifying Gambling and Porn Websites with Image Recognition. In Proceedings of the Pacific Rim Conference on Multimedia, Harbin, China, 28–29 September 2017. [Google Scholar]
Yuan, K.; Tang, D.; Liao, X.; Wang, X.; Feng, X.; Chen, Y.; Sun, M.; Lu, H.; Zhang, K. Stealthy Porn: Understanding Real-World Adversarial Images for Illicit Online Promotion. In Proceedings of the 2019 IEEE Symposium on Security and Privacy (SP), Francisco, CA, USA, 20–22 May 2019; pp. 952–966. [Google Scholar]
Chen, Y.; Zheng, R.; Zhou, A.; Liao, S.; Liu, L. Automatic Detection of Pornographic and Gambling Websites Based on Visual and Textual Content Using a Decision Mechanism. Sensors 2020, 20, 3839. [Google Scholar] [CrossRef]
Zhao, J.; Shao, M.; Peng, H.; Wang, H.; Li, B.; Liu, X. Porn2Vec: A robust framework for detecting pornographic websites based on contrastive learning. Knowl. Based Syst. 2021, 228, 107296. [Google Scholar] [CrossRef]
Wang, C.; Xue, P.; Zhang, M.; Hu, M. Identifying Gambling Websites with Co-training. In Proceedings of the International Conference on Software Engineering and Knowledge Engineering, Virtual, 1–10 July 2022. [Google Scholar]
Wang, C.; Zhang, M.; Shi, F.; Xue, P.; Li, Y. A Hybrid Multimodal Data Fusion-Based Method for Identifying Gambling Websites. Electronics 2022, 11, 2489. [Google Scholar] [CrossRef]
Zhao, R. The Chameleon on the Web: An Empirical Study of the Insidious Proactive Web Defacements. Proc. ACM Web Conf. 2023, 2023, 2241–2251. [Google Scholar]
Zhou, Z.H.; Li, M. Tri-training: Exploiting unlabeled data using three classifiers. IEEE Trans. Knowl. Data Eng. 2005, 17, 1529–1541. [Google Scholar] [CrossRef]
Li, J.; Zhou, H.; Wu, S.; Luo, X.; Wang, T.; Zhan, X.; Ma, X. FOAP: Fine-Grained Open-World Android App Fingerprinting. In Proceedings of the 31st USENIX Security Symposium (USENIX Security 22), Boston, MA, USA, 10–12 August 2022; pp. 1579–1596. [Google Scholar]
Li, J.; Wu, S.; Zhou, H.; Luo, X.; Wang, T.; Liu, Y.; Ma, X. Packet-Level Open-World App Fingerprinting on Wireless Traffic. In Proceedings of the 2022 Network and Distributed System Security Symposium, San Diego, CA, USA, 24–28 April 2022. [Google Scholar]
Kumar, A.; Sachdeva, N. Multimodal Cyberbullying Detection Using Capsule Network with Dynamic Routing and Deep Convolutional Neural Network. Multimed. Syst. 2022, 28, 2043–2052. [Google Scholar] [CrossRef]
Lin, D.; Ma, Y.; Li, Y.; Song, X.; Wu, J.; Nie, L. OFAR: A Multimodal Evidence Retrieval Framework for Illegal Live-streaming Identification. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, Taipei, Taiwan, 23–27 July 2023; SIGIR ’23. pp. 3410–3414. [Google Scholar] [CrossRef]
Zhou, S.; Ruan, L.; Xu, Q.; Chen, M. Multimodal fraudulent website identification method based on heterogeneous model ensemble. China Commun. 2023, 20, 263–274. [Google Scholar] [CrossRef]
Ul Hassan, I.; Ali, R.H.; Ul Abideen, Z.; Khan, T.A.; Kouatly, R. Significance of Machine Learning for Detection of Malicious Websites on an Unbalanced Dataset. Digital 2022, 2, 501–519. [Google Scholar] [CrossRef]
Blum, A.; Mitchell, T.M. Combining labeled and unlabeled data with co-training. In Proceedings of the Eleventh Annual Conference on Computational Learning Theory, Madison, WI, USA, 24–26 July 1998. [Google Scholar]
Shivani Sri Varshini, U.; Praneetha Sree, R.; Srinivas, M.; Subramanyam, R.B.V. I-S2FND: A novel interpretable self-ensembled semi-supervised model based on transformers for fake news detection. J. Intell. Inf. Syst. 2024, 62, 355–375. [Google Scholar] [CrossRef]
Qian, T.; Liu, B.; Chen, L.; Peng, Z.; Zhong, M.; He, G.; Li, X.; Xu, G. Tri-Training for Authorship Attribution with Limited Training Data. Neurocomputing 2014, 171, 798–806. [Google Scholar] [CrossRef]
Wang, L.; Zhang, J.; Tian, Q.; Li, C.; Zhuo, L. Porn Streamer Recognition in Live Video Streaming via Attention-Gated Multimodal Deep Features. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 4876–4886. [Google Scholar] [CrossRef]
Yu, J.; Yin, H.; Gao, M.; Xia, X.; Zhang, X.; Hung, N.Q.V. Socially-Aware Self-Supervised Tri-Training for Recommendation. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Singapore, 14–18 August 2021. [Google Scholar]
An, S.; Zhu, H.; Zhang, J.; Ye, J.; Wang, S.; Yin, J.; Zhang, H. Deep Tri-Training for Semi-Supervised Image Segmentation. IEEE Robot. Autom. Lett. 2022, 7, 10097–10104. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, P.; Qiu, X.; Huang, X. Recurrent Neural Network for Text Classification with Multi-Task Learning. arXiv 2016, arXiv:1605.05101. [Google Scholar]
Zhou, P.; Shi, W.; Tian, J.; Qi, Z.; Li, B.; Hao, H.; Xu, B. Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Berlin, Germany, 7–12 August 2016. [Google Scholar]
Joulin, A.; Grave, E.; Bojanowski, P.; Mikolov, T. Bag of Tricks for Efficient Text Classification. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics, Berlin, Germany, 7–12 August 2016. [Google Scholar]
Vaswani, A.; Shazeer, N.M.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All you Need. In Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]

Figure 1. A web page screenshot of an illegal website and the extracted OCR text and HTML text. The Chinese sentence in OCR text means high commissions are settled daily, the Chinese sentence in OCR text means we promise to provide every customer with the safest, fairest, and most equitable gambling games. Classification from different modalities may yield inconsistent classification results.

Figure 2. Multimodality-based Effective Data Augmentation Framework for Illegal Website Identification.

Figure 3. Basic image classifier structure.

Figure 4. Basic text classifier structure.

Figure 5. The multimodal mutual assistance module attached to the framework.

Figure 6. Basic classifier selection.

Figure 7. Decreasing trends in accuracy under the misleading of the worst model during basic tri-training.

Figure 8. Final performance improvement results.

Table 1. Table displaying the literature review of important research works in the field of illegal website identification based on multimodal methods.

Author Name	Categories	Datatypes	Basic Model
Lin et al. [26]	Live-streaming	Image–Text pairs	Encoder
Kumar and Sachdeva [25]	Cyberbullying	Text, Image	CapsNet, CNN
Zhou et al. [27]	Porn, Gambling, Fake, etc.	Text, Image, URL	BERT, ResNet, LR
Chen et al. [17]	Porn, Gambling	Text, Image, URL	Doc2Vec, SVM, RF
Zhao et al. [18]	Porn	Text, Image, HTML Structure	HGNN
Wang et al. [20]	Gambling	OCR Text, Image	LSTM, ResNet
Wang et al. [19]	Gambling	OCR Text, Image	TextRNN, CNN

Table 2. Dataset descriptions.

Classes	Total	Labeled Dataset		Unlabeled Dataset
Classes	Total	Train	Test	Unlabeled Dataset
Gambling	1200	500	200	500
Porn	1200	500	200	500
Attraction	1200	500	200	500
Normal	4800	2000	800	2000

Table 3. Evaluation results of basic classifiers.

Model	Accuracy	Precision	Recall	$F 1$
VGG16	0.9557	0.9583	0.9338	0.9454
FastText(HTML)	0.9062	0.8883	0.8565	0.8713
FastText(OCR)	0.9797	0.9676	0.9594	0.9625

Table 4. Comparisons of detection performances between basic and single-modality models after basic tri-training and mutual assistance process.

Methods	Model		Accuracy	Precision	Recall	F1
Basic	FastText(HTML)		0.9062	0.8883	0.8565	0.8713
	FastText(OCR)		0.9797	0.9676	0.9594	0.9625
	VGG16		0.9557	0.9583	0.9337	0.9453
Over Sampling	FastText(HTML)		0.8703	0.8198	0.8060	0.8108
	FastText(OCR)		0.8656	0.8152	0.7974	0.8048
	VGG16		0.9257	0.8980	0.9075	0.9022
Under Sampling	FastText(HTML)		0.8750	0.8357	0.8064	0.8185
	FastText(OCR)		0.8812	0.8379	0.8177	0.8272
	VGG16		0.9228	0.9129	0.8875	0.8994
Tri-training	FastText(html)	None	0.9071	0.8840	0.8619	0.8723
	FastText(html)	MA	0.9143	0.9032	0.8669	0.8830
	FastText(OCR)	None	0.9829	0.9814	0.9700	0.9753
	FastText(OCR)	MA	0.9814	0.9768	0.9712	0.9738
	VGG16	None	0.9700	0.9653	0.9494	0.9558
	VGG16	MA	0.9814	0.9808	0.9713	0.9757

Table 5. Comparisons of detection performances between our method and co-training method.

Methods	Model	Accuracy	Precision	Recall	$F 1$
Co-training [19]	Co_TI	0.9486	0.9283	0.9363	0.9289
	Co_TH	0.9543	0.9400	0.9406	0.9361
	Co_HI	0.9429	0.9282	0.9281	0.9244
Tri-training	Tri-TIH	0.9857	0.9846	0.9750	0.9794
Tri-training	MEDAL	0.9957	0.9944	0.9925	0.9934

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wen, L.; Zhang, M.; Wang, C.; Guo, B.; Ma, H.; Xue, P.; Ding, W.; Zheng, J. MEDAL: A Multimodality-Based Effective Data Augmentation Framework for Illegal Website Identification. Electronics 2024, 13, 2199. https://doi.org/10.3390/electronics13112199

AMA Style

Wen L, Zhang M, Wang C, Guo B, Ma H, Xue P, Ding W, Zheng J. MEDAL: A Multimodality-Based Effective Data Augmentation Framework for Illegal Website Identification. Electronics. 2024; 13(11):2199. https://doi.org/10.3390/electronics13112199

Chicago/Turabian Style

Wen, Li, Min Zhang, Chenyang Wang, Bingyang Guo, Huimin Ma, Pengfei Xue, Wanmeng Ding, and Jinghua Zheng. 2024. "MEDAL: A Multimodality-Based Effective Data Augmentation Framework for Illegal Website Identification" Electronics 13, no. 11: 2199. https://doi.org/10.3390/electronics13112199

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MEDAL: A Multimodality-Based Effective Data Augmentation Framework for Illegal Website Identification

Abstract

1. Introduction

2. Related Work

2.1. Single-Modal Based Identification Methods

2.2. Multimodal Fusion-Based Identification Methods

3. Methodology

3.1. Overview of the Proposed Method

3.2. Model Selection

3.2.1. Basic Image Classifier

3.2.2. Basic Text Classifier

3.3. Medal

3.3.1. Multimodal Mutual Assistance Module

3.3.2. Multimodal Data Augmentation Framework for Illegal Website Identification

4. Experiments and Analysis

4.1. Datasets and Evaluation Metrics

4.2. Evaluation of Basic Classifiers

4.3. Evaluation of Multimodal Mutual Assistance Module

4.4. Evaluation of Multimodal Data Augmentation Framework

5. Discussion

6. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI