Next Article in Journal
Prediction of Blade Tip Timing Sensor Waveforms Based on Radial Basis Function Neural Network
Previous Article in Journal
Dosimetric Evaluation of 177Lu Peptide Receptor Radionuclide Therapy Using GATE and Planet Dose
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multi-Modal Clustering Discovery Method for Illegal Websites Based on Network Surveying and Mapping Big Data

College of Electronic Engineering, National University of Defense Technology, Hefei 230037, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2023, 13(17), 9837; https://doi.org/10.3390/app13179837
Submission received: 29 May 2023 / Revised: 1 August 2023 / Accepted: 28 August 2023 / Published: 30 August 2023

Abstract

:
With the development of internet technology, the number of illicit websites such as gambling and pornography has dramatically increased, posing serious threats to people’s physical and mental health, as well as their financial security. Currently, the governance of such illicit websites mainly focuses on limited-scale detection through manual annotation. However, the need for effective solutions to govern illicit websites is urgent, requiring the ability to rapidly acquire large volumes of existing website data from the internet. Web mapping engines can provide massive, near real-time web data, which plays a crucial role in batch detection of illicit websites. Therefore, in this paper, we propose a method that combines web mapping engine big data to perform unsupervised multimodal clustering (MDC) for illicit website discovery. By extracting features based on contrastive learning methods from webpage screenshots and OCR text, we conduct feature similarity clustering to identify illicit websites. Finally, our unsupervised clustering model achieved an overall accuracy of 84.1% on all confidence levels, and a 92.39% accuracy at a confidence level of 0.999 or higher. By applying the MDC model to 3.7 million real web mapping data, we obtained 397,275 illicit websites primarily focused on gambling and pornography, with 14 attributes. This dataset is made publicly.

1. Introduction

Amidst the rapid expansion of the Internet, concerns regarding network security have become increasingly prominent. Notably, the proliferation of illicit websites, including those involved in gambling, pornography, and fraudulent activities, has witnessed a persistent surge, thereby imposing a grave peril to individuals’ financial assets and information security. According to “Facts and Figures” released by ITU [1], as of December 2022, about 52.3% of users have been harassed by illegal websites. Illegal websites not only inflict financial losses upon their victims but also compromise personal privacy and sensitive information, with potential repercussions on national security as well. The regulation of illegal websites is confronted with numerous obstacles and challenges. The prompt and accurate discovery and mitigation of such illicit platforms from a technical standpoint have emerged as pressing issues demanding immediate solutions. Hence, this research paper endeavors to explore an extensive and expedited approach for the detection of illegal websites. Additionally, it aims to examine the utilization of technical means to enhance the efficacy and precision of governing such illicit platforms, thereby ensuring the wholesome development of the Internet ecosystem.
Typically, existing research on combating illegal websites primarily concentrates on identifying platforms associated with gambling and pornography. These studies often rely on inputting specific website information such as text or images into detection models, which then render judgments on the legality of the website. This approach is crucial for detecting illicit websites. However, when dealing with large-scale and constantly evolving illicit websites, this method struggles to adapt to the disruptive changes posed by these websites and to effectively control the entire illicit ecosystem. On the other hand, utilizing clustering methods allows for identification from a global sample perspective, enabling adaptive classification detection based on the similarities between samples. To achieve rapid monitoring and comprehensive control of the entire illegal landscape, innovative governance solutions need to be adopted.
Cyberspace surveying and mapping entails the detection of diverse assets within the cyberspace, acquiring various attributes associated with these assets, and subsequently conducting fusion processing and correlation analysis on the collected network asset data to construct a comprehensive map of cyberspace resources. The cyberspace surveying and mapping engine continuously scans the network assets across the entire network, capturing and retaining information pertaining to all network websites. This valuable resource provides crucial data support for the governance of illegal websites.
Therefore, in this research paper, we utilize the website asset data from the network surveying and mapping platform to perform image–text comparative learning (Section 3.1), generating representations of website screenshots and the corresponding text. These representations serve a dual purpose: they facilitate subsequent website clustering for the discovery of illegal websites and expand the search capabilities of the network surveying and mapping platform. Additionally, we employ the semantic features obtained through image–text comparative learning to conduct cluster analysis and systematically identify illegal websites. Through the analysis of 3.7 million website asset data entries within the network asset mapping platform, we successfully identified 397,275 instances of illegal websites.
In summary, our study offers the following contributions:
  • We present a novel approach that combines the extensive data support provided by the network mapping engine. We modify the text model and employ transfer learning techniques to adapt it to the Chinese language environment, building upon the principles of the CLIP model. Additionally, we leverage unsupervised comparative learning to extract semantic features from website assets. These semantic features can be directly utilized in the search function of the network mapping engine and enable semantic clustering of website assets.
  • Building upon deep clustering technology, we introduce a multi-modal clustering (MDC) method that synergizes with a network mapping engine to expedite the identification of illegal websites. This approach significantly reduces the costs associated with discovering illicit platforms and enables the rapid detection of large-scale illegal websites. Ultimately, our method offers valuable technical support for the governance of illegal websites.
  • In the experiment, we used our proposed method to analyze a large number of website assets in the network surveying and mapping platform, and successfully discovered a large number of illegal websites, which proved the effectiveness and feasibility of our method in the governance of illegal websites. Simultaneously, for the first time, we share a database of illegal websites discovered by our method, which includes 397,275 pieces of illegal website data, and each piece of data contains 14 attributes. This dataset is made publicly available https://www.kaggle.com/listone/black-website (accessed on 1 August 2023). We hope that other researchers can also use these data to conduct research on illegal website governance.
The HTML text of web pages and screenshots of websites are important attributes of a website. The main objective of our research is to analyze website attributes using clustering techniques, combined with large-scale data from the network mapping engine, to discover illicit websites.
In conclusion, this research paper presents an innovative utilization of a network mapping engine within the realm of governing illegal websites. Furthermore, it introduces a multi-modal deep clustering approach for identifying illegal websites, leveraging an image–text comparative learning method.

2. Background and Related Work

2.1. Background

Illegal websites refer to websites engaged in illegal activities, such as pornography, gambling, piracy, fraud, network attacks, etc. [2]. These websites often change domain names and servers, and also use some camouflage methods to avoid automated detection, to evade supervision and blocking. Regardless of their tactics for evasion and transformation, these illegal websites persist in their need to remain accessible to their victims. Therefore, these illegal websites inevitably become part of the comprehensive cyberspace map created by the network mapping engine.

2.1.1. Cyberspace Mapping Engine

At present, there are many cyberspace mapping systems in the world. For example, Shodan [3] in the United States, which mainly scans and identifies network infrastructure equipment such as servers and network cameras, and has a convenient API interface that supports calls from multiple programming languages. The Censys [4] surveying and mapping system developed by researchers at the University of Michigan uses the self-developed Zmap scanning tool to collect detailed information on IP, certificates, and websites, which can help users collect and sort out the organization’s attack exposure surface. BinaryEdge [5], a surveying and mapping product from Switzerland, scans the entire network, maps the attack exposure surface of nearly 5 billion devices and 15 million business groups, and is committed to providing real-time threat intelligence information for enterprise organizations, to reduce their risk of being attacked. China also has many network surveying and mapping engines, including Zoomeye [6] from Zhidaochuangyu, FOFA [7] from Huashunxinan, Quake [8], a network-wide space surveying and mapping system independently developed and designed by the 360 Network Security Response Center, and the RaySpace [9] platform from Surbana Security.
The network mapping engine encompasses the storage of IP network assets across the entire network. In previous applications, the primary focus of the network mapping engine was predominantly centered around network security attack and defense technologies. Through the detection of organizational assets and the simple correlation of vulnerability information, the attack surface of network target assets was sorted out. Nonetheless, the extensive repository of website assets within the network mapping engine holds potential beyond the realm of network attack and defense. It also harbors a significant number of hidden illegal websites. Proficient analysis of these assets can offer valuable technical resources for the governance of illicit industries.
The network assets stored within the network mapping engine exhibit complexity and diversity. However, supervised learning approaches necessitate labeled samples, which can be costly and time-consuming to obtain. Moreover, limited samples may not effectively represent the entirety of the network assets, further exacerbating the challenge. Under such circumstances, a supervised learning model trained solely on labeled samples may exhibit high accuracy within the training dataset but could be subject to bias. Consequently, when dealing with the analysis of extensive network surveying and mapping asset data, we opt to learn the semantic representation of the data through unsupervised methods. By employing unsupervised clustering techniques, we can effectively discover and identify illegal websites without relying on labeled samples.

2.1.2. Website Page Camouflage

Currently, numerous illegal websites employ various camouflage methods to evade detection by automated monitoring systems. These methods include, but are not limited to:
  • Variant text: These methods involve the use of Chinese and English replacements, numerical substitutions, special symbols, artistic fonts, and other techniques. Such strategies often render text classification models trained on standard text ineffective.
  • Anti-image: Although relatively uncommon, this method involves the addition of disordered symbols of varying colors, blurring effects, and other visual distortions to webpage images. Such techniques can significantly hinder the performance of certain image classification models.
  • Verification code: Certain web pages implement verification codes as a security measure, requiring users to enter the correct code to access the authentic content. This approach presents a challenge for web crawlers without the capability to recognize and process verification codes, thus impeding their ability to obtain the genuine web content.
  • Fake 404 pages: This deceptive method is relatively straightforward to implement. The page’s content remains that of the illegal website, while only the title section is modified to display ‘404 Not Found’. This technique can successfully deceive models that solely rely on title detection, as they may mistakenly classify the illegal website as a non-existent or inaccessible page.
  • Building webpages with images: This approach exclusively relies on images to compose webpage content, with only the <img> tag in the HTML referencing these images. Consequently, models solely reliant on HTML detection will be ineffective in identifying the webpage’s actual content.
UTF8gbsn Through observation of numerous illicit websites, we have found that constructing webpages consisting of only images to boost traffic is more common. By examining the HTML source code of the webpages, we have observed that they only contain <img> tags which reference images. Moreover, valuable key text is embedded within the images themselves. As shown in Figure A1, the four gambling websites are all composed of images. We can see that there is the text “真人竞猜”(translated into English, meaning “Real person chess and card”) in Figure A1a. The text “开元棋牌” (meaning ‘Kaiyuan chess and card’) exists in Figure A1b, and the text “真人真钱真娱乐,棋牌体育” (meaning ‘Real money, real entertainment, board games and sports’) exists in Figure A1c. The text “彩票投注” (meaning ‘Lottery betting’) exists in Figure A1d, and the key text on these images can effectively improve the accuracy of recognition.
The practice of constructing web pages solely with images undermines the effectiveness of methods solely focused on detecting web page HTML. Hence, in this paper, we opt to extract text from web screenshots as features instead of relying on HTML.

2.2. Related Work

2.2.1. Detection of Illegal Websites

Identification methods for illegal websites, such as gambling, pornography, phishing websites, etc., can be broadly categorized into two groups. The first category involves the utilization of supervised learning techniques, while the second category employs unsupervised methods.
(1)
Supervised learning for illegal website detection
Supervised methods for illegal website detection rely on labeled datasets to train models and learn the distinguishing features of illegal websites. This approach offers the advantage of high accuracy; however, it necessitates a substantial amount of labeled data. Here are a few commonly employed supervised approaches:
Method based on machine learning. The method based on machine learning is a commonly used detection method for illegal websites, which uses a variety of machine learning algorithms to train the model. The method first extracts various features from the website, such as the title of the page, keywords, description, HTML tags, etc. Then, it uses these features to train a classifier to classify illegal websites and normal websites. Commonly used machine learning algorithms include Naive Bayes [10], Support Vector Machines [11], decision trees [12], random forests [13], etc. The Naive Bayesian algorithm is a classic machine learning algorithm, which is based on Bayesian theorem and feature independence assumptions for classification. A Support Vector Machine is a classifier that performs classification based on the margin maximization principle. A decision tree is a tree-based classifier that processes data hierarchically to achieve higher accuracy. A random forest is an ensemble learning algorithm that performs classification by combining multiple decision trees. Sahingoz et al. [14] used seven different classification algorithms and features based on natural language processing (NLP) to detect phishing websites. They used the URL features of phishing websites to achieve the best performance on the random forest algorithm. Kalabarige et al. [15] proposed MLSELM, which combines MLP (Multilayer Perceptron), KNN (K-Nearest Neighbors), RF (random forest) and other algorithms to form a three-layer classifier for four phishing datasets. Evaluations were performed and the final results outperformed the baseline models on the dataset.
Method based on deep learning. The deep learning-based method uses a multi-layer neural network for classification. This method first vectorizes the attribute information of the website, and then uses the deep neural network to classify the data. Tang et al. [16] proposed a phishing website detection framework based on deep learning, which combines various strategies such as white list filtering, black list interception and model prediction to improve detection accuracy. Finally, the RNN-GRU model achieved the best performance on multiple datasets. Liu et al. [17] considered the semantic information of different scales, and passed the text information such as URL, Title, and Body Text into the CNN model and the LSTM model for learning, respectively. Feature fusion was performed at the output layer, and, finally, the judgment result was output. In reality, the application model successfully detected 3016 phishing websites. Zhao et al. [18] proposed a robust end-to-end framework, Porn2Vec, that uses comparative learning to detect pornographic websites. They model pornographic websites through heterogeneous graphs composed of websites, webpages, images, texts, and their interactions. The pornographic website detection task is transformed into a heterogeneous graph node classification task, and the robustness of the model is improved through the representation of multiple data. Wang et al. [19] fused visual and text semantic features through the self-attention mechanism, and fused the prediction results of the image classifier, text classifier and multi-modal classifier in the later stage, which achieved excellent results on their own collected gambling datasets.
In the realm of supervised methods, a remarkable characteristic is their impressive accuracy, often surpassing 95% in recognition rates. Certain pioneering approaches have even demonstrated exceptional achievements, exceeding 99% accuracy. An exemplary case is depicted in reference [19], where the highest recognition accuracy of 99.31% was accomplished using a self-labeled dataset comprising 800 gambling websites and 800 normal websites. However, it is crucial to acknowledge that models trained on small-scale datasets may encounter challenges in terms of robustness, thereby resulting in a disparity between experimental and real-world performance. Scaling up labeled samples can present formidable obstacles, demanding substantial human and financial resources.
(2)
Unsupervised learning for illegal website detection
Unsupervised methods for illegal website detection do not rely on labeled datasets for training. Instead, they classify websites by discovering patterns and structures within the data. The clustering-based method is a common unsupervised learning method, which classifies data into different clusters. This method first extracts various features from the website, such as page titles, keywords, descriptions, HTML tags, etc. Then, it uses the clustering algorithm to divide the data into multiple clusters; each cluster represents a type of website. The commonly used clustering algorithms include k-means, DBSCAN, hierarchical clustering, etc. K-means is a classical clustering algorithm, which divides data into K clusters, and the center of each cluster is the average value of all data points in the cluster. DBSCAN is a density-based clustering algorithm that clusters high-density data points into one category. Hierarchical clustering is a clustering algorithm based on a tree structure. It divides the data into a series of hierarchical structures, each of which represents a type of website. Fan et al. [20] used a Fast Unfolding algorithm to cluster websites and extract URL features of illegal websites, and judged whether they were illegal websites by detecting the URL features of unknown websites. Xu et al. [21] used a density clustering algorithm to cluster dark web sites, phishing sites, and normal websites, which proved the effectiveness of clustering on the classification of dark web sites and phishing sites. Li et al. [22] proposed a gambling website detection method based on the PAM probabilistic topic model. By giving different weights to the structured information of the website, it is analyzed whether it has a high probability of tending to the ‘gambling’ topic for classification.
In summary, the detection of illicit websites is an important problem that can be addressed using supervised and unsupervised learning methods. Supervised methods [19] generally outperform unsupervised methods [22] in terms of recognition accuracy, with unsupervised methods often struggling to exceed 90% accuracy. However, supervised methods heavily rely on annotated training data, which is a labor-intensive task. Insufficient annotated samples can lead to a decline in model robustness. On the other hand, unsupervised learning methods do not require annotated data and can scale up the sample size without significant limitations, enabling the handling of unknown types of illicit websites. In this study, we focus on the analysis of massive network asset data from web mapping platforms. It is challenging to effectively annotate such large-scale datasets to represent the overall data characteristics. Therefore, we have chosen to employ unsupervised methods for the analysis of the extensive web mapping data, to uncover potential illicit websites.

2.2.2. Deep Clustering

Deep clustering refers to a clustering analysis technique that utilizes deep learning algorithms. Unlike traditional clustering algorithms, deep clustering is particularly adept at handling high-dimensional data. By leveraging the power of deep learning, it can automatically uncover hidden structures and patterns within the data, enabling the discovery of more complex data relationships and ultimately improving the accuracy of clustering results.
The early deep clustering method [23,24] used an auto-encoder to learn the representation of the data, and the data representation directly used k-means to obtain the clustering results. Later, the representation learning module of deep clustering added clustering prior, and IDFD [25] improved the performance of clustering by learning the similarity between sample instances and reducing the correlation of features. DeepCluster [26] introduced the features obtained by the feature extraction network into the k-means algorithm to generate pseudo-labels, and then reduced the gap between the classification model prediction results and pseudo-labels to update the network weights. SCAN [27] first uses comparative learning to obtain a feature extractor that can obtain image semantic information, and then uses this semantic feature to cluster, to prevent clustering from relying on low-level features. Experiments show that semantic features have a significant improvement in clustering results. The previous methods separate the representation learning and clustering modules, and perform optimization learning independently. There are also some algorithms that optimize the representation learning module and the clustering module in an end-to-end manner. The representative method is DEC [28]. DEC combines the autoencoder and the self-trainer, initializes the representation learning module through the encoder in the autoencoder, and uses the self-training strategy to optimize the representation learning module and the clustering module simultaneously.
In summary, deep clustering leverages different network architectures to effectively process diverse types of data. For example, convolutional neural networks (CNNs) excel at handling image data, while recurrent neural networks (RNNs) or transformer models are well-suited for text sequence data. This versatility allows deep clustering to outperform traditional clustering methods when dealing with complex data. By leveraging specialized network structures, deep clustering can capture intricate patterns and dependencies within the data, leading to more accurate and robust clustering results.

3. Methodology

In this section, we propose a comprehensive method that utilizes the abundant asset data from the network mapping engine to cluster and uncover illegal websites. Our method comprises three key stages: acquiring the semantic representation of website data, clustering the website data based on the similarity of representations, and employing the characterization of website data and clustering model to identify illegal websites within a vast sample pool. By leveraging these stages, our method aims to achieve accurate and efficient discovery of illegal websites using the available network asset data.
Figure 1 illustrates the two-stage clustering method employed for the discovery of illegal websites. The proposed model utilizes two modal data inputs, with one being image data representing screenshots of the websites. To address the image-based webpage camouflage method discussed in Section 2.1.2, the text data used in our model are obtained through OCR (Optical Character Recognition), which includes both Chinese and English text. In our approach, we utilize the pre-trained image encoder model from CLIP [29] for processing the image data. However, since the original CLIP model is trained primarily on English corpus, it may not be ideal for handling mixed Chinese and English text extracted through OCR. To address this, we replace the text encoder component of the original CLIP model with the BERT [30] model, which is better suited for processing multilingual text. After the pre-training phase, both the image and text data can be transformed into semantic vectors, capturing the underlying semantic meaning. These semantic vectors are then fed into the MLP (Multi-Layer Perceptron) clustering model. The clustering model analyzes the similarity between the semantic vectors and groups samples with similar semantic distances into the same category, effectively achieving the clustering of the data.
In the following sections, we will introduce the specific content of each part in detail.

3.1. Data Representation

With the advancements in computer computing power, training large-scale deep learning models has become more viable. Furthermore, the internet is teeming with vast amounts of unlabeled data, which has fueled the growing interest in unsupervised learning. In recent years, contrastive learning has emerged as a prominent approach in unsupervised learning, yielding impressive outcomes. Contrastive learning aims to learn high-quality feature representations by maximizing the similarity between positive samples and minimizing the similarity between positive samples and negative samples. It eliminates the need for manual annotations on unlabeled data, relying instead on learning representative features from the data, which can be applied to various tasks. Moreover, contrastive learning leverages the latent information within the data to enhance the model’s generalization capability and robustness. The CLIP [29] model is a multimodal unsupervised learning model that effectively learns representations of both images and texts. It establishes a bridge between image and text representations through an objective function, ensuring that the representations of images and texts do not reside in separate distribution spaces.
In contrast learning, the CLIP model is a multimodal model based on image and text parallelism, and then the training target is constructed by calculating the similarity of the feature vectors of the two branches. In this paper, f i m g ( · ) represents the image encoder, f t e x t ( · ) represents the text encoder, X i m g R N × H × W × C represents the images in a batch, X t e x t R N × L represents the text in a batch, and then the feature representation of each data can be obtained:
F i m g = f i m g ( X i m g ) R N × D i F t e x t = f t e x t ( X t e x t ) R N × D t
D i and D t are the feature vector dimensions of image and text, respectively. After obtaining each modal feature representation, the image features and text features are projected to the same dimension D e through the linear layer, and then standardized by L2 to maintain the consistency of the numerical scale:
F i m g n o r m = ( F i m g W i m g ) i i D ( F i m g W i m g ) i 2 R N × D e F t e x t n o r m = ( F t e x t W t e x t ) i i D ( F t e x t W t e x t ) i 2 R N × D e
At this point, as shown in the pretext model in Figure 1, the matrix multiplication of the image feature and the text feature can obtain the scoring matrix of the batch. The diagonal is a paired positive sample pairs scoring, and the other elements of the matrix are N 2 N negative sample pairs scoring:
M = ( F i m g n o r m ) ( F t e x t n o r m ) T R N × N
After that, the cross entropy loss is calculated for each row and column of M, and the total loss of the model can be obtained by summing the loss.

3.1.1. Image Representation

The image representation in our approach utilizes the CLIP image encoder model. However, due to the scale of the problem and the limitations of local computing power, we employ the ViT-B/32 model in this paper. The model structure is illustrated in Figure 2.
The model is based on Transformer, and its model structure can be divided into the following parts: input embedding layer, Transformer Encoder layer, and MLP Head layer. The embedding layer is to convert the image into a token vector sequence, which is implemented by a convolution layer. The size of the convolution kernel is 32 × 32, the stride is 32, and the number of convolution kernels is 768. Then, splice a class token and add a position code. This constitutes the embedding layer of ViT. The Transformer Encoder layer is to stack the Encoder block 12 times repeatedly. The MLP block in the Encoder block is a linear projection layer, mainly to maintain the data dimension of the input and output of ‘Encoder block’ as unchanged. The MLP head layer is composed of a linear layer and an activation function layer.

3.1.2. Text Representation

The text representation model in our approach utilizes the BERT-BASE model, which consists of a total of 110 million parameters. Compared to the BERT-LARGE model with 340 million parameters, BERT-BASE is smaller in size, resulting in faster operation speed and making it more suitable for processing a large number of samples efficiently. The model structure is depicted in Figure 3.
Our text data are obtained by extracting the text on the screenshot of the web page through OCR technology. Firstly, the sentences of the input text are divided into n words [ w 1 , w 2 , , w n ] , and then [CLS] token and [SEP] token are added before and after the sentence token sequence, to construct the token sequence of the input for the BERT model. After learning through the BERT model, we can obtain the representation of each token e [ l ] = [ e [ c l s ] [ l ] , e [ t o k e n 1 ] [ l ] , , e [ S E P ] [ l ] ] , take out the representation of the [CLS] token and pass it into the linear projection layer, and map it to the dimension of the text representation vector we need.

3.1.3. Transfer Learning

Due to the significant number of parameters in the representation model, a large dataset is required for training. The image encoder of the original CLIP model was trained on a dataset containing over 400 million samples, resulting in excellent image representation capabilities. Therefore, we have conducted two experiments on the image encoder of CLIP. The first experiment involved keeping the original parameters unchanged and solely training the text encoder model to align the text feature space with the image feature space. The second experiment involved fine-tuning the parameters of the image encoder during training. Detailed experimental results can be found in Section 4.2.
For the text encoder, since the original CLIP is trained in English corpus, it is not suitable for this task, and the BERT-BASE model has a large number of parameters (110 million). In order to reduce the training cost and accelerate the convergence of the model, we use the method based on transfer learning to train the text encoder. We use the mengzi-bert-base pre-training model [31], which is trained on more than 300 G Chinese corpus. Therefore, our transfer learning-based method can avoid overfitting and reduce training time when fine-tuning parameters with small-scale text data sets.We only extract the representation of the [CLS] token of the BERT model output, and its representation dimension is 768, which is not suitable for our task. Therefore, we add a linear projection layer after the BERT output layer to reduce its dimension to 512.
Furthermore, during the fine-tuning process of the aforementioned encoder models, we employed a very small learning rate to prevent rapid distortion of the model parameters. The learning rate was set to be less than 0.00005, enabling the two encoder models to adapt to new data while maintaining the stability of the original excellent feature extraction capability.

3.2. Cluster Analysis

In order to cluster websites engaged in illegal activities such as gambling, pornography, or fraud, this paper proposes a multimodal deep clustering model (MDC) based on a fully connected neural network. The multimodal deep clustering model exhibits stronger data fitting capabilities compared to traditional clustering algorithms such as the k-means model. It enhances the similarity between similar samples and increases the distance between dissimilar samples, leading to a significant improvement in clustering performance. Please refer to Section 4.2 for specific experimental results. In the design of the model’s objective function, we extended the objective function proposed in reference [27] and incorporated some supervised signals to enhance the model’s effectiveness. The details of the objective function design are described in Section 3.2.2.
The structure of the clustering model is depicted in Figure 4. The process involves combining the image representation vector and the text representation vector obtained from the image encoder and text encoder, respectively, to generate a representation vector for each sample. Then, by using the representation vectors, approximately k samples for each sample are identified. Finally, the clustering model is trained by ensuring the consistency of the outputs of similar samples inputted into the clustering model.
In Figure 5, “Anchor” represents the input vector of the clustering model, which is formed by concatenating the webpage screenshot feature vector and the OCR text feature vector of the input sample. “Neighbor” denotes the representation vector obtained by concatenating the webpage screenshot feature vector and the OCR text feature vector of a randomly selected sample from the 50 most similar samples. The subsequent clustering model consists of a two-layer fully connected neural network, with the number of neurons in the hidden layer being half of the number of neurons in the input layer.

3.2.1. Sample Similarity Calculation

Considering that the similarity calculation of samples on a large number of samples needs to improve its efficiency, the similarity calculation of samples is realized by dot product similarity. The calculation formula is as follows:
S i m i l a r i t y = x 1 · x 2 | x 1 | | x 2 |
where x 1 and x 2 are the eigenvectors of two samples, and | x 1 | and | x 2 | are the modules of the eigenvectors of the two samples. Each sample calculates the dot product similarity with all other samples in the data set, and then takes the k samples with the highest dot product similarity. In this paper, k is 50. Finally, the similarity relationship between samples in the whole data set is stored. The stored similarity relation is the input of the subsequent clustering model.

3.2.2. Clustering Loss

The clustering module is implemented by three-layer MLP, and the parameters of the clustering module are optimized by making similar samples consistent in output. Assume that f c l u s t e r represents the clustering model, X represents the training sample, and X k represents the sample similar to X, then we need to maximize log f c l u s t e r ( X ) , f c l u s t e r ( X k ) , and · represents the dot product operation. The dot product operation will make the prediction become a one-hot vector and keep the same. At this time, the point multiplication result is the largest. MLP has a strong mapping ability, which can map all samples to one class, and there is a degradation problem. Therefore, in order to prevent MLP from mapping all samples to the same class, we need to maximize entropy to avoid all samples being assigned to the same cluster. In the clustering process, CrossEntropy loss is added by generating pseudo-labels for high-confidence samples. Therefore, the final clustering module optimization function is:
L = 1 D X D X k N X log f c l u s t e r ( X ) , f c l u s t e r X k + λ c C f c l u s t e r c log f c l u s t e r c γ i = 1 I c = 1 C y i c log f c l u s t e r X i c w i t h f c l u s t e r c = 1 D X D f c l u s t e r c ( X )
Here, C = 1 , , C denotes cluster, f c l u s t e r c ( X i ) denotes the probability that X i is cluster c, and the weight of entropy is controlled by λ . I = 1 , , I means that the maximum probability of the output of I samples through f c l u s t e r exceeds the set threshold, y i means the one-hot coded pseudo-label of the i-th sample, f c l u s t e r ( X i ) means the probability distribution of the i-th sample after the clustering model, and the weight of cross entropy loss is controlled by γ .

3.3. Discovery of Illegal Websites

In the process of discovering illegal websites, the pre-trained fully connected neural network (FCNN) clustering model, introduced in Section 3.2, is utilized. This clustering model employs a soft label approach, where the model output for each data sample includes probability values indicating its likelihood of belonging to a particular category. By setting a threshold, datasets predominantly composed of illegal websites can be filtered out. Subsequently, through manual screening, a substantial number of illegal websites can be efficiently identified. The entire process is illustrated in Figure 5.
First and foremost, a significant amount of website data is required, encompassing both legitimate websites and potential illegal websites. These data can be acquired through the network mapping engine. Once obtained, the data need to undergo preprocessing before being fed into the clustering model. The processing steps mainly include extracting text from webpage screenshots and obtaining representation vectors for webpage screenshots and text using a pre-trained CLIP model. Webpage screenshots are obtained by simulating browser access, and the screenshot is taken after the webpage has finished loading. The resolution of the webpage screenshot is 1000 × 600. Text on the webpage screenshots is recognized using PaddleOCR [32]. PaddleOCR has a high accuracy in character recognition, and its model size is small, resulting in fast character recognition speed.
Subsequently, the obtained representation vector data are fed into the pre-trained fully connected neural network clustering model, with the model parameters kept fixed. To account for the uncertainty of data points during the clustering process, a soft label strategy is employed. This strategy assigns each data point a probability value indicating its likelihood of belonging to a specific category. These probability values reflect the degree of ambiguity of data points between different categories and can be utilized to assess the accuracy and stability of the clustering. In practical terms, a threshold can be set for each data point based on these probability values. If the probability of a data point representing an illegal website surpasses this threshold, it is considered a potential illegal website.
After the threshold screening, a dataset containing numerous potential illegal websites is obtained. However, to ensure the dataset’s accuracy and establish a reliable foundation for further research, manual screening remains necessary to validate the categories of data points. During this process, a thorough review of the content, structure, domain names, and other pertinent information of potential illegal websites is conducted to make a more precise determination of their legality.

4. Experiments and Analysis

In this section, we conduct experiments to evaluate the proposed method for cluster discovery of illegal websites. Our experimental environment is a workstation equipped with Intel(R) Xeon(R) Gold 6230R CPU, 256 GB memory, and Nvidia Geforce RTX4090 with 24 GB video memory.
Datasets. The dataset used in this study was obtained from the asset data available on the cyberspace mapping platform. Over 3.7 million data entries were downloaded, each containing 14 attributes. A detailed description of these attributes can be found in Section 4.3. To address efficiency concerns and considering that many illegal websites primarily utilize images for webpage construction, only the ‘screen’ attribute from the website asset data was used in this research. Text information was extracted from webpage screenshots using OCR technology. The dataset comprises a random selection of data downloaded from the surveying and mapping platform, encompassing both normal and illegal websites.
Evaluation metrics. In this paper, we use ACC (Accuracy), NMI (Normalized Mutual Information) and ARI (Adjusted Rand index) to evaluate the performance of the clustering algorithm. The specific formula is as follows:
A C C = T P + T N T P + F P + T N + F N
T P is the true positive, T N is the true negative, F P is the false positive, F N is the false negative.
N M I = 2 × I ( Y , C ) H ( Y ) + H ( C )
where Y represents the clustering result, C represents the real clustering label, I ( Y , C ) represents the mutual information of Y and C, and H ( Y ) and H ( C ) represent the entropy of Y and C, respectively. The higher the value of N M I is, the closer the clustering result is to the real label, and the value range is [ 0 , 1 ] .
A R I = R I E [ R I ] max ( R I ) E [ R I ]
where R is the Rand index, E [ R I ] is the expected Rand index, and max ( R I ) is the maximum possible Rand index. The value range of A R I is [ 1 , 1 ] , and the larger the value, the more consistent the clustering result is with the real result.

4.1. Data Representation Evaluation

Our data representation model is implemented by replacing the text encoder module of the original CLIP model with BERT, while preserving the structure and parameters of the image encoder in the CLIP model. During the training of the representation model, we employed two methods. The first method involved keeping the parameters of the image encoder model unchanged and only adjusting the parameters of the BERT model to align with the image encoder parameters. The second method involved tuning the parameters of both the image encoder model and the BERT model simultaneously. The training loss curves for both methods are depicted in Figure 6.
In Figure 6, the label “NMV” represents the scenario where no fine-tuning of the image encoder parameters was performed, while “MV” indicates the scenario where the image encoder parameters were fine-tuned. From the figure, it can be observed that both methods were able to converge the model parameters, and the NMV method achieved a lower loss value within the same number of training epochs. In the subsequent experiments, clustering analyses were conducted on the representations generated by the trained representation models obtained from both training methods. It was further observed that the clustering performance was better when the image encoder model was not fine-tuned.
The original CLIP model is trained on a significantly larger dataset of over 400 million image–text pairs, which exceeds the scale of the experimental dataset used in this paper, which consists of 3.7 million samples. Fine-tuning the CLIP parameters on a small dataset may risk disrupting the already excellent parameters of the model. To mitigate this effect, this paper employs the cosine annealing algorithm to dynamically adjust the learning rate during the training of the representation model parameters. The specific learning rate adjustments can be observed in Figure 7.
During the training process, the learning rate is kept relatively small, typically not exceeding 5 × 10−5. This cautious approach is taken because the BERT model parameters also undergo transfer learning, utilizing pre-trained parameters from large text datasets. To ensure the stability and avoid rapid modification of the model parameters, a small learning rate is used.

4.2. Website Clustering Evaluation

We trained a deep clustering model based on a multi-layer perceptron using the obtained data representation. The training dataset for the clustering model consisted of 200,000 data points extracted from the total dataset of 3.7 million. To compare with the traditional clustering method, we also conducted a clustering experiment using the k-means algorithm on the same data representation. The experimental results are summarized in Table 1.
The * in Table 1 indicates the absence of that experimental result, as the K-means algorithm results do not have a concept of confidence. MDC in Table 1 is the abbreviation of the depth clustering model (multimodal deep clustering) used in this paper, The meaning of NMV is “not modify vision-encoder”, and the representation with NMV uses the representation generated by the model without fine-tuning the parameters of the CLIP image encoder. The meaning of MV is “modify vision-encoder”, and the representation with MV uses the representation generated by the model with the fine-tuned parameters of the CLIP image encoder. The meaning with img means that only the representation of image data is used in the clustering process, and the meaning with text means that only the representation of text data is used in the clustering process. The meaning of CEloss is to add cross entropy loss to the clustering loss function, and provide some guidance similar to supervised training for the parameters of the clustering model through samples with high confidence. The column Confident-0.999 in the table represents the accuracy of the clustering model when the confidence level for a sample to be divided into a certain category is higher than 0.999. Through the results of this column, it can be seen that, for samples with high confidence, the accuracy of clustering is higher.
It can be seen from Table 1 that in the group of ‘k-means NMV’ experiments, the experimental results using only img features are the best, and the effect of combining text features and image features is slightly lower, which also shows the superiority of the original image encoder parameters. In the ‘MDC NMV’ set of experiments, the experimental results show that the combination of image data representation and text data representation is the best, and the effect is significantly improved compared to only using image data representation and only using text data representation. Compared with the simple use of representation data by the k-means algorithm, the multi-layer perceptron model used by MDC is more capable of processing these complex data. In the group of ‘k-means MV’ experiments, it can be seen that this is the best combination of image representation and text representation. In the experiment of ‘MDC MV’, the image encoder parameters are fine-tuned on our data, which makes the performance of the model lower than that of ‘MDC NMV’. But, what is interesting here is that the clustering performance using only text representation is the best. This should be because we adjust the parameters of the image encoder, so that the parameters of the text encoder change relatively little, retaining the better text encoder parameters. These sets of clustering experiments actually demonstrate that the model parameters trained on a large dataset do not require fine-tuning for many downstream tasks. Our training representation model uses 3.7 million image–text pair data. However, compared with the training data of the original CLIP image encoder model and the BERT text encoder model, the amount of data is still small. Therefore, we need to add more data to improve the performance of these large pre-training models by fine-tuning.
The experimental results unequivocally demonstrate that the utilization of deep clustering methods yields a substantial improvement in accuracy compared to traditional clustering algorithms, such as k-means and DBSCAN. In addition to their performance boost, deep clustering algorithms offer enhanced flexibility, enabling adjustments to the model’s structure and objective function to optimize clustering outcomes. Nevertheless, this increased flexibility presents a challenge in finding optimal hyperparameters, necessitating a considerable investment of time and computational resources for the tuning process.
All of the experimental groups were compared together, and the results are shown in Figure 8.
The top line that can be seen from Figure 8 is ‘MDC NMV CEloss’, which indicates that it is the best model with an accuracy rate of 84.1. If a sample with a confidence level exceeding 0.999 is selected, its accuracy rate can reach 92.39 in a completely unsupervised manner. Therefore, we use the ‘MDC NMV CEloss’ model for clustering in the subsequent large-scale illegal website discovery.

4.3. Clustering Discovery of Illegal Websites

In the process of clustering the 3.7 million data, the previously trained multi-layer perceptron clustering model is used. To handle the large amount of data, 10,000 data points are read at a time for clustering, and the clustering results are aggregated. Afterward, a manual review is conducted to remove a small number of normal websites, resulting in a dataset consisting entirely of illegal websites. The selected illegal website dataset includes the attributes listed in Table 2.
After applying the clustering model, we identified 467,382 data points as classified illegal websites. From these results, we further screened out 431,820 potential illegal website data points based on threshold criteria. Subsequently, through manual screening, we selected a total of 397,275 illegal website data points. Each attribute of these illegal website data points is stored in a JSON file. The “screen” attribute and “html” attribute are stored separately as independent files, residing in two distinct directories. In the JSON file, only the relative paths to these files are stored.

5. Conclusions

In this paper, we propose an unsupervised discovery method for illegal websites combined with big data from a network surveying and mapping platform. This method uses image–text comparative learning to extract visual features of web page screenshots and semantic features of text on pictures. Moreover, through the observation and analysis of a large number of illegal website samples, we found that many illegal websites only build their website pages by arranging pictures. This method makes it impossible to obtain semantic text features through HTML analysis. There are various anonymization techniques employed by illegal websites to evade detection and regulation. Thus, for those tasked with governing such websites, we recommend exploring and identifying more anonymization methods used by these illicit entities. Addressing different anonymization techniques with specific countermeasures can enhance the accuracy of curbing illegal websites. Employing a series of sequential detection methods can also improve the recall rate in identifying illegal websites.
In this paper, we primarily focused on the anonymization technique of constructing websites solely with images. We perform feature similarity clustering on the acquired features, and learn by letting similar samples obtain the same category. In the end, the experimental results prove the effectiveness of the discovery of illegal websites on large-scale data through unsupervised methods, and various evaluation indicators have achieved the expected goals. And, through the method we proposed, a large number of illegal websites were found in the real asset data on the network mapping platform, so as to construct a large-scale illegal website data set. The illicit website dataset comprises nearly 400,000 records, and we have made it open-source at https://www.kaggle.com/listone/black-website, accessed on 1 August 2023. Each record in the dataset contains the attributes listed in Table 2. The data are stored offline, ensuring that, even if the websites are no longer active in the future, their features can still be analyzed.
Although the proposed method has realized the function of large-scale illegal website discovery, it can be seen that we obtain a lot of attribute information of a certain website asset on the surveying and mapping platform, and this paper only uses the screenshot attribute of the webpage. In the future, the performance of the model can be improved by adding features of more attributes of assets. In this paper, we use splicing webpage screenshot features and webpage text features for feature fusion, which is more effective when there are few feature attributes. When using more feature attributes in the future, the feature dimension may be too high. Therefore, it is necessary to explore how to better integrate multimodal data features to solve this problem. When constructing the dataset of illegal websites, we improved the model’s accuracy by setting confidence thresholds. However, due to the pure unsupervised learning approach, the highest achieved accuracy was only 92.39%, necessitating manual screening to ensure the dataset’s high quality. In the future, leveraging the extensive dataset of illegal websites provided in this paper, it would be possible to build highly accurate and robust identification models to achieve fully automated detection of illegal websites.

Author Contributions

Conceptualization, B.W. and F.S.; methodology, B.W. and F.S.; software, B.W.; validation, B.W. and F.S.; formal analysis, F.S. and H.Z.; investigation B.W. and F.S.; resources, F.S.; data curation, B.W. and H.Z.; writing—original draft preparation B.W.; writing—review and editing F.S.; visualization, B.W.; supervision, F.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key R&D Program of China under Grant 2021YFB3100500.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data are publicly available at https://www.kaggle.com/listone/black-website, accessed on 1 August 2023.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
MDCMulti-modal Deep Clustering
NMVNot Modify Vision-Encoder
MVModify Vision-Encoder

Appendix A

The four images in Figure A1 are all website screenshots. To the right of each image is the HTML source code of the respective webpage. The code displayed in the four images of Figure A1 demonstrates that the HTML of these webpages does not contain distinctive semantic text; the webpage is entirely composed of images. The Chinese text on the images in Figure A1 is explained in the main text.
Figure A1. (ad) Website page camouflage.
Figure A1. (ad) Website page camouflage.
Applsci 13 09837 g0a1

References

  1. ITU. Measuring Digital Development: Facts and Figures 2022. 2022. Available online: https://www.itu.int/hub/publication/d-ind-ict_mdd-2022/ (accessed on 1 August 2023).
  2. Senker, C. Cybercrime and the DarkNet: Revealing the Hidden Underworld of the Internet; Arcturus Publishing: London, UK, 2016. [Google Scholar]
  3. Shodan. 2023. Available online: https://www.shodan.io/ (accessed on 1 August 2023).
  4. Censys. 2023. Available online: https://search.censys.io/ (accessed on 1 August 2023).
  5. Binaryedge. 2023. Available online: https://www.binaryedge.io/ (accessed on 1 August 2023).
  6. Zoomeye. 2023. Available online: https://www.zoomeye.org/ (accessed on 1 August 2023).
  7. Fofa. 2023. Available online: https://fofa.info/ (accessed on 1 August 2023).
  8. Quake. 2023. Available online: https://quake.360.net/quake/ (accessed on 1 August 2023).
  9. RaySpace. RaySpace. 2023. Available online: https://www.webray.com.cn/channel/RaySpace.html (accessed on 1 August 2023).
  10. Rish, I. An empirical study of the naive Bayes classifier. In Proceedings of the IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, Seattle, WA, USA, 4–10 August 2001; Volume 3, pp. 41–46. [Google Scholar]
  11. Hearst, M.A.; Dumais, S.T.; Osuna, E.; Platt, J.; Scholkopf, B. Support vector machines. IEEE Intell. Syst. Their Appl. 1998, 13, 18–28. [Google Scholar]
  12. Safavian, S.R.; Landgrebe, D. A survey of decision tree classifier methodology. IEEE Trans. Syst. Man Cybern. 1991, 21, 660–674. [Google Scholar]
  13. Zhou, Z.H. Machine Learning; Springer Nature: Berlin, Germany, 2021. [Google Scholar]
  14. Sahingoz, O.K.; Buber, E.; Demir, O.; Diri, B. Machine learning based phishing detection from URLs. Expert Syst. Appl. 2019, 117, 345–357. [Google Scholar] [CrossRef]
  15. Kalabarige, L.R.; Rao, R.S.; Abraham, A.; Gabralla, L.A. Multilayer stacked ensemble learning model to detect phishing websites. IEEE Access 2022, 10, 79543–79552. [Google Scholar] [CrossRef]
  16. Tang, L.; Mahmoud, Q.H. A deep learning-based framework for phishing website detection. IEEE Access 2021, 10, 1509–1521. [Google Scholar] [CrossRef]
  17. Liu, D.J.; Geng, G.G.; Zhang, X.C. Multi-scale semantic deep fusion models for phishing website detection. Expert Syst. Appl. 2022, 209, 118305. [Google Scholar] [CrossRef]
  18. Zhao, J.; Shao, M.; Peng, H.; Wang, H.; Li, B.; Liu, X. Porn2Vec: A robust framework for detecting pornographic websites based on contrastive learning. Knowl.-Based Syst. 2021, 228, 107296. [Google Scholar] [CrossRef]
  19. Wang, C.; Zhang, M.; Shi, F.; Xue, P.; Li, Y. A Hybrid Multimodal Data Fusion-Based Method for Identifying Gambling Websites. Electronics 2022, 11, 2489. [Google Scholar] [CrossRef]
  20. Fan, Y.; Yang, T.; Wang, Y.; Jiang, G. Illegal Website Identification Method Based on URL Feature Detection. Comput. Eng. 2018, 44, 171–177. [Google Scholar]
  21. Jie, X.; Haoliang, L.; Ao, J. A new model for simultaneous detection of phishing and darknet websites. In Proceedings of the 2021 7th International Conference on Computer and Communications (ICCC), Chengdu, China, 10–13 December 2021; pp. 2002–2006. [Google Scholar]
  22. Li, G.; Yin, T.; Zhang, X. A detection method of gambling websites based on pam. Comput. Appl. Softw. 2021, 38, 167–172. [Google Scholar]
  23. Huang, P.; Huang, Y.; Wang, W.; Wang, L. Deep embedding network for clustering. In Proceedings of the 2014 22nd International Conference on Pattern Recognition, Stockholm, Sweden, 24–28 August 2014; pp. 1532–1537. [Google Scholar]
  24. Tian, F.; Gao, B.; Cui, Q.; Chen, E.; Liu, T.Y. Learning deep representations for graph clustering. In Proceedings of the AAAI Conference on Artificial Intelligence, Quebec City, QC, Canada, 27–31 July 2014; Volume 28. [Google Scholar]
  25. Tao, Y.; Takagi, K.; Nakata, K. Clustering-friendly representation learning via instance discrimination and feature decorrelation. arXiv 2021, arXiv:2106.00131. [Google Scholar]
  26. Caron, M.; Bojanowski, P.; Joulin, A.; Douze, M. Deep clustering for unsupervised learning of visual features. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 132–149. [Google Scholar]
  27. Van Gansbeke, W.; Vandenhende, S.; Georgoulis, S.; Proesmans, M.; Van Gool, L. Scan: Learning to classify images without labels. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part X. Springer: Berlin/Heidelberg, Germany, 2020; pp. 268–285. [Google Scholar]
  28. Xie, J.; Girshick, R.; Farhadi, A. Unsupervised deep embedding for clustering analysis. In Proceedings of the International Conference on Machine Learning, PMLR, New York, NY, USA, 19–24 June 2016; pp. 478–487. [Google Scholar]
  29. Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
  30. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
  31. Zhang, Z.; Zhang, H.; Chen, K.; Guo, Y.; Hua, J.; Wang, Y.; Zhou, M. Mengzi: Towards lightweight yet ingenious pre-trained models for chinese. arXiv 2021, arXiv:2110.06696. [Google Scholar]
  32. PaddlePaddle. PaddleOCR. 2023. Available online: https://github.com/PaddlePaddle/PaddleOCR (accessed on 1 August 2023).
Figure 1. Illegal website clustering framework.
Figure 1. Illegal website clustering framework.
Applsci 13 09837 g001
Figure 2. ViT-B/32 model structure.
Figure 2. ViT-B/32 model structure.
Applsci 13 09837 g002
Figure 3. Text encoder model structure.
Figure 3. Text encoder model structure.
Applsci 13 09837 g003
Figure 4. Clustering model structure.
Figure 4. Clustering model structure.
Applsci 13 09837 g004
Figure 5. Discovery process of illegal websites.
Figure 5. Discovery process of illegal websites.
Applsci 13 09837 g005
Figure 6. Representation model training loss.
Figure 6. Representation model training loss.
Applsci 13 09837 g006
Figure 7. The lr value of the training representation model.
Figure 7. The lr value of the training representation model.
Applsci 13 09837 g007
Figure 8. Comparison of experimental results.
Figure 8. Comparison of experimental results.
Applsci 13 09837 g008
Table 1. Clustering result.
Table 1. Clustering result.
MethodACCARINMIConfident-0.999
k-means NMV66.7210.676.39*
k-means NMV img67.8712.257.56*
k-means NMV text66.4510.256.08*
MDC NMV83.1743.9838.2690.68
MDC NMV CEloss84.146.4940.7992.39
MDC NMV img79.6135.0229.0685.21
MDC NMV img CEloss80.136.230.3187.93
MDC NMV text79.2234.1229.0789.13
MDC NMV text CEloss81.840.4134.8490.47
k-means MV70.2816.0410.42*
k-means MV img69.5715.029.68*
k-means MV text68.212.787.95*
MDC MV82.2341.5336.8190.18
MDC MV CEloss82.7942.9636.9290.95
MDC MV img74.7824.5121.4981.35
MDC MV img CEloss75.2225.3723.1182.06
MDC MV text82.6242.5337.8990.73
MDC MV text CEloss83.3944.5638.6692.02
Table 2. Attributes description.
Table 2. Attributes description.
AttributesDescribesData Type
ipIP addresscharacter string
portport numbercontinuous data
serverweb containerdiscrete data
domaindomain nametext (domain name)
titlesite titletext
orgorganizationdiscrete data
countrycountrydiscrete data
citycitydiscrete data
htmlHTML original codetext
screenwebsite screenshotimage
headerWeb response header informationtext
subject.CNCommon name information for SSL certificatestext (domain name)
subject.NSSL certificate subject optional nametext (list of domain names)
linksSite external linktext (list of domain names)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, B.; Shi, F.; Zheng, H. Multi-Modal Clustering Discovery Method for Illegal Websites Based on Network Surveying and Mapping Big Data. Appl. Sci. 2023, 13, 9837. https://doi.org/10.3390/app13179837

AMA Style

Wang B, Shi F, Zheng H. Multi-Modal Clustering Discovery Method for Illegal Websites Based on Network Surveying and Mapping Big Data. Applied Sciences. 2023; 13(17):9837. https://doi.org/10.3390/app13179837

Chicago/Turabian Style

Wang, Bo, Fan Shi, and Haiyang Zheng. 2023. "Multi-Modal Clustering Discovery Method for Illegal Websites Based on Network Surveying and Mapping Big Data" Applied Sciences 13, no. 17: 9837. https://doi.org/10.3390/app13179837

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop