Next Article in Journal
Prediction of Blade Tip Timing Sensor Waveforms Based on Radial Basis Function Neural Network
Previous Article in Journal
Dosimetric Evaluation of 177Lu Peptide Receptor Radionuclide Therapy Using GATE and Planet Dose
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multi-Modal Clustering Discovery Method for Illegal Websites Based on Network Surveying and Mapping Big Data

College of Electronic Engineering, National University of Defense Technology, Hefei 230037, China
*
Author to whom correspondence should be addressed.
Appl. Sci. 2023, 13(17), 9837; https://doi.org/10.3390/app13179837
Submission received: 29 May 2023 / Revised: 1 August 2023 / Accepted: 28 August 2023 / Published: 30 August 2023

Abstract

With the development of internet technology, the number of illicit websites such as gambling and pornography has dramatically increased, posing serious threats to people’s physical and mental health, as well as their financial security. Currently, the governance of such illicit websites mainly focuses on limited-scale detection through manual annotation. However, the need for effective solutions to govern illicit websites is urgent, requiring the ability to rapidly acquire large volumes of existing website data from the internet. Web mapping engines can provide massive, near real-time web data, which plays a crucial role in batch detection of illicit websites. Therefore, in this paper, we propose a method that combines web mapping engine big data to perform unsupervised multimodal clustering (MDC) for illicit website discovery. By extracting features based on contrastive learning methods from webpage screenshots and OCR text, we conduct feature similarity clustering to identify illicit websites. Finally, our unsupervised clustering model achieved an overall accuracy of 84.1% on all confidence levels, and a 92.39% accuracy at a confidence level of 0.999 or higher. By applying the MDC model to 3.7 million real web mapping data, we obtained 397,275 illicit websites primarily focused on gambling and pornography, with 14 attributes. This dataset is made publicly.
Keywords: unsupervised learning; clustering; multimodal; network mapping unsupervised learning; clustering; multimodal; network mapping

Share and Cite

MDPI and ACS Style

Wang, B.; Shi, F.; Zheng, H. Multi-Modal Clustering Discovery Method for Illegal Websites Based on Network Surveying and Mapping Big Data. Appl. Sci. 2023, 13, 9837. https://doi.org/10.3390/app13179837

AMA Style

Wang B, Shi F, Zheng H. Multi-Modal Clustering Discovery Method for Illegal Websites Based on Network Surveying and Mapping Big Data. Applied Sciences. 2023; 13(17):9837. https://doi.org/10.3390/app13179837

Chicago/Turabian Style

Wang, Bo, Fan Shi, and Haiyang Zheng. 2023. "Multi-Modal Clustering Discovery Method for Illegal Websites Based on Network Surveying and Mapping Big Data" Applied Sciences 13, no. 17: 9837. https://doi.org/10.3390/app13179837

APA Style

Wang, B., Shi, F., & Zheng, H. (2023). Multi-Modal Clustering Discovery Method for Illegal Websites Based on Network Surveying and Mapping Big Data. Applied Sciences, 13(17), 9837. https://doi.org/10.3390/app13179837

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop