Multi-Modal Clustering Discovery Method for Illegal Websites Based on Network Surveying and Mapping Big Data

Wang, Bo; Shi, Fan; Zheng, Haiyang

doi:10.3390/app13179837

Open AccessArticle

Multi-Modal Clustering Discovery Method for Illegal Websites Based on Network Surveying and Mapping Big Data

by

Bo Wang

,

Fan Shi

^* and

Haiyang Zheng

College of Electronic Engineering, National University of Defense Technology, Hefei 230037, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(17), 9837; https://doi.org/10.3390/app13179837

Submission received: 29 May 2023 / Revised: 1 August 2023 / Accepted: 28 August 2023 / Published: 30 August 2023

Download

Browse Figures

Versions Notes

Abstract

With the development of internet technology, the number of illicit websites such as gambling and pornography has dramatically increased, posing serious threats to people’s physical and mental health, as well as their financial security. Currently, the governance of such illicit websites mainly focuses on limited-scale detection through manual annotation. However, the need for effective solutions to govern illicit websites is urgent, requiring the ability to rapidly acquire large volumes of existing website data from the internet. Web mapping engines can provide massive, near real-time web data, which plays a crucial role in batch detection of illicit websites. Therefore, in this paper, we propose a method that combines web mapping engine big data to perform unsupervised multimodal clustering (MDC) for illicit website discovery. By extracting features based on contrastive learning methods from webpage screenshots and OCR text, we conduct feature similarity clustering to identify illicit websites. Finally, our unsupervised clustering model achieved an overall accuracy of 84.1% on all confidence levels, and a 92.39% accuracy at a confidence level of 0.999 or higher. By applying the MDC model to 3.7 million real web mapping data, we obtained 397,275 illicit websites primarily focused on gambling and pornography, with 14 attributes. This dataset is made publicly.

Keywords: unsupervised learning; clustering; multimodal; network mapping

Share and Cite

MDPI and ACS Style

Wang, B.; Shi, F.; Zheng, H. Multi-Modal Clustering Discovery Method for Illegal Websites Based on Network Surveying and Mapping Big Data. Appl. Sci. 2023, 13, 9837. https://doi.org/10.3390/app13179837

AMA Style

Wang B, Shi F, Zheng H. Multi-Modal Clustering Discovery Method for Illegal Websites Based on Network Surveying and Mapping Big Data. Applied Sciences. 2023; 13(17):9837. https://doi.org/10.3390/app13179837

Chicago/Turabian Style

Wang, Bo, Fan Shi, and Haiyang Zheng. 2023. "Multi-Modal Clustering Discovery Method for Illegal Websites Based on Network Surveying and Mapping Big Data" Applied Sciences 13, no. 17: 9837. https://doi.org/10.3390/app13179837

APA Style

Wang, B., Shi, F., & Zheng, H. (2023). Multi-Modal Clustering Discovery Method for Illegal Websites Based on Network Surveying and Mapping Big Data. Applied Sciences, 13(17), 9837. https://doi.org/10.3390/app13179837

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Modal Clustering Discovery Method for Illegal Websites Based on Network Surveying and Mapping Big Data

Abstract

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI