Ground Truth Dataset: Objectionable Web Content
Abstract
:1. Introduction
2. Related Work
3. Data Description
3.1. Domain Metadata File
Attribute | Data Type | Description |
---|---|---|
domain | String | A code (D#) replacing the domain name of the website |
geo_locs | String | Names of the countries based on the ‘domain’s IP Address location using GeoIP Databases [23] |
domain_length | Numeric | Number of domain’s characters |
tld | String | Top-Level Domain (TLD) of the webpage using Tld Library [24] |
avg_time_response | Numeric | The response time of webpage request in milliseconds |
start_scrapping_timestamp | Numeric | The timestamp in milliseconds of scrapping the webpage |
domain_tls_ssl_certificate | Numeric | Value 0 if the webpage does not use a certificate and 1 if the webpage uses a certificate |
internal_urls_no | String list | |
internal_urls | Numeric | |
source | String | The collected source of the website |
label | String | A categorical string of the webpage, either objectionable or unobjectionable |
3.2. Internal Web Pages Detailed File
4. Data and Methods
4.1. Web Pages Collection
4.2. Web Pages Content Extraction
4.3. Labeling
5. Presence of Bias Results
Kappa Coefficient Inspection
- Observed agreement = 1520
- Expected agreement = ((800 × 740) + (800 × 790))/1600 = 765
- Kappa score = (1520 − 765)/(1600 − 765) = 0.904
- Kappa score > 0.904 (almost perfect agreement between human classification and ground truth classification).
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Sasson, H.; Mesch, G. Parental mediation, peer norms and risky online behavior among adolescents. Comput. Hum. Behav. 2014, 33, 32–38. [Google Scholar] [CrossRef]
- Ofcom. Children and Parents: Media Use and Attitudes Report 2018. Available online: https://www.ofcom.org.uk/__data/assets/pdf_file/0024/134907/children-and-parents-media-use-and-attitudes-2018.pdf (accessed on 24 November 2019).
- Altarturi, H.; Saadoon, M.; Anuar, N.B. Cyber parental control: A bibliometric study. Child. Youth Serv. Rev. 2020, 116, 105134. [Google Scholar] [CrossRef]
- Altarturi, H.H.; Anuar, N.B. A preliminary study of cyber parental control and its methods. In Proceedings of the 2020 IEEE Conference on Application, Information and Network Security (AINS), Kota Kinabalu, Malaysia, 17–19 November 2020; pp. 53–57. [Google Scholar]
- Altay, B.; Dokeroglu, T.; Cosar, A. Context-sensitive and keyword density-based supervised machine learning techniques for malicious webpage detection. Soft Comput. 2019, 23, 4177–4191. [Google Scholar] [CrossRef]
- Liu, S.; Forss, T. New classification models for detecting Hate and Violence web content. In Proceedings of the 2015 7th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management, Lisbon, Portugal, 2–14 November 2015; pp. 487–495. [Google Scholar]
- Marchal, S.; François, J.; State, R.; Engel, T. PhishStorm: Detecting phishing with streaming analytics. IEEE Trans. Netw. Serv. Manag. 2014, 11, 458–471. [Google Scholar] [CrossRef] [Green Version]
- Sahingoz, O.K.; Buber, E.; Demir, O.; Diri, B. Machine learning based phishing detection from URLs. Expert Syst. Appl. 2019, 117, 345–357. [Google Scholar] [CrossRef]
- Rao, R.S.; Vaishnavi, T.; Pais, A.R. CatchPhish: Detection of phishing websites by inspecting URLs. J. Ambient. Intell. Humaniz. Comput. 2020, 11, 813–825. [Google Scholar] [CrossRef]
- Kotenko, I.; Chechulin, A.; Shorov, A.; Komashinsky, D. Analysis and evaluation of web pages classification techniques for inappropriate content blocking. In Advances in Data Mining: Applications and Theoretical Aspects, Proceedings of The 14th Industrial Conference, ICDM 2014, St. Petersburg, Russia, 16–20 July 2014; Springer: Cham, Switzerland, 2014; pp. 39–54. [Google Scholar]
- Narwal, N. Web page filtering for kids. Int. J. Inf. Technol. 2021, 13, 19–25. [Google Scholar] [CrossRef]
- Zeng, J.; Duan, J.; Wu, C. Adaptive Topic Modeling for Detection Objectionable Text. In Proceedings of the 2013 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT), Atlanta, GA, USA, 17–20 November 2013; pp. 381–388. [Google Scholar]
- Duan, J.; Zeng, J. Web objectionable text content detection using topic modeling technique. Expert Syst. Appl. 2013, 40, 6094–6104. [Google Scholar] [CrossRef]
- Rajalakshmi, R.; Tiwari, H.; Patel, J.; Kumar, A.; Karthik, R. Design of Kids-specific URL Classifier using Recurrent Convolutional Neural Network. Procedia Comput. Sci. 2020, 167, 2124–2131. [Google Scholar] [CrossRef]
- Patel, O.; Tiwari, A.; Patel, V.; Gupta, O. Quantum based neural network classifier and its application for firewall to detect malicious web request. In Proceedings of the 2015 IEEE Symposium Series on Computational Intelligence, Cape Town, South Africa, 7–10 December 2015; pp. 67–74. [Google Scholar]
- Zhao, C.; Zhang, Y.; Zang, T.; Liang, Z.; Wang, Y. A Stacking Approach to Objectionable-Related Domain Names Identification by Passive DNS Traffic (Short Paper). In Proceedings of the International Conference on Collaborative Computing: Networking, Applications and Worksharing, Shanghai, China, 1–3 December 2018; pp. 284–294. [Google Scholar]
- Hussain, M.; Ahmed, M.; Khattak, H.A.; Imran, M.; Khan, A.; Din, S.; Ahmad, A.; Jeon, G.; Reddy, A.G. Towards ontology-based multilingual URL filtering: A big data problem. J. Supercomput. 2018, 74, 5003–5021. [Google Scholar] [CrossRef]
- Zamry, N.M.; Maarof, M.A.; Zainal, A. Islamic Web Content Filtering and Categorization on Deviant Teaching. In Recent Advances on Soft Computing and Data Mining, Proceedings of The First International Conference on Soft Computing and Data Mining (SCDM-2014), Johor, Malaysia, 16–18 June 2014; Springer: Cham, Switzerland, 2014; pp. 667–678. [Google Scholar]
- Singh, A. Malicious and benign webpages dataset. Data Brief 2020, 32, 106304. [Google Scholar] [CrossRef]
- Vrbančič, G.; Fister, I., Jr.; Podgorelec, V. Datasets for phishing websites detection. Data Brief 2020, 33, 106438. [Google Scholar] [CrossRef]
- Selenium for Python. Available online: https://pypi.org/project/selenium (accessed on 1 March 2022).
- BeautifulSoup Library. Available online: https://pypi.org/project/beautifulsoup4 (accessed on 1 March 2022).
- GeoIP Database. Available online: https://geolocation-db.com (accessed on 1 March 2022).
- Tld Library. Available online: https://pypi.org/project/tld (accessed on 1 March 2022).
- Chen, C.; Zhang, J.; Chen, X.; Xiang, Y.; Zhou, W. 6 million spam tweets: A large ground truth for timely Twitter spam detection. In Proceedings of the 2015 IEEE International Conference on Communications (ICC), London, UK, 08–12 June 2015; pp. 7065–7070. [Google Scholar]
- Khalil, A.; Jarrah, M.; Aldwairi, M.; Jaradat, M. AFND: Arabic fake news dataset for the detection and classification of articles credibility. Data Brief 2022, 42, 108141. [Google Scholar] [CrossRef]
- Ashouri, S.; Suominen, A.; Hajikhani, A.; Pukelis, L.; Schubert, T.; Türkeli, S.; Van Beers, C.; Cunningham, S. Indicators on firm level innovation activities from web scraped data. Data Brief 2022, 42, 108246. [Google Scholar] [CrossRef] [PubMed]
Reference | Limitation | Dataset Description |
---|---|---|
[5] |
|
|
[6] |
|
|
[7] |
|
• 101,098 URLs 2 categories: legitimate and phishing |
[8] |
| • 73,575 URLs 2 categories: legitimate and phishing |
[9] |
|
|
[10] |
|
|
[11] |
|
• 140 websites 5 categories: science, academics, fiction, sports, and news |
[12] |
| • 35,500 documents from different websites 2 categories: objectionable and non-objectionable |
[13] |
|
|
[14] |
|
|
[15] |
|
|
[16] |
|
|
[17] |
|
|
[18] |
|
|
Attribute | Data Type | Description |
---|---|---|
url | String | A code (D#_URL#) replacing the URL of the webpage |
domain_name | String | The code (D#) of the domain that the webpage belongs to |
created_time | String | Time created the record (format yyyy-MM-dd HH:mm:ss) |
geo_loc | String | Name of the country based on the ‘webpage’s IP Address location using GeoIP Databases [23] |
domain_length | ||
url_length | Numeric | Number of URL characters |
time_response | Numeric | The response time of webpage request in milliseconds |
html_char_length | Numeric | Number of characters in the full HTML |
text_char_length | Numeric | Number of characters in all visible texts |
textual_tags_cnt | Numeric | Number of the list of all visible texts on the webpage |
visual_content_no | Numeric | Number of the list of all visuals on the webpage |
label | String | A categorical string of the webpage, either objectionable or unobjectionable |
label_details | String | A sub-categorical string of the webpage, including but not limited to porn, gambling, erotica, sport, news, kids, etc., |
tld | String | Top-Level Domain (TLD) of the webpage using Tld Library [24] |
protocol | String | Name of the protocol used by the webpage URL (http, https, ftp, etc.,) |
tls_ssl_certificate | Numeric | False if the webpage does not use a certificate true if the webpage uses a certificate |
source | String | The collected source of the website |
Source | Objectionable Sites | Unobjectionable Sites | Total |
---|---|---|---|
Alexa | 0 | 1500 | 1500 |
DOMZ | 1500 | 1000 | 2500 |
500 | 500 | 1000 | |
Yandex | 500 | 500 | 1000 |
Yahoo | 500 | 500 | 1000 |
Internal links | 1000 | 0 | 1000 |
Total | 4000 | 4000 | 8000 |
Source (automatic) classification | Human (Manual) Classification | |||
Objectionable | Unobjectionable | Subtotal | ||
Objectionable | 730 | 70 | 800 | |
Unobjectionable | 10 | 790 | 800 | |
Subtotal | 740 | 860 | 1600 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Altarturi, H.H.M.; Anuar, N.B. Ground Truth Dataset: Objectionable Web Content. Data 2022, 7, 153. https://doi.org/10.3390/data7110153
Altarturi HHM, Anuar NB. Ground Truth Dataset: Objectionable Web Content. Data. 2022; 7(11):153. https://doi.org/10.3390/data7110153
Chicago/Turabian StyleAltarturi, Hamza H. M., and Nor Badrul Anuar. 2022. "Ground Truth Dataset: Objectionable Web Content" Data 7, no. 11: 153. https://doi.org/10.3390/data7110153
APA StyleAltarturi, H. H. M., & Anuar, N. B. (2022). Ground Truth Dataset: Objectionable Web Content. Data, 7(11), 153. https://doi.org/10.3390/data7110153