Pattern-Based and Visual Analytics for Visitor Analysis on Websites
Abstract
:Featured Application
Abstract
1. Introduction
- The design of an interactive visualization that allows users to have a comprehensive snapshot of visitors on a website, but also enables a fine-grained analysis by means of navigation graphs.
- The application of pattern mining techniques to extract patterns that characterize traffic segments of interest. The obtained patterns can aid in the selection of groups of users whose behavior would be interesting to observe. For example, we have been able to discover patterns that capture groups of interest, including (some types of) human and bot traffic. This last bit is of paramount importance, both because marketing can now give attention to clean and crisp segments of traffic, and because IT may block unwanted traffic, using for example firewall rules.
2. Related Work
3. Dataset
4. Visual Model
4.1. Visits View
4.2. Navigation Path (F4)
5. Traffic Segmentation and Characterization
5.1. Filtering Visits View
- represents a custom class assigned to a particular data. In our case we have three classes: visit, page, and objective. So instead of using node or edge, we use such classes which are less abstract. For example, .page will target only web page attributes, similarly .visit will query for the visit information.
- represents the attribute used for filtering. Such attributes correspond to the context represented by the group. For the visit group, the attributes are: duration, country, browser, events, among others. In the case of the page group the attributes are: url, visits, pageviews, bounceRate, exitRate, avgTimeSpent, among others.
- represents any binary operator. Depending on the data type, certain operators can be used over the others. The available operators are: =, !=, >, <, >=, <=, *=, ∧=, $=.
- is the value used for matching by the selected operator (). Depending on the data type of , is a string, a number or a boolean.
5.2. Pattern Mining Algorithm
5.3. Case Study Scenario: Humans vs. Bots
- (A)
- (B)
- (C)
- (D)
- (E)
- (F)
- (G)
- (H)
5.4. Case Study Scenario: Segmentation by Country
6. Further Work
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Ad Words. Available online: https://adwords.google.com/home/ (accessed on 9 June 2018).
- Double Click. Available online: https://www.doubleclickbygoogle.com/ (accessed on 9 June 2018).
- ComScore. comScore: Measure What Matters to Make Cross-Platform Audiences and Advertising More Valuable. Available online: https://www.comscore.com (accessed on 18 June 2018).
- comScore. Invalid Traffic. 2016. Available online: http://www.comscore.com/Products/Advertising-Analytics/Invalid-Traffic (accessed on 18 June 2018).
- Brian Pugh. Battling Bots: comScore’s Ongoing Efforts to Detect and Remove Non-Human Traffic. 2012. Available online: https://www.comscore.com/esl/Insights/Blog/Battling-Bots-comScores-Ongoing-Efforts-to-Detect-and-Remove-Non-Human-Traffic (accessed on 18 June 2018).
- KissMetrics. Kiss Metrics Platform. 2017. Available online: https://www.kissmetrics.com (accessed on 9 June 2018).
- Matomo. Matomo. Available online: https://matomo.org/ (accessed on 12 October 2018).
- Peter Adams. Open Web Analytics Repository. Available online: http://www.openwebanalytics.com/ https://github.com/padams/Open-Web-Analytics (accessed on 18 June 2018).
- Dong, G.; Bailey, J. Contrast Data Mining: Concepts, Algorithms, and Applications; CRC Press: Boca Raton, FL, USA, 2012. [Google Scholar]
- García-Borroto, M.; Martínez-Trinidad, J.F.; Carrasco-Ochoa, J.A.; Medina-Pérez, M.A.; Ruiz-Shulcloper, J. LCMine: An efficient algorithm for mining discriminative regularities and its application in supervised classification. Pattern Recognit. 2010, 43, 3025–3034. [Google Scholar] [CrossRef]
- Gutierrez-Rodríguez, A.E.; Martínez-Trinidad, J.F.; García-Borroto, M.; Carrasco-Ochoa, J.A. Mining patterns for clustering using unsupervised decision trees. Intell. Data Anal. 2015, 19, 1297–1310. [Google Scholar] [CrossRef]
- Akamai. Real-Time Internet Monitor Akamai. Available online: https://www.akamai.com/us/en/solutions/intelligent-platform/visualizing-akamai/real-time-web-monitor.jsp (accessed on 26 June 2018).
- Kaspersky. Kaspersky Cyberthreat Real-Time Map. Available online: https://cybermap.kaspersky.com/ (accessed on 26 June 2018).
- Logstalgia. Logstalgia—A Website Access Log Visualization Tool. Available online: http://logstalgia.io/ (accessed on 28 June 2018).
- Neo4j. White Paper: Fraud Detection Discovering Connections—Neo4j Graph Databas. 2015. Available online: https://neo4j.com/resources/fraud-detection-white-paper/ (accessed on 9 September 2017).
- Mahmoud, A. Detecting Complex Fraud in Real Time with Graph databases—The DeveloperWorks Blog. 2017. Available online: https://developer.ibm.com/dwblog/2017/detecting-complex-fraud-real-time-graph-databases/ (accessed on 9 September 2017).
- Atienza, D.; Herrero, Á.; Corchado, E. Neural analysis of HTTP traffic for web attack detection. Adv. Intell. Syst. Comput. 2015, 369, 201–212. [Google Scholar] [CrossRef]
- Chi, E.H. Improving web usability through visualization. IEEE Internet Comput. 2002, 6, 64–71. [Google Scholar] [CrossRef]
- Gugelmann, D.; Gasser, F.; Ager, B.; Lenders, V. Hviz: HTTP(S) traffic aggregation and visualization for network forensics. Digit. Investig. 2015, 12, S1–S11. [Google Scholar] [CrossRef] [Green Version]
- Institute, I. Botnets Unearthed—The ZEUS BOT. 2013. Available online: http://resources.infosecinstitute.com/botnets-unearthed-the-zeus-bot/ (accessed on 21 February 2018).
- DFRWS. DFRWS 2009 Forensics Challenge Challenge Data and Submission Details. Available online: http://old.dfrws.org/2009/challenge/submission.shtml (accessed on 21 February 2018).
- Xie, G.; Iliofotou, M.; Karagiannis, T.; Faloutsos, M.; Jin, Y. Resurf: Reconstructing web-surfing activity from network traffic. In Proceedings of the IFIP Networking Conference, Brooklyn, NY, USA, 22–24 May 2013; pp. 1–9. [Google Scholar]
- Neasbitt, C.; Perdisci, R.; Li, K.; Nelms, T. ClickMiner: Towards Forensic Reconstruction of User-Browser Interactions from Network Traces. In Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, Scottsdale, AZ, USA, 3–7 November 2014; pp. 1244–1255. [Google Scholar] [CrossRef]
- Blue, R.; Dunne, C.; Fuchs, A.; King, K.; Schulman, A. Visualizing real-time network resource usage. Vis. Comput. Secur. 2008, 5210, 119–135. [Google Scholar] [CrossRef]
- Tan, P.N.; Kumar, V. Discovery of Web Robot Sessions Based on Their Navigational Patterns. Data Min. Knowl. Discov. 2002, 6, 9–35. [Google Scholar] [CrossRef]
- Stevanovic, D.; An, A.; Vlajic, N. Feature evaluation for web crawler detection with data mining techniques. Expert Syst. Appl. 2012, 39, 8707–8717. [Google Scholar] [CrossRef]
- Suchacka, G. Analysis of aggregated bot and human traffic on e-commerce site. In Proceedings of the 2014 Federated Conference on Computer Science and Information Systems, Warsaw, Poland, 7–10 September 2014; pp. 1123–1130. [Google Scholar] [CrossRef]
- Foundation, T.A.S. Log Files—Apache HTTP Server Version 2.5; Technical Report; The Apache Software Foundation. Available online: https://httpd.apache.org/docs/trunk/logs.html (accessed on 16 May 2018).
- MaxMind’s GeoLite2 Dataset. Available online: https://dev.maxmind.com/geoip/geoip2/geolite2/ (accessed on 23 May 2018).
- Enemærke, S.; Aziz, A. UAParser, C# library. Available online: https://github.com/ua-parser/uap-csharp (accessed on 23 May 2018).
- Franz, M.; Lopes, C.T.; Huck, G.; Dong, Y.; Sumer, O.; Bader, G.D. Cytoscape.js: A graph theory library for visualization and analysis. Bioinformatics 2015, 32, 309–311. [Google Scholar] [CrossRef] [PubMed]
- Ben-Ari, M. Mathematical Logic for Computer Science; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
- Matomo Database Schema. Available online: https://developer.piwik.org/guides/persistence-and-the-mysql-backend (accessed on 26 July 2018).
- Loyola-González, O.; García-Borroto, M.; Martínez-Trinidad, J.F.; Carrasco-Ochoa, J.A. An Empirical Comparison Among Quality Measures for Pattern Based Classifiers. Intell. Data Anal. 2014, 18, S5–S17. [Google Scholar] [CrossRef]
- Loyola-González, O.; Martínez-Trinidad, J.F.; Carrasco-Ochoa, J.A.; García-Borroto, M. Effect of class imbalance on quality measures for contrast patterns: An experimental study. Inf. Sci. 2016, 374, 179–192. [Google Scholar] [CrossRef]
- Loyola-González, O.; Medina-Pérez, M.A.; Martínez-Trinidad, J.F.; Carrasco-Ochoa, J.A.; Monroy, R.; García-Borroto, M. PBC4cip: A new contrast pattern-based classifier for class imbalance problems. Knowl.-Based Syst. 2017, 115, 100–109. [Google Scholar] [CrossRef]
- Martínez-Díaz, Y.; Hernández, N.; Biscay, R.J.; Chang, L.; Méndez-Vázquez, H.; Sucar, L.E. On Fisher vector encoding of binary features for video face recognition. J. Vis. Commun. Image Represent. 2018, 51, 155–161. [Google Scholar] [CrossRef]
- Martínez-Díaz, Y.; Méndez-Vázquez, H.; López-Avila, L.; Chang, L.; Sucar, L.E.; Tistarelli, M. Toward More Realistic Face Recognition Evaluation Protocols for the YouTube Faces Database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Salt Lake City, UT, USA, 18–22 June 2018; pp. 526–5268. [Google Scholar] [CrossRef]
- González-Soler, L.J.; Chang, L.; Hernández-Palancar, J.; Pérez-Suárez, A.; Gomez-Barrero, M. Fingerprint Presentation Attack Detection Method Based on a Bag-of-Words Approach. In Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications; Mendoza, M., Velastín, S., Eds.; Springer International Publishing: Cham, Switherland, 2018; pp. 263–271. [Google Scholar]
- Loyola-González, O.; Medina-Pérez, M.A.; Hernández-Tamayo, D.; Monroy, R.; Carrasco-Ochoa, J.A.; García-Borroto, M. A Pattern-Based Approach for Detecting Pneumatic Failures on Temporary Immersion Bioreactors. Sensors 2019, 19, 414. [Google Scholar] [CrossRef] [PubMed]
- García-Borroto, M.; Martínez-Trinidad, J.F.; Carrasco-Ochoa, J.A. Finding the best diversity generation procedures for mining contrast patterns. Expert Syst. Appl. 2015, 42, 4859–4866. [Google Scholar] [CrossRef]
- Cieslak, D.; Hoens, T.; Chawla, N.; Kegelmeyer, W. Hellinger distance decision trees are robust and skew-insensitive. Data Min. Knowl. Discov. 2012, 24, 136–158. [Google Scholar] [CrossRef]
- Quinlan, J.R. C4.5: Programs for Machine Learning; Morgan Kaufmann Publishers Inc.: San Mateo, CA, USA, 1993; p. 302. [Google Scholar]
- Camiña, J.B.; Medina-Pérez, M.A.; Monroy, R.; Loyola-González, O.; Villanueva, L.A.P.; Gurrola, L.C.G. Bagging-RandomMiner: A one-class classifier for file access-based masquerade detection. Mach. Vis. Appl. 2018. [Google Scholar] [CrossRef]
Field | Description |
---|---|
remotehost | IP address |
rfc931 | The remote logname of the user. |
authuser | The username that has been used for authentication. |
date | Date and time of the request. |
request | Resource requested and HTTP version. |
status | The HTTP status code returned to the client. |
bytes | The content-length of the document transferred. |
referrer | The URL which linked the user to the site. |
useragent | The Web browser and platform used by the visitor. |
Feature | Tag | Log Field |
---|---|---|
Hour | hour | date |
Day of the Week | dayOfWeek | date |
City | city | IP address |
Country | country | IP address |
Subdivision | subdivision | IP address |
Organization | organization | IP address |
URL | url | request |
Number of parameters | parameters | request |
status | status | status |
bytes | bytes | bytes |
referrer | referrer | referrer |
Operating System | agentOS | useragent |
Browser | agentBrowser | useragent |
Device | agentDevice | useragent |
Feature | Description |
---|---|
actions | Number of actions in the visit. |
daysSinceFirstVisit | Days past since the first time the user visited the site. |
daysSinceLastVisit | Days past since the last time the user visited the site. |
firstActionTimestamp | Timestamp of the first action in a user’s visit |
lastActionTimestamp | Timestamp of the last action in a user’s visit |
pageviews | How many pages the visitor viewed |
referrerType | Where the visit comes from. Example: direct, search, website, etc. |
timeSpent | Average time the user spent in a page in seconds |
visitDuration | Visit duration in seconds |
visitorId | ID for identifying unique visitors |
visitorType | Can take the values new and returning. |
Use Case | Segment Description | Query |
---|---|---|
Simple query with the .visit className | Visitors from the United States | .visit[countryCode = ‘us’] |
Using the AND operator | Visitors using Google Chrome AND with a visit duration greater than 10 s AND coming Mexico | .visit[browserCode = ‘CH’] [visitDuration>10] [countryCode = ‘us’] |
Using the OR operator | Visitors coming from Facebook OR Instagram | .visit[referrerName = ‘Facebook’], .visit[referrerName = ‘Instagram’] |
Using both AND and OR operators | Visitors from Mexico OR visitors from the United States that have visited the site more than once | .visit[visitCount>1].visit[countryCode = ‘us’], .visit[countryCode = ‘mx’] |
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Cervantes, B.; Gómez, F.; Monroy, R.; Loyola-González, O.; Medina-Pérez, M.A.; Ramírez-Márquez, J. Pattern-Based and Visual Analytics for Visitor Analysis on Websites. Appl. Sci. 2019, 9, 3840. https://doi.org/10.3390/app9183840
Cervantes B, Gómez F, Monroy R, Loyola-González O, Medina-Pérez MA, Ramírez-Márquez J. Pattern-Based and Visual Analytics for Visitor Analysis on Websites. Applied Sciences. 2019; 9(18):3840. https://doi.org/10.3390/app9183840
Chicago/Turabian StyleCervantes, Bárbara, Fernando Gómez, Raúl Monroy, Octavio Loyola-González, Miguel Angel Medina-Pérez, and José Ramírez-Márquez. 2019. "Pattern-Based and Visual Analytics for Visitor Analysis on Websites" Applied Sciences 9, no. 18: 3840. https://doi.org/10.3390/app9183840
APA StyleCervantes, B., Gómez, F., Monroy, R., Loyola-González, O., Medina-Pérez, M. A., & Ramírez-Márquez, J. (2019). Pattern-Based and Visual Analytics for Visitor Analysis on Websites. Applied Sciences, 9(18), 3840. https://doi.org/10.3390/app9183840