PhiShield: An AI-Based Personalized Anti-Spam Solution with Third-Party Integration
Abstract
:1. Introduction
- Compatibility: By developing and deploying as a browser extension, PhiShield collects and analyzes emails in a user’s browser. Therefore, it is compatible with third-party spam detection services provided by cloud-based email service providers such as Gmail and adds an extra layer of protection to those services.
- Privacy and Personalization: PhiShield allows a user (or a group of users) to build a private detection rule without compromising user privacy. It eliminates the need to share sensitive email data with a third party by locally training and deploying personalized AI models for the detection.
- Proactive Detection: PhiShield provides proactive protection by analyzing emails in real-time before users access them. It delivers detailed analytical reports explaining phishing detection results, significantly enhancing user awareness and security practices.
Related Work
2. Materials and Methods
2.1. System Overview
System Architecture
- The email data collection uses APIs that a third-party email service provider provides and collects email data such as the subject, sender, body, and attachments before a user clicks on the email. After collecting all the information, it sends the data to the API server.
- The detection report component provides a review of the analysis results for users before opening the email. This report includes detailed explanations to help users easily identify if the email is spam or ham using the result that the API server sends.
- The Django REST API is a management module receiving email data from the browser extension to perform a comprehensive analysis of the email body, URLs, sender information, and attachments. The server connects to an SQLite database for efficient detection rule management, and the API cooperates with the AI-based detection model to identify phishing threats in real time and generates reports for users based on the analysis.
- The AI-based phishing detection model: The core analytical engine of PhiShield consists of signature-based detection rules and an LSTM-based AI model. By combining both detection methods, PhiShield can identify various phishing attacks successfully. In PhiShield, signature-based detection that identifies suspicious terms or phrases in harmful URLs or email addresses evaluates the trustworthiness of URLs and senders and checks if attachments contain malicious content. The model also analyzes hidden links or abnormal patterns in emails to detect potential phishing attacks more accurately. It also uses an LSTM model. LSTM [12] is an AI algorithm that can effectively detect phishing emails by learning the sequence of words in the email body and remembering recurring patterns in the email data. This will help PhiShield detect the spam emails that the signature-based models miss. This hybrid approach significantly enhances the accuracy of phishing detection compared to the detection system that only uses a signature-based detection method.
2.2. Implementing the AI-Based Detection Model
2.2.1. Dataset Description
- Dataset 1: Our first dataset, sourced from Kaggle-Phishing Email Detection https://www.kaggle.com/datasets/subhajournal/phishingemails (accessed on 10 April 2025) [13], contains approximately 18,600 emails, with 61% classified as legitimate emails and 31% as spam emails. The dataset captures the text body of emails, allowing for extensive analysis and classification through machine learning techniques.
- Dataset 2: The second dataset [14] was compiled by Al-Subaiey et al. for the purpose of studying phishing email tactics. They collected emails from various sources to provide a comprehensive resource for phishing detection analysis. The dataset includes a total of approximately 82,500 emails, of which 42,891 are classified as spam and 39,595 as legitimate. As the emails were sourced from multiple origins, the dataset offers a diverse range of phishing email content and the contextual environment in which these emails were sent, making it an invaluable tool for enhancing the effectiveness of phishing detection models.
- Dataset 3: The third dataset was obtained from the publicly available Large-scale Phishing Email Dataset https://zenodo.org/records/8339691 (accessed on 10 April 2025) [15] on Zenodo. It includes approximately 200,000 emails sourced from internationally recognized datasets such as CEAS-2008, TREC 2005/2007, Enron, Nazario, and Nigerian Fraud. Each of these datasets contains a variety of spam and phishing emails as well as legitimate emails. The emails are categorized by source and topic, enabling model training that reflects a wide range of phishing scenarios.
- Dataset 4: The fourth dataset, spam.csv https://github.com/Apaulgithub/oibsip_taskno4/blob/main/spam.csv (accessed on 10 April 2025) [16] is provided by an open-source project and consists of approximately 5500 email messages. Each message is labeled as either ‘ham’ or ‘spam’. Due to its simple structure, this dataset is well-suited for initial model testing and benchmarking purposes.
2.2.2. Data Preprocessing and Tokenization
2.2.3. LSTM Model Architecture
- Forget Gate: Determines which information to discard:
- Input Gate: Decides which new information to store:
- Output Gate: Determines the information passed to the next hidden state:
- Embedding layer: The first layer in the model is the embedding layer, which transforms the input data’s sequence length and unique word indices into dense vector representations. The ‘input_dim’ parameter is set to the total vocabulary size (‘vocab_size’), and the ‘output_dim’ parameter defines the embedding dimension as 128, enabling the model to capture semantic similarities between words. The ‘input_length’ parameter denoting the maximum sequence length of the sentences ensures that the model handles input data in a fixed sequence length. This layer outputs data in the form ‘(batch size, sequence length, vector dimension)’.
- Bidirectional LSTM layer: The bidirectional LSTM layer applies a bidirectional RNN to read the text data in both forward and backward directions. This layer uses 64 units and is set with ‘return_sequences=True’ to ensure that each time step’s output is passed to the following layer. As a result, the model can learn the contextual relationships of the data more comprehensively. The output of this layer is in the shape ‘(batch size, sequence length, 2×units)’.
- Dropout layer: The dropout layer is used to prevent overfitting. A dropout rate of 0.5 is applied, randomly disabling some neurons during training to improve the model’s generalization capabilities. This layer maintains the output dimension from the ‘Bidirectional LSTM’ layer.
- Single-directional LSTM layer: A single-directional LSTM layer with 32 units processes the sequence. It is set with ‘return_sequences=False’, so only the last time step’s output is passed to the next layer. This layer extracts the most significant time step information from the input sequence, summarizing it into features that contribute to phishing email predictions.
- Dense layer: The final layer is the dense layer using a sigmoid activation function to output binary classification results, where phishing emails are predicted as 1 and legitimate emails as 0. This layer contains a single output node, and its output can be interpreted as a probability between 0 and 1.
2.3. Signature-Based Spam Detection Model
3. Results
3.1. Model Performance
3.2. Phishing Detection Report
4. Discussion
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
LSTM | Long short-term memory |
AI | Artificial intelligence |
S/MIME | Secure/Multipurpose Internet Mail Extensions |
TLS | Transport Layer Security |
LLM | Large language model |
REST | Representational state transfer |
API | Application programming interface |
AWS | Amazon Web Services |
EC2 | Elastic Compute Cloud |
RNN | Recurrent neural network |
FCNN | Fully connected neural network |
BERT | Bidirectional Encoder Representations from Transformers |
XAI | Explainable artificial intelligence |
SHAP | SHapley additive exPlanations |
LIME | Local interpretable model-agnostic explanations |
References
- Griffiths, C. The Latest 2024 Phishing Statistics (Updated June 2024). 2024. Available online: https://aag-it.com/the-latest-phishing-statistics/ (accessed on 20 September 2024).
- Gajjar, V.R.; Taherdoost, H. Cybercrime on a Global Scale: Trends, Policies, and Cybersecurity Strategies. In Proceedings of the 2024 5th International Conference on Mobile Computing and Sustainable Informatics (ICMCSI), Lalitpur, Nepal, 18–19 January 2024; IEEE: New York, NY, USA, 2024; pp. 668–676. [Google Scholar]
- Agboola, O.T. Development of a Novel Approach to Phishing Detection Using Machine Learning. J. Sci. Technol. Educ. 2024, 12, 337–338. [Google Scholar]
- Aljabri, M.; Altamimi, H.S.; Albelali, S.A.; Al-Harbi, M.; Alhuraib, H.T.; Alotaibi, N.K.; Salah, K. Detecting malicious URLs using machine learning techniques: Review and research directions. IEEE Access 2022, 10, 121395–121417. [Google Scholar] [CrossRef]
- Do, N.Q.; Selamat, A.; Krejcar, O.; Herrera-Viedma, E.; Fujita, H. Deep learning for phishing detection: Taxonomy, current challenges and future directions. IEEE Access 2022, 10, 36429–36448. [Google Scholar] [CrossRef]
- Almalkawi, I.T.; Al-Hammouri, M.F.; Mallouh, M.A.; Barakat, T. Improving Email Security Through Machine Learning-Based Phishing Attack Detection. In Proceedings of the 2024 International Jordanian Cybersecurity Conference (IJCC), Amman, Jordan, 17–18 December 2024; pp. 180–185. [Google Scholar] [CrossRef]
- Sarkar, S.; Yadav, A.; Balachander, T. Email Phishing Detection Using AI and ML. In Proceedings of the Deep Sciences for Computing and Communications, Chennai, India, 20–22 April 2023; R., A.U., Kottursamy, K., Raja, G., Bashir, A.K., Kose, U., Appavoo, R., Madhivanan, V., Eds.; Springer: Cham, Switzerland, 2024; pp. 357–377. [Google Scholar]
- Li, Q.; Cheng, M.; Wang, J.; Sun, B. LSTM based phishing detection for big email data. IEEE Trans. Big Data 2022, 8, 278–288. [Google Scholar] [CrossRef]
- Alshingiti, Z.; Alaqel, R.; Al-Muhtadi, J.; Haq, Q.E.U.; Saleem, K.; Faheem, M.H. A deep learning-based phishing detection system using CNN, LSTM, and LSTM-CNN. Electronics 2023, 12, 232. [Google Scholar] [CrossRef]
- SpamFilter-ChromeExtension. Available online: https://github.com/surya-veer/SpamFilter-ChromeExtension (accessed on 20 September 2024).
- Naive Bayes and LSTM Based Classifier Models. Available online: https://towardsdatascience.com/naive-bayes-and-lstm-based-classifier-models-63d521a48c20 (accessed on 20 September 2024).
- Koide, T.; Fukushi, N.; Nakano, H.; Chiba, D. Chatspamdetector: Leveraging large language models for effective phishing email detection. arXiv 2024, arXiv:2402.18093. [Google Scholar]
- Phishing Email Detection. Available online: https://www.kaggle.com/datasets/subhajournal/phishingemails (accessed on 20 September 2024).
- Al-Subaiey, A.; Al-Thani, M.; Alam, N.A.; Antora, K.F.; Khandakar, A.; Zaman, S.A.U. Novel Interpretable and Robust Web-based AI Platform for Phishing Email Detection. arXiv 2024, arXiv:2405.11619. [Google Scholar] [CrossRef]
- Amrutkar, P. Large-Scale Phishing Email Dataset. 2023. Available online: https://zenodo.org/record/8339691 (accessed on 20 September 2024).
- Apaulgithub. Spam Email Dataset (spam.csv). 2022. Available online: https://github.com/Apaulgithub/oibsip_taskno4/blob/main/spam.csv (accessed on 20 September 2024).
- Said Elsayed, M.; Le-Khac, N.A.; Dev, S.; Jurcut, A.D. Network anomaly detection using LSTM-based autoencoder. In Proceedings of the 16th ACM Symposium on QoS and Security for Wireless and Mobile Networks, Alicante, Spain, 16–20 November 2020; pp. 37–45. [Google Scholar] [CrossRef]
- Khonji, M.; Iraqi, Y.; Jones, A. Phishing Detection: A Literature Survey. IEEE Commun. Surv. Tutorials 2013, 15, 2091–2121. [Google Scholar] [CrossRef]
- Sheng, S.; Wardman, B.; Warner, G.; Cranor, L.F.; Hong, J.; Zhang, C. An Empirical Analysis of Phishing Blacklists. In Proceedings of the 6th Conference on Email and Anti-Spam (CEAS 2009), Mountain View, CA, USA, 16–17 July 2009; pp. 60–69. [Google Scholar]
- Sharifi, M.; Siadati, S.H. A phishing sites blacklist generator. In Proceedings of the 2008 IEEE/ACS International Conference on Computer Systems and Applications (AICCSA ’08), Washington, DC, USA, 31 March–4 April 2008; IEEE Computer Society: Washington, DC, USA, 2008; pp. 840–843. [Google Scholar]
- Pascariu, C.; Bacivarov, I.C. Detecting Phishing Websites Through Domain and Content Analysis. In Proceedings of the 2021 13th International Conference on Electronics, Computers and Artificial Intelligence (ECAI), Pitesti, Romania, 1–3 July 2021; pp. 1–4. [Google Scholar] [CrossRef]
- Islavath, S.; Bhat, C.R. Uniform Resource Locator Phishing in Real Time Scenario Predicted Using Novel Term Frequency-Inverse Document Frequency +N Gram in Comparison with Support Vector Machine Algorithm. In Proceedings of the 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT), Kamand, India, 24–28 June 2024; pp. 1–5. [Google Scholar] [CrossRef]
- Sokolov, M.; Olufowobi, K.; Herndon, N. Visual spoofing in content-based spam detection. In Proceedings of the 13th International Conference on Security of Information and Networks, Istanbul, Turkey, 4–6 November 2020; pp. 1–5. [Google Scholar]
- Cook, D.L.; Gurbani, V.K.; Daniluk, M. Phishwish: A stateless phishing filter using minimal rules. In Proceedings of the Financial Cryptography and Data Security, Cozumel, Mexico, 28–31 January 2008; Tsudik, G., Ed.; Springer: Berlin/Heidelberg, Germany, 2008; pp. 182–186. [Google Scholar]
- Zhang, Y.; Hong, J.I.; Cranor, L.F. Cantina: A content-based approach to detecting phishing web sites. In Proceedings of the 16th International Conference on World Wide Web (WWW ’07), New York, NY, USA, 8–12 May 2007; pp. 639–648. [Google Scholar]
- Tida, V.S.; Hsu, S.H.Y. Universal Spam Detection using Transfer Learning of BERT Model. arXiv 2022, arXiv:2202.03480. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2019, arXiv:1810.04805. [Google Scholar]
- Subba, B. A heterogeneous stacking ensemble-based security framework for detecting phishing attacks. In Proceedings of the 2023 National Conference on Communications (NCC), Guwahati, India, 23–26 February 2023; pp. 1–6. [Google Scholar] [CrossRef]
- Chien, A.; Khethavath, P. Email Feature Classification and Analysis of Phishing Email Detection Using Machine Learning Techniques. In Proceedings of the 2023 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE), Nadi, Fiji, 4–6 December 2023; pp. 1–8. [Google Scholar] [CrossRef]
- Chinnasamy, P.; Krishnamoorthy, P.; Alankruthi, K.; Mohanraj, T.; Kumar, B.S.; Chandran, L. AI Enhanced Phishing Detection System. In Proceedings of the 2024 Third International Conference on Intelligent Techniques in Control, Optimization and Signal Processing (INCOS), Krishnankoil, India, 14–16 March 2024; pp. 1–5. [Google Scholar] [CrossRef]
System | Compatibility | Privacy and Personalization | Proactive Detection | AI Model |
---|---|---|---|---|
LSTM based phishing detection [8,9] | × | × | × | LSTM |
SpamFilter-ChromeExtension [10] | ◯ | △ | × | Naive Bayes |
ChatSpamDetector [12] | ◯ | × | × | LLM |
PhiShield (Ours) | ◯ | ◯ | ◯ | LSTM |
B | Name | Description | Implementation Method |
---|---|---|---|
1 | Harmful URL Check [19,20] | Check whether the email’s URL is on a blacklist | Compare with URL blacklist |
2 | Harmful Email Address Check [20] | Check whether the email address is on a blacklist | Compare with email blacklist |
3 | Suspicious Pattern Detection | Detect suspicious phishing words or phrases | Keyword detection using regular expressions |
4 | Attachment File Extension Check | Check the file extension for potential malicious content | Compare with a list of suspicious file extensions |
5 | URL Similarity Check [21] | Compare the URL with legitimate websites for similarity | Use SequenceMatcher to compare similarity |
6 | Email Address Similarity Check [21] | Compare the email address with legitimate addresses for similarity | Use SequenceMatcher to compare similarity |
7 | N-gram Based URL Similarity Check [22] | Compare the URL with legitimate websites using N-gram technique | Use TfidfVectorizer and cosine similarity for N-gram comparison |
8 | N-gram Based Email Address Similarity Check [22] | Compare the email address with legitimate addresses using N-gram technique | Use TfidfVectorizer and cosine similarity for N-gram comparison |
9 | Homoglyph Detection [23] | Check for the use of homoglyphs | Detection through homoglyph mapping |
10 | Behavior Pattern Analysis | Analyze suspicious behavior patterns such as clickbait or urgent requests | Analysis using regular expressions and HTML parsing |
11 | Hidden Text and Link Detection [24,25] | Analyze for the presence of hidden text or links | HTML/CSS analysis using BeautifulSoup |
Metric | LSTM100 | LSTM500 | LSTM1000 | BERT [26,27] | FCNN [28] | Naive Bayes [29,30] |
---|---|---|---|---|---|---|
True Positives (TP) | 27,238 | 27,234 | 27,488 | 27,394 | 23,054 | 21,905 |
False Positives (FP) | 446 | 314 | 574 | 205 | 2,825 | 10,967 |
False Negatives (FN) | 502 | 506 | 252 | 346 | 4686 | 5835 |
True Negatives (TN) | 28,759 | 28,891 | 28,631 | 29,000 | 26,380 | 18,238 |
Precision | 0.9839 | 0.9886 | 0.9795 | 0.9926 | 0.8908 | 0.6664 |
Recall | 0.9819 | 0.9818 | 0.9909 | 0.9875 | 0.8311 | 0.7897 |
F1 Score | 0.9829 | 0.9852 | 0.9852 | 0.9900 | 0.8599 | 0.7228 |
Test Accuracy | 0.9834 | 0.9856 | 0.9855 | 0.9903 | 0.8681 | 0.7049 |
Training Time (s) | 743.24 | 2005.22 | 2936.55 | 4639.72 | 227.11 | 1.11 |
Inference Time (s) | 24.30 | 61.20 | 108.07 | 113.99 | 3.96 | 0.05 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Mun, H.; Park, J.; Kim, Y.; Kim, B.; Kim, J. PhiShield: An AI-Based Personalized Anti-Spam Solution with Third-Party Integration. Electronics 2025, 14, 1581. https://doi.org/10.3390/electronics14081581
Mun H, Park J, Kim Y, Kim B, Kim J. PhiShield: An AI-Based Personalized Anti-Spam Solution with Third-Party Integration. Electronics. 2025; 14(8):1581. https://doi.org/10.3390/electronics14081581
Chicago/Turabian StyleMun, Hyunsol, Jeeeun Park, Yeonhee Kim, Boeun Kim, and Jongkil Kim. 2025. "PhiShield: An AI-Based Personalized Anti-Spam Solution with Third-Party Integration" Electronics 14, no. 8: 1581. https://doi.org/10.3390/electronics14081581
APA StyleMun, H., Park, J., Kim, Y., Kim, B., & Kim, J. (2025). PhiShield: An AI-Based Personalized Anti-Spam Solution with Third-Party Integration. Electronics, 14(8), 1581. https://doi.org/10.3390/electronics14081581