CIPHER: Cybersecurity Intelligent Penetration-Testing Helper for Ethical Researcher
Abstract
:1. Introduction
- The methodology for CIPHER’s development, utilizing penetration testing write-ups structured as beginner- and expert-level conversations to train a large language model specifically for practical penetration testing assistance.
- A novel augmentation technique called Findings, Action, Reasoning, and Results (FARR) Flow that condenses these write-ups into structured formats for more efficient evaluation.
- The development and evaluation of an automated pentesting simulation benchmark based on FARR Flow, providing a rigorous and realistic framework for evaluating LLM performance in practical penetration testing guidance.
- A comprehensive evaluation demonstrating CIPHER’s effectiveness and superior performance over existing LLMs in providing accurate guidance in FARR Flow and MITRE ATT&CK capabilities from PurpleLlama.
2. Background and Related Works
2.1. Large Language Models in Cybersecurity
2.1.1. Large Language Models
2.1.2. Applications and Challenges in Cybersecurity
2.2. Existing Tools and Methodologies for Penetration Testing
2.2.1. Manual and Automated Tools
2.2.2. Advancements in AI for Penetration Testing
- Testing Strategy Limitations: LLMs often adopt a depth-first search approach, which may overlook other potential attack surfaces [9].
- Limited Social Engineering and Image Interpretation: LLMs lack capabilities in physical social engineering techniques and interpreting visual hints [10].
- Ethical Concerns: The potential misuse of LLMs for automating attacks raises ethical issues [40].
2.2.3. Leveraging LLMs for Penetration Testing
2.3. Specialized Cybersecurity Large Language Models
2.3.1. Development of Domain-Specific LLMs
2.3.2. Evaluation and Benchmarking Challenges
2.3.3. Reproducible Benchmark for Penetration Testing Scenario
3. Methodology
3.1. Architecture Design
- Explanation of the question.
- Intuitive, step-by-step instructions with reasoning.
- Insightful response that resembles expert penetration tester reasoning.
- Additional practical examples for the action.
3.2. Dataset
3.2.1. General Assistant Capability
- Creative prompts from Airoboros 2.2 (44.8 K);
- Domain-specific expertise from CamelAI in MATH, Physics, Chemistry, and Biology (50 K, 20 K, 20 K, and 20 K, respectively);
- Real-world complex prompts from Chatbot Arena (7 K);
- Up-to-date data from Collective Cognition (as of 9 November 2023) (156);
- Chain-of-thought examples from the Alpaca dataset, produced by GPT-4 (52 K);
- Evolving complexity prompts from Evol Instruct (70 K and 140 K versions);
- Tool usage scenarios from Glaive Code Assistant (136 K);
- General-purpose conversational data from GPT4-LLM (54.6 K) and GPTeacher (35.5 K);
- Specialized task-oriented datasets like Medical Tasks (4.6 K) and MetaMath (40 K);
- Reasoning-focused entries from SlimOrca (550 K) and Platypus (24.9 K);
- Real conversation enhancements from ShareGPT (356 K);
- Versatility-boosting prompts from Unnatural Instructions (9 K).
3.2.2. Penetration Testing Capability
Category | Content |
---|---|
Fundamental Knowledge | Cheatsheets [57], TLDR [58], Identity, Kali tools [59] |
Pentesting Knowledge | OSCP Notes [60], OSCP Playbook [61] |
Privilege Escalation | GTFOBins [62], LOLBAS [63] |
Hacking Techniques | Hacktricks [64], PayloadAllTheThings [65] |
Practical Knowledge | 0xdf Hack The Box Write-ups [66] |
3.2.3. Augmentation
3.2.4. Dataset Structure and Pre-Processing
3.3. Training
3.3.1. Supervised Fine-Tuning
3.3.2. Direct Preference Optimization
3.4. FARR Augmentation
- No automated benchmarks: There are no established automatic penetration testing benchmarks for LLMs.
- Reliance on manual testing: The best available method involves human testers performing manual penetration testing while using an LLM chatbot for guidance.
- Time-consuming: This manual approach requires approximately 1–4 h per task per evaluator.
- Inconsistent results: Outcomes can vary significantly based on the evaluator’s expertise, condition, and interpretation of the LLM’s guidance.
- Benchmark Creation: We have designed a benchmark to measure the accuracy of an LLM’s first suggestion in a penetration testing scenario.
- Data Augmentation: Unlike our previous synthetic conversation data generation method for supervised fine-tuning (SFT), this benchmark augments write-ups into compact, dense lists of information.
- FARR Penetration Testing Flow: We introduce a new method to augment penetration testing write-ups into a Findings, Action, Reasoning, Result (FARR) Flow. This structure reflects the typical phases of a penetration test, capturing the following: Findings: information discovered; Action: steps taken based on the findings; Reasoning: explanation for the chosen action; and Result: outcome of the action.
- Rationale: We observed that penetration testing write-ups consistently follow this ordered structure, providing a comprehensive view of the testing process and vulnerable points in a system.
- Findings experience: Assessing the model’s understanding of the required findings for a specific action.
- Reasoning: Evaluating the model’s ability to explain why an action is taken given certain findings.
- Result prediction: Measuring the model’s capability to predict the outcome of an action.
4. Experiment Results
- General Capabilities: We use the LLM Eval Harness benchmark to measure CIPHER’s general task capabilities, examining how integrating general and penetration testing data affects performance.
- Cybersecurity Expertise: We evaluate the model’s performance on cybersecurity-specific questions to gauge its domain knowledge.
- Pentesting Guidance Accuracy: Using our novel FARR Flow reasoning evaluation method, we assess CIPHER’s ability to guide novice penetration testers through the testing process automatically.
- MITRE ATT&CK Understanding: We employ the PurpleLlama CyberSecEval [44] framework to measure CIPHER’s comprehension of the MITRE ATT&CK knowledge base.
- Real-world Beginner Use Case Experiment: To analyze and evaluate the current capabilities of CIPHER in real-world usage assisting a beginner in penetration testing.
4.1. Eval Harness General Capabilities
4.2. Cybersecurity Knowledge Evaluation
- Pentesting QA Evaluation: CIPHER is developed to bridge the gap between novice and expert. Therefore, the capability to explain penetration testing knowledge is very crucial. We utilize the pentesting evaluation topics from the preemware/pentesting-eval Huggingface repository [86]. This dataset consists of question and explanation pairs from GPT-4. We use the questions as prompts and the explanations as ground truth. Model responses are then compared to the ground truth and assessed by the Qwen1.5 72B Chat model to produce a score indicating alignment with the ground truth.
- Comparative Analysis: We compare the results from the LM evaluation harness to identify any performance degradation in general tasks after training various CIPHER models with pentesting datasets.
4.3. Penetration Testing Guidance Accuracy Evaluation (FARR Flow Reasoning)
Algorithm 1 Formulating FARR Step Question |
Input: JSON file containing a list of findings, actions, reasonings, and results (Machine_X_FARR_flow.json) Output: Suggested next action from inference model Load Machine_X_FARR_flow.json Parse JSON content into FARRflow list of dictionaries Initialize all_findings as an empty string for each flow in FARRflow do current_findings ← flow[“Findings”] current_result ← flow[“Result”] question_prompt ← “Below are my current findings: all_findings current_findings What is the most likely action to do next? Answer with one specific action only, not more than that.” model_output ← inference_model(question_prompt) all_findings ← all_findings + “ current_findings, current_result” end for Return model_output as the model suggested action |
4.3.1. Penetration Testing Guidance Accuracy Performance
4.3.2. Performance by Machine Difficulties
4.4. PurpleLlama CyberSecEval
4.5. Real-World Beginner Use Case Experiment
4.5.1. Reconnaissance
4.5.2. Tools Suggestion
4.5.3. Attack Vector Suggestion
4.5.4. Privilege Escalation Suggestion
5. Discussion and Future Works
5.1. Limitations
5.2. Scaling
5.3. Measuring Other Penetration Testing Capabilities
5.4. Ethical Consideration
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Denis, M.S.; Zena, C.; Hayajneh, T. Penetration testing: Concepts, attack methods, and defense strategies. In Proceedings of the 2016 IEEE Long Island Systems, Applications and Technology Conference (LISAT), Farmingdale, NY, USA, 29 April 2016; pp. 1–6. [Google Scholar]
- Jayasuryapal, G.; Pranay, P.; Kaur, H.; Swati. A Survey on Network Penetration Testing. In Proceedings of the 2021 2nd International Conference on Intelligent Engineering and Management (ICIEM), London, UK, 28–30 April 2021; pp. 373–378. [Google Scholar]
- Altulaihan, E.; Alismail, A.; Frikha, M. A Survey on Web Application Penetration Testing. Electronics 2023, 12, 1229. [Google Scholar] [CrossRef]
- Clintswood; Lie, D.G.; Kuswandana, L.; Nadia; Achmad, S.; Suhartono, D. The Usage of Machine Learning on Penetration Testing Automation. In Proceedings of the 2023 3rd International Conference on Electronic and Electrical Engineering and Intelligent System (ICE3IS), Yogyakarta, Indonesia, 9–10 August 2023; pp. 322–326. [Google Scholar] [CrossRef]
- Kennedy, D.; O’Gorman, J.; Kearns, D.; Aharoni, M. Metasploit: The Penetration Tester’s Guide, 1st ed.; No Starch Press: San Francisco, CA, USA, 2011. [Google Scholar]
- GmbH, G.N. OpenVAS—Open Vulnerability Assessment System. 2024. Available online: https://www.openvas.org/ (accessed on 4 July 2024).
- Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar]
- Happe, A.; Kaplan, A.; Cito, J. LLMs as Hackers: Autonomous Linux Privilege Escalation Attacks. arXiv 2024, arXiv:2310.11409. [Google Scholar]
- Deng, G.; Liu, Y.; Mayoral-Vilches, V.; Liu, P.; Li, Y.; Xu, Y.; Zhang, T.; Liu, Y.; Pinzger, M.; Rass, S. PentestGPT: An LLM-empowered Automatic Penetration Testing Tool. arXiv 2023, arXiv:2308.06782. [Google Scholar]
- Xu, J.; Stokes, J.W.; McDonald, G.; Bai, X.; Marshall, D.; Wang, S.; Swaminathan, A.; Li, Z. Autoattacker: A large language model guided system to implement automatic cyber-attacks. arXiv 2024, arXiv:2403.01038. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 6000–6010. [Google Scholar]
- Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F.; et al. Llama: Open and efficient foundation language models. arXiv 2023, arXiv:2302.13971. [Google Scholar]
- Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; Casas, D.d.l.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar]
- Young, A.; Chen, B.; Li, C.; Huang, C.; Zhang, G.; Zhang, G.; Li, H.; Zhu, J.; Chen, J.; Chang, J.; et al. Yi: Open Foundation Models by 01.AI. arXiv 2024, arXiv:2403.04652. [Google Scholar]
- Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han, Y.; Huang, F.; et al. Qwen Technical Report. arXiv 2023, arXiv:2309.16609. [Google Scholar]
- Jiang, A.Q.; Sablayrolles, A.; Roux, A.; Mensch, A.; Savary, B.; Bamford, C.; Chaplot, D.S.; Casas, D.d.l.; Hanna, E.B.; Bressand, F.; et al. Mixtral of Experts. arXiv 2024, arXiv:2401.04088. [Google Scholar]
- Qwen Team. Qwen-Agent. 2024. Available online: https://github.com/QwenLM/Qwen-Agent (accessed on 25 July 2024).
- Qwen1.5-7B-Chat. Available online: https://huggingface.co/Qwen/Qwen1.5-7B-Chat (accessed on 25 July 2024).
- Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; Agarwal, S.; et al. Language models are few-shot learners. arXiv 2020, arXiv:2005.14165. [Google Scholar]
- Wei, J.; Bosma, M.; Zhao, V.Y.; Guu, K.; Yu, A.W.; Lester, B.; Du, N.; Dai, A.M.; Le, Q.V. Finetuned Language Models Are Zero-Shot Learners. arXiv 2022, arXiv:2109.01652. [Google Scholar]
- Mukherjee, S.; Mitra, A.; Jawahar, G.; Agarwal, S.; Palangi, H.; Awadallah, A. Orca: Progressive Learning from Complex Explanation Traces of GPT-4. arXiv 2023, arXiv:2306.02707. [Google Scholar]
- Wei, J.; Wang, X.; Schuurmans, D.; Bosma, M.; Xia, F.; Chi, E.; Le, Q.V.; Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 2022, 35, 24824–24837. [Google Scholar]
- Hassanin, M.; Moustafa, N. A Comprehensive Overview of Large Language Models (LLMs) for Cyber Defences: Opportunities and Directions. arXiv 2024, arXiv:2405.14487. [Google Scholar]
- Gururangan, S.; Marasović, A.; Swayamdipta, S.; Lo, K.; Beltagy, I.; Downey, D.; Smith, N.A. Don’t stop pretraining: Adapt language models to domains and tasks. arXiv 2020, arXiv:2004.10964. [Google Scholar]
- Christiano, P.F.; Leike, J.; Brown, T.; Martic, M.; Legg, S.; Amodei, D. Deep reinforcement learning from human preferences. arXiv 2017, arXiv:1706.03741. [Google Scholar]
- Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 2022, 35, 27730–27744. [Google Scholar]
- Chen, Z.; Cano, A.H.; Romanou, A.; Bonnet, A.; Matoba, K.; Salvi, F.; Pagliardini, M.; Fan, S.; Köpf, A.; Mohtashami, A.; et al. MEDITRON-70B: Scaling Medical Pretraining for Large Language Models. arXiv 2023, arXiv:2311.16079. [Google Scholar]
- Quan, S. Automatically Generating Numerous Context-Driven SFT Data for LLMs across Diverse Granularity. arXiv 2024, arXiv:2405.16579. [Google Scholar]
- Happe, A.; Cito, J. Getting pwn’d by AI: Penetration Testing with Large Language Models. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE ’23 ACM, San Francisco CA USA, 30 November 2023. [Google Scholar] [CrossRef]
- Metta, S.; Chang, I.; Parker, J.; Roman, M.P.; Ehuan, A.F. Generative AI in Cybersecurity. arXiv 2024, arXiv:2405.01674. [Google Scholar]
- Holik, F.; Horalek, J.; Marik, O.; Neradova, S.; Zitta, S. Effective penetration testing with Metasploit framework and methodologies. In Proceedings of the 2014 IEEE 15th International Symposium on Computational Intelligence and Informatics (CINTI), Budapest, Hungary, 19–21 November 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 237–242. [Google Scholar]
- Aksu, M.U.; Altuncu, E.; Bicakci, K. A first look at the usability of openvas vulnerability scanner. In Proceedings of the Workshop on usable security (USEC), San Diego, CA, USA, 24 February 2019. [Google Scholar]
- Kumar, H. Learning Nessus for Penetration Testing; Packt Publishing: Birmingham, UK, 2014. [Google Scholar]
- Valencia, L.J. Artificial Intelligence as the New Hacker: Developing Agents for Offensive Security. arXiv 2024, arXiv:2406.07561. [Google Scholar]
- Huang, J.; Zhu, Q. PenHeal: A Two-Stage LLM Framework for Automated Pentesting and Optimal Remediation. arXiv 2024, arXiv:2407.17788. [Google Scholar]
- Goyal, D.; Subramanian, S.; Peela, A. Hacking, The Lazy Way: LLM Augmented Pentesting. arXiv 2024, arXiv:2409.09493. [Google Scholar]
- Alshehri, I.; Alshehri, A.; Almalki, A.; Bamardouf, M.; Akbar, A. BreachSeek: A Multi-Agent Automated Penetration Tester. arXiv 2024, arXiv:2409.03789. [Google Scholar]
- Fang, R.; Bindu, R.; Gupta, A.; Zhan, Q.; Kang, D. Llm agents can autonomously hack websites. arXiv 2024, arXiv:2402.06664. [Google Scholar]
- Fang, R.; Bindu, R.; Gupta, A.; Kang, D. Llm agents can autonomously exploit one-day vulnerabilities. arXiv 2024, arXiv:2404.08144. [Google Scholar]
- Kumar, A.; Singh, S.; Murty, S.V.; Ragupathy, S. The ethics of interaction: Mitigating security threats in llms. arXiv 2024, arXiv:2401.12273. [Google Scholar]
- Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef]
- Cheng, D.; Huang, S.; Wei, F. Adapting large language models via reading comprehension. arXiv 2023, arXiv:2309.09530. [Google Scholar]
- Jiang, T.; Huang, S.; Luo, S.; Zhang, Z.; Huang, H.; Wei, F.; Deng, W.; Sun, F.; Zhang, Q.; Wang, D.; et al. Improving domain adaptation through extended-text reading comprehension. arXiv 2024, arXiv:2401.07284. [Google Scholar]
- Bhatt, M.; Chennabasappa, S.; Li, Y.; Nikolaidis, C.; Song, D.; Wan, S.; Ahmad, F.; Aschermann, C.; Chen, Y.; Kapil, D.; et al. CyberSecEval 2: A Wide-Ranging Cybersecurity Evaluation Suite for Large Language Models. arXiv 2024, arXiv:2404.13161. [Google Scholar]
- Strom, B.E.; Applebaum, A.; Miller, D.P.; Nickels, K.C.; Pennington, A.G.; Thomas, C.B. Mitre att&ck: Design and philosophy. In Technical Report; The MITRE Corporation: McLean, VA, USA, 2018. [Google Scholar]
- Teknium. OpenHermes 2.5: An Open Dataset of Synthetic Data for Generalist LLM Assistants. 2023. Available online: https://huggingface.co/datasets/teknium/OpenHermes-2.5 (accessed on 25 July 2024).
- GitHub—Axolotl-ai-Cloud/Axolotl: Go Ahead and Axolotl Questions—github.com. Available online: https://github.com/axolotl-ai-cloud/axolotl (accessed on 25 July 2024).
- Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.T.; Rocktäschel, T.; et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Vancouver, BC, Canada, 6 December 2020; Curran Associates Inc.: Red Hook, NY, USA, 2020. [Google Scholar]
- Lee, S.; Shakir, A.; Koenig, D.; Lipp, J. Open Source Strikes Bread—New Fluffy Embeddings Model. 2024. Available online: https://www.mixedbread.ai/blog/mxbai-embed-large-v1 (accessed on 25 July 2024).
- Li, X.; Li, J. AnglE-optimized Text Embeddings. arXiv 2023, arXiv:2309.12871. [Google Scholar]
- Beijing Academy of Artificial Intelligence. BAAI/bge-Reranker-Large. 2024. Available online: https://huggingface.co/BAAI/bge-reranker-large (accessed on 25 July 2024).
- Xiao, S.; Liu, Z.; Zhang, P.; Muennighoff, N.; Lian, D.; Nie, J.Y. C-Pack: Packed Resources For General Chinese Embeddings. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, Washington, DC, USA, 14–18 July 2024; pp. 641–649. [Google Scholar] [CrossRef]
- Dong, Q.; Li, L.; Dai, D.; Zheng, C.; Ma, J.; Li, R.; Xia, H.; Xu, J.; Wu, Z.; Chang, B.; et al. A Survey on In-context Learning. arXiv 2024, arXiv:2301.00234. [Google Scholar]
- Hendrycks, D.; Burns, C.; Basart, S.; Zou, A.; Mazeika, M.; Song, D.; Steinhardt, J. Measuring Massive Multitask Language Understanding. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 4 May 2021. [Google Scholar]
- Chen, M.; Tworek, J.; Jun, H.; Yuan, Q.; Pinto, H.P.d.O.; Kaplan, J.; Edwards, H.; Burda, Y.; Joseph, N.; Brockman, G.; et al. Evaluating Large Language Models Trained on Code. arXiv 2021, arXiv:2107.03374. [Google Scholar]
- Lin, S.; Hilton, J.; Evans, O. TruthfulQA: Measuring How Models Mimic Human Falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Dublin, Ireland, 22–27 May 2022; pp. 3214–3252. [Google Scholar] [CrossRef]
- Lane, C.A.; Luciolebrillante; Jazzabeanie; Huber, S.; Wyssmann, A.; Tcpekin; Clifford, P.; Script, T.; Jishnu; Kalinke, S.; et al. Community-Sourced Cheatsheets. Available online: https://github.com/cheat/cheatsheets (accessed on 21 June 2024).
- GitHub—tldr-pages/tldr: Collaborative Cheatsheets for Console Commands—github.com. Available online: https://github.com/tldr-pages/tldr (accessed on 25 July 2024).
- Kali Tools | Kali Linux Tools—kali.org. Available online: https://www.kali.org/tools/ (accessed on 25 July 2024).
- Introduction | OSCP Notes—gabb4r.gitbook.io. Available online: https://gabb4r.gitbook.io/oscp-notes (accessed on 25 July 2024).
- Fauzi, F. OSCP Playbook. Available online: https://web.archive.org/web/20240129203805/https://fareedfauzi.gitbook.io/oscp-playbook/ (accessed on 26 July 2024).
- GTFOBins—gtfobins.github.io. Available online: https://gtfobins.github.io/ (accessed on 25 July 2024).
- LOLBAS—Lolbas-project.github.io. Available online: https://lolbas-project.github.io/ (accessed on 25 July 2024).
- HackTricks | HackTricks—book.hacktricks.xyz. Available online: https://book.hacktricks.xyz/ (accessed on 25 July 2024).
- GitHub—Swisskyrepo/PayloadsAllTheThings: A List of Useful Payloads and Bypass for Web Application Security and Pentest/CTF—github.com. Available online: https://github.com/swisskyrepo/PayloadsAllTheThings (accessed on 25 July 2024).
- 0xdf hacks stuff—0xdf.gitlab.io. Available online: https://0xdf.gitlab.io/ (accessed on 25 July 2024).
- Hack The Box: The #1 Cybersecurity Performance Center—hackthebox.com. Available online: http://hackthebox.com (accessed on 9 October 2024).
- TryHackMe|Cyber Security Training—tryhackme.com. Available online: http://tryhackme.com (accessed on 9 October 2024).
- Vulnerable by Design VulnHub—vulnhub.com. Available online: https://www.vulnhub.com/ (accessed on 9 October 2024).
- argilla/dpo-mix-7k · Datasets at Hugging Face—huggingface.co. Available online: https://huggingface.co/datasets/argilla/dpo-mix-7k (accessed on 25 July 2024).
- Sutawika, L.; Gao, L.; Schoelkopf, H.; Biderman, S.; Tow, J.; Abbasi, B.; Fattori, B.; Lovering, C.; Farzanehnakhaee, A.; Phang, J.; et al. A Framework for Few-Shot Evaluation of Language Models. Available online: https://github.com/EleutherAI/lm-evaluation-harness (accessed on 23 June 2024).
- Clark, P.; Cowhey, I.; Etzioni, O.; Khot, T.; Sabharwal, A.; Schoenick, C.; Tafjord, O. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. arXiv 2018, arXiv:1803.05457. [Google Scholar]
- Liu, J.; Cui, L.; Liu, H.; Huang, D.; Wang, Y.; Zhang, Y. LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning. arXiv 2020, arXiv:2007.08124. [Google Scholar]
- Bisk, Y.; Zellers, R.; Bras, R.L.; Gao, J.; Choi, Y. PIQA: Reasoning about Physical Commonsense in Natural Language. arXiv 2019, arXiv:1911.11641. [Google Scholar] [CrossRef]
- Zellers, R.; Bisk, Y.; Schwartz, R.; Choi, Y. SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference. arXiv 2018, arXiv:1808.05326. [Google Scholar]
- Zellers, R.; Holtzman, A.; Bisk, Y.; Farhadi, A.; Choi, Y. HellaSwag: Can a Machine Really Finish Your Sentence? arXiv 2019, arXiv:1905.07830. [Google Scholar]
- Nie, Y.; Williams, A.; Dinan, E.; Bansal, M.; Weston, J.; Kiela, D. Adversarial NLI: A New Benchmark for Natural Language Understanding. arXiv 2020, arXiv:1910.14599. [Google Scholar]
- Mihaylov, T.; Clark, P.; Khot, T.; Sabharwal, A. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. arXiv 2018, arXiv:1809.02789. [Google Scholar]
- Qwen1.5-7B. Available online: https://huggingface.co/Qwen/Qwen1.5-7B (accessed on 25 July 2024).
- Colibri_8b_v0.1. Available online: https://huggingface.co/CyberNative-AI/Colibri_8b_v0.1 (accessed on 25 July 2024).
- Lily-Cybersecurity-7B-V0.2. Available online: https://huggingface.co/segolilylabs/Lily-Cybersecurity-7B-v0.2 (accessed on 25 July 2024).
- Meta-Llama-3-8B-Instruct. Available online: https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct (accessed on 25 July 2024).
- Llama-3-WhiteRabbitNeo-8B-v2.0. Available online: https://huggingface.co/WhiteRabbitNeo/Llama-3-WhiteRabbitNeo-8B-v2.0 (accessed on 25 July 2024).
- Hermes-2-Pro-Llama-3-8B. Available online: https://huggingface.co/NousResearch/Hermes-2-Pro-Llama-3-8B (accessed on 25 July 2024).
- Mistral-7B-Instruct-v0.3. Available online: https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3 (accessed on 25 July 2024).
- Preemware. Preemware/Pentesting-Eval. 2024. Available online: https://huggingface.co/datasets/preemware/pentesting-eval (accessed on 25 July 2024).
- Qwen1.5-72B. Available online: https://huggingface.co/Qwen/Qwen1.5-72B (accessed on 25 July 2024).
- Xu, C.; Sun, Q.; Zheng, K.; Geng, X.; Zhao, P.; Feng, J.; Tao, C.; Jiang, D. WizardLM: Empowering Large Language Models to Follow Complex Instructions. arXiv 2023, arXiv:2304.12244. [Google Scholar]
- Zhang, C.; Bengio, S.; Hardt, M.; Recht, B.; Vinyals, O. Understanding deep learning (still) requires rethinking generalization. Commun. ACM 2021, 64, 107–115. [Google Scholar] [CrossRef]
Name | Native LLM | Chat Assistant | Foothold | Priv. Escalation | Multiple Tools |
---|---|---|---|---|---|
LLM as Hackers [8] | ✗ | ✗ | ✗ | ✓ | ✗ |
PentestGPT [9] | ✗ | ✓ | ✓ | ✓ | ✓ |
AutoAttacker [10] | ✗ | ✗ | ✓ | ✓ | ✗ |
hackingBuddyGPT [29] | ✗ | ✗ | ✓ | ✓ | ✓ |
ReaperAI [34] | ✗ | ✗ | ✓ | ✓ | ✓ |
PenHeal [35] | ✗ | ✗ | ✓ | ✓ | ✓ |
Pentest Copilot [36] | ✗ | ✓ | ✓ | ✓ | ✓ |
BreachSeek [37] | ✗ | ✓ | ✗ | ✓ | ✓ |
LLM Agents Can Autonomously [38,39] | ✗ | ✗ | ✗ | ✓ | ✗ |
CIPHER (Ours) | ✓ | ✓ | ✓ | ✓ | ✓ |
Name | Multiple Tools | Consistent Reproducibility | Open Source Availability |
---|---|---|---|
LLM as Hackers [8] | ✗ | ✓ | ✓ |
PentestGPT [9] | ✓ | ✗ | ✗ |
AutoAttacker [10] | ✗ | ✓ | ✗ |
hackingBuddyGPT [29] | ✓ | ✗ | ✓ |
ReaperAI [34] | ✓ | ✗ | ✓ |
PenHeal [35] | ✓ | ✓ | ✗ |
Pentest Copilot [36] | ✓ | ✗ | ✗ |
BreachSeek [37] | ✓ | ✗ | ✓ |
LLM Can Autonomously [38,39] | ✗ | ✗ | ✗ |
FARR Flow (Ours) | ✓ | ✓ | ✓ |
Prompt Format |
---|
<500 tokens of write-up chunk> — Convert the write-up above into a self-sufficient generalized conversation without referring to this context. The conversation is a question from a novice pentester and a helpful answer from an expert pentester. The newbie always asking what to do next. The experts always provide reasoning explanations, then followed by examples. The conversation is multiple turns, step by step. |
Model | ARC Challenge | ARC Easy | LogiQA | Average Score |
---|---|---|---|---|
qwen1.5-7b-chat [79] | 0.4352 | 0.6831 | 0.2995 | 0.4726 |
colibri-8b-v0.1 [80] | 0.4838 | 0.7904 | 0.2565 | 0.5102 |
lily-cybersecurity-7b-v0.2 [81] | 0.4974 | 0.7912 | 0.2903 | 0.5263 |
meta-llama-3-8b-instruct [82] | 0.5265 | 0.8161 | 0.2719 | 0.5381 |
llama-3-whiterabbitneo-8b-v2.0 [83] | 0.5247 | 0.8102 | 0.2873 | 0.5407 |
openhermes-2.5-mistral-7b [46] | 0.5631 | 0.8329 | 0.2980 | 0.5647 |
hermes-2-pro-llama-3-8b [84] | 0.5520 | 0.8342 | 0.3226 | 0.5696 |
mistral-7b-instruct-v0.3 [85] | 0.5717 | 0.8418 | 0.3195 | 0.5776 |
CIPHER 7B DPO (Ours) | 0.4838 | 0.7652 | 0.2642 | 0.5044 |
CIPHER 7B (Ours) | 0.5503 | 0.8232 | 0.2688 | 0.5475 |
Model | PiQA | Swag | Hellaswag | OpenBook QA | ANLI Average | Average Score |
---|---|---|---|---|---|---|
Lily-Cybersecurity-7B-v0.2 [81] | 0.7807 | 0.5610 | 0.6095 | 0.3280 | 0.3955 | 0.5350 |
Llama-3-Whiterabbitneo-8B-v2.0 [83] | 0.7938 | 0.5794 | 0.6093 | 0.3340 | 0.3846 | 0.5402 |
Meta-Llama-3-8B-Instruct [82] | 0.7856 | 0.5700 | 0.5773 | 0.3440 | 0.4653 | 0.5484 |
Qwen1.5-7b-Chat [79] | 0.7546 | 0.5776 | 0.5881 | 0.3220 | 0.5026 | 0.5490 |
Colibri-8B-v0.1 [80] | 0.7960 | 0.5866 | 0.5942 | 0.3640 | 0.4676 | 0.5617 |
Mistral-7B-Instruct-v0.3 [85] | 0.8177 | 0.5867 | 0.6475 | 0.3540 | 0.4560 | 0.5724 |
Hermes-2-Pro-Llama-3-8B [84] | 0.8003 | 0.6026 | 0.6266 | 0.3800 | 0.4776 | 0.5774 |
OpenHermes-2.5-Mistral-7B [46] | 0.8156 | 0.5935 | 0.6310 | 0.3460 | 0.5116 | 0.5795 |
CIPHER 7B DPO (Ours) | 0.7753 | 0.5682 | 0.5922 | 0.2920 | 0.4903 | 0.5436 |
CIPHER 7B (Ours) | 0.7965 | 0.5935 | 0.6288 | 0.3260 | 0.5194 | 0.5728 |
Model | High School CompSci | Computer Security | College CompSci | Formal Logic | Logical Fallacies | Average Score |
---|---|---|---|---|---|---|
colibri-8B-v0.1 [80] | 0.5900 | 0.6800 | 0.4200 | 0.4127 | 0.7055 | 0.5616 |
lily-cybersecurity-7B-v0.2 [81] | 0.5800 | 0.6800 | 0.5600 | 0.3333 | 0.7055 | 0.5718 |
openhermes-2.5-Mistral-7b [46] | 0.7100 | 0.7200 | 0.4500 | 0.4048 | 0.7607 | 0.6091 |
mistral-7B-Instruct-v0.3 [85] | 0.6200 | 0.6700 | 0.5400 | 0.4524 | 0.7730 | 0.6111 |
hermes-2-pro-llama-3-8B [84] | 0.6700 | 0.7100 | 0.4700 | 0.4762 | 0.7669 | 0.6186 |
qwen1.5-7B-Chat [79] | 0.7200 | 0.7600 | 0.6100 | 0.4127 | 0.6748 | 0.6355 |
llama-3-whiterabbitneo-8B-v2.0 [83] | 0.6700 | 0.7800 | 0.5400 | 0.4762 | 0.7239 | 0.6380 |
meta-llama-3-8B-Instruct [82] | 0.6800 | 0.7300 | 0.5200 | 0.5000 | 0.7607 | 0.6381 |
CIPHER 7B DPO (Ours) | 0.6600 | 0.6800 | 0.4800 | 0.3651 | 0.6933 | 0.5757 |
CIPHER 7B (Ours) | 0.6100 | 0.7300 | 0.5500 | 0.3651 | 0.7301 | 0.5970 |
Model | Average |
---|---|
lily-cybersecurity-7B-v0.2 [81] | 0.5688 |
llama-3-whiterabbitneo-8B-v2.0 [83] | 0.5759 |
colibri-8b-v0.1 [80] | 0.5887 |
hermes-2-pro-llama-3-8B [84] | 0.6626 |
qwen1.5-7B-chat [79] | 0.6846 |
meta-llama-3-8B-instruct [82] | 0.7008 |
openhermes-2.5-Mistral-7B [46] | 0.7157 |
mistral-7B-Instruct-v0.3 [85] | 0.7896 |
CIPHER 7B (Ours) | 0.8302 |
CIPHER DPO 7B (Ours) | 0.8394 |
Model | Outcome | Service | Vulnerability | Total Avg |
---|---|---|---|---|
whiterabbitneo-7b [83] | 25.81 | 38.78 | 27.49 | 30.69 |
lily-7b [81] | 29.35 | 43.77 | 35.57 | 36.23 |
qwen-7b [79] | 36.50 | 51.18 | 40.11 | 42.60 |
colibri-0.1-8b [80] | 40.85 | 56.61 | 46.30 | 47.92 |
hermes2pro-llama3-8b [84] | 41.27 | 58.13 | 47.15 | 48.85 |
mistral-ins-0.2-7b [85] | 41.36 | 59.91 | 46.60 | 49.29 |
qwen-72b [87] | 43.57 | 59.05 | 46.84 | 49.82 |
gpt-3.5-turbo [26] | 43.57 | 59.31 | 47.45 | 50.11 |
llama-3-70b-ins [82] | 44.92 | 61.67 | 49.62 | 52.07 |
CIPHER DPO 7B (Ours) | 44.73 | 62.65 | 50.39 | 52.59 |
Model | Easy | Medium | Hard | Insane | Average |
---|---|---|---|---|---|
whiterabbitneo-7b [83] | 33.75 | 30.01 | 28.47 | 28.75 | 30.25 |
lily-7b [81] | 40.73 | 34.66 | 34.50 | 32.03 | 35.48 |
qwen1.5-7b [79] | 44.68 | 41.31 | 42.02 | 41.88 | 42.47 |
colibri-0.1-8b [80] | 49.74 | 47.38 | 45.82 | 48.75 | 47.92 |
hermes2pro-llama3-8b [84] | 50.43 | 47.76 | 48.54 | 48.37 | 48.78 |
mistral-ins-0.2-7b [85] | 51.21 | 49.34 | 47.27 | 47.89 | 48.93 |
qwen1.5-72b [87] | 52.02 | 47.81 | 50.88 | 47.67 | 49.60 |
gpt-3.5-turbo [26] | 53.41 | 48.18 | 49.67 | 47.72 | 49.74 |
openhermes-2.5-7b [46] | 53.05 | 48.51 | 50.27 | 48.50 | 50.08 |
llama-3-70b-ins [82] | 54.56 | 51.28 | 51.48 | 48.74 | 51.52 |
CIPHER DPO 7B (Ours) | 56.25 | 50.80 | 51.05 | 50.96 | 52.26 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Pratama, D.; Suryanto, N.; Adiputra, A.A.; Le, T.-T.-H.; Kadiptya, A.Y.; Iqbal, M.; Kim, H. CIPHER: Cybersecurity Intelligent Penetration-Testing Helper for Ethical Researcher. Sensors 2024, 24, 6878. https://doi.org/10.3390/s24216878
Pratama D, Suryanto N, Adiputra AA, Le T-T-H, Kadiptya AY, Iqbal M, Kim H. CIPHER: Cybersecurity Intelligent Penetration-Testing Helper for Ethical Researcher. Sensors. 2024; 24(21):6878. https://doi.org/10.3390/s24216878
Chicago/Turabian StylePratama, Derry, Naufal Suryanto, Andro Aprila Adiputra, Thi-Thu-Huong Le, Ahmada Yusril Kadiptya, Muhammad Iqbal, and Howon Kim. 2024. "CIPHER: Cybersecurity Intelligent Penetration-Testing Helper for Ethical Researcher" Sensors 24, no. 21: 6878. https://doi.org/10.3390/s24216878
APA StylePratama, D., Suryanto, N., Adiputra, A. A., Le, T. -T. -H., Kadiptya, A. Y., Iqbal, M., & Kim, H. (2024). CIPHER: Cybersecurity Intelligent Penetration-Testing Helper for Ethical Researcher. Sensors, 24(21), 6878. https://doi.org/10.3390/s24216878