A Practical Approach to Defining a Framework for Developing an Agentic AIOps System
Abstract
:1. Introduction
- What is our vision for the next generation of Agentic AIOps that help with maintaining IT service health by supporting IT operations, going beyond just observability, through autonomous diagnosis and resolution of incidents and problems and by preventing potential problems through proactive event management?
- What are the approaches and key architectural components required to design and implement an effective Agentic AIOps system?
- What is the line of evolution for AIOps, the potential of future development through Agentic AIOps and the core principles of a framework to ensure realization of that potential by maximizing autonomy levels in a reliable fashion?
- Enablement of autonomous problem and incident management, request fulfillment, and prevention of potential issues using intelligent agents with contextual awareness and self-learning capabilities.
- Alignment to service management best practices—such as ITIL and Unified Process Framework for IT (UPF-IT)—while supporting a continuously evolving range of technologies present in organisations and real environments.
- Scalable deployment across all layers of the IT landscape, including infrastructure, middleware, data, and applications.
2. Materials and Methods
2.1. Preamble
2.2. Review and Quantitative Analysis
- Anomaly detection,
- Event correlation,
- Automated remediation,
- Intelligent decision-making in AIOps workflows.
2.3. Own Industry Experience, Focus Groups and Expert Discussions
- Current challenges in IT operations automation,
- The applicability of Agent-Based AIOps,
- Adoption barriers and proactive incident risk prediction [26],
- Future research directions.
2.4. Integrating Quantitative and Qualitative Insights
- Identify gaps in current research,
- Assess the alignment between academic advancements and real-world implementations,
- Propose future research directions that bridge the gap between theory and practice.
2.5. Other Considerations
- Focuses on the development, deployment, and lifecycle management of ML models;
- Ensures that ML models are efficiently trained, versioned, monitored, and updated;
- Includes aspects like data pipelines, model retraining, model deployment like continuous integration and continuous deployment (CI/CD), and governance. CI/CD is “the appropriate design pattern to support model-as-a-service in AIOps” [33];
- Primarily used in data science and ML applications.
- Uses AI/ML techniques to automate and enhance IT operations;
- Focuses on real-time monitoring, anomaly detection, event correlation, and incident response in IT systems;
- Helps IT teams manage logs, metrics, alerts, and incidents using intelligent automation;
- Primarily used in DevOps, IT infrastructure management, and cybersecurity.
3. Results
- Availability, as the ability of the service to serve client requests a certain percentage of the time, typically measured as a percentage applied to a time period;
- Resilience, as the ability of the service to continuously operate even in conditions of (natural) disasters that may affect certain sites of its deployment. Indicators such as Recovery Point Objective (RPO) and Recovery Time Objective (RTO) are subsumed into this service aspect;
- Performance, which measures certain qualities related to response times and throughput, related to transaction processing;
- Capacity, which is the ability to accommodate a certain volume, such as concurrent user sessions or an amount of data;
- Scalability, as the trait allowing the service to accommodate additional load requiring less than linear addition in its underlying infrastructure;
- Security;
- Maintainability, which refers to allowing the service to be patched or upgraded.
3.1. AIOps Evolution to Date
- Scope—we see an indication of proactive preventive actions implemented within the AIOps notion space [38];
- Level of autonomy.
3.2. Service Management Alignment
- Service Strategy
- Service Design
- Service Transition
- Service Operation
- Continuous service improvement [42]
- Event Management
- Incident Management
- Request Fulfillment
- Access Management
- Problem Management
- IT Operations Control
- Facilities Management
- Application Management
- Technical Management
- Automated understanding of existing incidents and problems as well as for potential, imminent ones.
- Automated applications for resolutions and preventive actions in addition to the current ones in the literature and industry, which are essentially limited to operational insights generation.
4. Discussion
4.1. Proposed Framework Overview
- Actual applications;
- Platform services such as middleware, runtimes, or databases;
- Infrastructure, in terms of server, storage, and network.
4.2. Incident Management Example
- ‘Log Incident’ and ‘Classify Incident and Provide Initial Support’ refer to ‘Understand’. As such, we can design and implement a discriminative AI component to classify the incident and then identify the concerned service element, e.g., a MS SQL Server database.
- The ‘investigate and diagnose incident’ has the goal of identifying what caused it, for example, an exhausted storage volume. That diagnosis can be performed using an AI retrieval augmented generation (RAG) component. This is a ‘decision’ element, as such.
- ‘Resolve Incident and Recover Service’ ultimately ‘Takes Action’ and applies the resolution that can be realized by an automation component, such as a Jenkins pipeline that contains steps to extend the concerned storage volume and then starts the SQL Service.
- Finally, the dispatch and ‘close incident’ updates the status of the ticket communication, potentially employing an AI-LLM component.
4.3. Considerations on Research Questions
- Proactive Event Management: The framework enables the continuous monitoring and analysis of system events using AI-powered components that can detect patterns indicative of potential service degradation before they impact users. By mapping event management processes to the “Understand” phase of our framework, we enable early detection of anomalies across all service layers (infrastructure, platform, and application).
- Autonomous diagnosis: In the “Decision” phase, the framework takes advantage of advanced AI techniques such as knowledge-based systems and RAG to diagnose the root causes of incidents. This capability, demonstrated in our incident management example in Figure 5, allows the system to accurately determine the nature of the problem without human intervention.
- Self-Healing Actions: Our framework’s “Take Action” phase enables autonomous remediation through predefined automation workflows. As illustrated in our example, when a database incident occurs due to storage exhaustion, the system can independently deploy the appropriate resolution by extending storage and restarting the necessary services.
- The Orchestrator component manages the control flow through the other components, end-to-end, as per the Incident Management process flow (as depicted by the process elements in the figure).
- ‘Understand’ capability components.
- Identify capability components.
- ‘Take Action’ capability components.
- Progressive Autonomy: Referring to the ten levels of autonomy described in Section 3.1, our framework advocates a progressive approach to increasing autonomy. Organizations can implement Agentic AIOps components at different autonomy levels based on risk tolerance and operational maturity.
- Process Standardization: The framework’s alignment with established ITIL/UPF-IT processes ensures that autonomy is constructed upon standardized operational practices, reducing the risk of unexpected behaviors while enabling consistent automation across domains.
- Service-aware design: By incorporating service layer awareness, the framework ensures that autonomous actions consider the hierarchical nature of IT services, preventing remediation actions that might resolve issues at one layer while creating problems at another.
- Continuous Learning: The framework includes feedback loops that allow the system to learn from the outcomes of autonomous actions, gradually improving the accuracy of its decisions and the effectiveness of its remediation strategies.
- Human Oversight: While maximizing autonomy, the framework maintains provisions for human oversight at critical decision points, particularly for high-risk actions or unprecedented scenarios, ensuring that reliability is not compromised in pursuit of autonomy.
5. Conclusions
- Full alignment with ITIL—via UPF-IT—process flows to provide full coverage of IT Operation activities;
- Establishment of an IT Service taxonomy, including Service Elements featuring distinct technologies for infrastructure, middleware, and applications;
- Application of the understand–decide–take action paradigm for allowing consistent architectural decision making for component realization (such as various AI models versus automation implementations);
- Continuous training of the target Agentic AIOps system in order to maximize its proactive posture.
- Defining the capabilities of the system for a given process. Indeed, any ITIL Operations process can be seen as requiring its own solution (or system capability);
- Specifying components for a given technology—infrastructure, middleware, and application—for a certain process element, such as an activity or task. Considerations such as AI model types and related implementation guidelines apply;
- A conceptual architecture will be proposed to support specific aspects of the aforementioned directions, which are discussed in a dedicated chapter.
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
AI | Artificial Intelligence |
AIOps | Artificial Intelligence for IT Operations |
CI/CD | Continuous integration/Continuous deployment |
DevSecOps | Development, security and operations |
ERP | Enterprise Resource Planning |
GenAI | Generative Artificial Intelligence |
IT | Information Technology |
ITIL | Information Technology Infrastructure Library |
ITOA | IT Operations Analytics |
ITSM | IT Service Management |
KPIs | Key Performance Indicators |
LLMs | Large Language Models |
MLOps | Machine Learning for IT Operations |
NIST | National Institute of Standards and Technology |
PRM-IT | Process Reference Model for IT |
RAG | Retrieval augmented generation |
RPO | Recovery Point Objective |
RTO | Recovery Time Objective |
UPF-IT | Unified Process Framework for IT |
References
- Notaro, P.; Cardoso, J.; Gerndt, M. A Systematic Mapping Study in AIOps. In Service-Oriented Computing—ICSOC 2020 Workshops; Hacid, H., Outay, F., Paik, H., Alloum, A., Petrocchi, M., Bouadjenek, M.R., Beheshti, A., Liu, X., Maaradji, A., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2021; Volume 12632, pp. 110–123. ISBN 978-3-030-76351-0. [Google Scholar]
- ITIL V3. Available online: https://www.itsm-docs.com/blogs/itil-faq/itil-v3 (accessed on 10 March 2025).
- Dande, F.; Li, X.; Shofoluwe, M.; McLeod, A. Artificial Intelligence Integration in IT Service Management: An ITIL Configuration Management Process Review. In Proceedings of the International Conference on Industrial Engineering and Operations Management, Detroit, MI, USA, 9–11 October 2024; IEOM Society International: Detroit, MI, USA, 2024. [Google Scholar]
- Iden, J.; Eikebrokk, T.R. Implementing IT Service Management: A systematic literature review. Int. J. Inf. Manag. 2013, 33, 512–523. [Google Scholar] [CrossRef]
- Remil, Y.; Bendimerad, A.; Mathonat, R.; Kaytoue, M. AIOps Solutions for Incident Management: Technical Guidelines and A Comprehensive Literature Review. arXiv 2024, arXiv:2404.01363. [Google Scholar]
- Shen, S.; Zhang, J.; Huang, D.; Xiao, J. Evolving from Traditional Systems to AIOps: Design, Implementation and Measurements. In Proceedings of the 2020 IEEE International Conference on Advances in Electrical Engineering and Computer Applications(AEECA), Dalian, China, 25–27 August 2020; IEEE: Dalian, China, 2020; pp. 276–280. [Google Scholar]
- How To Get Started with Aiops. Available online: https://www.gartner.com/smarterwithgartner/how-to-get-started-with-aiops (accessed on 17 April 2025).
- Bogatinovski, J.; Nedelkoski, S.; Acker, A.; Schmidt, F.; Wittkopp, T.; Becker, S.; Cardoso, J.; Kao, O. Artificial Intelligence for IT Operations (AIOPS) Workshop White Paper. arXiv 2021, arXiv:2101.06054. [Google Scholar]
- Lyu, Y.; Rajbahadur, G.K.; Lin, D.; Chen, B.; Jiang, Z.M. Towards a Consistent Interpretation of AIOps Models. ACM Trans. Softw. Eng. Methodol. 2022, 31, 1–38. [Google Scholar] [CrossRef]
- Dang, Y.; Lin, Q.; Huang, P. AIOps: Real-World Challenges and Research Innovations. In Proceedings of the 2019 IEEE/ACM 41st International Conference on Software Engineering: Companion Proceedings (ICSE-Companion), Montreal, QC, Canada, 25–31 May 2019; IEEE: Montreal, QC, Canada, 2019; pp. 4–5. [Google Scholar]
- Kumar, S. Data Silos A Roadblock for AIOps. arXiv 2023, arXiv:2312.10039. [Google Scholar]
- Yang, X.; Palmes, P.; Jha, S.; Turkkan, B.; Vanloo, G.; Bagehorn, F.; Narayanaswami, C.; Shwartz, L.; Abe, N.; Deng, Y.; et al. SAM: Subseries Augmentation-Based Meta-Learning for Generalizing AIOps Models in Multi-Cloud Migration. In Proceedings of the 2024 IEEE 17th International Conference on Cloud Computing (CLOUD), Shenzhen, China, 7–13 July 2024; IEEE: Shenzhen, China, 2024; pp. 291–301. [Google Scholar]
- Mao, H.; Zhang, T.; Tang, Q. Research Framework for Determining How Artificial Intelligence Enables Information Technology Service Management for Business Model Resilience. Sustainability 2021, 13, 11496. [Google Scholar] [CrossRef]
- Min, S.; Kim, B. AI Technology Adoption in Corporate IT Network Operations Based on the TOE Model. Digital 2024, 4, 947–970. [Google Scholar] [CrossRef]
- Benzaid, C.; Taleb, T. AI-Driven Zero Touch Network and Service Management in 5G and Beyond: Challenges and Research Directions. IEEE Netw. 2020, 34, 186–194. [Google Scholar] [CrossRef]
- de Arcaya, J.D. A Framework for the Operationalization of Analytic Workloads in Complex Distributed Computing Environments. Ph.D. Thesis, Universidad de Deusto, Bilbao, Spain, 2024. [Google Scholar]
- Dong, W. AIOps Architecture in Data Center Site Infrastructure Monitoring. Comput. Intell. Neurosci. 2022, 2022, 1988990. [Google Scholar] [CrossRef]
- Chen, Y.; Shetty, M.; Somashekar, G.; Ma, M.; Simmhan, Y.; Mace, J.; Bansal, C.; Wang, R.; Rajmohan, S. AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds. arXiv 2025, arXiv:2501.06706. [Google Scholar]
- Duan, Y.; Bao, H.; Bai, G.; Wei, Y.; Xue, K.; You, Z.; Zhang, Y.; Liu, B.; Chen, J.; Wang, S.; et al. Learning to Diagnose: Meta-Learning for Efficient Adaptation in Few-Shot AIOps Scenarios. Electronics 2024, 13, 2102. [Google Scholar] [CrossRef]
- Potts, W.C.; Carver, C. Best Practices Implementing AIOps in Large Organizations. In Proceedings of the 2024 International Conference on Smart Applications, Communications and Networking (SmartNets), Harrisonburg, VA, USA, 28–30 May 2024; IEEE: Harrisonburg, VA, USA, 2024; pp. 1–5. [Google Scholar]
- Liu, Y.; Pei, C.; Xu, L.; Chen, B.; Sun, M.; Zhang, Z.; Sun, Y.; Zhang, S.; Wang, K.; Zhang, H.; et al. OpsEval: A Comprehensive IT Operations Benchmark Suite for Large Language Models. arXiv 2024, arXiv:2310.07637. [Google Scholar]
- Pounds, E. What Is Agentic AI? NVIDIA Blog 2024. Available online: https://blogs.nvidia.com/blog/what-is-agentic-ai/ (accessed on 17 April 2025).
- IBM Watson AIOps 2.1. Available online: https://www.ibm.com/docs/en/watson-aiops/2.1?topic=overview-component (accessed on 2 April 2025).
- Onkamo, M.; Rahman, T. Artificial Intelligence for IT Operations—Basic Guide to Start with AIOps. 2023. Available online: https://doi.org/10.13140/RG.2.2.20295.16803 (accessed on 19 February 2025).
- Ayub, K.; Alshawa, R. A Novel AI Framework for WBAN Event Correlation in Healthcare: ServiceNow AIOps approach. In Proceedings of the 2024 IEEE Workshop on Microwave Theory and Technology in Wireless Communications (MTTW), Riga, Latvia, 2–4 October 2024; IEEE: Riga, Latvia, 2024; pp. 55–60. [Google Scholar]
- Ahmed, S.; Singh, M.; Doherty, B.; Ramlan, E.; Harkin, K.; Bucholc, M.; Coyle, D. An Empirical Analysis of State-of-Art Classification Models in an IT Incident Severity Prediction Framework. Appl. Sci. 2023, 13, 3843. [Google Scholar] [CrossRef]
- Shetty, M.; Chen, Y.; Somashekar, G.; Ma, M.; Simmhan, Y.; Zhang, X.; Mace, J.; Vandevoorde, D.; Las-Casas, P.; Gupta, S.M.; et al. Building AI Agents for Autonomous Clouds: Challenges and Design Principles. In Proceedings of the ACM Symposium on Cloud Computing, Redmond, WA, USA, 20–22 November 2024; ACM: Redmond, WA, USA, 2024; pp. 99–110. [Google Scholar]
- Korada, L. AIOps and MLOps: Redefining Software Engineering Lifecycles and Professional Skills for the Modern Era. J. Eng. Appl. Sci. Technol. 2023, 271, 2–7. [Google Scholar] [CrossRef]
- Díaz-de-Arcaya, J.; Torre-Bastida, A.I.; Miñón, R.; Almeida, A. Orfeon: An AIOps framework for the goal-driven operationalization of distributed analytical pipelines. Future Gener. Comput. Syst. 2023, 140, 18–35. [Google Scholar] [CrossRef]
- Battina, D.S. An intelligent devops platform research and design based on machine learning. Novat. Publ. Int. J. Innov. Eng. Res. Technol. 2019, 6, 68–75. [Google Scholar]
- Amrit, C.; Narayanappa, A.K. An analysis of the challenges in the adoption of MLOps. J. Innov. Knowl. 2025, 10, 100637. [Google Scholar] [CrossRef]
- Diaz-de-Arcaya, J.; Torre-Bastida, A.I.; Zárate, G.; Miñón, R.; Almeida, A. A Joint Study of the Challenges, Opportunities, and Roadmap of MLOps and AIOps: A Systematic Survey. ACM Comput. Surv. 2024, 56, 1–30. [Google Scholar] [CrossRef]
- Chen, R.; Pu, Y.; Shi, B.; Wu, W. An automatic model management system and its implementation for AIOps on microservice platforms. J. Supercomput. 2023, 79, 11410–11426. [Google Scholar] [CrossRef]
- Feuerriegel, S.; Hartmann, J.; Janiesch, C.; Zschech, P. Generative AI. Bus. Inf. Syst. Eng. 2024, 66, 111–126. [Google Scholar] [CrossRef]
- Gartner Says Algorithmic IT Operations Drives Digital Business. Available online: https://www.gartner.com/en/newsroom/press-releases/2017-04-11-gartner-says-algorithmic-it-operations-drives-digital-business (accessed on 17 April 2025).
- Wu, X.; Zhang, Y.; Shi, M.; Li, P.; Li, R.; Xiong, N.N. An adaptive federated learning scheme with differential privacy preserving. Future Gener. Comput. Syst. 2022, 127, 362–372. [Google Scholar] [CrossRef]
- Ahmed, S.; Singh, M.; Doherty, B.; Ramlan, E.; Harkin, K.; Coyle, D. AI for Information Technology Operation (AIOps): A Review of IT Incident Risk Prediction. In Proceedings of the 2022 9th International Conference on Soft Computing & Machine Intelligence (ISCMI), Toronto, ON, Canada, 26–27 November 2022; IEEE: Toronto, ON, Canada, 2022; pp. 253–257. [Google Scholar]
- Sivakumar, S. Agentic AI in Predictive AIOps: Enhancing IT Autonomy and Performance. Int. J. Sci. Res. Manag. IJSRM 2024, 12, 1631–1638. [Google Scholar] [CrossRef]
- Zha, J.; Shan, X.; Lu, J.; Zhu, J.; Liu, Z. Leveraging Large Language Models for Efficient Alert Aggregation in AIOPs. Electronics 2024, 13, 4425. [Google Scholar] [CrossRef]
- Gulenko, A.; Acker, A.; Kao, O.; Liu, F. AI-Governance and Levels of Automation for AIOps-supported System Administration. In Proceedings of the 2020 29th International Conference on Computer Communications and Networks (ICCCN), Honolulu, HI, USA, 3–6 August 2020; IEEE: Honolulu, HI, USA, 2020; pp. 1–6. [Google Scholar]
- Manchana, R. AI-Powered Observability: A Journey from Reactive to Proactive, Predictive, and Automated. Int. J. Sci. Res. IJSR 2024, 13, 1745–1755. [Google Scholar] [CrossRef]
- Menken, I. Virtualization Architecture, Adoption and Monetization of Virtualization Projects Using Best Practice Service Strategy, Service Design, Service Transition,… and Continual Service Improvement Processes; Emereo Pty Ltd.: London, UK, 2008; ISBN 978-1-921523-49-6. [Google Scholar]
- IBM. IBM Process Reference Model for IT; IBM: New York, NY, USA, 2008. [Google Scholar]
- Eramo, R.; Said, B.; Oriol, M.; Bruneliere, H.; Morales, S. An architecture for model-based and intelligent automation in DevOps. J. Syst. Softw. 2024, 217, 112180. [Google Scholar] [CrossRef]
- Cheng, Q.; Sahoo, D.; Saha, A.; Yang, W.; Liu, C.; Woo, G.; Singh, M.; Saverese, S.; Hoi, S.C.H. AI for IT Operations (AIOps) on Cloud Platforms: Reviews, Opportunities and Challenges. arXiv 2023, arXiv:2304.04661. [Google Scholar]
- Mell, P.M.; Grance, T. The NIST Definition of Cloud Computing; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2011; p. NIST SP 800-145.
- Mulongo, N.Y. Key Performance Indicators of Artificial Intelligence For IT Operations (AIOPS). In Proceedings of the 2024 International Symposium on Networks, Computers and Communications (ISNCC), Washington, DC, USA, 22–25 October 2024; IEEE: Washington, DC, USA, 2024; pp. 1–8. [Google Scholar]
Selection Criteria | Description |
---|---|
Publication Type | Peer-reviewed articles, industry reports, white papers |
Source Databases | IEEE Xplore, ACM Digital Library, Google Scholar, Gartner, Forrester |
Keywords Used | AIOps, intelligent IT automation, agent-based AIOps, ML for IT operations |
Time Frame | Last five years (to ensure relevance and up-to-date findings) |
Evaluation Focus | Empirical studies, case studies, systematic reviews, theoretical frameworks |
Core Themes | Anomaly detection, event correlation, self-healing systems, Agentic decision-making |
Feature | MLOps | AIOps |
---|---|---|
Primary Goal | Deliver and maintain ML models in production | Automate and enhance IT operations using AI |
Scope | ML pipelines, model training, deployment, monitoring | IT observability, anomaly detection, event correlation, incident response |
Data Sources | Structured datasets, feature stores, databases | Logs, metrics, traces, events, alerts from IT systems |
Key Techniques | Model training, hyperparameter tuning, CI/CD for ML, versioning | Anomaly detection, predictive analytics, event correlation, NLP for log analysis |
Automation Level | Partial (mostly focused on model deployment and retraining) | High (self-healing, auto-remediation, predictive issue detection) |
Main Users | Data scientists, ML engineers, software engineers | IT Operations Teams, DevOps engineers, Site Reliability Engineers (SREs) |
Challenges | Model drift, data drift, retraining complexity, explainability | Noisy alerts, false positives, integration with existing IT tools, trust in automation |
Key Tools and Frameworks | MLflow, Kubeflow, TensorFlow Extended (TFX), Airflow | Splunk AIOps, Dynatrace, New Relic, IBM Watson AIOps, Moogsoft |
End Goal | Deliver high-performing ML models that improve business outcomes | Reduce Mean Time to Resolution (MTTR), enhance IT system reliability, automate incident management |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zota, R.D.; Bărbulescu, C.; Constantinescu, R. A Practical Approach to Defining a Framework for Developing an Agentic AIOps System. Electronics 2025, 14, 1775. https://doi.org/10.3390/electronics14091775
Zota RD, Bărbulescu C, Constantinescu R. A Practical Approach to Defining a Framework for Developing an Agentic AIOps System. Electronics. 2025; 14(9):1775. https://doi.org/10.3390/electronics14091775
Chicago/Turabian StyleZota, Răzvan Daniel, Corneliu Bărbulescu, and Radu Constantinescu. 2025. "A Practical Approach to Defining a Framework for Developing an Agentic AIOps System" Electronics 14, no. 9: 1775. https://doi.org/10.3390/electronics14091775
APA StyleZota, R. D., Bărbulescu, C., & Constantinescu, R. (2025). A Practical Approach to Defining a Framework for Developing an Agentic AIOps System. Electronics, 14(9), 1775. https://doi.org/10.3390/electronics14091775