1. Introduction
The exponential growth of data in the modern era has made database management systems integral to efficiently storing and retrieving information. However, the increasing complexity and scale of these systems, especially in terms of optimizing query performance for large datasets, present significant challenges. Traditional database management techniques often struggle to cope with these challenges, particularly when dealing with dynamic and heterogeneous workloads. Manual tuning and configuration, while effective in smaller or more stable environments, are becoming increasingly impractical and insufficient for modern, large-scale data environments. Furthermore, as workload patterns evolve rapidly, the need for real-time optimization becomes paramount.
Recent advancements in machine learning (ML) have introduced new opportunities for automating and enhancing various aspects of DBMS operations, including query optimization, indexing, and workload management. Several studies have demonstrated the potential of ML models to predict query execution times, optimize query plans, and automate the selection of indexes and configuration parameters. However, these efforts often focus on isolated aspects of database management, lacking a unified framework that simultaneously addresses multiple optimization tasks and adapts in real time to changing workloads. For instance, Ref. [
1] highlights the use of ML in query optimization, and [
2] applies ML to optimize performance, yet both focus on narrow tasks within the broader scope of DBMS operations. Our work builds on these efforts by offering a holistic, real-time approach.
This study is motivated by two primary research questions: (1) How can machine learning models be integrated into a DBMS to improve query optimization and workload management? (2) What performance gains can be achieved through dynamic configuration tuning based on real-time workload patterns? These questions stem from the increasingly complex demands of modern databases, where workload diversity and scale necessitate sophisticated optimization strategies.
In this paper, we propose a comprehensive framework that integrates advanced ML models within the architecture of a DBMS, with a particular focus on PostgreSQL. Our framework is designed to overcome the limitations of existing approaches by offering a tightly coupled platform that enables real-time ML model inference and dynamic adjustment of database configurations based on ongoing workload patterns. We employ a combination of supervised learning techniques, such as linear regression and random forest, and unsupervised methods, including K-means clustering, to accurately predict query execution times, identify patterns in query workloads, and suggest optimization strategies that improve overall system performance. This combination ensures that both historical data patterns and unseen data behaviors are accounted for in the optimization process.
A key innovation of our framework is the seamless integration of ML models with the DBMS engine, allowing for real-time data ingestion, automated model retraining, and continuous adaptation to workload changes. This approach not only minimizes the need for manual database administration but also enhances the DBMS’s ability to maintain optimal performance in diverse and evolving environments. By integrating these ML capabilities directly into the query optimization process, our framework achieves significant improvements in both query response times and system throughput.
The contributions of this paper are as follows:
We propose a comprehensive ML-driven framework that addresses multiple aspects of database optimization, including query performance prediction, workload management, and dynamic configuration tuning.
We demonstrate a novel approach for integrating ML models within the PostgreSQL DBMS engine, facilitating real-time inference and optimization.
We apply advanced feature engineering techniques tailored for database environments, enabling more accurate and effective ML model predictions by considering factors such as query structure, data distribution, and system state.
We conduct extensive evaluations by using the Transaction Processing Performance Council Decision Support (TPC-DS) benchmark to showcase the scalability and adaptability of our framework, with significant improvements in query execution times and system throughput.
We address potential limitations, such as conflicts in tuning recommendations and the overhead of ML integration, providing insights for future research directions. Specifically, we explore the trade-offs involved in integrating ML models into a DBMS and identify future directions to enhance model retraining and scalability.
The rest of this paper is organized as follows:
Section 2 provides a comprehensive review of related work.
Section 3 details our system architecture and integration approach.
Section 4 describes our ML methodology, including feature selection, model training, and optimization strategies.
Section 5 presents our experimental setup and evaluation results. Finally,
Section 6 concludes the paper and discusses future research directions.
3. System Architecture
This section presents the detailed architecture of our proposed machine learning (ML)-integrated database management system (DBMS) framework, which is designed to enhance query performance, automate database management tasks, and dynamically adapt to workload variations. The architecture is built on PostgreSQL (version 13.0) and is modular, allowing for the seamless integration of various ML models and supporting real-time adjustments to database configurations. The architecture is structured into three primary layers: the Database Management Layer, the Machine Learning Integration Layer, and the User Interface and Visualization Layer. Each layer is designed to function independently, ensuring flexibility, scalability, and ease of maintenance. We have integrated dynamic clustering mechanisms to manage workload patterns, utilizing methods such as K-means for workload classification.
5. Results and Analysis
In this section, we present the results of our experiments and analyze the performance of our ML-integrated DBMS framework. We compare the results against the baseline systems described in the previous section, focusing on key performance metrics such as query execution time, system throughput, resource utilization, model accuracy, and scalability. Additionally, we examine the system’s adaptability to changing workloads and its scalability across different quantities of data, providing a comprehensive evaluation of our approach.
6. Conclusions
In this paper, we have presented a novel and comprehensive framework for integrating machine learning (ML) techniques into the architecture of a database management system (DBMS), specifically targeting PostgreSQL. Our approach addresses the increasing complexity and scale of modern database environments by automating critical tasks such as query optimization, workload management, and dynamic system tuning, which are traditionally manual and time-consuming. We have also addressed deeper insights into workload classification, utilizing clustering methods to further enhance system adaptability.
Our extensive evaluation, using the TPC-DS benchmark—a standard for decision support systems—demonstrates the substantial performance improvements achieved by our ML-integrated system. The integration of clustering methods (as determined by the elbow and silhouette analysis) ensured that workload-specific optimization strategies were applied efficiently, contributing to a significant increase in performance. Across various workloads and data scale factors, the system consistently outperformed baseline configurations. Notably, we observed a reduction of up to 55% in query execution times for complex analytical queries, which are typically resource-intensive and time-consuming. This reduction directly translates into faster query responses and improved user experience in environments where complex queries are prevalent.
Additionally, the system achieved a remarkable 74% increase in throughput under OLAP-heavy workloads, demonstrating its ability to handle large volumes of data and queries efficiently. This improvement is crucial for data-intensive environments such as data warehouses and business intelligence systems, where high throughput is essential to timely data processing and reporting.
One of the key contributions of our work is the use of dynamic clustering to classify workloads based on their characteristics, allowing the system to apply tailored optimization strategies in real time. As demonstrated in the Results section, this clustering not only ensured optimal resource allocation but also played a crucial role in the system’s adaptability to workload changes. The elbow and silhouette methods were instrumental to determining the optimal number of clusters, improving system efficiency by reducing redundant computations.
The architecture of our system ensures that ML models are seamlessly integrated with the DBMS, enabling real-time inference and optimization. This integration allows the system to dynamically adjust query execution plans, resource allocations, and system configurations based on real-time predictions, all while maintaining low overhead. The overhead introduced by the ML components was minimal, with the highest observed being only 7.1% in OLAP-heavy workloads—more than offset by the significant performance gains.
Our framework also demonstrated robust scalability, effectively managing data quantities ranging from 100 GB to 10 TB without degradation in performance. The system’s adaptability to changing workloads was evident, with rapid adaptation times of less than 10 min and performance recovery rates exceeding 97% in most scenarios. These results underscore the system’s versatility, making it suitable for a wide range of database environments, from operational systems requiring low-latency transactions to analytical systems dealing with complex, long-running queries. The clustering analysis further demonstrated that our system is capable of classifying workloads into distinct groups, allowing for targeted optimization strategies that enhance the system’s performance under varied conditions.
Limitations and Future Work
Despite the promising results, there are several limitations in our current framework that warrant further investigation. First, the system’s reliance on predefined ML models means that its performance could degrade if workload patterns changed significantly over time. While continuous learning is implemented to mitigate this, the retraining process could be improved by automating model updates based on real-time shifts in workload characteristics. Additionally, our clustering approach, though effective, may not capture the full complexity of multi-dimensional workloads, particularly in environments where inter-table relationships and nested queries are prevalent.
Another limitation lies in the framework’s current dependency on PostgreSQL. While this system is widely used, extending the framework to support other DBMS platforms such as MySQL, Oracle, or SQL Server would enhance its applicability and generalizability to a broader range of database environments. Similarly, testing in distributed database environments such as Apache Cassandra or Google BigQuery could offer insights into how the framework scales in cloud-native, distributed settings.
Future work will focus on several key areas of improvement. First, optimizing the clustering mechanism to handle more complex, multi-dimensional workloads is a priority. We aim to investigate the integration of reinforcement learning techniques, which could enable the system to learn optimal clustering strategies dynamically. This would allow for more granular workload classification and further improvements in resource management.
Additionally, automating the retraining of ML models is an area that needs refinement. A key future direction involves developing adaptive learning mechanisms that adjust model retraining frequencies based on workload shifts and system performance. This would minimize the risk of model obsolescence and ensure that the system remains responsive to evolving workloads. Furthermore, extending our framework to integrate support for multiple DBMS platforms and distributed database architectures would allow for more comprehensive evaluation in different operational contexts, enhancing the framework’s applicability and scalability.
Finally, future work will also include investigating deeper analysis for optimal model update intervals and how these can be adjusted in real time to improve system adaptability without compromising performance. We also plan to explore more advanced ML models, including deep learning architectures, to handle increasingly complex query optimization tasks.
In conclusion, our ML-integrated DBMS framework represents a significant advancement in the field of database optimization. By leveraging the predictive power of machine learning, our system offers a robust and adaptable solution to the challenges of modern database management. It not only provides immediate performance improvements—such as reduced query execution times and increased throughput—but also ensures long-term adaptability to evolving workloads and data environments. The clustering mechanism and continuous learning aspects are particularly valuable in modern, dynamic environments, allowing the system to evolve alongside changing workload patterns.