Learned Query Optimization by Constraint-Based Query Plan Augmentation

Ye, Chen; Duan, Haoyang; Zhang, Hua; Wu, Yifan; Dai, Guojun

doi:10.3390/math12193102

Open AccessArticle

Learned Query Optimization by Constraint-Based Query Plan Augmentation

by

Chen Ye

,

Haoyang Duan

,

Hua Zhang

^*,

Yifan Wu

and

Guojun Dai

College of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou 310018, China

^*

Author to whom correspondence should be addressed.

Mathematics 2024, 12(19), 3102; https://doi.org/10.3390/math12193102

Submission received: 19 July 2024 / Revised: 18 September 2024 / Accepted: 29 September 2024 / Published: 3 October 2024

(This article belongs to the Section Mathematics and Computer Science)

Download

Browse Figures

Versions Notes

Abstract

:

Over the last decades, various cost-based optimizers have been proposed to generate optimal plans for SQL queries. These optimizers are key to achieving good performance in database systems and can speed up query execution. Still, they may need enormous expert efforts and perform poorly on complicated queries. Learning-based optimizers have been shown to achieve high-quality plans by learning from past experiences. However, these solutions treat each query separately and neglect the semantic equivalence among different queries. Intuitively, a high-quality plan may be obtained for a complicated query by discovering a simple equivalent query. Motivated by this, in this paper, we present Celo, a novel constraint-enhanced learned optimizer to directly integrate the equivalent information of queries into the learning-based model. We apply denial constraints to identify equivalent queries by replacing equivalent predicates. Given a query, we augment the query plans generated by the learning-based model with the high-quality plans of its equivalent queries. Then, a more potentially well-performed plan will be predicted among the augmented query plans. Extensive experiments using real-world datasets demonstrated that Celo outperforms the previous state-of-the-art (SOTA) results even with few constraints.

Keywords:

query optimization; query plan augmentation; tree convolutional neural network; denial constraints

MSC:

68P20; 68T07

1. Introduction

The query optimizer is the core part of database management systems, which has been studied for decades [1]. Traditional methods make use of various metadata, including cardinalities, statistics, and data dependencies, to build heuristic-based cost models for query optimization tasks such as cardinality and cost estimation [2] and join order selection [3]. Due to the heuristics-based nature of these methods, building a query optimizer still needs enormous human efforts and relies on expert experience [4].

Benefiting from the development of machine learning, recent work [5,6,7] focuses on applying neural networks to build learned models replacing corresponding components in a traditional query optimizer. These studies have demonstrated that the models with learned strategies, such as reinforcement learning, outperform the heuristics-based models. However, improving performance on one query optimizer component cannot ensure this query optimizer generates the optimal query plan. To deal with this issue, learning-based query optimizers Neo [8] and Bao [9] are proposed to learn the query plans by adjusting their neural network models with actual execution plans and their latency. Neo builds a learned optimizer from scratch, while Bao builds into existing query optimizers by providing per-query optimization execution strategies (i.e., hints).

Motivations. Although these learning-based methods achieve remarkable results, they only use simple metadata such as cardinalities and data statistics, which is easy to embed as a feature in the learning model. These methods learn the execution plans of each query separately, without considering the semantic equivalence among the queries. Whereas, for a user query, especially a long-running query, finding an equivalent query with a lower execution cost is also crucial for query optimization [10]. Specifically, the equivalent queries return the same answer with different execution costs by executing different query plans.

Challenges. Although the information provided by equivalent queries may significantly improve query optimization performance, its effects on learned query optimization have been undervalued. To comprehensively and sufficiently enhance the power of such information, it is much better to identify equivalent queries and make use of the equivalent information in learning-based query optimization models. However, it is a challenging problem to achieve this goal since this raises two important questions: (1) How to identify the equivalent queries. (2) How to learn the query plan with the best expected performance of each query based on its equivalent queries.

Contributions. To answer these two questions, in this paper, we develop a Constraint-Enhanced Learned Optimizer Celo to directly integrate the knowledge of equivalent queries into the learning-based models. In this paper, we build our model upon the state-of-the-art learned optimizer Bao. However, our model can also be used with other learned optimizers. For challenge (1), we propose an equivalence SQL discovery method based on data correlations. We apply the denial constraints (DCs) [11], a universally quantified first-order logic formalism, to represent the diversified correlations of attribute values. Then, we generate equivalent predicates according to the denial constraints. Thus, equivalent queries can be identified by replacing the equivalent predicates. For challenge (2), we generate equivalent query plans using the different hint sets generated by a TCNN predictive model for equivalent queries. When running a query, we augment the query plans with the learned best query plans of its equivalent queries, thus augmenting the possible best query plans of this query. Then, the best query plan can be predicted over the augmented query plans.

We summarize our contributions as follows:

(1): We recognize that the equivalence information among queries is critical in learning-based query optimization models. We propose Celo, a novel learned optimizer that directly integrates the information of equivalent queries into the learning-based model.
(2): We propose identifying equivalent SQL statements using constant DCs. We also address how to use equivalent queries to augment the alternative execution plans of queries for learning-based query optimization.
(3): We conducted extensive experiments on two real-world datasets. The experimental results demonstrate that Celo achieves significant improvements in cost and latency over traditional and learning-based query optimizers.

2. Related Work

In recent years, machine learning (ML) techniques have enabled automatic, fine-grained, and more accurate characterization of various problem spaces, thus benefiting a variety of tasks in query optimization [12]. These tasks can mainly be divided into cardinality and cost estimation [5,7,13,14,15,16,17,18,19,20,21,22,23,24], join order selection [6,25,26,27,28,29,30], and end-to-end optimization [1,31].

Cardinality and cost estimation are essential to selecting an optimal plan among multiple query plans. In [13,21], the authors construct a model for cardinality estimation based on fully connected neural networks. MSCN [5] builds upon the multi-set deep convolutional network to learn the cardinality of joins. In [7], the authors design a tree-structured model to efficiently learn a representation of each plan and estimate the cardinality and cost more accurately. In [14,15,23,24], the authors use an unsupervised density model to fit the underlying distribution of the dataset, ultimately to estimate the cardinality. Join order selection enumerates all candidate plans in the search space to find the near-optimal plan with the minimum estimated cost. Leo [28] proposes learning from its mistakes by adjusting its cardinality estimations over time to optimize join order. Inspired by Leo, DQ [6] and ReJoin [27] generate better join orders based on reinforcement learning combined with traditional cost models, but they cannot learn the latency of the query plan. In [30], the authors deal with this problem by training based on cost and adding latency tuning for continuous training. Although these works demonstrate that learning-based cost and cardinality estimation and join order selection perform better than traditional ones, they do not show how to generate better query plans.

Cardinality, cost estimation, and join order selection are the components of the query optimizer, which converts the relational expression of the query into an execution plan. Neo [8] is the first end-to-end learned optimizer that uses a TCNN to predict the plan latency and then performs a DNN-guided best-first search strategy to replace the plan enumeration. Bao [9] learns to provide several hints to a traditional optimizer for each query and uses TCNN to select the best query plan generated by the optimizer under different hints. Compared with these methods, Celo introduces (1) query rewriting based on constant DCs to generate equivalent SQL statements and (2) query plan augmentation based on equivalent SQL statements.

There also exist several query optimization works using rules. CORDS [32] discovers the correlations between pairs of attributes and provides better estimation for predicate selectivity for queries involving conjunctive predicate. In [33], the authors choose function dependencies (FDs) for join elimination to reduce query response time. In [34], the authors propose a method to determine data dependencies based on a given workload and use them to generate more efficient query plans. However, the above methods do not adopt rules associated with the machine learning approaches. In contrast, our method, Celo, seeks to use machine learning and rules to build a more efficient optimization system. For other database query optimization problems, Celo can also enhance the accuracy of cost estimation by learning a cost model directly from query execution history, thereby providing more precise cost predictions. Celo addresses the limitations of traditional optimizers, which rely on static statistical information (such as table size and index selectivity) for cost estimation and struggle to cope with changes in data distribution, query complexity, and multidimensional correlations. It effectively improves the optimizer’s decision-making, allowing for the selection of more optimal execution plans and enhanced query performance.

3. Preliminaries of Learned Query Optimization

This section briefly describes the query plans generated for the queries, the TCNN model for the cost prediction of query plans in Bao optimizer, and then provides the problem definition of constraint-enhanced learned query optimization.

3.1. Query Plan Generation

Traditional query optimizers in relational database management systems (DBMSes) usually transform each query into several physical query execution plans according to predicate reordering, join operator ordering, and the selection of execution strategies for operators with different implementations (e.g., index/sequence scan, loop/merge join, or sort). Generally, a query plan describes a sequence of operations modeled as a directed acyclic graph, where each node describes an operation and each edge indicates the dependency of two nodes. To process this plan, children nodes are executed first, and the output of each child node is fed into the parent. Among a set of physical query plans, query optimization aims to find an optimal plan w.r.t. the plan’s execution time. To limit and steer the search space of the traditional optimizer, Bao only allows a subset of execution strategies (i.e., hint sets) to be used for the plan generation of each query, as illustrated in Figure 1.

3.2. The TCNN Model for Query Plan Cost Prediction

Due to the operation information in an execution plan being structured as nodes in a tree, the SOTA query optimizers Neo [8] and Bao [9] adopt TCNN to predict the cost of each physical query plan. Specifically, tree convolution slides a set of shared filters over each part of the query plan, thus capturing a wide variety of local parent–child relations. The architecture of the TCNN prediction model used in Bao is shown in Figure 2. To generate the inputs for the TCNN model, Bao first transforms the query plan tree into a binary tree. For instance, a “null” node is inserted as the right child of the non-binary operation “aggregation.” Then, Bao generates a tree of vectors for the binary query plan tree by encoding each query plan operator as a vector containing a one-hot encoding of the operator and cardinality and cost information. With the vectorized tree as inputs, three layers of tree convolution are applied to learn complex features of the tree patterns (i.e., a chain of complex operators). Dynamic pooling [35] is used to flatten the tree structure into a single vector. A linear layer, a ReLU [36] activation function, and another linear layer are used to map the pooled vector to the cost prediction.

After a plan is selected by the TCNN model, the plan is sent to a query execution engine to obtain its actual performance. Then, the selected query plan and its actual performance are added to experience for TCNN model training. In Bao’s system, the TCNN model is trained with Thompson sampling [37] in order to pick the best set of execution strategies (hints) for each query. The network structure of TCNN can be seen in Table 1.

3.3. Problem Definition

As above discussed, the learned query optimization methods select the optimal query plan for each query by predicting the cost of its query plans individually, which undervalues the effects of correlations among equivalence queries. Therefore, to further explore the effectiveness of equivalence queries to the learned query optimization, in this work, we support the subset of integrity constraints identified by DCs. Note that in this paper, we are only interested in constant DCs. We leave discovering equivalent queries through variable DCs for future work. However, existing methods [38] can be easily integrated into our model to discover equivalence queries according to special variable DCs such as functional dependencies (FDs), as demonstrated in the experimental part. over a given database to discover equivalence queries. In the following parts, we first define several important concepts and then formally describe the constraint-enhanced learned query optimization problem.

Consider a given database

D

containing N tables

D = {T_{1}, T_{2}, \dots, T_{N}}

. Each table

T_{n}

is a collection of tuples, where each tuple

t_{α} \in T_{n}

made up of the values of attributes

{A_{1}, A_{2}, \dots, A_{n}}

. Given a set of operators

B = {=, <, >, \neq, \leq, \geq}

, each DC is a first-order formula with the form

φ : \forall t_{α} \in T_{n}, \neg (C_{1} \land \dots \land C_{Z})

, where each clause

C_{z}

is of the form

t_{α} [A_{i}] ϕ c

, c is a constant,

ϕ \in B

,

A_{i} \in {A_{1}, A_{2}, \dots, A_{n}}

. The given database

D

satisfies

φ

if the data in

D

meets all the requirements defined in

φ

. Suppose we have a set

Σ

of DCs,

D ⊧ Σ

if and only if

\forall φ \in Σ, D ⊧ φ

. With the help of denial constraints, we define the constraint-based learned query optimization problem as follows:

Input: A given database

D

, a set

Σ

of DCs defined on

D

, a set Q of user queries, and a set

Q^{'}

of sample queries, both with SQL submitted through a DBMS on

D

.

Output: The high-quality query plans for the queries in Q by learning from past experiences of

Q^{'}

and Q.

4. Constraint-Enhanced Query Optimizer

In this section, we propose the Celo model. We start by introducing the whole framework in Section 4.1. We then detail the technical solution for the key component equivalent plan discovery, query plan augmentation, and TCNN model training in Section 4.2, Section 4.3, and Section 4.4, respectively.

4.1. The DBMS Architecture

CELO is integrated into the query optimizer module of the DBMS, focusing on optimizing the generation of query plans without interfering with other components of the DBMS. The following describes the query processing flow after CELO is integrated into PostgreSQL. According to Figure 3, the query processing flow after integrating CELO into PostgreSQL is described as follows:

Parse. The query is passed to the parser, which analyzes the SQL statement based on database syntax rules. The database retrieves the relevant table definitions, validates the query, and converts it into a parse tree.
Rewrite. The parse tree is then passed to the rewriter. The rewriter modifies and standardizes the query based on the rules and views defined in the database tables, generating a rewritten parse tree. This step supports view expansion, rule application, and other semantic-based rewrites.
Optimization. The rewritten parse tree is sent to the query optimizer, where CELO plays a key role. At this stage, CELO intervenes by setting query hints in PostgreSQL. The traditional optimizer uses statistics collected from the database tables—such as table size, data distribution, and indexes—to generate an optimized query plan tree. CELO further enhances the plan by extending the query plan set using denial constraints (DCs) and predicting the best query plan through a TCNN model.
Execution. The query plan generated by the optimizer is passed to the executor. The executor follows the query plan, accessing and manipulating the actual data in the database tables, and ultimately returning the results to the user.

4.2. The Celo Model Overview

An illustration of the Celo framework is shown in Figure 4. We introduce the Celo model into four parts:

Plan expertise collection. This section encompasses two elements: query optimizer and executed plans. Our model assumes the existence of a sample workload $Q^{'}$ consisting of a query representative of the user’s total workload. For each query in $Q^{'}$ , the query optimizer generates multiple query plans and executes the optimal one using Bao. The query, its execution plan, and its latency are added to the experience module for model training as well as providing equivalent plans for user queries in Q.
Equivalent query discovery. This section encompasses three elements: constant DCs, equivalent predicates, and equivalent queries. We design a constraint-based discovery method based on a set of denial constraints (DCs). Based on the constraints, we convert them into equivalent predicates, which include a pair of equivalent predicates { ${predicate}_{a}$ , ${predicate}_{b}$ }. By replacing the predicates in a query in Q with equivalent predicates, we obtain equivalent queries.
Query plan augmentation. This section encompasses three elements: experience, equivalent plans, and augmented plan sets. For each equivalent query statement identified in the previous step, we search the corresponding query plan in the experience, which serves as the equivalent plans. Then, we added them to the set of query plans generated by the database query optimizer, thereby expanding it into the augmented plan set. Thus, the Celo model overcomes the drawback of limited candidate query plans by providing only a set of hints, offering more high-quality query plans than BAO. Then, the TCNN model can predict the optimal query plan among the augmented query plans, which may contain more potentially well-performed plans for execution.
TCNN model training. This section encompasses two elements: training and the TCNN model. Once the query execution is complete, the combination of the query itself, the selected query plan, and the execution latency is added to Celo’s experience as training data to incrementally improve the TCNN model. Therefore, it is stipulated that after executing a certain number of queries, Celo will retrain the TCNN model. This allows the TCNN model to adapt to the current workload, making the retrained model more reliable in selecting the best set of hints for each query in Q. When retraining the TCNN model, a specific amount of training data is retrieved from the experience and used in the training process.

4.3. Equivalent Query Discovery

Example 1.

Table 2 shows four queries generated from the IMDB dataset [7] and the latency obtained from Bao and our method Celo, respectively. Suppose that queries {

q_{1}, q_{2}

} and {

q_{3}, q_{4}

} are two equivalence query pairs.

First, we analyze the latency obtained from the Bao method, which enhances a traditional optimizer by learning to provide a limited set of hints to the optimizer for the query plan generation of each query. The query plan with the best expected performance is then predicted by a trained tree convolutional neural network (TCNN). On one hand, through learning limited hints, Bao reduces the search space of the traditional optimizer. Thus, Bao can improve the latency for each query over the traditional optimizer in most cases. On the other hand, using a limited set of hints also leads to a restricted action space, which makes Bao unable to always learn the best possible query plan from the traditional optimizer. Considering equivalence query pair

{q_{1}, q_{2}}

, the latency obtained from Bao is 0.39 s and 84.81 s. The results demonstrate that Bao does not learn the best query plan for

q_{2}

. Similarly, for equivalence query pair

{q_{3}, q_{4}}

, Bao fails to learn a better query plan for

q_{4}

.

In our method, Celo, we consider learning the best query plan by augmenting the query plans of equivalent queries according to the constraints of data correlations. For instance, for the equivalence query pair

{q_{1}, q_{2}}

, our method Celo augments the query plans of

q_{2}

with the ones of

q_{1}

, which leads to the prediction of a better query plan. As a result, our method Celo improves the latency of

q_{2}

to 0.43 s, which is close to the latency of

q_{1}

. Similarly, Celo improves the latency of

q_{4}

to 50.48 s.

To augment the query plans of a given user query

q \in Q

, we discover its equivalent queries with the help of constant DCs. Recall that a DC

φ : \forall t_{α} \in T_{n}, \neg (C_{1} \land \dots \land C_{Z})

is a first-order logic formalism made up of clauses, where each clause

C_{z}

is of the form

t_{α} [A_{i}] ϕ c

,

ϕ \in B

. The semantics is that for each tuple

t_{α}

, its attribute values should not satisfy all the clauses in the DC. In this way, a DC builds the dependencies among these clauses, i.e., if

(Z - 1)

clauses in

{C_{1}, \dots, C_{Z}}

are satisfied, the last clause

C_{z}

must not be satisfied, denoting as

{C_{1} \land \dots \land C_{z - 1} \land C_{z + 1} \land \dots \land C_{Z}} \to \neg C_{z}

. Intuitively, the tuple set

T_{a}

which satisfies

{C_{1} \land \dots \land C_{z - 1} \land C_{z + 1} \land \dots \land C_{Z}}

, must be the subset of the tuple set

T_{b}

, which satisfies

\neg C_{z}

. Then, we obtain

T_{a} \cap T_{b} = T_{a}

. Thus, a DC φ can be used to generate a pair of equivalent predicates

{p_{a}, p_{b}}

in the SQL, where

p_{a} = {C_{1} AND \dots AND C_{z - 1} AND \neg C_{z} AND C_{z + 1} AND \dots AND C_{Z}}

,

p_{b} = {C_{1} AND \dots AND C_{z - 1} AND C_{z + 1} AND \dots AND C_{Z}}

. For a given query q, its equivalent queries can then be obtained by replacing the predicate in q with its corresponding equivalent predicate according to φ.

Example 2.

Consider Example 1. Suppose that we have two DCs defined on the tables of the IMDB dataset: (1)

\forall t_{α} \in T_{m o v i e_i n f o}, \neg (t_{α} [i d] >

5,486,631

\land t_{α} [i d] <

7,020,540

\land t_{α} [i n f o_t y p e_i d]

\neq 3)

, which states that for each tuple in the

m o v i e_i n f o

table, if its id is greater than 5,486,631 and less than 7,020,540, then its info_type_id must be 3; (2)

\forall t_{α} \in T_{n a m e}, \neg (t_{α} [i d] > 1, 739, 579 \land t_{α} [i d] < 2, 701, 135 \land t_{α} [g e n d e r] \neq ‘ f ’)

, which states that for each tuple in the

n a m e

table, if its id is greater than 1,739,579 and less than 2,701,135, then its gender must be ‘f’. Based on these two DCs, we can infer two pairs of equivalent predicates as follows:{movie_info.id > 5,486,631 and movie_info.id < 7,020,540 and movie_info.info_type_id = 3, movie_info.id > 5,486,631 and movie_info.id < 7,020,540} and {name.id > 1,739,579 and name.id < 2,701,135 and name.gender = ‘f’, name.id > 1,739,579 and name.id < 2,701,135}, which are highlighted in bold in Table 1. According to these equivalent predicates, we obtain the equivalence query pairs {

q_{1}, q_{2}

} and {

q_{3}, q_{4}

}.

4.4. Query Plan Augmentation

For a query

q \in Q

, if an equivalence query

q^{'} \in Q \cup Q^{'}

exists, we can augment the candidate query plans

P_{q} = {p_{q}^{1}, \dots, p_{q}^{n}}

of q with the plans

P_{q^{'}} = {p_{q^{'}}^{1}, \dots, p_{q^{'}}^{n}}

of

q^{'}

to let the TCNN model select a more well-performed one. However, the query plan generation of

q^{'}

may need additional time. To avoid the cost of additional query plan generation, we only make use of the plan of the equivalence query

q^{'}

which has been previously executed, i.e., the execution information of

q^{'}

can be observed in the Experience module. Note that during the optimization process for

q^{'}

, the optimal plan

p_{q^{'}}^{o p t} \in P_{q^{'}}

for

q^{'}

and its latency have been predicted by the TCNN model, and its actual latency has been executed. The execution information is used for retraining the TCNN model. Thus, the TCNN model will give a more accurate estimated latency for

p_{q^{'}}^{o p t}

when seeing it again. Intuitively,

p_{q^{'}}^{o p t}

has a higher possibility to provide a well-performed query plan for q than the other plans of

q^{'}

. Thus, we augment the candidate query plans

P_{q}

of q with

p_{q^{'}}^{o p t}

, i.e.,

P_{q}^{a u g} = {p_{q}^{1}, \dots, p_{q}^{n}, p_{q^{'}}^{o p t}}

. With the augmented query plan set

P_{q}^{a u g}

, the best performed plan

p_{q^{'}}^{o p t}

of

q^{'}

is added without more time.

4.5. TCNN Model Training

After plan augmentation, for each query

q \in Q

, TCNN predicts the execution time of each query plan in

P_{q}^{a u g}

. Then, the best plan

p_{q}^{o p t}

with the shortest estimated execution time

t_{q}^{o p t}

is selected to execute. After execution,

(q, p_{q}^{o p t}, t_{q}^{{o p t}^{*}})

are added to Celo’s experience E, where

t_{q}^{{o p t}^{*}}

denotes the actual execution time of

p_{q}^{o p t}

. As mentioned above, to enhance the prediction ability of the TCNN model, the model is re-trained according to the past experience. Specifically, we update the parameters of the model by minimize a joint loss

L

as follows:

L = \sum_{q \in Q} {(t_{q}^{o p t} - t_{q}^{{o p t}^{*}})}^{2} + \sum_{q \in Q \cap q^{'} \in E} {(t_{q}^{o p t} - t_{q^{'}}^{o p t})}^{2},

(1)

where

t_{q^{'}}^{o p t}

is the estimated execution time for

p_{q^{'}}^{o p t} \in P_{q}^{a u g}

. The joint loss function

L

consists of two parts. The first function describes the difference between the predicted execution time

t_{q}^{o p t}

and its actual execution time

t_{q}^{{o p t}^{*}}

. Intuitively,

t_{q}^{o p t}

should be close to

t_{q}^{{o p t}^{*}}

; The second function describes the difference between the predicted execution time

t_{q}^{o p t}

for the optimal plan

p_{q}^{o p t}

of query q and the predicted execution time

t_{q^{'}}^{o p t}

for the augmented plan

p_{q^{'}}^{o p t}

from the equivalent query

q^{'}

. We define the second function to make the model predict the close time for the best plans of equivalent queries when

p_{q}^{o p t} \neq p_{q^{'}}^{o p t}

. Then, for the query

q / q^{'}

whose best plan has a longer estimated time, the model will learn to generate better plans according to the estimated shorter time.

Following Bao, the training of the TCNN model refers to the Thompson sampling mechanism. The traditional Thompson sampling mechanism updates the experience pool, retraining the prediction model after each query execution. In the context of query optimization, as training neural networks is a time-consuming process, Celo retrains the TCNN model after every n queries.

4.6. The Whole Procedure

The whole procedure of Celo is summarized in Algorithm 1. We first collect a set of experiences for a set of sample queries (Lines 1–3). Then, for each user query, we find its equivalent query in the experience and augment its plan set with the selected plan of its equivalent query (Lines 5–10). We use the TCNN model to select a plan from the augmented set and collect its actual execution time to experience (Lines 11–12). Finally, the TCNN model is trained after every n queries (Lines 14–16).

Algorithm 1 The whole procedure of Celo.

Input:: A given database $D$ , a set $Σ$ of DCs defined on $D$ , a set Q of user queries, a set $Q^{'}$ of sample queries, and a parameter n.
Output:: The high-quality query plans for the queries in Q.
1:: for $q \in Q^{'}$ do
2:: Generate $P_{q}$ for q and select $p_{q}^{o p t} \in P_{q}$ to execute using Bao.
3:: Add $(q, p_{q}^{o p t}, t_{q}^{{o p t}^{*}})$ to experience E;
4:: end for
5:: Initialize $i = 1$ ;
6:: for $q \in Q$ do
7:: Generate $P_{q}$ for q using Bao;
8:: Generate a set $Q_{q}^{e q u a l}$ of equivalent queries for q according to $φ \in Σ$ ;
9:: Search an equivalent query $q^{'} \in Q_{q}^{e q u a l}$ in experience E;
10:: if $q^{'}$ exists then
11:: Augment $P_{q}$ to $P_{q}^{a u g}$ with $p_{q^{'}}^{o p t}$ ;
12:: Select $p_{q}^{o p t} \in P_{q}^{a u g}$ using the TCNN model to execute;
13:: Add $(q, p_{q}^{o p t}, t_{q}^{{o p t}^{*}})$ to experience E;
14:: end if
15:: $i = i + 1$ ;
16:: if $i = = n$ then
17:: Retrain the TCNN model according to Equation (1);
18:: $i = 1$ ;
19:: end if
20:: end for
21:: return the selected query plans for $q \in Q$ .

5. PostgreSQL Integration

In this section, we describe how to build Celo upon Bao. As Bao is integrated into PostgreSQL, we also explore incorporating user-written code into PostgreSQL by loading it dynamically. Using hook functions, Bao enables PostgreSQL to generate query plans corresponding to given hints. Bao uses Socket to communicate between PostgreSQL and the TCNN, which can feed the query plans generated by PostgreSQL into the TCNN for prediction and inform PostgreSQL of the optimal plan for prediction. Based on the implementing extensions and communication of Bao to PostgreSQL, we detail the implementation of Celo’s two critical parts.

Celo implements plan augmentation based on the discovered equivalent queries. When Celo executes a query

q \in Q

, PostgreSQL generates candidate query plans

P_{q} = {p_{q}^{1}, \dots, p_{q}^{n}}

that correspond to the hint sets. Then, Celo searches the equivalent query plan in the experience through the equivalent query

q^{'} \in Q \cup Q^{'}

. Celo creates the experience set using Sqlite3, a database provided by Python. If Celo has executed an equivalent query

q^{'}

, we can obtain the equivalent query plan

p_{q^{'}}^{o p t}

with the select statement. The equivalent plan

p_{q^{'}}^{o p t}

is added to

P_{q}

before the candidate query plans enter the TCNN, augmenting them to

P_{q}^{a u g} = {p_{q}^{1}, \dots, p_{q}^{n}, p_{q^{'}}^{o p t}}

.

Celo addresses how to execute the optimal plan predicted by the TCNN. As the TCNN model is separated from the execution engine, the plan selected by TCNN cannot be directly sent to PostgreSQL to execute. Then, based on the selected plan, the execution process is divided into two cases: (1) If the optimal plan

p_{q}^{o p t}

predicted by TCNN in

P_{q}^{a u g} = {p_{q}^{1}, \dots, p_{q}^{n}, p_{q^{'}}^{o p t}}

belongs to

P_{q} = {p_{q}^{1}, \dots, p_{q}^{n}}

, TCNN provides PostgreSQL the index

i \in {1, \dots, n}

of

p_{q}^{o p t}

. As PostgreSQL has the correspondence between the index and the hint set, it can generate the query plan for this hint set again and execute it. (2) If the optimal plan

p_{q}^{o p t}

predicted by TCNN in

P_{q}^{a u g} = {p_{q}^{1}, \dots, p_{q}^{n}, p_{q^{'}}^{o p t}}

is

p_{q^{'}}^{o p t}

, the index of the

p_{q}^{o p t}

is

n + 1

. As this index exceeds the total number n of hint sets, PostgreSQL cannot obtain the hint set that this index corresponds to. To solve this issue, we use pg_cancel_backend(), a cancellation function provided by PostgreSQL, to terminate the current query q and reprocess the query

q^{'}

to execute

p_{q^{'}}^{o p t}

.

6. Experiments

In this section, we evaluate the performance of Celo using real-world datasets. The experimental results clearly demonstrate the advantages of integrating plans of equivalent queries in learned query optimization. We first discuss the experimental setup in Section 6.1 and then provide the experimental results for the real-world datasets in Section 6.2.

6.1. Experimental Setup

Dataset.

Our experiment uses two real-world datasets, including IMDB and Hosptial.

IMDB [7]. We use the Internet Movie Database (IMDB), an online database of information about movie actors, movies, movie production teams, etc. IMDB contains 21 tables, and we create indexes on the primary keys (PKs) and foreign keys (FKs) of these tables. Table 3 shows the constant DCs we discovered from tables name (n), movie_info (

m i

), cast_info (

c i

), and movie_companies (

m c

) using the C-FASTDC method [11], and the generated equivalent predicates. We use this dataset to test how effective Celo is when the equivalent queries are discovered by generating equivalent predicates according to constant DCs.

Hospital (https://data.cms.gov/provider-data/search?theme=Hospitals, accessed on 1 July 2024). Hospital is a publicly available dataset and has been used in several studies [39,40]. It contains tables hospital, state, and nation, and we create indexes on the PKs and FKs of them. We use the functional dependencies (FDs) provided in [39] and represent these FDs as DCs. Also, we generate equivalent queries using the method provided in [41] for rewriting queries using FDs. Specifically, if an FD

B, C \to A

holds and a group operation on the combination B, C, A is to be executed, the grouping attribute A is unnecessary, as all elements that fall into the same group for B, C necessarily also have the same value for A. Thus, we can obtain a pair of equivalent grouping operations from the FDs, as shown in Table 4. We use this dataset to test how effective Celo is when the equivalent queries are discovered by generating equivalent grouping operations according to variable DCs (FDs).

Training data generation.

In learning-based optimizers or methods that assist traditional optimizers [42,43,44,45,46], the training data for models comes from queries executed on databases. Common classic databases and queries include IMDB (JOB query), Stack, TPC-DS, TPC-H, etc. When using these datasets, a certain number of queries can be generated using standard query templates. In our research, we generated queries with denial constraints (DCs) based on the IMDB and hospital databases without using standard templates to construct the queries. This approach was taken to specifically highlight the performance improvements brought to the query optimizer by using DCs.

To generate the training SQL statements, we generate a join graph based on the PKs and FKs of the tables of each dataset. In the graph, a node represents a table, and an edge represents the PK-FK relationship between two tables (nodes), forming a join relation. According to the join graph, we randomly select 4∼5 join relations, one of which has a table related to at least a DC. To generate predicates for the SQL statement, we first randomly choose 2∼4 columns from the tables. For a numeric column, we randomly select one of the values between the maximum and minimum among the column as the operand and one of ‘

>, <, =,! =

’ as the operator. For a categorical column, we randomly pick a value and set the substring of this value as the operand and one of ‘LIKE, NOT LIKE’ as the operator. Then, we aggregate these generated predicates into complex predicates by using the ‘AND’/‘OR’ operator. For projection, we randomly select columns and apply aggregation functions to these columns. By this way, we can obtain a set of complex SQL statements.

For the IMDB dataset, we use the pairs of equivalent predicates in Table 3 to generate equivalent SQLs. Specifically, we add the equivalent predicates to a SQL statement by And to generate a pair of equivalent queries. According to the number of DCs related to the equivalent SQLs, we obtain two workloads: (1) Workload 1 contains 3900 pairs of queries with one DC. (2) Workload 2 contains 6600 pairs of queries with two DCs. For the Hospital dataset, we use the pairs of equivalent grouping operations in Table 4 to generate equivalent SQLs. For a pair of equivalent grouping operations, we add them to a SQL statement to generate a pair of equivalent queries. We generate two workloads: (1) Workload 1 contains 3600 pairs of queries with one data constraint. (2) Workload 2 contains 4800 pairs of queries with two data constraints.

Baselines.

The baseline methods are listed as follows:

Bao. Bao (https://github.com/learnedsystems/BaoForPostgreSQL, accessed on 1 July 2024) is a learned optimizer integrated into PostgreSQL. It allows PostgreSQL to generate equivalent query plans with different search spaces by setting limited hint sets. The generated plans are fed into TCNN to predict the best one to execute.

PostgreSQL. PostgreSQL (https://www.postgresql.org/ftp/source/v12.13/, accessed on 1 July 2024) is a powerful, open-source relational database management system that can be extended to run user-supplied code in a separate process. CELO and BAO are built on PostgreSQL’s traditional optimizer, leveraging the advantages of machine learning to improve optimization performance. Therefore, PostgreSQL was included in the comparison experiments to verify whether incorporating machine learning models can enhance optimization effectiveness. For the parameter configuration, we set work_mem to 2 GB, shared_buffers to 4 GB, effective_cache_size to 32 GB, and geqo_threshold = 18.

LEON. LEON (https://github.com/haitianchen/LEON, accessed on 1 July 2024) is also a machine learning framework that integrates with Postgresql to complement the traditional optimizer. Enumerating equivalence sets based on DP and pruning, then using LTR in the recommendation system to sort equivalence sets, the first best query plan as the final result. Unlike Bao, it does not predict the latency for the query plan but rather the ranking between them.

Implementation details.

Our experiments are performed on a server with an 8-core CPU, 64GB of RAM, and an NVIDIA Geforce GTX 1080 GPU. Each core has two threads, and the CPU type is Intel(R) Core(TM) i7-10700k CPU @3.80 GHz. The operating system used for our experiments is Ubuntu 18.04 LTS. The model implementation is based on Pytorch (Python version 3.8). For both Bao and Celo, we set the number of hint sets as 5. The training of the TCNN model is performed with Adam using a batch size of 16 and is run until either 100 epochs elapsed or convergence is reached. We retrain the predictive model every

n = 200

queries. We execute all the queries in the workloads and record the execution time for analysis. For Celo, we randomly select a query from each pair of equivalent queries for each workload to generate sample SQLs and treat the remaining queries as user SQLs. For LEON, the model consists of a MLP, a three-layer fully connected network that sequentially maps the input features to 128, 64, and 32 dimensions, with each layer followed by layer normalization and a Leaky ReLU activation. This is followed by the tree convolution component, which includes three tree convolution layers that map the input from 32 to 512, 256, and 128 dimensions, each with standardization and Leaky ReLU activation, and ends with max pooling. Finally, the second MLP is a three-layer fully connected network that maps the features from 128 to 64, 32, and ultimately to 1, with the middle two layers incorporating layer normalization and Leaky ReLU activation.

6.2. Experimental Results

6.2.1. Execution Time

We first evaluate the effectiveness of Celo and the baseline methods by comparing the execution time. Figure 5 shows the time profiles of PostgreSQL, Bao, LEON, and Celo for all workloads in both datasets as the query progresses. Overall, on the IMDb and hospital datasets, as the number of executed queries increases, the total execution time of Celo gradually becomes smaller than the other three baseline methods.

Compared with PostgreSQL and Bao, Celo demonstrates significantly superior performance on both datasets. As the number of executed queries increases, the difference in total execution time between them becomes increasingly large. It is worth noting that, in Figure 5c,d on the hospital dataset, Bao’s latency is greater than that of PostgreSQL. Since the hospital dataset is relatively simple, PostgreSQL only uses the traditional optimizer during execution to compute the execution time. In contrast, Bao also needs to use a model to predict multiple query plans, and the time required for these predictions may exceed the actual query execution time, leading to worse performance than PostgreSQL. On the other hand, Celo, even while using model predictions, still has lower latency than PostgreSQL. This is because Celo can identify equivalent queries, thereby expanding the pool of selectable query plans. This allows it to choose a much better query plan with an execution time far less than the one chosen by PostgreSQL, so the impact of the time spent on model prediction is minimized, resulting in a lower total execution time. These results fully demonstrate the advantages of using DCs in Celo to expand the query plan space, thereby improving the performance of the learning-based optimizer.

Compared with the LEON-based learning query optimizer, CELO also achieves better query latency performance. In the IMDB dataset shown in Figure 5a,b, LEON’s execution time is less than or equal to the other three methods before 2000 queries, with its advantage being more apparent in Figure 5a, making it the method with the lowest latency. However, as the number of executed queries increases, CELO gradually surpasses LEON, though LEON still performs better than PostgreSQL and Bao. LEON assists the traditional optimizer by maintaining finer-grained knowledge (such as histograms) and can utilize the traditional query optimizer to obtain more training data, accelerating the training process. In contrast, both Bao and CELO can only learn knowledge related to the entire query plan. Before 2000 queries, due to the lack of equivalent plan data in CELO’s experience pool, its performance is not as good as LEON. As more queries are executed, more equivalent plans are stored in CELO’s experience pool, gradually reducing the execution time compared with LEON.

In conclusion, in long-term database usage scenarios, CELO exhibits better performance.

6.2.2. Performance Relative to Bao

We normalize the total execution time of Bao to 1.0 and analyze the performance of PostgreSQL and Celo relative to Bao, as shown in Figure 6.

PostgreSQL and Bao have different relative performances in the IMDB and Hospital datasets. In the IMDB dataset, Bao outperforms PostgreSQL, which is not the case in the Hospital dataset. The reason is that the size of the Hospital dataset is much smaller than the IMDB dataset, and the queries in the Hospital dataset are much more straightforward. However, Bao is more suitable for complex queries that take longer to process.

Celo has a shorter execution time than PostgreSQL and Bao for both datasets’ workloads, demonstrating Celo’s effectiveness in augmenting the query plans of equivalent queries. In the IMDB dataset, Figure 6a shows that compared with Bao, Celo’s total execution time is reduced by 6% in Workload 1 and 8% in Workload 2. In the Hospital dataset, Figure 6b shows that compared with Bao, Celo’s total execution time decreases by 14% in Workload 1 and 21% in Workload 2. We can see that Celo performs better on Workload 2 than Workload 1 in both datasets, indicating that more DCs contribute to discovering equivalent queries and well-performed plans.

6.2.3. Absolute Performance Improvement

In this part, we show the absolute performance improvement of Celo compared with Bao. We first select a subset of IMDB’s Workload 1, then execute the remaining queries in the workload and retrain Bao and Celo with these queries’ experiences. After the training, we run the queries in the subset that Bao and Celo have never seen before. In Figure 7, we show each query’s latency difference (y−axis) between PostgreSQL and Celo/Bao, where the x−axis represents the query ID in the subset. Compared with PostgreSQL, Celo shortens the latency of several queries by more than 100s, while Bao does not. Also, Celo achieves noticeable improvements over more queries than Bao.

6.2.4. Latency Distribution Analysis

To explore the effectiveness of Celo to different percentile latency, we show the 20%, 30%, 50%, 60%, 80%, 90%, 95%, and 99.5% latencies for each workload in both datasets in Figure 8.

For Workloads 1 and 2 of the IMDB dataset, Bao drastically decreases tail latencies compared with PostgreSQL. However, Bao performs worse than PostgreSQL in the 20% to 60% latencies in most cases, especially in Workload 2. In contrast, in Workload 1, Celo reduces all percentiles compared with both PostgreSQL and Bao. In Workload 2, Celo also achieves the lowest latencies in most cases. The 95% to 99.5% latencies of Celo are higher than those of Bao, which indicates that Celo produces slower but not disastrous plans than Bao during the learning process. However, in the 60% to 90% latency range, LEON performs better. But in the 90% to 99.5% latency range, its performance is not as good as Celo and Bao. This indicates that for more complex and time-consuming queries, using the DP search strategy to enumerate and rank query plans is not effective at predicting from the coarse-grained query plan level.

For the Hospital dataset, Bao performs better than PostgreSQL in all percentiles for all workloads. However, Celo still reduces the latencies in all percentiles compared with PostgreSQL and Bao, indicating the generality of Celo.In the 60% to 80% latency range, LEON still maintains lower latency, but for the 80% to 99.5% range, it still doesn’t perform as well as Celo. This further confirms that LEON is better at handling mid-range latencies, while Celo excels at both the head and tail latencies.

6.2.5. Performance Evaluation over Constraint-Independent Queries

In this part, we explore Celo’s effectiveness in the queries that fail to discover equivalent queries on the IMDB dataset.

We generate two constraint-independent test query sets using the method of generating complex SQL statements in Section 6.1. Specifically, the two test query sets contain 500 and 360 queries, respectively. For Bao and Celo, we execute one test query set in the model trained by Workload 1 and the other in the model trained by Workload 2. For PostgreSQL, the query test sets are directly executed. We analyze the total execution time of these three optimizers for two sets of constraint-independent queries. As shown in Figure 9, the y-axis is the number of queries executed, and the x-axis is the total time to run these queries. We can see that on both test query sets, the whole execution time of Celo is lower than that of Bao and PostgreSQL. This result shows that by learning the equivalent plans of equivalent queries, Celo also has better optimization capabilities for constraint-independent queries than Bao and PostgreSQL.

Figure 10a shows the selected plan of Celo and Bao for a query executed in the middle-late workload, where Celo and Bao’s predictive models have been trained with four thousand experiences. However, Bao still chooses a poor-performing plan among all the alternative plans, like the one selected by PostgreSQL. In contrast, Celo selects a more efficient plan. Considering the execution time, the plan Celo chooses is 100 s faster than the plan selected by Bao.

In Figure 10b, we show the execution plans of a query in the late workload, and Bao still chooses the same plan as PostgreSQL. In contrast, Celo chooses a different plan regarding join order and operation arithmetic, reducing the query time from 124 s to 3 s. In both queries, Celo selects the plan with better performance than Bao. In terms of execution time, this significantly improves query efficiency.

7. Conclusions

In this paper, we realize that the effects of equivalence queries on learned query optimization have been undervalued. To take advantage of such information, we propose a novel method, Celo, to integrate the well-performed plans of equivalent queries based on constant DCs into a learning-based model to achieve the goal of improving query optimization. Our experimental results on two real-world datasets verified the effectiveness of Celo, which has better optimization capabilities for the queries than the baseline methods.

In the future, we plan to explore the application of Celo according to variable DCs. This study applies constant denial constraints and functional dependencies from data constraints to query optimization to improve query performance. However, the types of constraints in databases are far more diverse, such as inclusion dependencies, order dependencies, multivalued dependencies, join dependencies, and domain constraints. In future research, combining different types of constraints for joint optimization could enable the optimizer to comprehensively consider multiple constraint conditions, achieving multi-constraint optimization and generating more efficient query plans.

Author Contributions

Conceptualization, C.Y.; Data curation, H.D.; Formal analysis, C.Y.; Funding acquisition, C.Y. and G.D.; Methodology, H.D.; Project administration, C.Y.; Resources, C.Y.; Software, H.D.; Supervision, C.Y. and G.D.; Validation, H.D.; Writing—original draft, C.Y.; Writing—review and editing, H.Z. and Y.W. All authors have read and agreed to the published version of the manuscript.

Funding

This paper was partially supported by the National Natural Science Foundation of China (No. 62202132), the Natural Science Foundation of Zhejiang Province (No. LQ22F020032), and the National Key Research and Development Program of China (No. 2022YFE0199300).

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Lan, H.; Bao, Z.; Peng, Y. A Survey on Advancing the DBMS Query Optimizer: Cardinality Estimation, Cost Model, and Plan Enumeration. Data Sci. Eng. 2021, 6, 86–101. [Google Scholar] [CrossRef]
Poosala, V.; Ioannidis, Y.E. Selectivity Estimation without the Attribute Value Independence Assumption. In Proceedings of the 23rd International Conference on Very Large Data Bases, Athens, Greece, 25–29 August 1997; pp. 486–495. [Google Scholar]
Wu, Y.; Patel, J.M.; Jagadish, H.V. Structural Join Order Selection for XML Query Optimization. In Proceedings of the 19th International Conference on Data Engineering (Cat. No.03CH37405), Bangalore, India, 5–8 March 2003; pp. 443–454. [Google Scholar]
Weiss, C.; Karras, P.; Bernstein, A. Hexastore: Sextuple indexing for semantic web data management. Proc. VLDB Endow. 2008, 1, 1008–1019. [Google Scholar] [CrossRef]
Kipf, A.; Kipf, T.; Radke, B.; Leis, V.; Boncz, P.A.; Kemper, A. Learned Cardinalities: Estimating Correlated Joins with Deep Learning. In Proceedings of the 9th Biennial Conference on Innovative Data Systems Research (CIDR’19), Asilomar, CA, USA, 13–16 January 2019. [Google Scholar]
Krishnan, S.; Yang, Z.; Goldberg, K.; Hellerstein, J.M.; Stoica, I. Learning to Optimize Join Queries With Deep Reinforcement Learning. CoRR 2018. abs/1808.03196. [Google Scholar]
Sun, J.; Li, G. An End-to-End Learning-based Cost Estimator. Proc. VLDB Endow. 2019, 13, 307–319. [Google Scholar] [CrossRef]
Marcus, R.C.; Negi, P.; Mao, H.; Zhang, C.; Alizadeh, M.; Kraska, T.; Papaemmanouil, O.; Tatbul, N. Neo: A Learned Query Optimizer. Proc. VLDB Endow. 2019, 12, 1705–1718. [Google Scholar] [CrossRef]
Marcus, R.; Negi, P.; Mao, H.; Tatbul, N.; Alizadeh, M.; Kraska, T. Bao: Making Learned Query Optimization Practical. In Proceedings of the 2021 International Conference on Management of Data, Xi’an, China, 20–25 June 2021; pp. 1275–1288. [Google Scholar]
Zhou, X.; Li, G.; Chai, C.; Feng, J. A learned query rewrite system using monte carlo tree search. Proc. VLDB Endow. 2021, 15, 46–58. [Google Scholar] [CrossRef]
Chu, X.; Ilyas, I.F.; Papotti, P. Discovering Denial Constraints. Proc. VLDB Endow. 2013, 6, 1498–1509. [Google Scholar] [CrossRef]
Zhou, X.; Chai, C.; Li, G.; Sun, J. Database Meets Artificial Intelligence: A Survey. IEEE Trans. Knowl. Data Eng. 2022, 34, 1096–1116. [Google Scholar] [CrossRef]
Dutt, A.; Wang, C.; Nazi, A.; Kandula, S.; Narasayya, V.R.; Chaudhuri, S. Selectivity Estimation for Range Predicates using Lightweight Models. Proc. VLDB Endow. 2019, 12, 1044–1057. [Google Scholar] [CrossRef]
Hasan, S.; Thirumuruganathan, S.; Augustine, J.; Koudas, N.; Das, G. Multi-Attribute Selectivity Estimation Using Deep Learning. CoRR 2019. abs/1903.09999. [Google Scholar]
Heimel, M.; Kiefer, M.; Markl, V. Self-Tuning, GPU-Accelerated Kernel Density Models for Multidimensional Selectivity Estimation. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Australia, 31 May–4 June 2015; pp. 1477–1492. [Google Scholar]
Heule, S.; Nunkesser, M.; Hall, A. Hyperloglog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm. In Proceedings of the 16th International Conference on Extending Database Technology, Genoa, Italy, 18–22 March 2013; pp. 683–692. [Google Scholar]
Ioannidis, Y.E. The History of Histograms (abridged). In Proceedings of the 29th International Conference on Very Large Data Bases, Berlin, Germany, 9–12 September 2003; pp. 19–30. [Google Scholar]
Leis, V.; Radke, B.; Gubichev, A.; Kemper, A.; Neumann, T. Cardinality Estimation Done Right: Index-Based Join Sampling. In Proceedings of the 8th Biennial Conference on Innovative Data Systems Research (CIDR’17), Chaminade, CA, USA, 8–11 January 2017. [Google Scholar]
Lu, X.; Guan, J. A new approach to building histogram for selectivity estimation in query processing optimization. Comput. Math. Appl. 2009, 57, 1037–1047. [Google Scholar] [CrossRef]
Müller, M.; Moerkotte, G.; Kolb, O. Improved Selectivity Estimation by Combining Knowledge from Sampling and Synopses. Proc. VLDB Endow. 2018, 11, 1016–1028. [Google Scholar] [CrossRef]
Ortiz, J.; Balazinska, M.; Gehrke, J.; Keerthi, S.S. An Empirical Analysis of Deep Learning for Cardinality Estimation. CoRR 2019. abs/1905.06425. [Google Scholar]
Wu, W.; Naughton, J.F.; Singh, H. Sampling-Based Query Re-Optimization. In Proceedings of the 2016 International Conference on Management of Data, San Francisco, CA, USA, 26 June–1 July 2016; pp. 1721–1736. [Google Scholar]
Yang, Z.; Kamsetty, A.; Luan, S.; Liang, E.; Duan, Y.; Chen, X.; Stoica, I. NeuroCard: One Cardinality Estimator for All Tables. Proc. VLDB Endow. 2020, 14, 61–73. [Google Scholar] [CrossRef]
Yang, Z.; Liang, E.; Kamsetty, A.; Wu, C.; Duan, Y.; Chen, X.; Abbeel, P.; Hellerstein, J.M.; Krishnan, S.; Stoica, I. Selectivity Estimation with Deep Likelihood Models. CoRR 2019. abs/1905.04278. [Google Scholar]
Bennett, K.P.; Ferris, M.C.; Ioannidis, Y.E. A Genetic Algorithm for Database Query Optimization. In Proceedings of the 4th International Conference on Genetic Algorithms, San Diego, CA, USA, 13–16 July1991; pp. 400–407. [Google Scholar]
Fegaras, L. A New Heuristic for Optimizing Large Queries. In Proceedings of the 9th International Conference, DEXA’98, Vienna, Austria, 24–28 August 1998; Volume 1460, pp. 726–735. [Google Scholar]
Marcus, R.; Papaemmanouil, O. Deep Reinforcement Learning for Join Order Enumeration. In Proceedings of the First International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, Houston, TX, USA, 10 June 2018; pp. 3:1–3:4. [Google Scholar]
Stillger, M.; Lohman, G.M.; Markl, V.; Kandil, M. LEO—DB2’s LEarning Optimizer. In Proceedings of the 27th International Conference on Very Large Data Bases, Roma, Italy, 11–14 September 2001; pp. 19–28. [Google Scholar]
Waas, F.; Pellenkoft, A. Join Order Selection—Good Enough Is Easy. In Proceedings of the 17th British National Conference on Databases, BNCOD 17, Exeter, UK, 3–5 July 2000; Volume 1832, pp. 51–67. [Google Scholar]
Yu, X.; Li, G.; Chai, C.; Tang, N. Reinforcement Learning with Tree-LSTM for Join Order Selection. In Proceedings of the 2020 IEEE 36th International Conference on Data Engineering (ICDE), Dallas, TX, USA, 20–24 April 2020; pp. 1297–1308. [Google Scholar]
Zhu, R.; Wu, Z.; Chai, C.; Pfadler, A.; Ding, B.; Li, G.; Zhou, J. Learned Query Optimizer: At the Forefront of AI-Driven Databases. In Proceedings of the 25th International Conference on Extending Database Technology (EDBT), Edinburgh, UK, 29 March–1 April 2022; pp. 1–4. [Google Scholar]
Ilyas, I.F.; Markl, V.; Haas, P.J.; Brown, P.; Aboulnaga, A. CORDS: Automatic Discovery of Correlations and Soft Functional Dependencies. In Proceedings of the 2004 ACM SIGMOD International Conference on Management of Data, Paris, France, 13–18 June 2004; pp. 647–658. [Google Scholar]
Pena, E.H.M.; Falk, E.; Meira, J.A.; de Almeida, E.C. Mind Your Dependencies for Semantic Query Optimization. J. Inf. Data Manag. 2018, 9, 3–19. [Google Scholar]
Kossmann, J.; Naumann, F.; Lindner, D.; Papenbrock, T. Workload-driven, Lazy Discovery of Data Dependencies for Query Optimization. In Proceedings of the 12th Annual Conference on Innovative Data Systems Research (CIDR’22), Chaminade, CA, USA, 9–12 January 2022. [Google Scholar]
Mou, L.; Li, G.; Zhang, L.; Wang, T.; Jin, Z. Convolutional Neural Networks over Tree Structures for Programming Language Processing. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–17 February 2016; pp. 1287–1293. [Google Scholar]
Xu, L.; Choy, C.; Li, Y. Deep sparse rectifier neural networks for speech denoising. In Proceedings of the 2016 IEEE International Workshop on Acoustic Signal Enhancement (IWAENC), Xi’an, China, 13–16 September 2016; pp. 1–5. [Google Scholar]
Thompson, W.R. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika 1933, 25, 285–294. [Google Scholar] [CrossRef]
Kossmann, J.; Papenbrock, T.; Naumann, F. Data dependencies for query optimization: A survey. VLDB J. 2022, 31, 1–22. [Google Scholar] [CrossRef]
Mahdavi, M.; Abedjan, Z. Baran: Effective Error Correction via a Unified Context Representation and Transfer Learning. Proc. VLDB Endow. 2020, 13, 1948–1961. [Google Scholar] [CrossRef]
Rekatsinas, T.; Chu, X.; Ilyas, I.F.; Ré, C. HoloClean: Holistic Data Repairs with Probabilistic Inference. Proc. VLDB Endow. 2017, 10, 1190–1201. [Google Scholar] [CrossRef]
Szlichta, J.; Godfrey, P.; Gryz, J. Fundamentals of Order Dependencies. Proc. VLDB Endow. 2012, 5, 1220–1231. [Google Scholar] [CrossRef]
Chen, X.; Wang, Z.; Liu, S.; Li, Y.; Zeng, K.; Ding, B.; Zhou, J.; Su, H.; Zheng, K. Base: Bridging the gap between cost and latency for query optimization. Proc. VLDB Endow. 2023, 16, 1958–1966. [Google Scholar] [CrossRef]
Chen, T.; Gao, J.; Chen, H.; Tu, Y. LOGER: A learned optimizer towards generating efficient and robust query execution plans. Proc. VLDB Endow. 2023, 16, 1777–1789. [Google Scholar] [CrossRef]
Chen, X.; Chen, H.; Liang, Z.; Liu, S.; Wang, J.; Zeng, K.; Su, H.; Zheng, K. Leon: A new framework for ml-aided query optimization. Proc. VLDB Endow. 2023, 16, 2261–2273. [Google Scholar] [CrossRef]
Xu, X.; Zhao, Z.; Zhang, T.; Kang, R.; Sun, L.; Chen, J. COOOL: A Learning-To-Rank Approach for SQL Hint Recommendations. arXiv 2023, arXiv:2304.04407. [Google Scholar]
Woltmann, L.; Thiessat, J.; Hartmann, C.; Habich, D.; Lehner, W. Fastgres: Making learned query optimizer hinting effective. Proc. VLDB Endow. 2023, 16, 3310–3322. [Google Scholar] [CrossRef]

Figure 1. Example of query optimizer hint set.

Figure 2. TCNN model architecture.

Figure 3. DBMS architecture.

Figure 4. Framework overview.

Figure 5. Total execution time as the query proceeds. (a) Workload 1 of IMDB; (b) Workload 2 of IMDB; (c) Workload 1 of Hospital; (d) Workload 2 of Hospital.

Figure 6. The performance of PostgreSQL and Celo relative to Bao. (a) IMDB; (b) Hospital.

Figure 7. Absolute difference in query latency between the selected plans of Celo and Bao for the subset of queries from IMDB’s Workload 1 (lower is better).

Figure 8. Percentile latency for queries. (a) Workload 1 of IMDB; (b) Workload 2 of IMDB; (c) Workload 1 of Hospital; (d) Workload 2 of Hospital.

Figure 9. Total execution time as the test query proceeds. (a) Workload 1 of IMDB dataset; (b) Workload 2 of IMDB dataset.

Figure 10. The plans selected by Bao (left) and Celo (right) of two queries in Workload 1 of the IMDB dataset. (a) A query in the middle-late workload; (b) A query in the late workload.

Table 1. Layer-wise description of TCNN model.

Layer	Description
Input Layer	$X \in R^{1 \times 256}$ .
First TreeConv Layer	$H_{1} = {TreeConv}_{1} (X; W_{1}) \in R^{1 \times 128}$ , where $W_{1}$ is the weight matrix of the first convolution layer.
Second TreeConv Layer	$H_{2} = {TreeConv}_{2} (H_{1}; W_{2}) \in R^{1 \times 64}$ , where $W_{2}$ is the weight matrix of the second convolution layer.
Third TreeConv Layer	$H_{3} = {TreeConv}_{3} (H_{2}; W_{3}) \in R^{1 \times 32}$ , where $W_{3}$ is the weight matrix of the third convolution layer.
Dynamic Pooling	$H_{pool} = DynamicPooling (H_{3}) \in R^{1 \times 32}$ .
First Linear Layer	$H_{4} = H_{pool} W_{4} + b_{4} \in R^{1 \times 16}$ , where $W_{4} \in R^{32 \times 16}$ is the weight matrix, and $b_{4}$ is the bias term.
ReLU Layer	$H_{5} = σ H_{4} \in R^{1 \times 16}$ , where $σ$ is the ReLU activation function.
Second Linear Layer (Output Layer)	$y = (H_{5} W_{6} + b_{6}) \in R^{1 \times 1}$ , where $W_{6} \in R^{16 \times 1}$ is the weight matrix, $b_{6}$ is the bias term.

Table 2. An example of query optimization latency (seconds) for equivalent queries using the SOTA method Bao and our method Celo.

Query	Bao	Celo
$q_{1}$ : SELECT MAX(cast_info.note),MIN(title.title) FROM movie_link,title,kind_type,cast_info,movie_info WHERE movie_link.movie_id=title.id AND title.kind_id=kind_type.id AND cast_info.movie_id=title.id AND movie_info.movie_id=title.id AND title.title NOT LIKE ‘%(#1.33)’ AND cast_info.note NOT LIKE ‘%nan%’ AND movie_info.id > 5,486,631 and movie_info.id < 7,020,540 and movie_info.info_type_id = 3;	0.39	0.37
$q_{2}$ : SELECT MAX(cast_info.note),MIN(title.title) FROM movie_link,title,kind_type,cast_info,movie_info WHERE movie_link.movie_id=title.id AND title.kind_id=kind_type.id AND cast_info.movie_id=title.id AND movie_info.movie_id=title.id AND title.title NOT LIKE ‘%(#1.33)’ AND cast_info.note NOT LIKE ‘%nan%’ AND movie_info.id > 5,486,631 and movie_info.id < 7,020,540;	84.81	0.43
$q_{3}$ : SELECT MAX(movie_info.info),MAX(name.name_pcode_nf) FROM cast_info,name,movie_info,info_type,person_info WHERE cast_info.person_id=name.id AND movie_info.info_type_id=info_type.id AND person_info.person_id=name.id AND person_info.info_type_id=info_type.id AND movie_info.info NOT LIKE ‘th H%’ AND name.name_pcode_nf NOT LIKE ‘%P1453%’ AND name.surname_pcode NOT LIKE ‘%S35’ AND name.name_pcode_cf LIKE ‘A6265%’ AND name.id > 1,739,579 and name.id < 2,701,135 and name.gender = ‘f’;	51.71	50.38
$q_{4}$ : SELECT MAX(movie_info.info),MAX(name.name_pcode_nf) FROM cast_info,name,movie_info,info_type,person_info WHERE cast_info.person_id=name.id AND movie_info.info_type_id=info_type.id AND person_info.person_id=name.id AND person_info.info_type_id=info_type.id AND movie_info.info NOT LIKE ‘th H%’ AND name.name_pcode_nf NOT LIKE ‘%P1453%’ AND name.surname_pcode NOT LIKE ‘%S35’ AND name.name_pcode_cf LIKE ‘A6265%’ AND name.id > 1,739,579 and name.id < 2,701,135;	141.45	50.48

Table 3. DCs and equivalent predicates in IMDB dataset.

Table	DCs	${Predicates}_{a}$	${Predicates}_{b}$
n	¬( $t_{α} [i d]$ < 1,739,580 ∧ $t_{α} [g e n d e r]$ ≠ ‘m’)	n.id < 1,739,580 AND n.gender = ‘m’	n.id < 1,739,580
n	¬( $t_{α} [i d]$ > 1,739,579 ∧ $t_{α} [i d]$ < 2,701,135 ∧ $t_{α} [g e n d e r]$ ≠ ‘f’)	n.id > 1,739,579 AND n.id < 2,701,135 AND n.gender = ‘f’	n.id > 1,739,579 AND n.id < 2,701,135
$m i$	¬( $t_{α} [i d]$ > 11,658,936 ∧ $t_{α} [i d]$ < 14,695,656 ∧ $t_{α} [i n f o_t y p e_i d]$ ≠ 16)	$m i$ .id > 11,658,936 AND $m i$ .id < 14,695,656 AND $m i$ .info_type_id = 16	$m i$ .id > 11,658,936 AND $m i$ .id < 14,695,656
$m i$	¬( $t_{α} [i d]$ > 5,486,631 ∧ $t_{α} [i d]$ < 7,020,540 ∧ $t_{α} [i n f o_t y p e_i d]$ ≠ 3)	$m i$ .id > 5,486,631 AND $m i$ .id < 7,020,540 AND $m i$ .info_type_id = 3	$m i$ .id > 5,486,631 AND $m i$ .id < 7,020,540
$c i$	¬( $t_{α} [i d]$ < 12,670,689 ∧ $t_{α} [r o l e_i d]$ ≠ 1)	$c i$ .id < 12,670,689 AND $c i$ .role_id = 1	$c i$ .id < 12,670,689
$c i$	¬( $t_{α} [i d]$ > 12,670,688 ∧ $t_{α} [i d]$ < 20,122,662 ∧ $t_{α} [r o l e_i d]$ ≠ 2)	$c i$ .id > 12,670,688 AND $c i$ .id < 20,122,662 AND $c i$ .role_id = 2	$c i$ .id > 12,670,688 AND $c i$ .id < 20,122,662
$m c$	¬( $t_{α} [i d]$ < 1,274,247 ∧ $t_{α} [c o m p a n y_t y p e_i d]$ ≠ 1)	$m c$ .id < 1,274,247 AND $m c$ .company_type_id = 1	$m c$ .id < 1,274,247
$m c$	¬( $t_{α} [i d]$ > 1,274,246 ∧ $t_{α} [i d]$ < 2,609,130 ∧ $t_{α} [c o m p a n y_t y p e_i d]$ ≠ 2)	$m c$ .id > 1,274,246 AND $m c$ .id < 2,609,130 AND $m c$ .company_type_id = 2	$m c$ .id > 1,274,246 AND $m c$ .id < 2,609,130

Table 4. DCs and equivalent group operations in Hospital dataset.

Table	DCs	$Group {Operation}_{a}$	$Group {Operation}_{b}$
hospital	¬( $t_{α} [c i t y]$ = $t_{β} [c i t y]$ ∧ $t_{α} [z i p]$ ≠ $t_{β} [z i p]$ )	GROUP BY hospital.city, hospital.zip	GROUP BY hospital.city
hospital	¬( $t_{α} [c i t y]$ = $t_{β} [c i t y]$ ∧ $t_{α} [c o u n t r y]$ ≠ $t_{β} [c o u n t r y]$ )	GROUP BY hospital.city, hospital.country	GROUP BY hospital.city
hospital	¬( $t_{α} [z i p]$ = $t_{β} [z i p]$ ∧ $t_{α} [c i t y]$ ≠ $t_{β} [c i t y]$ )	GROUP BY hospital.city, hospital.zip	GROUP BY hospital.zip
hospital	¬( $t_{α} [z i p]$ = $t_{β} [z i p]$ ∧ $t_{α} [s t a t e]$ ≠ $t_{β} [s t a t e]$ )	GROUP BY hospital.zip, hospital.state	GROUP BY hospital.zip
hospital	¬( $t_{α} [z i p]$ = $t_{β} [z i p]$ ∧ $t_{α} [c o u n t r y]$ ≠ $t_{β} [c o u n t r y]$ )	GROUP BY hospital.zip, hospital.country	GROUP BY hospital.zip
hospital	¬( $t_{α} [c o u n t r y]$ = $t_{β} [c o u n t r y]$ ∧ $t_{α} [s t a t e]$ ≠ $t_{β} [s t a t e]$ )	GROUP BY hospital.country, hospital.state	GROUP BY hospital.country

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ye, C.; Duan, H.; Zhang, H.; Wu, Y.; Dai, G. Learned Query Optimization by Constraint-Based Query Plan Augmentation. Mathematics 2024, 12, 3102. https://doi.org/10.3390/math12193102

AMA Style

Ye C, Duan H, Zhang H, Wu Y, Dai G. Learned Query Optimization by Constraint-Based Query Plan Augmentation. Mathematics. 2024; 12(19):3102. https://doi.org/10.3390/math12193102

Chicago/Turabian Style

Ye, Chen, Haoyang Duan, Hua Zhang, Yifan Wu, and Guojun Dai. 2024. "Learned Query Optimization by Constraint-Based Query Plan Augmentation" Mathematics 12, no. 19: 3102. https://doi.org/10.3390/math12193102

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Learned Query Optimization by Constraint-Based Query Plan Augmentation

Abstract

1. Introduction

2. Related Work

3. Preliminaries of Learned Query Optimization

3.1. Query Plan Generation

3.2. The TCNN Model for Query Plan Cost Prediction

3.3. Problem Definition

4. Constraint-Enhanced Query Optimizer

4.1. The DBMS Architecture

4.2. The Celo Model Overview

4.3. Equivalent Query Discovery

4.4. Query Plan Augmentation

4.5. TCNN Model Training

4.6. The Whole Procedure

5. PostgreSQL Integration

6. Experiments

6.1. Experimental Setup

6.2. Experimental Results

6.2.1. Execution Time

6.2.2. Performance Relative to Bao

6.2.3. Absolute Performance Improvement

6.2.4. Latency Distribution Analysis

6.2.5. Performance Evaluation over Constraint-Independent Queries

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI