Matching and Rewriting Rules in Object-Oriented Databases

Giacomo Bergami; Oliver Robert Fox; Graham Morgan

doi:10.3390/math12172677

Abstract

Graph query languages such as Cypher are widely adopted to match and retrieve data in a graph representation, due to their ability to retrieve and transform information. Even though the most natural way to match and transform information is through rewriting rules, those are scarcely or partially adopted in graph query languages. Their inability to do so has a major impact on the subsequent way the information is structured, as it might then appear more natural to provide major constraints over the data representation to fix the way the information should be represented. On the other hand, recent works are starting to move towards the opposite direction, as the provision of a truly general semistructured model (GSM) allows to both represent all the available data formats (Network-Based, Relational, and Semistructured) as well as support a holistic query language expressing all major queries in such languages. In this paper, we show that the usage of GSM enables the definition of a general rewriting mechanism which can be expressed in current graph query languages only at the cost of adhering the query to the specificity of the underlying data representation. We formalise the proposed query language in terms declarative graph rewriting mechanisms described as a set of production rules

L \to R

while both providing restriction to the characterisation of L, and extending it to support structural graph nesting operations, useful to aggregate similar information around an entry-point of interest. We further achieve our declarative requirements by determining the order in which the data should be rewritten and multiple rules should be applied while ensuring the application of such updates on the GSM database is persisted in subsequent rewriting calls. We discuss how GSM, by fully supporting index-based data representation, allows for a better physical model implementation leveraging the benefits of columnar database storage. Preliminary benchmarks show the scalability of this proposed implementation in comparison with state-of-the-art implementations.

Keywords:

direct acyclic graphs; generalised semistructured model; graph grammars; graph query languages; algorithms; operator algebras

MSC:

68W40; 68P15; 68Q42; 68Q55; 68R10; 68U35

1. Introduction

Query languages [1] fulfill the aim of retrieving and manipulating data after being adequately processed according to a physical model requiring preliminary data loading and indexing operations. Good declarative languages for manipulating information, such as SQL [2], are agnostic from the underlying physical model while expressing the information need of combining (JOIN), filtering (WHERE) and grouping (e.g., COUNT(*) over GROUP BY-s) data over its logical representation. This language can also be truly declarative as it does not require a user to explicitly convey this in terms of operations to be performed over the data rather than instructing the machine which information should be used and which types of transformations should be applied. Due to the intrinsic nature of graph data representation, graph query languages such as Gremlin [3], Cypher [4], or SPARQL [5] are procedural (i.e., navigational) due to the navigational structure of the data, thus heavily requiring the user to inform the language how to match and transform the data. These considerations extend to how edge and vertex data should be treated, thus adding further complexity [6,7].

On the other hand, graph grammars [8] provide a high-level declarative way to match specific sub-graphs of interest from vertex-labelled graphs for then rewriting them into another subgraph of interest, thus updating the original graph of interest. These consist in a set of rules

{L_{i} \to R_{i}}_{1 \leq i \leq N}

, where each rule

L_{i} \to R_{i}

consists of two graphs

L_{i}

and

R_{i}

possibly sharing a common subgraph K: while

K \subseteq L

specifies a potential removal of either vertices or edges,

K \subseteq R

specifies their addition. Automating the application of such rules requires explicitly following a graph navigational order for applying each matched subgraph to the structure [9] to guarantee the generation of a unique acceptable graph representation resulting from the querying data. As a by-product of exploiting topological sorts for scheduling visiting and rewriting operations, we then require to precisely identify an entry match vertex for each

L_{i}

from a rule, thus precisely determining from which position within the graph the rewriting rule should be applied, and in which order.

At the time of writing, the aforementioned graph query languages are not able to express graph grammars natively without resorting to explicitly instructing the query language how to restructure the underlying graph data representation (Lemma 7); furthermore, the replacement of previously matched vertices with new ones invalidates previous matches, thus forcing the user to pipeline different queries until convergence is reached. On the other hand, we would expect any graph query language to directly express the rewriting mechanism by directly expressing the set of rules, without necessarily instructing the querying language in which order such operations shall be performed. Furthermore, when a property-graph model is chosen, and morphisms are expressed through the properties associated with the vertices and edges rather than their associated IDs as per the current Cypher and Neo4j v5.20 specifications, this makes the deletion of vertices and their update through the resulting morphisms quite impractical, as such data model provides no clear way to refer to both vertices and edges via unique identifiers. As a result of these considerations, we then derive that, at the time of writing, both the graph query languages and their underlying representational model are insufficient to adequately express rewriting rules as general as the ones postulated by graph grammars over graphs containing property-value associations, independently from their representation of choice [10].

To overcome the aforementioned limitations, we propose a query language, for the first time directly expressing such rewriting desiderata: we restrict the set of all the possible matching graphs

L_{i}

into ego-nets containing one entry-point while incorporating nesting operations (Section 6.1); as updating graph vertices’ properties breaks the declarative assumption as the user should specify the order in which the operations are performed, we relax the language declarativity for the sole rewritings. To better support this query language, we completely shift the data model of interest to object-oriented databases, where relationships across objects are expressed through object “containment” relationships, and where both objects and such containments are uniquely identified. By also requiring that any object shall never contain itself at any nesting level [6], we obtain cycle-free containment relationships.

The paper is structured as follows: after providing some preliminary notation used throughout the paper (Section 2), we introduce the relational and the graph data models from current literature, while focusing on both their data representation and associated query languages (Section 3). Their juxtaposition motivates the proposal of our Generalised Semistructured Model (Section 4), for which we outline both the logical and physical data model, where the latter leverages state-of-the-art columnar relational database representations [11]; we also introduce the concept of a view for a generalised semistructured model, as well as introduce some morphism notation used throughout the paper. After introducing the designated operator for instantiating the morphisms by joining matched containments stored in distinct sets of tables (Section 5), we finally propose our query language for matching and rewriting object-oriented databases expressed in the aforementioned GSM model (Section 6). We characterise the formal semantics of such novel graph query language in pseudocode notation (Algorithm from Section 6) characterised in terms of both algorithmic and algebraic notation (Section 6.3 and Section 6.4) for the matching part, as well as in terms of Structured Operational Semantics (SOS) [12] for describing the rewriting steps (Section 6.5). The remaining sections discuss some of the expressiveness properties of such query language if compared to Cypher (Section 7), discuss its time complexity (Section 8), and benchmark it against Cypher running over Neo4j v5.20, showing the extreme inefficiency of property graph computational model (Section 9), which are fairly restricted due to the impossibility of conveying a novel single query for any possible graph schema (Lemmas in Section 7). Scalability tests show that our solution outperforms Cypher and Neo4j v5.20, providing the query language standard nearer to the recently proposed GQL, by two orders of magnitude (Section 9.1) while providing a computational throughput being 600 times faster than Neo4j by solving more queries in the same comparable amount of time (Section 9.1). Last, we draw our final conclusions where we propose some future works (Section 10). To improve the paper’s readability, we move some definitions (Appendix A) and the full set of proofs for the Lemmas and Corollaries (Appendix B) to the Appendix.

These main contributions are then obtained in terms of the following ancillary results:

Definition of a novel nested relational algebra natural equi-join operator for composing nested morphisms, also supporting left outer joins by parameter tuning (Section 5).
Definition of an object-oriented database view for updating the databases on the fly without the need for heavy restructuring of the loaded and indexed physical model (Section 4.3).
As the latter view relies on the definition of a logical model extending GSM (Section 4.1), we show that the physical model is isomorphic to an indexed set of GSM databases expressed within the logical model (Lemma 1).

2. Preliminary Notation

We denote sets

S = {x_{1}, \dots, x_{n}}

of cardinality

| S | = n

as standard. The power set

℘ (S)

of any set S is the set of all subsets of S, including the empty set ∅ and S itself. Formally,

℘ (S) = {S^{'} | S^{'} \subseteq S}

.

We define a finite function

f : A ⇀ B

via its graph

[(x_{1}, f (x_{1})), \dots, (x_{n}, f (x_{n}))]

explicating the association between a value from the domain of f (

{x_{i}, \dots, x_{i}} = dom (f)

) and a non-NULL codomain value. Using an abuse of notation, we denote the restriction

{f |}_{X}

as the evaluation of f over a domain

X \cap dom (f)

, i.e.,

{f |}_{X} = [(x, f (x)) | x \in X \cap dom (f)]

. We can also denote

f (x) : = C

where C is the definition of the function over x as

x \mapsto C

. With an abuse of notation, we denote

| f |

as the cardinality of its domain, i.e.,

| f | = | dom (f) |

. We say that two functions f and

f^{'}

are equivalent,

f \dot{=} f^{'}

, if and only if they both share the same domain and, for each element of their domain, both functions return the same codomain value, i.e.,

f \dot{=} f^{'} \Leftrightarrow dom (f) = dom (f^{'}) \land \forall x \in d o m (f) . f (x) = f^{'} (x)

.

A tuple or indexed set

t = ⟨ t_{1}, \dots, t_{n} ⟩

of length

| t | = n

is defined as a finite function in

N ⇀ V

where

t_{i}

has i as a natural-valued independent variable representing the index of the tuple, and the dependent variable corresponds to the element

t (i)

in the tuple.

A binary relation ℜ on a set A is said to be an equivalence relation if and only if it is reflexive (

\forall x \in A . x ℜ x

), symmetric (

\forall x, y . x ℜ y \Leftrightarrow y ℜ x

), and transitive (

\forall x, y, z . x ℜ y \land y ℜ z \Rightarrow x ℜ z

). Given a set A and an equivalence relationship ℜ, an equivalence class

{[x]}_{ℜ} \subseteq A

for

x \in A

is the set of all the elements in A that are equivalent to x, i.e.,

{[x]}_{ℜ} = {y \in A | x ℜ y}

. Given a set A and an equivalence relationship ℜ, the quotient set

\frac{A}{ℜ}

is the set of all the equivalence classes of A, i.e.,

\frac{A}{ℜ} = {{[x]}_{ℜ} | x \in A}

. We denote

{\dot{=}}_{X}

as the equivalence relationship denoting two functions as equivalent if they are equivalent after the same restriction over X, i.e.,

f {\dot{=}}_{X} f^{'} \Leftrightarrow f {|_{X} \dot{=} f^{'} |}_{X}

.

Given a set of all the possible string attributes

Σ^{*}

and a set of all the possible values

V

, a record is defined as a finite function

f : A ⇀ V

with

A \subseteq Σ^{*}

[13]. Given two records represented as finite functions f and

f^{'}

, we denote

f \oplus f^{'}

as the overriding of f by

f^{'}

returning

f^{'} (x)

for each

x \in dom (f^{'})

and

f (x)

otherwise. Given this, we can also denote the graph of a function as

\oplus_{x_{i} \in dom (f)} [(x_{i}, f (x_{i}))] \equiv [(x_{i}, f (x_{i})) | x_{i} \in dom (f)]

. We denote as NULL a void value not representing a sensible value in

V

: using the usual notation in the database area at the cost of committing an abuse of notation, we say that

f (x)

returns NULL if and only if f is not defined over x (i.e.,

x \notin dom (f)

). We say that a record

ϵ

is empty if no attribute is associated with a value in

V

(i.e.,

dom (ϵ) = \emptyset

and

\forall x \in V . ϵ (x) = NULL

).

Higher-Order Functions

Higher-Order Functions (HOFs) are functions that either take one or more functions as arguments or return a function as a result. We provide the definition of some HOFs used in this paper:

The zipping operator maps n tuples (or records) $t^{1}, \dots, t^{n}$ to a record of tuples (or records) r defined as $r (i) = ⟨ t_{i}^{1}, \dots, t_{i}^{n} ⟩$ if and only if all n tuples are defined over i:

$ζ (t^{1}, \dots, t^{n}) = [⟨ t_{i}^{1}, \dots, t_{i}^{n} ⟩ | i \leq i \leq min (| t^{1} |, \dots, | t^{n} |), \forall 1 \leq j \leq n . i \in dom (t^{j})]$
Given a function $f : A \to B$ and a generic collection C, the mapping operator returns a new collection by applying f to each component of C:

$μ (f, C) = \{\begin{matrix} {f (x) | x \in C} & C i s a set \\ [f (C) (i) | i \in dom (C)] & C i s a record \\ ⟨ f (C_{1}), \dots, f (C_{n}) ⟩ & C i s a tuple \end{matrix}$
Given a binary predicate p and a collection C, the filter function trims C by restricting it to its values satisfying p:

$ϝ (p, C) = \{\begin{matrix} {x \in C | p (x)} & C is a set \\ [x (C) (x) | x \in dom (C), p (C (x))] & C is a record \\ x \mapsto {arg min}_{\begin{matrix} | C | \geq i \geq x \\ s . t . p (C_{i}) \end{matrix}} C_{i} & C is a tuple \end{matrix}$
Given a binary function $f : A \times V \to A$ , an initial value $α \in A$ (accumulator), and a tuple C, the (left) fold operator is a tail-recursive function returning either $α$ for an empty tuple, or $f (\dots f (α, t_{1}), t_{n})$ for a tuple $t = ⟨ t_{1}, \dots, t_{n} ⟩$ :

$Λ (f, α, C) = \{\begin{matrix} α & | C | = 0 \\ Λ (f, f (α, C (m)), C |_{dom (C) ∖ {m}}) & | C | \neq 0 \land m : = min dom (C) \end{matrix}$
Given a collection of strings C and a separator s, collapse (also referred to as join in programming languages such as Javascript or Python ) returns a single string where all the strings in C are separated by s. Given “^” the usual string concatenation operator, this can be expressed in terms of $Λ$ as follows:

$λ (C, s) : = Λ (^, s, C)$

When s is a space “␣”, then we can use $λ (C)$ as a shorthand.
Given a function $f : A ⇀ B$ and two values $a \in A$ and $b \in B$ , the update of f so that it will return b for a and will return any other previous value for $f (x)$ otherwise is defined as follows:

${P UT}_{f} (a, b) : = f \oplus [(a, b)]$

(1)
Given a (finite) function f and an input value y, the HOF optionally obtaining the value of $f (y)$ if $y \in dom (f)$ and returning z otherwise is defined as:

${O PT G ET}_{f} (y, z) : = \{\begin{matrix} f (y) & y \in dom (f) \\ z & oth . \end{matrix}$

Please observe that we can use this function in combination with Put to set multiple nested functions:

${P UT}_{f}^{2} (⟨ i, j ⟩, v) : = {P UT}_{f} (i, {P UT}_{{O PT G ET}_{f} (i, \emptyset)} (j, v))$

(2)

4. Generalised Semistructured Model v2.0

We continue the discussion of this paper’s methodology by discussing the logical model (Section 4.1) as a further extension of the Generalised Semistructured Model [34] to explicitly support property-value associations for vertices via

π

. Although we propose no property-value extension for containments, we argue that this is minor and does not substantially change the theoretical and implementation results discussed in this paper for Generalised Graph Grammar. We also introduce a novel columnar-oriented physical model (Section 4.2 on the next page) being a direct application of our previous work on temporal databases [11] already supporting collections of databases (log of traces): this will be revised for loading and indexing collection of GSM databases defining our overall physical storage. As the external data to be loaded into the physical model directly represents the logical model, we show that these two representations are isomorphic (Lemma 1) by defining explicitly the loading and serialisation operations (Algorithm from Section 4.2).

As directly updating the physical model with the changes specified in the rewriting steps of the graph grammar rules might require massive restructuring costs, we instead keep track of such changes in a separate non-indexed representation acting as an incremental view to the entire database. We then refer to this additional structure as a GSM view

Δ (g)

for each GSM database g loaded in the physical model (Section 4.3). We then characterise the semantics of

Δ (g)

by defining the updating function for g as a materialisation function considering the incremental changes recorded in

Δ (g)

(Section 4.3.2 on page 19).

Last, as we consider the execution of the rewriting steps for each graph as a transformation of the vertices and edges as referenced within each morphism, modulo the updates tracked in

Δ (g)

, it becomes necessary to introduce some preliminary notation for resolving vertex and edge variables from such morphism (Section 4.4).

4.1. Logical Model

The logical model for GSM describes a single object-oriented database g as a tuple

⟨ O, ℓ, ξ, ϵ, π, ϕ, t_{ϕ} ⟩

, where

O \subseteq N

is a collection of objects (references). Each object is associated with a tuple of possible types

ℓ (o)

and to a tuple of string-representations

ξ (o)

providing its human-readable descriptions. As an object might be the result of an automated entity-relationship extraction process, each object is associated with a list of confidence values

ϵ : O ⇀ ℘ (R)

describing the trustworthiness of the provided information. Differently from the previous definition [34], we extend the previous model to also associate each object to an explicit property-value association for each object

o \in O

through a finite function

π : O \times Σ^{*} ⇀ V

.

Differently from our previous definition of our GSM model, we express object relationships through uniquely identified vertex containments similar to edges as in Figure 1a, to explicitly reference such containments in morphism tables similarly to Figure 1c. We associate each object o with a containment attribute

κ

referring to multiple containment IDs via

ϕ (o, κ) \in ℘ (N)

. An explicit index

t_{ϕ}

maps each of these IDs to a concrete containment

⟨ o_{j}, w ⟩

denoting that

o_{j}

is contained in o through the containment relationship

κ

with a confidence score of w. This separation between indices and values is necessary to allow the removal of containment values when computing queries efficiently. This also guarantees the correct correspondence between each value

ι

in the domain of

t_{ϕ}

to the label associated with the

ι

-th containment requiring each

ι

will be associated with one solve containment relationship (e.g.,

{arg min}_{κ \in Σ^{*}} \exists j \in g . ι \in ϕ (j, κ)

).

We avoid objects containing themselves at any nesting level by imposing a recursion constraint [6] to be checked at indexing time: this allows both the definition of sound structural aggregations, which can be then conveniently used to represent multi-dimensional data-warehouses [34]. Thus, we freely assume that no object shall store the same containment reference across containment attributes [6]:

\forall o, κ . \forall o^{'}, κ^{'} . ((o = o^{'} \land κ \neq κ^{'}) \lor o \neq o^{'}) \Rightarrow ϕ (o, κ) \cap ϕ (o^{'}, κ^{'}) = \emptyset

. This property can also be adapted to conveniently represent Direct Acyclic Graphs, by representing each vertex

v \in V

of a weighted property-graph

G = (V, E)

as a GSM object, and each edge

u \to v \in E

with weight w and label

κ

reflects a containment

⟨ v, w ⟩ \in t_{ϕ} (ϕ (i, κ))

. Given this isomorphism

g ≃ G_{g}

, we can also apply a layered and reverse topological ordering of the vertices of g and denote it as

O_{rtopo} (g)

.

The Python code provided in Appendix A.5 showcases the possibility of instantiating Python objects within the GSM model via the implementation of an adequate transformation function, mapping all native types as

π

key-value properties of a GSM object, while associating the others to

ϕ

and

t_{ϕ}

containment relationships (Line 222). Containments also enable the representation of other object-oriented data structures, such as dictionaries/maps (Line 144) and linear data structures (Line 164), thus enabling a direct representation of JSON and nested relational data (Line 53). As per our previous work [9], GSM also supports the representation of XML (Line 117) and property graph data (Line 65). This also achieves the representation aim of EPGM by natively supporting unique identifiers for both vertices and edges, to better operate on those by directly referring to their IDs without the need to necessarily carry out all of its associated payload information [46]. As such, this model leverages all the pros and cons of the previous graph data models by framing them within an object-oriented semistructured model.

4.2. Physical Model

We now describe how the former model can be represented in primary memory for fastening up the matching and rewriting mechanism.

First, we would like to support the loading of multiple GSM databases while being able to operate over them simultaneously similar to the named graphs in the RDF model. Differently from Neo4j v5.20 and similarly to RDF’s named graphs, we ensure a unique and progressive ID across different databases.

We directly exploit our ActivityTable for listing all the objects appearing within each loaded GSM: as such, the id-th record

⟨ i d, a, g, i, p, x ⟩

will refer to the i-th object in the g-th GSM database, while a will refer to the first label of

ℓ_{j} (i)

, that is

ℓ_{j} (i) [1]

.

We extend KnoBAB’s ActivityTable to represent the

ϕ

containment relationship; the result is a PhiTable^κ for each containment attribute

κ

: each record

⟨ ℓ_{0}, g, o_{src}, w, o_{dst}, ι ⟩

refers to a specific GSM database g through which object

o_{i}

associated with the first label

ℓ_{0} = ℓ_{g} (o_{i}) [0]

contains

o_{j}

with an uncertainty score of w and is associated with an index

ι

: this expresses the idea that

ι \in ϕ (o_{src}, κ)

with

t_{ϕ} (ι) = ⟨ o_{dst}, w ⟩

. At the indexing phase, the table is sorted by lexicographical order over the record’s constituents. We extend this table with two indices: a primary index

P_{I}^{1}

mapping each first occurring ℓ value to the first and the last object within the collection, and a secondary index

P_{I}^{2}

mapping such each database ID g and object ID

o_{i}

to a collection of containment records expressing

μ (t_{ϕ}, ϕ (o_{i}, κ))

for each

o_{i}

and

κ

such that

ϕ (o_{i}, κ) \neq \emptyset

.

We retain the AttributeTable^κ for expressing the properties associated with the GSM vertices, for which we keep the same interpretation from Section 3.2.3 on page 12: thus, each record

⟨ a, v, id ⟩

refers to

ϕ (o_{i}, κ) = v

where

o_{i}

appears as the id-th record within the

{ActivityTable}_{L}

, now used to list all the objects occurring across GSM models.

Last, ℓ and

ξ

properties are stored in a one-column table and a secondary index mapping each database ID and object ID to a list of records referring to the strings associated with the object ID. The same approach is also used to store the confidence values

ϵ

associated with each GSM object.

Algorithm 2 shows the algorithms used for loading all of the GSM databases

g_{i}

to be loaded of interest to a single columnar database representation

d b

, as well as providing the algorithm used to serialize back the stored data into a GSM model for data visualisation and materialisation purposes. Given this, we can easily prove the isomorphism between the two shared data structures, thus also providing proof of the correctness of the two transformations which, as a result, explains the formal characterisation of such an algorithm.

Algorithm 2 Loading a Logical Model into the Physical Model and serialising it back

1:: function LoadingAndIndexing( $G = {g_{1}, \dots, g_{n}}$ )
2:: for all $g_{i} = ⟨ O_{i}, ℓ_{i}, ξ_{i}, ϵ_{i}, π_{i}, ϕ_{i}, t_{i, ϕ} ⟩ \in G$ do
3:: for all $j \in O_{i}$ do
4:: $L_{i} (j) : = ℓ_{i} (j)$ ; $X_{i} (j) : = ξ_{i} (j)$ ; $C_{i} (j) : = ϵ_{i} (j)$
5:: ActivityTable.add( $⟨ ℓ (j) [0], i, j, NULL, NULL ⟩$ )
6:: end for
7:: end for
8:: Index(ActivityTable) ▹ Also sorting the table, [11]
9:: for all $κ \in Σ^{*}, j \in O_{i}$ s.t. ActivityTable [r] = $⟨ ℓ_{i} (j) [0], i, j, p, x ⟩$ do
10:: for all $ι \in ϕ_{i} (j, κ)$ s.t. $t_{i, ϕ} (ι) = ⟨ t, w ⟩$ do
11:: PhiTable^κ.add( $⟨ ℓ (j) [0), i, j, w, t, ι ⟩$ )
12:: end for
13:: if $π_{i} (j, κ) \neq NULL$ then, AttributeTable^κ.add( $⟨ ℓ_{i} (j) [0], π (j, κ), r ⟩$ )
14:: end for
15:: Index(ActivityTable^κ, PhiTable^κ| $κ \in Σ^{*}$ ) ▹ As in [11]
16:: return $d b : = ⟨ L, X, C, ActivityTable, [(κ, {AttributeTable}^{κ}) | κ \in Σ^{*}], [(κ, {PhiTable}^{κ}) | κ \in Σ^{*}] ⟩$
17:: end function

18:: function Serialisation( $d b$ )
19:: $n : = {max}_{r \in AttributeTable} r (1)$ Determining the maximum number of GSM databases
20:: for all $i : = 0$ to n do
21:: $O_{i} : = \{j | \exists l, p, x . ⟨ l, i, j, p, x ⟩ \in ActivityTable\}$
22:: $ℓ_{i} : = [(i, L_{i} (j)) | j \in O_{i}]$ ; $ξ_{i} : = [(i, X_{i} (j)) | j \in O_{i}]$ ; $ϵ_{i} : = [(i, C_{i} (j)) | j \in O_{i}]$
23:: $π_{i} : = [((j, κ), v) | ⟨ l, v, r ⟩ \in {AttributeTable}^{κ}, \exists p, x . ⟨ l, i, j, p, x ⟩ \in ActivityTable, κ \in Σ^{*}]$
24:: $ϕ_{i} : = [((s, κ), {ι^{'} | \exists l^{'}, w^{'}, d^{'}, ⟨ l^{'}, i, j, w^{'}, d^{'}, ι^{'} ⟩ \in {PhiTable}^{k}}) | ⟨ l, i, s, w, d, ι ⟩ \in {PhiTable}^{κ} κ \in Σ^{*}]$
25:: $t_{i, ϕ} : = (ι, {d, w}) | ⟨ l, i, s, w, d, ι ⟩ \in {PhiTable}^{κ}, κ \in Σ^{*}]$
26:: yield $⟨ O_{i}, ℓ_{i}, ξ_{i}, ϵ_{i}, π_{i}, ϕ_{i}, t_{i, ϕ} ⟩$
27:: end for
28:: end function

Lemma 1.

A collection of logical model GSM databases

{⟨ g_{i} ⟩}_{i \leq n}

is isomorphic to the ones loaded and indexed physical model.

The proof is given in Appendix B.1. We refer to Section 8 for proofs related to the time complexity for loading, indexing, and serialising operations.

4.3. GSM View $Δ (g)$

To avoid massive restructuring costs while updating the information indexed in the physical model, we use a direct extension of the logical model to keep track of which objects were newly generated, removed, or updated during the application of the rule rewriting mechanisms. At query time, we instantiate a view

Δ (g_{i})

for each

g_{i}

being loaded within the physical model (Section 6.5). We want our view to support the following operations: (i) creation of new objects, (ii) update of the type/labelling information ℓ, (iii) updating the human-readable value characterisation

ξ

, (iv) update of the containment values, (v) removal of specific objects, and (vi) substitution of previously matched vertices with newly-created or other previously matched ones. While we are not explicitly supporting the removal of specific properties or values, these can be easily simulated by setting specific fields to empty strings or values. A view for

g_{i}

is defined as follows:

Δ (g_{i}) = ⟨ g_{i}^{Δ}, Γ_{i}, Γ_{i}^{v}, O_{i}^{+}, O_{i}^{-}, E_{i}^{-}, ρ_{i} ⟩

(8)

where

g_{i}^{Δ} = ⟨ O_{i}^{Δ}, ℓ_{i}^{Δ}, ξ_{i}^{Δ}, ϵ_{i}^{Δ}, π_{i}^{Δ}, ϕ_{i}^{Δ}, t_{ϕ, i}^{Δ} ⟩

is a GSM database holding all the objects being newly inserted alongside with the properties as well as the updated properties for the objects within the graph g (i–iv),

Γ

refers to the nested morphism being considered while evaluating the query,

Γ^{v}

denotes the extension of such morphism with the newly inserted objects through variable declaration, which resulting objects are then collected in

O^{+}

(i).

O^{-}

(and

E^{-}

) tracks all the removed objects (and specific containment objects, resp.) through their ID (v). Last,

ρ

is a replacement map

⨁_{i} [(o_{i}, o_{α (i)})]

to be used when evaluating the transformations over morphisms occurring at a higher topological sort layer, stating to refer to an object

o_{α (i)}

when

o_{i}

occurs (vi).

Γ^{v}

and

O^{+}

are used to retain the updated information locally to each evaluated morphism, while the rest are shared across the evaluation of each distinct morphism.

We equip

Δ (g)

with update operations reflecting the insertion, deletion, update, and replacement operations as per rewriting semantics associated with each graph grammar production rule. Such operations are the following:

Start:: re-initialises the view to evaluating a new morphism $Γ^{'}$ by discarding any information being local to each specific morphism:

${S TART}_{Δ (g)} (Γ^{'}) = ⟨ g^{Δ}, Γ^{'}, \emptyset, \emptyset, O^{-}, E^{-}, ρ ⟩$

(9)
DelCont:: We remove the i-th containment relationship from the database:

${D EL C ONT}_{Δ (g)} (i) : = ⟨ g^{Δ}, Γ, Γ^{v}, O^{+}, O^{-}, E^{-} \cup {i}, ρ ⟩$
NewObj:: Generates a new empty object associated with a variable j and with a new unique object ID $| g | + | g^{Δ} | + 1$ :

\begin{matrix} {N EW O BJ}_{Δ (g)} (j) : = & let fresh : = | g | + | g^{Δ} | + 1 in \\ ⟨ g^{Δ}, Γ, {P UT}_{Γ^{v}} (j, Γ^{v} (j) \cup {fresh}), O^{+} \cup {fresh}, O^{-}, E^{-}, ρ ⟩ \end{matrix}

ReplObj:: replaces $o_{i}$ with $o_{j}$ if and only if $o_{j}$ was not removed ( $o_{j} \notin O^{-}$ ):

${R EPL O BJ}_{Δ (g)} ((o_{i}, o_{j})) : = \{\begin{matrix} Δ (g) & o_{j} \in O^{-} \\ ⟨ g^{Δ}, Γ, Γ^{v}, O^{+}, O^{-}, E^{-}, ρ \circ [(o_{i}, o_{j})] ⟩ & oth . \end{matrix}$
DelObj:: We remove an object $o_{i}$ only if this was already in g or if this was inserted in a previous evaluation of a morphism and, within the evaluation of the current morphism, we remove the original object $o_{i}$ being replaced by $\tilde{o} = ρ (o_{i})$ :

$\begin{matrix} {D EL O BJ}_{Δ (g)} (o_{i}) : = & let \tilde{o} : = {O PT G ET}_{ρ} (o_{i}) in \\ \{\begin{matrix} ⟨ g^{Δ}, Γ^{'}, \emptyset, O^{+}, O^{-} \cup {o_{i}}, E^{-}, ρ ⟩ & \tilde{o} \in O^{+} \\ ⟨ g^{Δ}, Γ^{'}, \emptyset, O^{+}, O^{-} \cup {\tilde{o}}, E^{-}, ρ ⟩ & oth . \end{matrix} \end{matrix}$
Update:: updates one of the object property functions by specifying which of those should be updated. In other words, this is the extension of Put² (Equation (2)) as a higher function for updating the view alongside one of these components:

${U PDATE}_{Δ (g_{i})}^{f} (⟨ i, j ⟩, u) : = \{\begin{matrix} ⟨ O_{i}^{Δ}, P u t_{ℓ_{i}^{Δ}}^{2} (⟨ i, j ⟩, u), ξ_{i}^{Δ}, ϵ_{i}^{Δ}, π_{i}^{Δ}, ϕ_{i}^{Δ}, t_{ϕ, i}^{Δ} ⟩ & f \equiv ℓ \\ ⟨ O_{i}^{Δ}, ℓ_{i}^{Δ}, {P UT}_{ξ_{i}^{Δ}}^{2} (⟨ i, j ⟩, u), ϵ_{i}^{Δ}, π_{i}^{Δ}, ϕ_{i}^{Δ}, t_{ϕ, i}^{Δ} ⟩ & f \equiv ξ \\ ⟨ O_{i}^{Δ}, ℓ_{i}^{Δ}, ξ_{i}^{Δ}, ϵ_{i}^{Δ}, P u t_{π_{i}^{Δ}} (⟨ i, j ⟩, u), ϕ_{i}^{Δ}, t_{ϕ, i}^{Δ} ⟩ & f \equiv π \\ let n : = | dom (t_{i, ϕ}^{Δ}) | in & f \equiv ϕ \\ let \tilde{ϕ} : = P u t_{ϕ_{i}^{Δ}} (⟨ i, j ⟩, {n, \dots, n + | u |}) in \\ let \tilde{t} : = t_{ϕ, i}^{Δ} \oplus ⨁_{0 \leq j < | u |} [(j + n, u (j))] in \\ ⟨ O_{i}^{Δ}, ℓ_{i}^{Δ}, ξ_{i}^{Δ}, ϵ_{i}^{Δ}, π_{i}^{Δ}, \tilde{ϕ}, \tilde{t} ⟩ \end{matrix}$

Concerning the time complexity, we can easily see that all operations take

O (1)

time to compute by assuming efficient hash-based sets and dictionaries. Concerning

{U PDATE}^{ϕ}

, this operation takes

O (1)

time, as we are merely inserting one pair at a time.

4.3.1. Object Replacement and Resolution

As our query language will have variables to be resolved via matched morphisms and view updates (Appendix A.1), we focus on specific variable resolution operations. Replacement operations should be interpreted as a reflexive and transitive closure over the step-wise replacement operations performed while running the rewriting query (Section 6.5 on page 33).

Definition 4

(Active Replacement). The active replacement function resolves any object ID x into its final replacement vertex following the chain of subsequent unique substitutions of each single vertex in ρ, or otherwise returns x:

ρ_{Δ (g)}^{⋆} (x) : = \{\begin{matrix} ρ^{n} (x) & n^{'} = arg max_{n \in N} . x \in dom (ρ^{n}) \land ρ^{n} (x) \neq ρ^{n + 1} (x) \\ x & oth . \end{matrix}

During an evaluation of a morphism to be rewritten, such replacements and changes should be effective from the next morphism while we would like to preserve the original information while evaluating the current morphism.

Definition 5

(Local Replacament). The local replacement function blocks any notion of replacement while evaluating the original data matched by the current morphism while activating the changes from the evaluation of any subsequent morphism where such newly-created vertices from the current morphism will not be considered:

↓ ρ_{Δ (g)} (x) : = \{\begin{matrix} ρ^{⋆} (x) & ρ^{⋆} (x) \notin O^{+} \\ x & oth . \end{matrix}

We consider objects as removed if they have no effective replacements to be enacted in any subsequent morphism evaluation:

x \in dom (ρ) \land x \in O^{-}

. Thus, we also need to resolve objects’ properties (such as ℓ,

ξ

,

π

, and

ϕ

) by considering the changes registered in

Δ (g_{i})

. We want to define a HOF property extraction that is independent of the specific function of choice. By exploiting the notion of local replacement (Definition 5), we obtain the following definition:

Definition 6

(Property Resolution). Given any property access function

ℓ, ξ, ϕ, π

, a GSM database

g_{i}

and a corresponding view

Δ (g_{i})

we define the following property resolution high-order function:

↓ ρ_{Δ (g)}^{f} (o) = \{\begin{matrix} \emptyset & ↓ ρ_{Δ (g)} (o) \in O_{i}^{-} \\ f_{Δ (g_{i})} (↓ ρ_{Δ (g)} (o)) & ↓ ρ_{Δ (g)} (o) \in O_{i}^{Δ} \land f_{Δ (g_{i})} (↓ ρ_{Δ (g)} (o)) \neq \emptyset \\ f_{g_{i}} (↓ ρ_{g_{i}} (o)) & oth . \end{matrix}

where we ignore any value associated with a removed vertex in

O_{i}^{-}

(first case), we consider any value stored in

Δ (g_{i})

as overriding any other value originally in the loaded graph (second case), while returning the original value if the object underwent no updates (last case).

4.3.2. View Materialisation

Last, we define a materialisation function as a function updating a GSM database

g_{i}

with the updates stored in the incremental view

Δ (g_{i})

. We consider all the objects being inserted (implicitly associated with a

1.0

ε

score) and removed, as well as extending all the properties as per the view, thus removing containment relationships originating from or arriving at GSM objects.

\begin{matrix} {MATERIALISE}^{'} (g_{i}, Δ (g_{i})) = & ⟨ O_{i} \cup O_{i}^{+} ∖ O_{i}^{-}, \\ ℓ \oplus ℓ_{i}^{Δ}, \\ ξ \oplus ξ_{i}^{Δ}, \\ ϵ \oplus (⨁_{o \in O_{i}^{Δ} ∖ O_{i}^{-}} [(o, 1.0)]), \\ π \oplus π_{i}^{Δ}, \\ ⨁_{⟨ p, k ⟩ \in dom (ϕ)} [(⟨ p, k ⟩, ϝ (y \mapsto y \notin E_{i}^{-}, ϕ (p, k)))] \oplus ϕ_{i}^{Δ}, \\ (t_{ϕ} \oplus t_{ϕ, i}^{Δ}) ⟩ \end{matrix}

As a rewriting mechanism might add edges violating the recursion constraint, we prune the containments loading to its violation by adopting the following heuristic: after approximating the topological sort by prioritising the object ID, we remove all the containments generating a cycle where the contained object has an ID with a lower value than its container ID. From this definition, we then derive the definition of the update of all the GSM databases loaded in the physical model G with their corresponding updates in

Δ

via the following expression:

MATERIALISE (G, Δ) = μ ({MATERIALISE}^{'}, ζ (G, Δ))

(10)

4.4. Morphism Notation

We consider nested relationships mapping attributes to either basic data types or to not nested schemas, as our query language will syntactically avoid the possibility of arbitrary nesting depths. Given this, any attribute

A_{i}

can nest at a maximum level 1 of depth. This will then motivate a similar requirement for the envisioned operator for composing matched containments (collected in relational tables) into nested morphisms (Section 5).

As our query language requires resolving variables by associating each variable

A_{i}

to the values stored in a specific morphism

Γ

, we need a dedicated function enabling this. We can define a value extraction function for each morphism

Γ

and attribute

A_{i} \in dom (Γ)

, returning directly the value associated with

A_{i}

in

Γ

if

A_{i}

directly appears in the schema of

Γ

(

dom (Γ)

), and otherwise returns the list of all the possible values associated with it within a nested relationship

A_{j}

having

A_{i}

in its domain:

{I DX}_{Γ} (A_{i}) : = let S : = S (Γ) in \{\begin{matrix} ⟨ Γ (A_{i}) ⟩ & S (A_{i}) \in B \\ ⟨ γ_{i} (A_{i}) | γ_{i} \in Γ (A_{j}) ⟩ & \exists! A_{j} . A_{i} \in dom (S (A_{j})) \\ \emptyset & oth . \end{matrix}

(11)

When resolving a variable, we need to determine whether this refers to a containment or to an object, thus selecting to remove the most appropriate type of constituent indicated within a morphism. So, we can define a function similar to the former for extracting the basic datatypes associated with a given attribute:

{TI DX}_{Γ} (A_{i}) : = let S : = S (Γ) in \{\begin{matrix} S (A_{i}) & S (A_{i}) \in B \\ (S (A_{j})) (A_{i}) & \exists! A_{j} . A_{i} \in dom (S (A_{j})) \land S (A_{j}) (A_{i}) \in B \\ \emptyset & oth . \end{matrix}

(12)

We also need a function determining the occurrence of an attribute x nested in one of the attributes of S. This will be used for both automating the discovery of the path

\vec{L}

for joining nested tables from our recently designed operator (Section 5) or for determining whether two variables belong to the same nested cell of the morphism while updating the GSM view. This boils down to defining a function returning

A_{j}

if

A_{i}

is an attribute of a table nested in

A_{j}

, and NULL otherwise.

{I D N EST}_{S} (A_{i}) : = {arg min}_{\begin{matrix} A_{j} \in dom (S) \\ s . t . S (A_{j}) \neq B \end{matrix}} A_{i} \in dom (S (A_{j}))

(13)

Last, we need a function for returning all the object and containment IDs under the circumstance that these contribute to the satisfaction of a boolean expression. We then define such a function returning such IDs at any level of depth of a nested morphism:

S E (Γ) = let S = S (Γ) in {x \in dom (Γ) | S (x) \in B} \cup ⋃_{x \in dom (S), S (x) \notin B} S E (Γ (x))

(14)

5. Nested Natural Equi-Join

Although previous literature defines nested natural join, no known algorithmic implementation is available. As our query language will return nested morphisms by gradually composing intermediate tables through natural or left joins is, therefore, important to provide an implementation for such an operator. This will be required to combine tables derived from the containment matching (Section 6.3) into nested morphisms, where it is required to join via attributes appearing within nested tables (Section 6.4). Our lemmas show the necessity of this operator by demonstrating the impossibility of expressing it via Equation (6) directly, while capturing the desired features for generating nested morphisms.

We propose for the first time Algorithm 3 for computing the nested (left outer) equi-join with a path

\vec{L}

of depth at most 1. The only parameter provided to the algorithm is whether we want a left outer equi-join or a natural one otherwise (isLeft) and, given that the determination of the nesting path will depend on the schema of both the left and right operand, we automate (Line 9) the determination of the

\vec{L} = ⟨ N ⟩

path along which compute the nested join for which, we freely assume that we navigate on the nested schema of the left operand similarly to Equation (6): this assumption comes from our practical use case scenario that we are gradually composing the morphisms provided as a left operand argument with the containment relationships provided as the right operand. Furthermore, to apply the definition from Equation (6) while automating the discovery of the path to navigate to nest the relationship, we require that each attribute appearing from the right table schema might appear as nested in one single attribute from the left table or, otherwise, we cannot automatically determine which left attribute to choose to nest the graph visit (Line 8). Otherwise, we determine a unique attribute from the left table alongside which apply the path descent (Line 9).

Algorithm 3 Nested Natural Equi-Join

1:: function ${N ESTED N ATURAL E QUIJOIN}_{isLeft}$ ( $L, R$ )
2:: $S_{L} : = S (L)$ ; $S_{R} : = S (R)$
3:: $IR : = (dom (S_{L}) ∖ {x \in dom (S_{L}) | S_{L} (x) \notin B}) \cap dom (S_{R})$
4:: if $IR = \emptyset$ then return $L \times R$ ▹ Cross Product
5:: if $⋃ {dom (A_{i}) | A_{i} \in dom (S_{L}) \land S_{L} (A_{i}) \notin B} \cap dom (S_{R}) = \emptyset$ then
6:: if isLeft then return L R else return $L ⋈ R$
7:: end if

8:: assert $|⋃_{x \in dom (S_{R})} {I D N EST}_{S_{L}} (x)| = 1$ ▹ Equation (13)
9:: $N : = min ⋃_{x \in dom (S_{R})} {I D N EST}_{S_{L}} (x)$
10:: $L M : = ⨁_{c \in \frac{L}{{\dot{=}}_{I R}}} [(c (I R), π_{S_{L} ∖ I R} (c))]$
11:: $R M : = ⨁_{c \in \frac{R}{{\dot{=}}_{I R}}} [(c (I R), π_{S_{R} ∖ I R} (c))]$
12:: for all $k \in dom (L M) \cup dom (R M)$ do
13:: if $k \notin dom (R M)$ and isLeft then
14:: for $y \in L M (k)$ yield $k \oplus y$
15:: else if $k \in dom (L M)$ then
16:: for all $y \in L M (k), z \in R M (k)$ do
17:: $y^{'} : =$ copyof y
18:: $y^{'} (N) : =$ if isLeft then return $y^{'} (N)$ Z else return $y^{'} (N) ⋈ z$
19:: yield $k \oplus y^{'}$
20:: end for
21:: end if
22:: end for
23:: end function

The algorithm also takes into account whether no nesting path

\vec{L} = ⟨ N ⟩

is derivable, thus resorting to traditional relational algebra operations: if there are no shared attributes, we boil down to the execution of a cross product (Line 4) and, if none of the attributes from the right table appear within a nested attribute from the left table, then we boil down to a classical left-outer or natural equijoin depending on the isLeft parameter (Line 6).

Otherwise, we know that some attributes from the right table appear as nested within the same attribute N of the left table and that the two tables share the same non-nested attributes. Then, we initialize the join across the two tables by first identifying the nested attribute N from the left (Line 9). Given

IR

, the attributes appearing as nonnested attributes from the left table also appear in the right one, we partition the tables by

{\dot{=}}_{IR}

, thus identifying the records having the same values for the same attributes in

IR

(Lines 10–11). Then, we start producing the results for the nested join by iterating over the values k appearing in either of the two tables (Line 12): if k appears only over the left table and we want to compute a left nested join (Line 13), we reconstruct the original rows appearing from such table and return them (Line 14). Last, we only consider values k for IR appearing on both tables and, for each row y from the left table having values in k, we compute the left (or natural equi-)join of

y (N)

with each row z from the right table and combine the results with k (Line 18).

Properties

We prove that

{}^{\vec{L}}⋈

cannot trivially boil down to ⋈ unless

\vec{L} = \emptyset

; otherwise, if

A_{i}

is in

\vec{L}

not appearing as an attribute for the to-be-joined table schemas, we will be left out with the left table and not a classic un-nested natural join. Proofs are postponed to Appendix B.2.

Lemma 2.

Given

S (t) = S

and

\vec{L} = ⟨ A_{1}, A_{2}, \dots, A_{n} ⟩

, if

A_{1} \notin dom (S)

, then

t {}^{\vec{L}}⋈ s = t

As this is not a desired feature for an operator whose application should be automated, this justifies the need for a novel nested algebra operator for composing nested morphisms, which should shift to left joins [37] for composing optional patterns provided within the right operand (isLeft), while also backtracking to natural joins or cross products if no nested attribute is shared across matched containments. The following lemma discards the possibility of the aforementioned limitation to occur from our operator, by instead capturing the notion of cross-products when tables are not sharing non-nested attributes.

Lemma 3.

Given tables L and R, respectively, with schema S and U with non-shared attributes (

dom (S) \cap dom (U) = \emptyset

), either NestedNaturalEquijoin_False(

L, R

) or NestedNaturalEquijoin_True(

L, R

) compute

L \times R

.

We also demonstrate that the proposed operator falls back to the natural join when no attribute nested in the left operand appears in the right one, while also capturing the notion of left join by changing the isLeft

Lemma 4.

Given tables L and R, respectively, with schema S and U where no nested attribute appearing in the left table appears in the schema of the second, thenNestedNaturalEquijoin_false(

L, R

)

= L ⋈ R

andNestedNaturalEquijoin_true(

L, R

)

= L

R.

The next lemma observes that our proposed operator not only nests the computation of the join operator within a table, but also implements an equijoin doing a value match across the table fields that are shared within the shallowest level. This is a desideratum to guarantee the composition of nested morphisms within the same GSM database ID, thus requiring sharing at least the same dbid field (Section 6.3). Still, these operations cannot be expressed through the nested join operator available from the current literature (Equation (6)).

Lemma 5.

Given tables L and R, respectively, with schema S and U, that is

S (L) = S

and

S (R) = U

, where the left table has a column N (

N \in dom (S)

) being nested (

S (N) \notin B

) and also appearing in the right table (

N \in dom (U)

), NestedNaturalEquijoin_false(

L, R

) cannot be expressed in terms of

L {}^{⟨ N ⟩}⋈ R

for

N \in dom (S) \cap dom (U)

,

N \in dom (S (N))

, and

dom (S (N)) \cap dom (U) \neq \emptyset

.

6. Generalised Graph Grammar

After a preliminary and example-driven representation of the proposed query language (Section 6.1), we characterise the semantics of the proposed query language in terms of procedural semantics being subsumed in Algorithm 4. This is defined by the following phases: after determining the order of application of the matching and rewriting rules (Line 2), we match and cache the traversal of each containment relationship to reduce the number of accesses to the physical model (Line 3), from which we then proceed to the instantiation of the morphisms, to produce the

M T [\cdot, \cdot]

table (Line 4). This fulfills the matching phase. Finally, by visiting the objects from each GSM database in reverse topological order, we then access each morphism stored in

M T [\cdot, \cdot]

for then applying (Line 5) the rewriting rules according to the sorting in Line 2. As this last phase produces the views for each

g_{i}

GSM database we then materialise this view and store the resulting logical model on disk (Line 6). Each of the forthcoming sections discusses each of the aforementioned phases.

Algorithm 4 Generalised Graph Grammar (gg) evaluation

1:: function GeneralisedGraphGrammars(gg, $G = {g_{1}, \dots, g_{n}}$ ) ▹ $G ≃ d b$ (Lemma 1)
2:: SortRules(gg) ▹ Algorithm 5
3:: CacheIntermediateResults(gg, $d b$ ) ▹ Algorithm 6
4:: InstantiateMorphisms(gg, $d b$ ) ▹ Algorithm 7
5:: $Δ : =$ GenerateGraphViews(gg, G) ▹ Algorithm 8
6:: return materialise( $G, Δ$ ) ▹ Equation (10)
7:: end function

6.1. Syntax and Informal Semantics

We now discuss our novel’s proposed matching and rewriting language by taking inspiration from graph grammar. To achieve language declarativeness, we do not force the user to specify the order of application of the rules as in graph rewriting.

Figure 4 provides the Backus-Naur Form (BNF) [48] for the proposed query language for matching and rewriting object-oriented databases by extending the original definition of graph grammars. Each query (gg) is a list of semi-colon-separated rules (rule), where each of those describes a matching and rewriting rule,

L_{i} \to_{Θ} R_{i}

. For each uniquely identified rule (pt), we identify a match

L_{i}

(obj cont+joining*) and an optional (?) rewrite

R_{i}

(op*obj). Those are separated by an optional condition predicate

Θ

(Appendix A.2 on page 43), providing the conditions upon which the rewriting needs to be applied to the database view, and ↪.

L_{i}

is characterised by one single entry-point similarly to GraphQL [49] as well as other navigational approaches to visiting graph-based data [50], thus defining the main GSM object through which we relativise the graphs’ structural changes or update around its neighbouring vertices, as defined by its ego-network cont of objects being either contained (- - clabel -> obj) or containing (<- clabel - - obj) the entry-point obj. While objects should be always referred to through variables, containment relationships might be optionally referred to through a variable. Edge traversal beyond the ego-net is expressed through additional edges (joining). We require that at least one of the edges should be a mandatory one. Differently from Cypher, we can match containments by providing more possible alternatives for the containment label rather than just considering one single alternative: this acts as the SPARQL’s union operator across differently matched edges, each for a distinct edge label. Please observe that this boils down to a potentially polynomial sub-problem of the usual subgraph isomorphism problem, being still in NP despite the results in Section 8 proposed for the present query language.

Up to this point, all these features are shared with graph query languages. We now start discussing features extending those: we generate nested embeddings by structurally grouping entry-point matches sharing the same containing vertex: this is achieved by specifying the need to group an object (≫) along one of its containment vertices via a containment relationship remarked with ∀. Last, we use return statements in the rewritings to specify which entry-point objects should be replaced by any other matched or newly created objects.

Example 1.

Listing 1 expresses the graph grammar rules from Figure 2 in our proposed language with minor extensions: we achieve declarativeness when associating multiple string values coming from nested vertices (and therefore, associated with a single variable) to one single vertex, as strings will be normally space-joined (Line 12) with a syntax equivalent to setting such properties where no nestings are in a morphism (Line 2). A return statement at Line 22 guarantees that, while considering matching a GSM database from Figure 3a, objects 2 and 3 for Alice and Bob in X will be replaced by the newly instantiated “Alice Bob” object h when considering the subsequent creation of the edge “plays” (Line 31). This remarks the need for visiting the GSM database in topological order to minimise the rewriting costs when the updates are applied. This is also guaranteed by the matching assumption only considering objects within the entry-point’s ego-net, as we ensure to pass on the information by layers via return statements. Figure 5a shows the result of this rewrite.

Figure 4. Proposed language for Graph Grammars over the GSM expressed in ANTLR4-flavoured BNF notation [51]. Terminal symbols are expressed in green.

s c r i p t

refers to a double-quoted string representing a program in the Script v2.0 language [34]. Similarly to Regular Expression syntax [48], ? refer to optional sub-expressions, + (or *) indicate one (or zero) or more occurrences of a given sub-expression.

Figure 4. Proposed language for Graph Grammars over the GSM expressed in ANTLR4-flavoured BNF notation [51]. Terminal symbols are expressed in green.

s c r i p t

refers to a double-quoted string representing a program in the Script v2.0 language [34]. Similarly to Regular Expression syntax [48], ? refer to optional sub-expressions, + (or *) indicate one (or zero) or more occurrences of a given sub-expression.

Figure 5. Applying the rewriting rules expressed in Figure 2 to the graph originally presented in Figure 3a: different colours refer to different matching rules. Filled vertices in the left (and right) graph refer to the distinct vertex entry-points (and newly generated components), while uparrows ↑ are used to differentiate containment IDs from the ones for the objects. (a) Generating a binary relationship between the subject as a single entity and the direct object. (b) Morphisms

M [p_{3}, g_{0}]

. (c) Morphisms

M [p_{2}, g_{0}]

, where * refers to sub-matches nested over the entry point (See Algorithm from Section 6.4).

Figure 5. Applying the rewriting rules expressed in Figure 2 to the graph originally presented in Figure 3a: different colours refer to different matching rules. Filled vertices in the left (and right) graph refer to the distinct vertex entry-points (and newly generated components), while uparrows ↑ are used to differentiate containment IDs from the ones for the objects. (a) Generating a binary relationship between the subject as a single entity and the direct object. (b) Morphisms

M [p_{3}, g_{0}]

. (c) Morphisms

M [p_{2}, g_{0}]

, where * refers to sub-matches nested over the entry point (See Algorithm from Section 6.4).

Listing 1.

Expressing the graph grammar rules represented visually in Figure 2 in our proposed language (file: paper_simple.txt).

When grouping entry-points, we require those to be grouped over one same containing object, to unambiguously refer the nested entry-points to one single morphism. This allows the query language to coalesce morphisms.

Example 2.

The usefulness of a nested morphism representation can be promptly shown with the example in Figure 5 while focusing on the morphism tables referring to the matching of the subject-verb.object structure of a sentence (Figure 5b). Each morphism can contain two distinct nested relationships, one referring to the subject (S) and one to the object (Z). The possibility of representing such values in a nested morphism allows us to better group vertices to be considered while referring to those with the same variable while keeping unique entry-point instances.

Example 3.

With reference to the morphism resulting from matching (con-)/dis-junctions with a sentence (Figure 5c), entry-point grouping allows the creation of one single vertex matching as a single subject for the sentence, thus ensuring the creation of one final single vertex per group of matched entry-points.

We also show how the language allows us to break the declarativeness assumption when we want to specifically compose values according to a specific value composition function:

Example 4.

The user is free to break this declarative assumption by directly specifying the order of combination when it is required to combine the values from different variables. This can be observed in a longer query considering more morphosyntactic language features, which is provided online (https://github.com/datagram-db/datagram-db/blob/v2.0/data/test/einstein/einstein_query.txt, accessed on 18 July 2024). This can be used to fully rewrite the database as per Figure 6. As the creation of the will-not-have containment in this will require combining values from vertices 3, 8, and 9 and, respectively, associated with variables V,A, and N, we can use a scripting language as a direct extension of Script v2.0 [34] for determining the order of strings’ composition. Please observe that this formulation, contrary to Neo4j’s APOC in v5.20, also supports optional object matches, where the values associated with non-existingNULLobjects are resolved into empty strings (see Proof for Lemma 7 in Appendix B.3).

Figure 6. Applying the rewriting rules expressed in Figure 2: different colours refer to different graph grammar rules (b and c), filled vertices in the left (and right) graph refer to the distinct vertex entry-points (and newly generated components). (a) Dependency graph for “Matt and Tray believe that either Alice and Bob and Carl play cricket or Carl and Dan will not have a way to amuse themselves”. While object IDs are presented as numbers, containment IDs are omitted. (b) Generating a binary relationship between the subject as a single entity and the direct object.

By explicitly expressing a containment relationship across the nested entry-point X via so-called hook containments defining equivalence classes, we split the nested morphism

Γ

into as many new morphisms as the equivalence classes in

\frac{{Idx}_{Γ} (X)}{ℜ}

(see Section 6.3).

Example 5.

We appreciate the usefulness of such morphism splitting while looking at a more convoluted example, as the one considering the rewriting in the sentence depicted in Figure 6a: vertices Matt and Trayand play and havefrom conjunctions but at different branches of the sentence structure. Furthermore, all four constituent vertices have the same containing vertex believe for which, if no hook relationship was considered, they would have been added within the same morphism as per the previous example.

Still, given that those vertices are associated with different conjas they appear in different coordinating conjunctions, we can use this as a hook relationship to distinguish those, for then obtaining two separate morphisms as illustrated by the first two rows in Figure 7b. Thus, hooks help in splitting nested entry-points structurally by identifying similar elements through structural information.

Figure 7. Resulting morphisms from the application of the graph grammar rules from Listing 1 over the GSM database in Figure 6a, from which the resulting rewritten database Figure 6b is then obtained. (a) Morphisms

M [p_{1}, g_{1}]

. (b) Morphisms

M [p_{2}, g_{1}]

, where * refers to sub-matches nested over the entry point (See Algorithm from Section 6.4). (c) Morphisms

M [p_{3}, g_{1}]

.

Last, these examples provide an intuitive motivation for why the matching within our query language can express distances of at most one containment relationship from the entry-point match. We want to guarantee that, given two objects matching the query entry-points, located at different distances from the vertex appearing last within the reverse topological order, these are still reachable within the distance of crossing a single containment. Similarly to semistructured data literature, we refer to one as its (direct) ancestor and to the other as its (direct) descendant. If the entry point considering in its match the aforementioned direct descendant object is replaced with another object, be it recently created during the application of the rewriting rules or already existing within the database, we want this information to be passed directly during the execution of the rewriting associated with the match of the object of the direct descendant. For this, we need both an explicit return mechanism, which allows the possibility of explicitly telling the objects appearing at the higher layers induced by the topological order that the previous object has been replaced, and to keep the match size compact, so that we can guarantee that any entry-point value updated at a lower level is retrieved immediately.

6.2. Determining the Order of Application of the Rules

We determine the application order of our language rules for each entry-point vertex of interest (Algorithm 5). This boils down to solving a scheduling problem, which requires first determining the interdependencies across the graph grammar rules. All the matching constructs

L_{i}

for each rule

L_{i} \to_{Θ} R_{i}

in our query gg have variables that might be shared across morphisms.

Algorithm 5 Sorting the Graph Grammar rules by application order

1:: procedure SortRules(gg)
2:: $V : = {p_{i} | p_{i} = L_{i} \to_{Θ} R_{i} \in gg}$
3:: $E : = {p_{i} \to p_{j} | p_{i} . res \neq NULL \land p_{i} . res = p_{j} . entrypoint \lor$
3:: $p_{i} . entrypoint = p_{j} . entrypoint, p_{i} \in V, p_{j} \in V}$
4:: $G : = (V, E)$
5:: time $: =$ LayerFromTopologicalSort(G, ${ApproxV}_{topo}$ (G))▹ Algorithm 1
6:: sort each x in gg by time $[x]$ in ascending order
7:: end procedure

As per Example 1, each rewriting

R_{i}

might replace the entry-points with a single new object, or we preserve the previously matched ones otherwise. These are then input to any later morphism being considered while applying the rewritings. For this, we might consider the variables across patterns as hints to the query language on how the updated or matched objects are going to influence their updates, thus declaring their interdependencies. By reducing this to a dependency ordering, we consider the dependency graphs for the matching and rewriting rules, (Line 4), where each vertex represents a rule (Line 2). Each edge connecting two vertices (or patterns) represents a connection between the entry-point or returned variable from the source pattern and any other non-entrypoint variable occurring in the target pattern (Line 3). As the resulting graph might contain loops as some patterns might exhibit mutual dependencies, we are then forced to run an approximated topological sorting algorithm (Line 5) to determine an approximated scheduling order across such tasks through a DFS visit of the graph [52]: we start traversing the graph from the first rule appearing in the graph grammar while avoiding visiting edges leading to already-visited vertices; if we visited all the vertices reachable from such initial vertex while still having unvisited ones, we recommence the same visit starting from the earliest vertex that was not already visited. We add each visited vertex inside a stack after each of its children has been fully visited. By doing so, we prioritise the rules’ declaration order which then acts as a heuristic for guiding the algorithm to decide upon a specific visiting order.

6.3. Containment Matching

With Algorithm 6, we define the steps realising containment matching for each

L_{i}

from a rule

p_{i}

to later on generate a morphism table

M T [L_{i}, g_{j}]

per GSM database

g_{j}

, as discussed in Section 6.4. The algorithm works as follows: (i) after caching the PhiTable^κ referenced by the containments in the matching patterns to minimize the tables’ access in a uniform schema specification, (ii) we specialise such tables to the specific schema induced by the variables’ names and nesting occurring in the matching

L_{i}

. Last, (iii) we collect the matching containments by separating them between required or optional ones.

Algorithm 6 Intermediate Edge Result Caching

1:: procedure UniqueCacheId( $c = (q, isOut), L$ )
2:: global queryCache, queryMap, emptySet
3:: if $L = \emptyset$ then emptySet = emptySet $\cup {c}$ else
4:: for all $l \in L$ do
5:: queryCache = queryCache $\cup {l}$
6:: end for
7:: end procedure
8:: function FromDB( $κ, d b$ )
9:: $t : = ⟨ ⟩; S (t) : = [(dbid, N), (src, ni), (edge, ci), (edgeLabel, Σ^{*}), (dst, ni)]$
10:: if ${PhiTable}^{κ} \in d b$ then
11:: $t : = ⟨ [(dbid, i), (src, j), (edge, r), (edgeLabel, κ), (dst, d)] ∣$
11:: $r < | {PhiTable}^{κ} |, {PhiTable}^{κ} (r) = ⟨ l, i, j, w, d, ι ⟩ ⟩$
12:: end if
13:: return t
14:: end function
15:: function ToTable( $t, x, y, l_{x}, n e s t C o n t$ )
16:: if $l_{x} = NULL$ then $t : = R_{src, dst \to x, y} (π_{dbid, src, dst} (t))$
17:: else $t : = R_{src, edge, edgeLabel, dst \to x, l_{x}, l_{x} + “_{label}^{”}, y} (t)$
18:: if nestCont then $t : = ν_{l_{x}, y \to y} (t)$
19:: return t
20:: end function

21:: procedure CacheIntermediateResults(gg, $d b$ )
22:: global queryCache, queryMap, emptySet
23:: for all $p_{i} = L \to_{Θ} R \in gg$ s.t. $L \equiv ⟨ e p, i n, o u t, j o i n, hook ⟩$ do
24:: for all $q \in i n$ s.t. $q = ⟨ u, l, l_{x}, all, opt ⟩$ do UniqueCacheId( $(q, false), l$ )
25:: for all $q \in o u t$ s.t. $q = ⟨ u, l, l_{x}, all, opt ⟩$ do UniqueCacheId( $(q, true), l$ )
26:: for all $q \in j o i n$ s.t. $q = ⟨ u, l, l_{x}, all, opt, v ⟩$ do UniqueCacheId( $(q, true), l$ )
27:: queryCache = queryCache $\cup hook$
28:: end for
29:: if emptySet $\neq \emptyset$ then queryCache $= {κ | {PhiTable}^{κ} \in d b}$
30:: cache $: = ⨁_{x \in q u e r y C a c h e} [(x, F R O M D B (x, d b))]$
31:: for all $p_{i} = L \to_{Θ} R \in gg$ s.t. $L \equiv ⟨ ⟨ e p, {agg}_{e p} ⟩, i n, o u t, j o i n, hook ⟩$ do
32:: ${hook}_{i} : = π_{dbid, src, dst} (⊔_{j \in hook} T O T A B L E (cache (i), src, dst, NULL, false))$
33:: for all $q \in o u t, ι \in q u e r y C a c h e (q, true)$ s.t. $q = ⟨ ⟨ u, {agg}_{u} ⟩, l, l_{x}, all, opt ⟩$ do
34:: $t : = T O T ABLE (c a c h e (ι), e p, u, l_{x}, all and {agg}_{u})$
35:: if opt then $o p t_{i} : = o p t_{i} \cup {t}$ else $r e q_{i} : = r e q_{i} \cup {t}$
36:: end for
37:: for all $q \in i n, ι \in q u e r y C a c h e (q, false)$ s.t. $q = ⟨ ⟨ u, {agg}_{u} ⟩, l, l_{x}, all, opt ⟩$ do
38:: $t : = T O T ABLE (c a c h e (ι), u, e p, l_{x}, false)$
39:: if opt then $o p t_{i} : = o p t_{i} \cup {t}$ else $r e q_{i} : = r e q_{i} \cup {t}$
40:: end for
41:: for all $q \in j o i n$ s.t. $q = ⟨ ⟨ u, false ⟩, l, l_{x}, all, opt, ⟨ v, {agg}_{v} ⟩ ⟩$ do
42:: $t : = T O T ABLE (c a c h e (ι), u, v, l_{x}, all and {agg}_{u})$
43:: if opt then $o p t_{i} : = o p t_{i} \cup {t}$ else $r e q_{i} : = r e q_{i} \cup {t}$
44:: end for
45:: end for
46:: end procedure

6.3.1. Pseudocode Notation for $L_{i}$

We describe the notation used in our pseudocode being the machine readable representation of the query syntax discussed in Section 6.1.

We define each object variable as a pair

⟨ x, agg ⟩ \in N = Σ^{*} \times {0, 1}

, where x is the variable name and agg denotes whether the vertex should be aggregated (1) or not (0). In our pseudocode, each matching

L_{i}

is represented as the tuple

⟨ e p, i n, o u t, j o i n, hook ⟩

, where

e p \in N

specifies the pattern entry-point and each ingoing (or outgoing) edge is represented by a pair

⟨ u, l, l_{x}, all, opt, v ⟩

, where

u \in N

remarks the variable associated with the container (or contained) object alongside the containment relationship through

ϕ

and

t_{ϕ}

,

l \in ℘ (Σ^{*})

provides a potentially empty-set of containment relationships,

l_{x}

is an optional containment variable, all

\in {0, 1}

determines whether the edges should be considered in the aggregation or not, and opt

\in {0, 1}

determines whether the match should be considered as optional or not. The join edges extend such records as

⟨ u, l, l_{x}, all, opt ⟩

by specifying both the containment (

v \in N

) and container (

u \in N

) variable explicitly, and hook

\in ℘ (Σ^{*})

determines whether the aggregated entry-points over the single incoming container should be subdivided in equivalence classes according to the containment labels in hook, and we perform no clustering otherwise.

6.3.2. Procedural Semantics for Matching and Caching Containments

We now narrate the operations required to match each containment occurring across patterns while representing those as relational tables expressing either required (req_i), optional (opt_i), or hook-driven equivalence relationships (hook_i) per pattern

p_{i}

.

First, we define the semantics associated with the matching of each object ego-net as described in ToTable from Algorithm 6 (Line 15): either containment (src)- -[ $l_{x} : L$ ]->(dst) or (dst)<-[ $l_{x} : L$ ]- -(src) are represented as records with fixed schema (Line 9) where each record refers to a single containment

ι

(edge) in

ϕ (src, ℓ)

for a container src, where

⟨ dst, w ⟩ = t_{ϕ} (ι)

and dst refers to the contained object; in this, we also retain the containing db as dbid and the containment

ℓ \in L

(edgeLabel). At this stage, all containments are associated with the same schema and are not specialised to abide by a specific schema induced by a matching specification. This allows us to easily cache PhiTable^κ containments (Line 30).

Next, we discuss how we specialise the results from the cache according to the schema induced by the variables occurring in each matching

L_{i}

. This is carried out by renaming generic containing/containment/contained object labels (src/edge/dst) with the variable names in

L_{i}

associated with them; if the patterns also contain references to the edge variable (

l_{x}

), we also retain each ID in

ϕ_{x} (u, L)

as

l_{x}

, and

l_{x}

’s label (Line 17) and we discard such information otherwise (Line 16). If the edge expresses an aggregation from the container to the content (e.g., (src)–[ $\forall L$ ]–(≫ dst)), we nest

l_{x}

(if available) and dst from the table obtained at the previous step (Line 18). This makes the major difference with existing graph query languages, as we consider containment identifiers as a separate class from object ones (thus differently from SPARQL) and we also produce nested morphisms according to the query annotations.

Last, we collect the tables while differentiating where those are associated with a required or optional pattern (Lines 35, 39, and 43), acting as the basic building step for defining nested morphisms as in the subsequent section.

6.3.3. Algorithmic Choices for Optimisation

After discussing the procedural semantics of the matching of containments occurring across all

L_{i}

-s, we describe the algorithmic choices for optimisation purposes. First, to minimize the access to a relational database, we ensure that each

{PhiTable}^{κ}

is accessed at most once across all the edges by collecting all the labels of interest across table-matched containments in queryCache (Line 5). If some containments have no containment attribute specified, we remember the existence of such containment (Line 3) for which we will then require considering all the

{PhiTable}^{κ}

records, as a containment for which no label is specified forces to consider all the containments (Line 29). Only after this, we access the physical model to transform each PhiTable^κ per

κ \in

queryCache to a containment table to be then composed into the final nested morphism table, thus ensuring minimal memory access (Line 30).

6.4. Morphism Instantiation and Indexing

Algorithm 7 combines each table produced from the previous phase to generate the morphisms describing the result of matching

L_{i}

through the generation of morphisms being recorded within an

M T [L_{i}, g_{i}]

table for each GSM database

g_{i}

. Similarly to SPARQL’s triple navigation semantics, we generate the morphisms

P_{i}

for all GSM databases by natural joining all the tables associated with the mandatory containments (Line 10), while left-joining

P_{i}

(the resulting table from the natural join of the required containments) against the optional patterns set and updating

P_{i}

with such a result (Line 13).

Algorithm 7 Morphism instantiation and indexing

1:: procedure InstantiateMorphisms(gg, $G = {g_{1}, \dots, g_{n}}$ ) ▹global $M T$
2:: for all $p_{i} = L_{i} \to_{Θ} R_{i} \in g g$ do
3:: global req_i, opt_i, hook_i ▹ Algorithm 6
4:: $\bar{e} : = L . entrypoint$
5:: if $| r e q_{i} | = 0$ then continue
6:: sort each t in req_i by $| t |$ in ascending order
7:: $P_{i} : = r e q_{i} (0)$
8:: for $j = 1$ to $| r e q_{i} | - 1$ do
9:: if $| P_{i} | = 0$ then break
10:: $P_{i} : = {N ESTED N ATURAL E QUIJOIN}_{false} (P_{i}, r e q_{i} (j))$
11:: end for
12:: if $| P_{i} | = 0$ then continue
13:: $P_{i} : = Λ ({N ESTED N ATURAL E QUIJOIN}_{true}, P_{i}, o p t_{i})$
14:: if L.entrypoint = $(≫ z)$ and $\exists x, obj . < - [\forall x : \dots] - - obj \in cont$ then
15:: $\bar{e} : = obj$
16:: $P_{i} : = ν_{S (P_{i}) ∖ {dbid, x, x_label} \to ⋆} (P_{i})$
17:: hook $: =$ Trans(Symm(Refl(hook_i)))
18:: $ℜ : = {ℜ_{i} | g_{i} \in G F \land t_{ℜ_{i}} . s \Leftrightarrow (i, t (z), s (z)) \in hook}}$
19:: for all $Γ \in P_{i}$ and $γ_{i} \in \frac{Γ (⋆)}{ℜ_{Γ (dbid)}}$ do
20:: $M T [L_{i}, Γ (graph)]$ .add( ${Γ |}_{{graph, x, x_label}} \oplus [(⋆, γ_{i})]$ )
21:: end for
22:: else
23:: for all $Γ \in P_{i}$ do $M T [L_{i}, Γ (graph)] . add (Γ)$
24:: end if
25:: for all $g_{j} \in G$ do
26:: sort each $Γ$ in $M T [L_{i}, g_{j}]$ by $O_{rtopo} (Γ (\bar{e})))$ in ascending order
27:: end for
28:: end for
29:: end procedure

We further nest the morphism if and only if the entry-point is aggregated via a single containment object obj (Line 14), for which we then nest in a fresh attribute ⋆ all the attributes except the database where obj is contained, obj itself, and optionally the edge labels for the containment if the pattern exhibits its variable (Line 16).

Hooks derive an equivalence relationship

ℜ_{i}

per GSM database

g_{i}

having a ⋆-nested morphism through which to optionally split the morphisms. We retain only the container and containment relationship and their containing database ID (Line 32 from Algorithm 6), for then obtaining a suitable equivalence relationship

ℜ_{i}

by computing the reflextive, symmetric, and transitive closure of that relationship (Line 17). Then, we potentially split the table nested in ⋆according to the equivalence classes associated with each equivalence relationship

ℜ_{i}

obtained from hooks (Line 19) and update the morphism table accordingly.

Concerning the composition of cached tables, to reduce the equi-join time we first sort the tables by increasing size (Line 6) for then stopping the joins as soon as we find or compute an empty morphism, thus entailing that no collection of objects across all dbs can match

L_{i}

(Lines 5 and 9). As a last optimisation, we populate

M T

collecting the morphisms by database ID and rule ID (Lines 20 and 23). Last, we sort each morphism in

M T [L_{i}, g_{j}]

by entry-point (Line 4) reverse topological order (Line 26) or, if these were nested in ⋆, by their container object (Line 15). This induces a primary block indexing mapping each elected vertex

\bar{e}

to a set of morphisms containing it.

6.5. Graph Rewriting Operations (`op` from $R_{i}$ )

Finally, we apply the transformation operations required by the rewriting side of the rule for each instantiated morphism in

M T

across all GSM databases. This works by keeping track of the desired changes within a GSM view per loaded GSM database.

We now discuss Algorithm 8. For updating GSM views, we apply the rewriting query for each database

g_{j}

as described by the rewriting patterns

R_{i}

in gg (Line 2): we visit per GSM database its objects according to the reverse topological order from

O_{rtopo} (g_{i})

(Line 4) while retaining the objects v appearing in the aforementioned primary block index of a non-empty morphism table

M [L_{i}, g_{j}]

for each production rule

p_{i} = L_{i} \to_{Θ} R_{i} \in gg

(Line 5): we skip the morphisms

Γ

associated with v if either a previously matched vertex was deleted and not replaced with a new one (Line 6), or if the current morphism

Γ

does not satisfy a possible WHERE condition

Θ

associated with

L_{Θ}

(Line 8). For the remaining morphisms, we run the operations listed in

R_{i}

in order of appearance (Line 9).

Algorithm 8 Rewriting phase

1:: function GenerateGraphViews(gg, $G = g_{1}, \dots, g_{n}$ )
2:: for all $g_{i} \in G$ do
3:: $Δ (g_{i}) : =$ new GraphView( $| V (g_{i}) |$ )
4:: for all $v \in O_{rtopo} (g_{i})$ s.t. $v \notin O_{i}^{-}$ do ▹ Section 4.1
5:: for all $p_{i} = L_{i} \to_{Θ} R_{i} \in gg$ s.t. $M T [L, g_{i}] \neq \emptyset$ do
6:: for all $Γ \in M T [L, g_{i}]$ s.t. $\forall col . \neg {Optional}_{L} (col) \Rightarrow Γ (col) \notin O_{i}^{-}$ do
7:: $Δ (g_{i}) : = {S TART}_{Δ (g_{i})} (Γ)$ ▹ Equation (9)
8:: if $Θ \neq true$ and $|{[[Θ]]}_{Γ, g_{i}}^{1, M T}| = 0$ then continue ▹Figure A2
9:: $“^{”}, T, p, Δ (g_{i}), M T : = ν (R, T, p, Δ (g_{i}), M T)$ ▹ Figure 8
10:: end for
11:: end for
12:: end for
13:: yield $Δ (g_{i})$
14:: end for
15:: end function

We now discuss the definition of

ν

through SOS semantics enclosed within the evaluation of Line 9 in detail. Figure 8 describes the interpretation of all rewriting rules

R_{j}

updating the GSM view

Δ (g_{i})

, where the first three updates and exploit the functions from Section 4.3. All the y-s are interpreted as evaluated expressions without intermediate assignments. Rule NewObject creates a new object and refers it to the variable j. Rule DelObject deletes a pre-existing or a newly-created object, and Rule DelObject deletes a single container-containment relationship that was defined at loading time. We can easily distinguish between variables referring to objects or containments by the simple type associated with the attribute. For now, we do not allow the explicit removal of containments at run-time unless the containment is explicitly specified via containment update.

We discuss the set update for the vertices’ label values; we distinguish the following cases: if both the number of variable and values are in the same number, we assign for each i-th variable the u-th occurring resolved value (labelZip); otherwise, we assign to each object associated with the resolved variable the collapsed string values (LabelValFlat). We can obtain similar rules for

ξ

which are omitted here for conciseness, but that can be retrieved in the appendix (Figure A4).

Figure 8. Graph Rewriting SOS for view update.

We treat the property

π

and the containment

ϕ

updates differently, as they deal with the resolution of three variables. We are also interested in whether different resolved variables belong to the same nested table within the morphism or not, with the rationale being that we can freely associate each value within the same nesting row-by-row (P2Zip, P2’Zip, and P2″Zip), while we need to compute the cross-product between the assignments if the value belongs to distinct nestings (Clearly in P3Zip). This is determined via the second parameter of expression evaluation

\to_{E}

(Appendix A.3) by transferring the attribute information originated by the variable resolution

\to_{ρ}

(Appendix A.1). In all the remaining occasions, we arbitrarily decide to flatten the associated expressions. Please observe that, if the querying user wants more control over the precise value to be associated, they can always refer to SCRIPT expressions, thus breaking the declarative assumptions for a more controlled output.

Even in this case, we can formulate the rules for setting the

ϕ

-s similarly to our previous discussion for the

π

-s. Therefore, we defer to Figure A5 provided in the Appendix A.4.

We conclude by analysing the semantics associated with the replacement via

R_{i}

’s return statement, the last and mandatory operation for declared rewriting operations. We apply no rewriting if the returned variable is the variable entry-point (NoRewr). Otherwise, if the entry-point variable is not aggregated, we resolve the replacement and entry-point (Repl and Orig, respectively) and we replace any object in Orig associated with the entry-point variable

p^{'}

with an object in Repl associated with the replacing variable

p^{'}

(NoAggrRewr). Otherwise (AggrRewr), the rewriting occurs by replacing the objects in Orig, associated with the entry-point’s container and the objects in Repl, associated with the returned variable p′. Furthermore, given C the containment labels for which the entry-point is contained by its aggregating object in cont, we also update the containing for the cont objects to also contain via C the replacing objects in Repl. As this provides the final update, we then consider this last resulting GSM view of the resulting view for our rewriting step.

As no SOS rule matches the empty string, no further operation is conducted, and the rewriting program terminates after considering the final rewriting statement.

7. Language Properties

Given the previous language description, we want to characterise its properties by juxtaposing them with Cypher’s. Full proofs are provided in Appendix B.3. We start by showing that, unlike current graph query languages, we propose a rewriting language framed as generalised graph grammars: we relate our proposed language to the graph grammars as, similarly to these, the absence of a matched pattern leads to no view updates. Still, we claim that such language provides a generalisation of the former by supporting explicit data-aware update operations over the objects and containments, while also defining explicit semantics determining the order of the application of such rules, both across rules and within each GSM database.

Lemma 6.

If either we query all the GSM databases with no rules, or all the rules have no rewritings, or none of the matches returned a morphism, or none of the ones being matched pass the associated Θ condition, then we return the same GSM databases as the ones being originally loaded.

Next, we ensure that the rules are applied in reverse topological order, thus minimising the restructuring cost of the GSM database while achieving declarativeness for rule application, as the user does not specify this within the query formulation, as no matched pattern leads to no view updates.

Property 1.

The rules are applied in (reversed) topological order while visiting each GSM.

On the other hand, we can show that Cypher forces the user to specify in which order the updates shall be applied, thus breaking the declarative assumption for a good query language.

Lemma 7.

The application of the rewriting in Cypher in lexicographical order requires explicitly determining the order of the morphisms’ application and the order of the graph’s visit.

Next, we state the usefulness of an explicit return statement within the interpretation of a rule as it allows us to propagate the updates on the subsequently evaluated morphisms.

Lemma 8.

If an entry-point match is deleted and a new object is re-created inheriting all of its properties, this new object will not be connected to the entry point’s containers unless the newly-returned object was applied.

To a minor extent, we also show a greater amount of declarativeness if compared to current graph query languages by automatically generating a new object if an update operation occurs with reference to an optionally matched variable that was not matched to an existing object.

Lemma 9.

The setting of properties to variables not associated with a newly-created object variable and not associated with an actual GSM object due to an optional match reduces to the creation of a new object associated with the variable, which then always receives this and any following update associated with the same variable.

At this point, it could be argued that, although our proposed rewriting language performs queries that cannot be expressed in other graph query languages, this does not return the matched subgraphs as in such other languages, similar to Cypher’s return statement due to the considerations from Lemma 6. The following Lemma shows otherwise. Thanks to this, this language is considered more expressive than Cypher.

Lemma 10.

The proposed graph grammar query language can express traversal queries retaining only the objects and containments being matched and traversed.

Corollary 1.

The proposed graph query language is more expressive than current graph query languages.

8. Time Complexity

We now study the computational complexity associated with the algorithms discussed in the previous section and infer this from the implementation details discussed while reasoning on the SOS discussion. Proofs are postponed to Appendix B.4. Please observe that, as previously noted from previous graph query language literature (Section 3.1.2), the following results do not intend to prove P vs NP, as we are deliberately expressing a sub-problem of the more generic subgraph isomorphism problem that can be easily captured through algebraic operations.

Lemma 11.

The time complexity of sorting the rules within the query is quadratic over the number of query rules.

Lemma 12.

The intermediate edge result caching time complexity has a polynomial cost being linear in both the loaded databases and the number of available objects.

Corollary 2.

The cost for generating the nested morphisms is polynomial with the size of the entire data loaded within the physical model.

Lemma 13.

The rewriting cost is polynomial and linear with the number of rewriting operations and the number of the morphisms.

Corollary 3.

The time complexity of the whole Generalised Graph Grammars is polynomial with the size of the loaded physical model.

9. Empirical Evaluation

For our empirical evaluation, we study the use case of graph grammar in the context of rewriting graphs representing the grammatical structure of a natural-language sentence. Universal dependencies [53] capture these syntactical features across languages by exploiting a shared annotation scheme. In this context, the usual approach to graph rewriting boils down to rewriting a PROLOG-like logical program by applying declarative rewriting rules (https://github.com/opencog/relex accessed on 22 April 2024) via a unification algorithm [54], where compound terms are equivalently expressing binary relationship and properties associated with specific vertices. Given the general interest for such an approach within the domain of Natural Language Processing (NLP), the present paper is going to specifically focus on use case scenarios within such domain. This will also give us the opportunity to re-use freely available datasets for sentences for our experiments (https://osf.io/btjqw/?view_only=f31eda86e7b04ac886734a26cd2ce43d and https://osf.io/rpu37/, accessed on 21 April 2024), which can be then deemed as repeatable.

Notwithstanding the previous approaches, we want to achieve a more data-driven approach to sentence rewriting, where atoms can also be associated with properties and labels, thus motivating the definition of the proposed query language. Furthermore, the extension of the graph grammar language with a script can be used to compose data and define boolean conditions allowing us to break the declarativeness assumption only when the user wants more control over how data are processed. Thus, we postulate that our proposed query language extends the current literature in several aspects way beyond graph database literature while postulating the possibility of applying concurrently multiple rewriting rules over disparate sentences via a sole declarative query. The proposed approach shares some similarities with programming language research where, after representing a program in terms of its operational semantics, we can apply graph rewriting rules over abstract semantic graphs [55], which are usually represented as trees, for which similar considerations like the former can be applied.

We test the implementation of our physical model and associated query language as implemented in our novel object-oriented database, named DatagramDB, which source code associated with the current paper is freely available online (https://github.com/datagram-db/datagram-db/releases/tag/v2.0, accessed on the 22 April 2024). We conducted all our benchmarks on a Dell Precision mobile workstation 5760 running Ubuntu 22.04 LTS. The specifications of the machine include an Intel^® Xeon(R) W-11955M CPU @ 2.60 GHz x16, 64 GB DDR4 3200 MHz RAM.

9.1. Comparing Cypher and Neo4j with Our Proposed Implementation

Given the problems being evidenced by the Cypher query language from Lemma 7, we cannot automate the benchmarking of Cypher for all the possible sentences coming from an online available dataset. By recalling the results of this lemma, we observe that, when the query is not able to capture a pattern albeit optionally, this will have a cascade effect on the entire query for which none of the following rewriting operations will be applied. Given this, the same query might not necessarily provide the same output across different sentences having a different sentence structure. Thus, we cannot use one single query for dealing with unpredictable sentence structure that, in the worst-case scenario, would require us to write one single query per input sentence. We then preferred to limit our comparison to two sentences, for which we designed the minimal Cypher query exactly capturing the syntactical features expressed by these sentences while using the same query for DatagramDB across all the sentences.

We considered two distinct dependency graphs: “Alice and Bob play cricket”, the one in Figure 3a, and the one resulting from the dependency parsing of the “Matt and Tray...” sentence from Figure 6a. We loaded them in both Neo4j v5.20 and our proposed GSM database. In Cypher, we then run the query as formulated in the previous section, while we construct a fully declarative query in our proposed graph query language syntax directly representing an extension of the patterns in Figure 2. From now on, when referring to Neo4j, we will always refer to version 5.20.

Examining Table 1, we can see our solution consistently outperforms the Neo4j solution by two orders of magnitude. Furthermore, the data materialisation phase does not significantly impact the overall running time, as its running times are always negligible compared to the other ones. Additionally, Neo4j does not consider a materialisation phase, as the graph resulting from the graph rewriting pattern is immediately returned and stored as a distinct connected component of the previously loaded graph. This then clearly remarks the benefit of the proposed approach for rewriting complex sentences into a more compact machine representation of the dependency graphs.

Table 1. Table displaying results from rewriting the aforementioned sentences.

9.2. Scalability for the Proposed Implementation

For testing the scalability of our implemented system, we used a corpora of sentences used for natural-language common-sense question answering [56] which we rewrote into dependency graphs using Stanford NLP [57]. As its output is failing systematically at correctly recognising verbs in passive form when at the present time and at recognising negations due to its training-based nature [58], we provide a piece of software amending its output so that all the syntactic and the semantic information retained within the sentence could pertain in a graph structure. This server also transforms the internal Stanford NLP dependency graph into a collection of GSM databases serialised in textual format. The resulting server is available online (https://github.com/datagram-db/stanfordnlp_dg_server accessed on 19 April 2024).

Given the resulting GSM representation of the two sentences, we provide two distinct scalability tests: in the first one, we sample the dataset into sentences to determine the algorithm’s scalability in terms of both the number of GSM databases and the number of vertices, while in the second we will only consider scalability regarding the sole number of GSM databases while maintaining the sample distribution of the sentence’s lengths, thus reflecting the number of GSM objects within the database.

Concerning the first scenario, we choose the sentences containing

| O_{i} | \in {5, 10, 15, 18}

words, and for each of them we choose 300 sentences each, thus obtaining sample sets

S_{300}^{| O_{i} |}

. Then, we further sample

S_{4}^{| O_{i} |}

into three distinct subsets

S_{i}^{| O_{i} |}

having cardinality of

S_{i}^{| O_{i} |} = 75 \cdot i

for which

S_{i}^{| O_{i} |} \subset S_{i + 1}^{| O_{i} |}

for

n > 0

and

1 \leq i < i + n \leq 4

. This will be useful to then plot the rewriting running times using for the x-axis either the number of sentences (or GSM databases) or the sequence length, to better analyze the overall time complexity. The results for these experiments are reported in Figure 9.

Figure 9. Analysing DatagramDB. (a) Results grouped by number of words per sentence. (b) Results grouped by number of databases.

From these plots, we can clearly remark that querying and materialisation times are clearly linear over the size of objects being loaded or GSM databases being matched and rewritten, thus remarking the efficiency for the overall envisioned approach: please observe that we cannot achieve a better time complexity that a linear scan without additional heuristics, as we still have to perform the visit over each GSM database by also performing a linear scan of the database objects in reverse topological (layered) order. We can also motivate these results by having graphs in which the branching factor is relatively small compared to the overall number of available vertices, thus

β ≪ | O_{i} |

for each GSM database

g_{i}

. We also observe that, for these smaller datasets, the resulting materialisation time is almost negligible compared to the query time, which, across the board, dominates the loading and indexing time.

By comparing such running times with the ones from Neo4j, we can easily observe that, while we were able to process 300 GSM databases in 100 milliseconds, Neo4j could rewrite just one single graph in double time. Given this, our approach has a throughput of almost 600 times over Neo4j. This further remarks the impracticality of using the competing solution for analysing more sentences in the future, e.g., while considering sentences crawled from the web.

While it might initially seem that all phases (loading, querying, and materialisation) exhibit linear time complexity, we will try to consider a larger set of data to better outline the time complexity associated with our implementation.

Concerning the second scenario, we decided to sample the internet dataset in a subset of sentences

S_{1}, \dots, S_{4}, S

where

| S_{i} | = 10^{i}

. S represents the entire dataset while ensuring that each dataset of a lesser size is always contained in the larger dataset. Furthermore, we ensure that the number of words, and therefore, of objects on each sampled database reflects a similar frequency distribution of the number of objects per resulting GSM database (Figure 10). By doing so, for our final scalability tests in which we consider more data, we make up for the lack of long sentences with the number of sentences reflected in the number of the GSM databases to be processed.

Figure 10. Sampled probability density function associated with the number of words within the sentences for each subset of traces.

Figure 11 provides the benchmarks for these experiments: a non-linear but polynomial loading time might be possibly related to the data parsing time and the time to store such data in primary memory, while the remaining running times keep a linear time complexity concerning the increase in the number of the GSM databases to be rewritten, similarly to the previous experiments. Querying time always dominates in indexing time by at least one order of magnitude, thus showing that most of the significant computations occur while matching and rewriting. Materialisation times are on the same order of magnitude as indexing time, thus also showing that this cost does not exceed the actual querying phase. Overall, the efficiency of our system is also reflected by a linear querying time for the datasets being considered.

Figure 11. Running time of each algorithm with different sentence samples.

10. Conclusions and Future Works

This paper offers the definition of a matching and rewriting mechanism similar to the one provided by graph grammar implemented over object-oriented databases. As our definition supports both data matches and complex data update operations over the objects of interest, whose features were not considered in previous graph grammar formulations, we name our proposed language Generalised Graph Grammars. Our theoretical results prove the impossibility of expressing the same query with the same degree of generality of Cypher, which requires specifying a different query for any potential graph to be queried, thus posing a major limitation to automating rewriting operations over graphs. Empirical results prove that our query language offers an implementation faster than the ad-hoc query formalised in Cypher and run over Neo4j v5.20, in terms of both running time and throughput expressed in the number of queries runnable per comparable amount of time. These results aim to observe the inadequacy of graph-centric implementations of data representations since a large amount of literature now agrees in stating that more traditional and relational representations often offer better performances with respect to queries both natively supported by graph languages [59] and in representing new operations on graphs that require their rewriting [6,7].

At this stage, we considered nested morphisms of at most depth of 1, thus requiring that each cell of a morphism table should contain at most one non-nested table. Future works will investigate whether there might be any benefit for further generalising this nesting to arbitrary depth, thus requiring further extending the definition of the Nested Natural Equi-Join operator to arbitrary depths.

Notwithstanding the possibility of the current model expressing uncertainty within the data, the rewriting operations always assume to deal with perfect information, thus generating objects or containments containing no uncertainty. Future works will address this gap by further extending the SOS semantics of these rewriting steps while considering provenance information [44], thus relieving the user from explicitly defining the uncertainty of the generated data by adding further declarativeness to the query language here proposed.

Although Lemma 10 showed that the proposed graph query language is also able to remove the unmatched objects and contents, our current algorithm is not tailored for effectively matching this case, as the removing pattern will be forced to return all the objects and containments for then removing them. Future works will extend this by optimising testing conditions for our general language while matching the objects and containments. As a direct consequence of this, our returned morphisms are not pruned after testing the

Θ

condition, which is just evaluated in the rewriting phase due to the fact that any updates to the GSM views will also alter the outcome of

Θ

. Future works will use static analysis approaches to determine when

Θ

can be effectively used for pruning morphisms before being returned in the matching phase, and condition rewriting strategies to push condition evaluations towards the generation of the morphism as discussed in Section 6.3 and Section 6.4.

Although the current language supports the update of objects and containments within GSM objects, the provided query semantics do not consider the possibility that such updates can be still matched by the query, thus triggering additional rewriting operations. Future works will also consider further generalising the database matching and rewriting approach by considering the fix-point over the loaded database until convergence is met, thus meeting the former desiderata. We will also consider extending our containments with property-values associations similarly to the property graph model, and considering updating the objects’ and containments’ costs while performing the rewriting operations.

Last, preliminary experiments [34] conducted on a physical model being the direct mapping of the logical model in memory provide promising results showcasing the possibility of expressing not only JSON files but also representing indexing data structures in GSM that eventually lead to scalable query processing for JSON data. Future works will also compare the efficiency of DatagramDB if compared to other databases supporting object-oriented representations such as MongoDB or MySQL [60].

Author Contributions

Conceptualisation, G.B. and O.R.F.; methodology, G.B.; software, G.B. and O.R.F.; validation, G.B. and O.R.F.; formal analysis, G.B.; investigation, O.R.F.; resources, G.B.; data curation, O.R.F.; writing—original draft preparation, G.B.; writing—review and editing, O.R.F. and G.M.; visualisation, G.B. and O.R.F.; supervision, G.B. and G.M.; project administration, G.B.; funding acquisition, G.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets are available at the following repositories: https://osf.io/btjqw/?view_only=f31eda86e7b04ac886734a26cd2ce43d and https://osf.io/rpu37/. The codebase associated with the implementation of the data model and query language interpretation are available on GitHub: https://github.com/datagram-db/datagram-db/releases/tag/v2.0. All the URLs were accessed on the 21 April 2024.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

BNF	Backus-Naur Form
DAG	Direct Acyclic Graph
GSM	Generalised Semistructured Model
HOF	Higher-Order Function
NLP	Natural Language Processing
SOS	Structured Operational Semantics

Appendix A. Full Definitions

Appendix A.1. Variable Resolution

Before discussing value or boolean expression evaluation, we need to consider their base case first, which is the evaluation of variables occurring in such expressions. We, therefore, discuss these first before introducing the other steps. We consider variables referring to both objects and containments while supporting the update operations only for the latter. We, therefore, resolve the variables while returning the IDs for either matched objects or containments.

We, furthermore, want to achieve declarativeness while considering variables: if associated with NULL values as a result of a missed optimal match, we want to return empty values, while if we set values to them, we want to create implicitly new vertices. We need to then provide a new parameter to explicitly consider the creation of new objects when the expression context allows us to do so over unreferenced and unmatched variables. Furthermore, the single variable might be associated with multiple objects if referring to a morphism attribute within a nested relationship. Since the resolution of variables within our language may require to access previously replaced or substituted values such as from

Δ (g)

, we also have to refer to

Δ (g)

and g to carry out the variable resolution.

Since these operations can also involve updating the environment, we express these operations via SOS rather than using algebraic notation. Figure A1 shows the SOS for the sole variable resolution, returning a tuple containing (in order of appearance) the resolved variables, the possibly updated view due to object insertion, and the potentially nested attribute containing a nested table expressing the variable as an attribute.

If the variable belongs to the morphism’s schema and is associated with a non NULL value within the morphism while being associated with a basic type, we return the value stored in it (ExRes). If the variable refers to a nested attribute, we resolve the variable (

I d x_{Γ} (x)

, Equation (11)) and return all the associated values via

↓ ρ_{Δ (g)}^{f}

(Definition 6, ExResNest). If the variable was declared within the rewriting pattern alongside the creation of a new object, we return the ID associated with the newly created object (NewRes). If the variable is neither newly declared nor in the morphism’s schema, we return no result, as there os no binding and the query language is not expected to return a value (NoRes). Last, if we require the object to be created (new=true) as we set values associated with an object and the morphism did not return an object ID associated with x due to an optional match, we then create a new object in the GSM view

\tilde{o}

associated with a fresh value, while returning the updated view (ForceRes). This behaviour is completely disabled and no view is updated if the original expression does not force the creation of a new object due to an expression evaluation (NoFRes). In all the other circumstances, we resolve no reference IDs (NoFCRes).

Figure A1. Variable resolution with potential view updates (

ϱ

).

Figure A1. Variable resolution with potential view updates (

ϱ

).

Appendix A.2. Predicate Evaluation (Θ)

We now discuss boolean predicate evaluation: given that the variables might refer to nested attributes referring to multiple objects or containment, for which we are interested in comparing the labels, values, or properties associated with them, we might be interested in returning not only one single boolean value but one single boolean value per comparison outcome. Given this, we need a notion of sets explicitly remarking that some elements were not inserted. A maximum cardinality set is a pair of a set S and a natural number n denoted as

⟅ S, n ⟆

such that

| S | \leq n

. This type of set can be used to efficiently represent how many elements satisfy a given condition if the number of elements is previously known. So, if

| S | < n

, we know that not all the n elements of interest satisfy a given condition. We also constrain such sets to contain at most n items. We can easily override traditional set operations over such sets as follows

⟅ S, n ⟆ \cap ⟅ S^{'}, m ⟆ = ⟅ S \cap S^{'}, max (n, m) ⟆

⟅ S, n ⟆ \cup ⟅ S^{'}, m ⟆ = ⟅ S \cup S^{'}, max (n, m) ⟆

⟅ S, n ⟆ ∖ ⟅ S^{'}, m ⟆ = ⟅ S ∖ S^{'}, max (n, m) ⟆

We say a maximum cardinality set

⟅ S, n ⟆

is full if the cardinality of S is equal to n:

F ULL (⟅ S, n ⟆) \Leftrightarrow | S | = n \land n \neq 0

. We say that such maximum cardinality set is empty if S is empty:

⟅ S, n ⟆ ≃ \emptyset \Leftrightarrow S = \emptyset

. The cardinality of the maximum cardinality set is the cardinality of its constituent set:

|⟅ S, n ⟆| = | S |

.

Figure A2 provides the definition providing the considered semantics. For this, we require that two original variables refer to the same cardinality of values, or that at least one of them is associated with one single value (TestComp). For this, we can then return a maximal cardinality set

⟅ S, M ⟆

where S indicates the resulting tuple indices associated with a true value, and M refers to the maximum size of the morphisms.

Figure A2. Predicate Evaluation semantics

{[[Θ]]}_{Γ, g, Δ (g)}^{M T}

.

Figure A2. Predicate Evaluation semantics

{[[Θ]]}_{Γ, g, Δ (g)}^{M T}

.

Furthermore, we might also be interested in determining whether objects being matched in the current morphism also appear (TestMatch) or not (TestUmatch) in another morphism L associated with a variable y. Given that we are interested in the objects actually being matched and not in later changes provided by subsequent transformations, we can simply refer to

I d x_{Γ} (X)

listing the reference IDs associated with an attribute X in

Γ

. This can be easily represented as a maximum cardinality set

⟅ {x \in I d x_{Γ} (X) | x \neq NULL}, | I d x_{Γ} (X) | ⟆

by stripping the NULL values, while returning the intersection (TestMatch) or the difference (TestUmatch) for those IDs.

As the tests in the second last paragraph will return no sets whose indices are released to neither object nor containment ID while the former will return such indices, we need an intermediate predicate where these two set-boolean results are computed, to return for the former a full maximum cardinality set containing all the ID references from the morphism

Γ

if the underlying expression will not return an empty maximum cardinality set (TestFill).

We associate a similar behaviour to the typecasting of a script evaluation to a boolean value, for which we return an empty set if this is false and tall the occurring references within the morphism otherwise (TestScript).

Last, we interpret the conjunction and the disjunction of such multivalued boolean semantics as the intersection (TestConj) or the union (TestDisj) at the set of reference ID satisfying the base case candidates.

Appendix A.3. Expression Evaluation (expr)

Last, we are interested in evaluating expressions exploiting the variables withheld by each morphism. Please also observe that these expressions will have different operational semantics if compared to the ones associated with assignments, as the former will only retrieve values associated with the expressions, and the latter describe how to update the view with the newly assigned value. For this, these expressions will only return the final value associated with them. As the morphisms might be also nested, their final representation will be a one-column table with an arbitrary attribute, for which time is one of the basic types.

Figure A3 shows the SOS associated with the expression evaluation. From this, we provide

η

as a shorthand for the above relationship:

η (x, δ, Γ, N T) : = T s . t . ⟨ x, δ, false, Γ, M T ⟩ \to_{E} ⟨ T, I ⟩

In other words,

η

computes the first projection of

\to_{E}

, being the evaluation of an expression x.

Figure A3. Expression evaluation (E) with no view update, where

ifte (x, y, z)

is an expression returning y if x holds and z otherwise.

Figure A3. Expression evaluation (E) with no view update, where

ifte (x, y, z)

is an expression returning y if x holds and z otherwise.

For the evaluation of the attribute associated with a containment ID (LabelE), we directly apply

λ

to all the non-NULL matches, as containment IDs are never updated in this version of the language. For extracting values (XiE) or labels (EllE) associated with the object x at the numerical index idx not associated with an evaluated expression, we resort instead to the Property Resolution function also encompassing the changes in the GSM view (Definition 6). The interpretation of

ϕ (x, str)

considers both x and str as expressions, where only the former is forced to be interpreted as a variable if is a string: if x and str are associated with a tuple of values V and P of the same cardinality, we then return

ϕ (y, z)

for

(y, z) \in ζ (V, P)

(ContZipE), and if otherwise

| P | = 1

we return

ϕ (y, P (0))

for

y \in V

(ContE).

Last, for conditional expressions, we first evaluate a condition

Θ

which, as per the forthcoming discussion, will return a set of object or containment IDs for which the entire expression holds. If such a set is full, we return the evaluation of the expression associated with the “then” branch (AllTrueE), if it is empty, we return the evaluation of the expression associated with the “else” branch (AllFalseE), and otherwise, we first interpret

Θ

over the IDs referenced by a variable x, and then return the i-th value from the left expression if the MES associated with

Θ

contains the i-th object reference after x, and the i-th value from the else branch otherwise.

If we need to evaluate the string as a variable as it appears as the leftmost argument of a label,

ξ

, ℓ, or

ϕ

, then we use variable resolution

ρ

(Appendix A.1 on page 42) to evaluate the values associated with the variable (VarE), and we consider this as a simple string otherwise (StrE). Please observe that this invented value is then associated with an unexisting nested morphism

- 1

.

Appendix A.4. Full SOS Rewriting Specifications

This section describes through Figure A4 and Figure A5 the remaining set-update operations that were not reported in Figure 8 for conciseness. These refer to

ξ

updates (Figure A5), similar to the ones for ℓ, and

ϕ

(Figure A5), similar to the ones for

π

.

Figure A4. Remaining setting functions for

ξ

-s being the extension of Figure 8.

Figure A4. Remaining setting functions for

ξ

-s being the extension of Figure 8.

Figure A5. Remaining setting functions for

ϕ

-s being the extension of Figure 8.

Figure A5. Remaining setting functions for

ϕ

-s being the extension of Figure 8.

Appendix A.5. Converting GSM from Any Data Representation

Listing A1.

Python code showing the conversion of an arbitrary Python object representation of some data to a GSM representation. In particular, we showcase the conversion of (possibly nested) Pandas dataframes, XML data, NetworkX graphs, and XML trees. As we also showcase the representation of arbitrary Python objects, thus including linear collections and dictionaries, we also support JSON conversion.

Appendix B. Proofs

Appendix B.1. Transformation Isomorphism

Proof (for Lemma 1).

Proving this reduces to prove that

{⟨ g_{i} ⟩}_{i \leq n} = S ERIALISATION (L OADING ({⟨ g_{i} ⟩}_{i \leq n}))

and

d b = L OADING (S ERIALISATION (d b))

. We can prove this by extending the definition of the transformations being given in Algorithm 2.

${⟨ {\tilde{g}}_{i} ⟩}_{i \leq n} = S ERIALISATION (L OADING ({⟨ g_{i} ⟩}_{i \leq n}))$ : We can consider one GSM database ${\tilde{g}}_{i}$ being returned at a time, for which we consider its constituents $⟨ {\tilde{O}}_{i}, {\tilde{ℓ}}_{i}, {\tilde{ξ}}_{i}, {\tilde{ϵ}}_{i}, {\tilde{π}}_{i}, {\tilde{ϕ}}_{i}, {\tilde{t}}_{i, ϕ} ⟩$ . To do this, we want to show that such a returned database is equivalent to the originally loaded $g_{i}$ . To conduct this, we need to show that each of the constituents is equivalent.
We can prove that ${\tilde{O}}_{i} = O_{i}$ as follows:

$\begin{matrix} j \in {\tilde{O}}_{i} \Leftrightarrow \exists l, p, x . ⟨ l, i, j, p, x ⟩ \in ActivityTable \Leftrightarrow j \in O_{i} \end{matrix}$

We can prove that ${\tilde{ℓ}}_{i} \dot{=} ℓ_{i}$ , ${\tilde{ξ}}_{i} \dot{=} ξ_{i}$ , and ${\tilde{ϵ}}_{i} \dot{=} ϵ_{i}$ as follows:

${\tilde{ℓ}}_{i} (j) = L_{i} (j) = ℓ_{i} (j) {\tilde{ξ}}_{i} (j) = X_{i} (j) = ξ_{i} (j) {\tilde{ϵ}}_{i} (j) = C_{i} (j) = ϵ_{i} (j)$

We can also prove that the properties are preserved:

$\begin{matrix} {\tilde{π}}_{i} (j, κ) = v & \Leftrightarrow ⟨ ℓ_{i} (j) [0], v, r ⟩ \in Attribute Table \land \exists p, x . Attribute Table [r] = ⟨ ℓ_{i} (j) [0], i, j, p, x ⟩ \\ \Leftrightarrow ⟨ ℓ_{i} (j) [0], v, r ⟩ \in Attribute Table \land Attribute Table [r] (2) = j \land j \in O_{i} \\ \Leftrightarrow π (j, κ) = v \land j \in {\tilde{O}}_{i} \end{matrix}$

For ${\tilde{ϕ}}_{i} (j, κ)$ , we can easily show that this is associated with no value only if there are no records referring to a containment for j in the loaded database, which we can easily show that this derives from an originally empty contained from the loaded database $g_{i}$ . On the other hand, this function returns a set S of identifiers only if there exists at least one containment record in PhiTable^κ, for which we can derive the following:

$\begin{matrix} {\tilde{ϕ}}_{i} (j, κ) = S & \Leftrightarrow S = \{ι | \exists l, w, d . ⟨ l, i, j, w, d, ι ⟩ \in {PhiTable}^{κ}\} \\ \Leftrightarrow S = \{ι | ι \in ϕ_{i} (j, κ)\} \\ \Leftrightarrow S = ϕ_{i} (j, κ) \end{matrix}$

Given that ${\tilde{ϕ}}_{i} (j, κ)$ and $S = ϕ_{i} (j, κ)$ , this sub-goal is closed by transitivity closure over the equality predicate.
We can prove similarly the equivalence ${\tilde{t}}_{i, ϕ} \dot{=} t_{i, ϕ}$ as a corollary of the previous sub-cases:

$\begin{matrix} {\tilde{t}}_{i, ϕ} (ι) = ⟨ d, w ⟩ & \Leftrightarrow \exists l, j . ⟨ l, i, j, w, d, ι ⟩ \in {PhiTable}^{κ} \\ \Leftrightarrow ι \in ϕ_{i} (j, κ) \land t_{i, ϕ} (ι) = ⟨ d, w ⟩ \\ \Leftrightarrow ι \in {\tilde{ϕ}}_{i} (j, κ) \land t_{i, ϕ} (ι) = ⟨ d, w ⟩ \\ \Leftrightarrow t_{i, ϕ} (ι) = ⟨ d, w ⟩ \end{matrix}$
$\bar{d b} = L OADING (S ERIALISATION (d b))$ : In this case, we need to show that each table in $\bar{d b}$ is equivalent to those in $d b$ . We can prove similarly to the previous step that $\bar{L} = L$ , $\bar{X} = X$ , and $\bar{C} = C$ .
Next, we need to prove that the ActivityTables being returned are equivalent. We can achieve this by guaranteeing that both tables should contain the same records. After remembering that the values of p and x are determined through the indexing phase, from which we determine the offset where the objects with immediately preceding and following IDs are stored, we can then guarantee that these values will be always the same under the condition that the two tables will always contain the same records, for which these values will be the same. After remembering from the previous proof that ${\tilde{l}}_{i} (j) \in) \equiv L_{i} (j) \in d b$ , for ${\tilde{l}}_{i} \in S ERIALISATION (d b)$ and $L \in d b$ , we can also have that ${\tilde{l}}_{i} (j) (0) \equiv L_{i} (j) (0)$ :

$\begin{matrix} ⟨ l, i, j, p, x ⟩ \in \bar{Activity Table} & \Leftrightarrow {\tilde{g}}_{i} \in d b \land {\tilde{l}}_{i} (j) [0] = l \land j \in {\tilde{O}}_{i} \\ \Leftrightarrow {\tilde{g}}_{i} \in d b \land {\tilde{l}}_{i} (j) [0] = l \land \exists l^{'}, p^{'}, x^{'} . ⟨ l^{'}, i, j, p^{'}, x^{'} ⟩ \in Activity Table \\ \land L_{i} (j) [0] = l^{'} \\ \Leftrightarrow \exists p^{'}, x^{'} . ⟨ l, i, j, p^{'}, x^{'} ⟩ \in ActivityTable \end{matrix}$

Given that the first three components of the record always correspond, this entails that we will preserve the number of records across the board, hence preserving the same record ordering, thus also guaranteeing that $x = x^{'}$ and $x = x^{'}$ .
By adopting the equivalence of the previous and next offset for the AttributeTable from the previous sub-proof, we can then also prove that each record of the $\bar{AttributeTable}$ always corresponds to an equivalent one in the AttributeTable.

$\begin{matrix} ⟨ l, v, r ⟩ \in {\bar{AttributeTable}}^{κ} & \Leftrightarrow \bar{ActivityTable} (r) = ⟨ l, i, j, p, x ⟩ \land {\tilde{π}}_{i} (j, κ) = v \\ \exists l^{'}, r^{'} . ⟨ l^{'}, v, r^{'} ⟩ \in {AttributeTable}^{κ} \land ActivityTable [r^{'}] = ⟨ l^{'}, i, j, p, x ⟩ \end{matrix}$

Given that we can prove that $\tilde{l} = l$ also by the previous sub-case and $\tilde{r} = r$ as the records’ index will always correspond after sorting and indexing, we can close this sub-goal.
Last, we need to prove that the ${PhiTable}^{κ}$ tables are equivalent, which can be closed as follows:

$\begin{matrix} ⟨ l, i, j, w, d, ι ⟩ \in {\bar{PhiTable}}^{κ} & \Leftrightarrow ℓ_{i} (j) (0) = l \land ι \in ϕ_{i} (j, κ) \land t_{i, ϕ} (ι) = ⟨ d, w ⟩ \\ \Leftrightarrow ⟨ l, i, j, w, d, ι ⟩ \in {PhiTable}^{κ} \end{matrix}$

□

Appendix B.2. Nested Equi-Join Properties

Proof (for Lemma 2).

If

L \neq \emptyset

with

\vec{L} = ⟨ A_{1}, A_{2}, \dots, A_{n} ⟩

, then we can rewrite the definition of

{}^{\vec{L}}⋈

as follows:

t {}^{\vec{L}}⋈ s = \{\tilde{t} |_{S ∖ A_{1}} \oplus (\tilde{t} (A_{1}) {}^{⟨ A_{2}, \dots, A_{n} ⟩}⋈ s) | \tilde{t} \in t\}

If

A_{1} \notin dom (\tilde{t})

, then

\tilde{t} (A_{1}) = \emptyset

by assuming to represent a NULL value as an empty table by default. While doing so and by

\tilde{t} |_{S ∖ {A_{1}}} = \tilde{t}

by former assumption, the former result can be rewritten as:

= \{\tilde{t} |_{S ∖ {A_{1}}} | \tilde{t} \in t\} = t

□

Proof (for Lemma 3).

Given the hypothesis and with reference to Algorithm 3, we have

I R = \emptyset

, which then yields

L \times R

(Line 4). □

Proof (for Lemma 4).

Given the hypothesis and with reference to Algorithm 3, this satisfies the condition at Line 5, for which we can then immediately close our goal. □

Proof (for Lemma 5).

As

dom (S) \cap dom (U) = {N}

, this violates the condition for the rewriting Lemma 3, which is not then rewritten accordingly. Furthermore, the condition

dom (S (N)) \cap dom (U) \neq \emptyset

violates the condition for the rewriting Lemma 4, which is then not applied. Given

I R

as defined in Line 3 of Algorithm 3, we show that this algorithm computes the following:

⟨\tilde{t} |_{I R} \oplus (\tilde{t} (N) \oplus \tilde{s} |_{U ∖ I R}) ∣ \tilde{t} \in L, \tilde{s} \in R, \tilde{t} {\dot{=}}_{I R} \tilde{s}⟩

This equates to equi-joining the tables over the

I R

records where

N \notin I R

by construction, and by nesting all the records from the right table sharing the same

I R

values with the records from the left table. Last, we can easily observe that this cannot be easily expressed in terms of

L {}^{⟨ N ⟩}⋈ R

, as the rewriting of the former expression in the following way, for which it is evident that the former operator did not consider the recursive natural joining of the records by considering the commonly-shared attributes during its descent:

\begin{matrix} L {}^{⟨ N ⟩}⋈ R & = ⟨ \tilde{t} |_{S ∖ {N}}, \oplus (\tilde{t} (N) {}^{⟨ ⟩}⋈ s) ∣ \tilde{t} \in t ⟩ \\ = ⟨ \tilde{t} |_{I R} \oplus (\tilde{t} (N) \oplus \tilde{s}) ∣ \tilde{t} \in t, \tilde{s} \in R ⟩ \end{matrix}

This can be easily observed by the lack of

{\dot{=}}_{I R}

component that captures the essence of such a natural join condition. □

Appendix B.3. Language Properties

Proof (for Lemma 6).

As by the conditions stated in the current Lemma we either generate an empty morphism table

M T [,]

(no rules, no rewritings, no matches) or all the generated morphisms are ignored as all the

Θ

values are falsified (Line 8 from Algorithm 8), then we will always have an empty view

Δ (g_{i})

for all the GSM databases

g_{i}

loaded in the physical model. Considering this, the proof is an application of Equation (10), by which the materialisation of a GSM database

g_{i}

with a view

Δ (g_{i})

containing no changes simply returns

g_{i}

. As

g_{i}

is stored in a physical model, this result is also validated via Lemma 1. □

Proof (for Property 1).

This condition is generated by Line 4 from Algorithm 8, as we use the reverse topological order index

O_{rtopo} (g_{i})

associated with each GSM DB

g_{i}

to visit the entry-point objects or, when nested, their containing object associated with the reported morphism in

M T [L_{i}, G_{i}]

. □

Proof (for Lemma 7).

As per previous discussions, the proof for this lemma will be given intuitively by analysing the Cypher representation of the graph grammar represented visually in Figure 2 and previously represented by our proposed Generalised Graph Grammar language in Listing 1. We then provide the query required for rewriting the sentence expressed in Figure 6a:

The current Neo4j v5.20 implementation does not support the theorised graph incremental views for Cypher [32]. As we require to update the graph while querying, it is not possible to entirely create a new graph without restructuring or expanding a previously loaded one as per graph joins [7] or nesting [6] by simply returning some newly-created relationships or vertices; returning a new graph and rewriting a previous match will come at the cost of either restructuring the previously loaded graph, thus requiring additional overhead costs for re-indexing and updating the database while querying, or by creating a new distinct connected component within the loaded graph ( Mathematics 12 02677 i013

statements from Listing A2). As it is impossible to refer by the vertices and edges through their ID, thus exploiting graph provenance techniques for mapping the newly created vertices to the ones from the previously loaded graph [45], we are forced to join the loaded vertices (e.g., Lines 35–37, 50–52, and 67) with the newly created ones (e.g., Lines 38, 53, and 68, respectively) by their property values (e.g., Lines 39, 54, and 70, respectively). Our proposed approach avoids such cost via the aforementioned objects’ and containments’ ID-based morphism representation while keeping track of the restructuring operations (property Update, insertion NewObj, deletion DelObj and DelCont, and substitution ReplObj) over a graph g within an incremental view

Δ (g)

(Section 4.3).

Listing A2.

Cypher representation for Figure 2 to rewrite the sentence from Figure 6a.

Cypher does not ensure to apply the graph rewriting rules as intended in our scenarios: let us consider the dependency graph generated from the recursive sentence “Matt and Tray believe that either Alice and Bob and Carl play cricket or Carl and Dan will not have a way to amuse themselves” and let us try to express patterns in Figure 2b,c as two distinct Mathematics 12 02677 i014

-es with their respective update operations as per the following Listing:

Instead of generating one single connected component representing the result, we will generate as many distinct connected components as subgraphs being identified as matching the patterns, while this does not occur with a simple sentence structure (Figure 3a) where we achieve the correct result as in Figure 5. We must Mathematics 12 02677 i014

elements of the graph multiple times, constantly rejoining on data previously Mathematics 12 02677 i014

-ed in earlier stages of the query for establishing relationships over previously grouped vertices (Lines 108 and 118 from Listing A2). This then postulates the inability of such language to automatically apply an order of visit for restructuring the loaded graph (e.g., we need to tell the query language to first group-by the vertices—Lines 2–15—and then establish the orig relationships—Lines 18–24) while not expressing an automated way to merge each distinct transformed graph into one cohesive, connected component. This then forces the expression of a generic graph matching and rewriting mechanism to be dependent on the specific recursive structure of the data. Thus, requiring the creation of a broader query, where we need to explicitly instruct the query language on the correct way to visit the data while instructing how to reconcile each generated subgraph from each morphism within one final graph.

During the delineation of the final Cypher query succeeding in obtaining the correct rewritten graph (also Listing A2), we also highlighted the impossibility of Cypher propagating the temporary result generated by a rewriting rule and propagating it to another rule to be applied upstream: this requires carrying out intermediate sub-queries establishing connections across patterns sharing intermediate vertices, and the re-computation of the same intermediate solutions, such as vertex grouping (cfr. Line 4 and Line 116). Since Cypher also does not support the explicit grouping of vertices based on a pattern as in [46], this required us to identify the vertices satisfying each specific pattern, label them appropriately in a unique way (e.g. Line 9), and then compare the result obtained (e.g., Line 20 for generating orig relationships). This limitation can be overcome by providing two innovations: first, using nested relational tables for representing morphisms, where each nest will contain the sub-pattern of interest possibly to be grouped. Second, we track any vertex substitution for entry-point vertex matches via incremental views. This substitution can be easily propagated at any level by considering the transitive closure of the substitution function (Definition 5), while the order of visit in the graph guarantees the correctness of the application of such substitution (Algorithm 8).

Listing A2, constructed for the specific matches referring to the sentence “Matt and Tray...”, will not fully execute on a different sentence without the given dependencies, as no match is found, and therefore, no rewriting can occur. Current graph query languages are meant to return a subgraph from the given patterns. In Cypher, you must abide by what is contained within the data, if the data are not there we need to remove the match from the query, which we cannot forecast in advance. This results in constant analysis of the data. For us, the intention is to have graph grammar rewriting rules whereby if a match is not made, no rewriting occurs.

By leveraging such limitations of Cypher while juxtaposing the desired behaviour of the language, we derive a declarative graph query language where patterns can be expressed similarly to Figure 5. □

Proof (for Lemma 8).

This is a direct application of the SOS rules from Figure 8: any removed vertex will not be replaced by a newly inserted vertex within the matched entry-point ego-net if not explicitly updating the containment to also add the newly created object. If an entry-point was removed, the only way to preserve the connectivity of the GSM objects is to exploit the replacement, through which we will explicitly state that, for any explicitly declared container within the matched pattern, we will insert the created object or any other previously matched object of choice within the container’s containment relationships also containing the object. □

Proof (for Lemma 9).

This requires discussing the SOS rules from Figure 8, from which we set or update values, labels, containments, and properties associated with objects. Concerning label updates, such updates occur over variable x, in which variable resolution

ρ

is always in the form

⟨ x, Δ (g), true ⟩

: if the variable does not appear in the morphism, we expand the first two cases from Figure A1. We need to exclude the case it was declared with a new statement from Figure 8, as we will have otherwise x in

Γ^{ν}

from

Δ (g)

. As we have the parameter true, this also excludes the rule NoFRes (Figure A1). We can then see that we do not create an optimally matched containment, as expected by the intended semantics. Thus, we restrict our case to ForceRes (Figure A1), on which we see that a new object is created, thus updating

Δ (g)

, and that we know the ID of this object as it will be naturally the next incrementally number being available. Then, the label update will always occur, which will then preserve the update in

Δ (g)

. These choices are reflected in the materialisation phase by extending each database g and prioritising the changes described in the view

Δ (g)

. □

Proof.

Given the possibility of visiting several patterns

L_{1}, \dots, L_{n}

, we can express the matching of those in our proposed query language as rules

p_{i} = L_{i} \to \emptyset

for

1 \leq i \leq n

, where both objects and containments must be explicitly referenced with a variable. Still, this formulation will not explicitly remove any object or containment not being visited. Enabling this requires the extension of the former query with two additional rules, one for removing all the vertices not visited in the different pattern (om), and the other for explicitly removing unmatched containments (cm). Given the variables

A^{o}, \dots, W^{o}

referring to matched objects and

A^{c}, \dots, W^{c}

to matched containments in

L_{1}, \dots, L_{n}

, we can then express om and cm as the following rules being defined immediately after

p_{n}

:

As we rewrite the same matching object, no replacement will occur and given that the matching (Z) and (X)–[Y:]->(Z) will return all the objects and containments across the databases, we have to further test those to delete only the ones being not matched in

L_{1}, \dots, L_{n}

. □

Proof (for Corollary 1).

This follows from our previous proof, for which we clearly showed that our proposed language can match and rewrite graphs declaratively while considering optional rewrites. Cypher has some limitations in this regard, as it forces the user to specify the order in which the matching and rewriting rules should be applied. Furthermore, our language can return the matched morphisms similarly to SPARQL while allowing the generation of multiple morphism tables rather than just one, and selecting just the objects and containments being matched while removing the remaining ones similarly to Cypher. Therefore, the proposed language over GSM generalizes over current graph query languages over a novel generalised semistructured model enabling this. □

Appendix B.4. Time Complexity

Proof (for Lemma 11).

Regarding Algorithm 5, as we defined a graph connecting each rule appearing in the query, which will be then represented as a vertex, in the worst-case scenario, we will have a fully connected graph with

E = V \times V

. Thus, the cost of creating this graph is quadratic, as

O (| V | + | V |^{2})

is in

O (| V |^{2})

. Given that the approximate topological sort uses a DFA visit of the resulting graph and that the layering is linear over the size of the vertices, and given that

| V | = | g g |

by construction we, therefore, obtain an overall global quadratic time complexity over the size of the query when expressed via the number of available rules. □

Proof (for Lemma 12).

Suppose that each rule has at most c containment relationships, which will be provided by disjunction reference to all the k containment labels recorded in the physical model. Thus, the caching phase will take at most

c k | g g | + k

time, as we might still consider all of these labels if we have containments for which the containment relationship is not specified.

Thus, the caching mechanism will guarantee sole access to each

{PhiTable}^{k}

once. By estimating an average branching factor

β

across the loaded GSM in the physical model and by assuming that, in the worst case scenario, all the objects will contain containments for all the k labels, then the cost of caching the tables to make them ready to the morphism representation takes

k | d b | \bar{O} β

time, where

\bar{O}

is the average number of objects stored across GSM databases.

Now, we consider the cost associated with a table of size

| d b | \bar{O} β

. We can freely assume that rewriting operations are in

O (1)

, as in our implementation, morphisms are striped by schema information and are merely represented as tuples while associating the schema to the sole table. Similarly, the projection costs are linear over the size of the table, while the nesting operation can be performed in linear time while reducing the size of the table to

| d b | \bar{O}

due to the ego-net assumptions enforced within the structure of the matching pattern. Overall, this comes at

O (| d b | \bar{O} β)

time.

In the worst-case scenario, the association of containment to a table will take cost

k | d b | \bar{O} β

, thus totalling an overall cost in

O (c k | d b | \bar{O} β)

, which also dominates the time complexity of the other phases. □

Proof (for Corollary 2).

This proof refers to the time complexity of Algorithm 7, which can be seen as a corollary of the previous lemma, for which we already derived the estimated table size

m = | d b | \bar{O} β

for each required table composing the final morphism. Let us denote r (and o) as the maximum number of required (and optional) matches appearing across all

L_{i}

from

g g

-s. As the nested relational algebra can be always expressed in terms of “flat” relational algebra, we can upper-bound the time complexity of Algorithm 3 by

O (m^{2})

. This gives us a worst-case scenario time complexity of

O (m^{r + o})

for computing

P_{i}

for each rule

p_{i} \in g g

, which is a linear time complexity over the size of the worst-case scenario yielded table.

The nesting operator

ν_{B \to A} (t)

from Equation (5) being optionally used to nest morphisms if entry-point vertices are grouped by a direct ancestor can be easily implemented with a linear time complexity over the size of the table that we want to nest: this boils down to computing the equivalence class of

{\dot{=}}_{R}

over the fields

R = dom (S (t)) ∖ {A, B}

and holding such information as a map from the values

\tilde{t} (R)

for each

\tilde{t} \in t

to the collection of rows

\bar{t} |_{B}

for which

\tilde{t} {|_{R} = \bar{t} |}_{R}

. Thus, Line 16 comes with a linear cost over the size of the table

P_{i}

.

Given that the time complexity of computing the symmetric closure of a relationship is trivially linear while the time complexity of computing the transitive closure for a relationship is upper-bounded by the Floyd–Warshall algorithm with a cubic time, this leads to a worst-case time complexity of

O (| d b | {\bar{O}}^{3})

time for computing each

ℜ_{i}

(Lines 17 and 18).

We can freely assume that the insertion cost of each morphism within the

M T [\cdot, \cdot]

table comes at a linear cost, while the sorting of each

M T [L_{i}, g_{j}]

comes with a cost of

O ((r + o) m^{r + o} log (m))

with

m = | d b | \bar{O} β

. This phase clearly dominates over all the previous ones, and thus we can freely assume that the time complexity of computing each morphism is in

O ((r + o) m^{r + o} log (m))

. This leads to an overall time complexity of

O (| g g | (r + o) m^{r + o} log (m))

for generating all the morphisms, which can be still upper-bounded by a polynomial time complexity. □

Proof (for Lemma 13).

In Section 4.3, we observed that all the functions that were there introduced can be computed in

O (1)

time via the GSM view update; these operations also occur with the definition of

\to_{ν}

at the basis of the graph rewriting operations outlined in Section 6.5 and Figure 8: the worst case scenario for the evaluation of such expressions refers to the evaluation of variables associated with nested relationships, thus referring to at most

β

object per morphism

Γ

. Given that each rewriting operation considers one single morphism at a time and that, within the worst-case scenario, we consider the cross-product of the objects associated with both variables, the computation of each operation shall take at most

O (β^{2})

time.

Given

| g g | m^{r + o}

the number of possible nested morphisms as determined from the previous corollary, this overall leads to an overall time complexity of

O (| g g | m^{r + o} β^{2})

for overall computing Algorithm 8. □

Proof (for Corollary 3).

This is a corollary for all the previous lemmas, as the composition of polynomial-time algorithms leads to an overall algorithm of polynomial-time complexity. □

References

Schmitt, I. QQL: A DB&IR Query Language. VLDB J. 2008, 17, 39–56. [Google Scholar] [CrossRef]
Chamberlin, D.D.; Boyce, R.F. SEQUEL: A Structured English Query Language. In Proceedings of the 1974 ACM-SIGMOD Workshop on Data Description, Access and Control, Ann Arbor, MI, USA, 1–3 May 1974; Altshuler, G., Rustin, R., Plagman, B.D., Eds.; ACM: New York, NY, USA, 1974; Volume 2, pp. 249–264. [Google Scholar] [CrossRef]
Rodriguez, M.A. The Gremlin graph traversal machine and language (invited talk). In Proceedings of the 15th Symposium on Database Programming Languages, New York, NY, USA, 27 October 2015; pp. 1–10. [Google Scholar] [CrossRef]
Robinson, I.; Webber, J.; Eifrem, E. Graph Databases; O’Reilly Media, Inc.: Sebastopol, CA, USA, 2013. [Google Scholar]
Angles, R.; Gutierrez, C. The Expressive Power of SPARQL; Springer: Berlin/Heidelberg, Germany, 2008; pp. 114–129. [Google Scholar]
Bergami, G.; Petermann, A.; Montesi, D. THoSP: An algorithm for nesting property graphs. In Proceedings of the 1st ACM SIGMOD Joint International Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA), New York, NY, USA, 10–15 June 2018. GRADES-NDA ’18. [Google Scholar] [CrossRef]
Bergami, G. On Efficiently Equi-Joining Graphs. In Proceedings of the 25th International Database Engineering & Applications Symposium, New York, NY, USA, 14–16 July 2021; IDEAS ’21. pp. 222–231. [Google Scholar] [CrossRef]
Ehrig, H.; Habel, A.; Kreowski, H.J. Introduction to graph grammars with applications to semantic networks. Comput. Math. Appl. 1992, 23, 557–572. [Google Scholar] [CrossRef][Green Version]
Bergami, G. A New Nested Graph Model for Data Integration. Ph.D. Thesis, University of Bologna, Bologna, Italy, 2018. [Google Scholar]
Das, S.; Srinivasan, J.; Perry, M.; Chong, E.I.; Banerjee, J. A Tale of Two Graphs: Property Graphs as RDF in Oracle. In Proceedings of the 17th International Conference on Extending Database Technology, EDBT 2014, Athens, Greece, 24–28 March 2014; pp. 762–773. [Google Scholar] [CrossRef]
Bergami, G.; Appleby, S.; Morgan, G. Quickening Data-Aware Conformance Checking through Temporal Algebras. Information 2023, 14, 173. [Google Scholar] [CrossRef]
Turi, D.; Plotkin, G.D. Towards a Mathematical Operational Semantics. In Proceedings of the 12th Annual IEEE Symposium on Logic in Computer Science, Warsaw, Poland, 29 June–2 July 1997; IEEE Computer Society. pp. 280–291. [Google Scholar] [CrossRef]
Codd, E.F. A relational model of data for large shared data banks. Commun. ACM 1970, 13, 377–387. [Google Scholar] [CrossRef]
Kahn, A.B. Topological sorting of large networks. Commun. ACM 1962, 5, 558–562. [Google Scholar] [CrossRef]
Sugiyama, K.; Tagawa, S.; Toda, M. Methods for Visual Understanding of Hierarchical System Structures. IEEE Trans. Syst. Man Cybern. 1981, 11, 109–125. [Google Scholar] [CrossRef]
Hölsch, J.; Grossniklaus, M. An Algebra and Equivalences to Transform Graph Patterns in Neo4j. In Proceedings of the Workshops of the EDBT/ICDT 2016 Joint Conference, EDBT/ICDT Workshops 2016, Bordeaux, France, 15 March 2016; Volume 1558. [Google Scholar]
Gutierrez, C.; Hurtado, C.A.; Mendelzon, A.O.; Pérez, J. Foundations of Semantic Web databases. J. Comput. Syst. Sci. 2011, 77, 520–541. [Google Scholar] [CrossRef]
Fionda, V.; Pirrò, G.; Gutierrez, C. NautiLOD: A Formal Language for the Web of Data Graph. ACM Trans. Web TWEB 2015, 9, 1–43. [Google Scholar] [CrossRef]
Hartig, O.; Pérez, J. Chapter LDQL: A Query Language for the Web of Linked Data. In Proceedings of the Semantic Web-ISWC 2015 14th International Semantic Web Conference, Bethlehem, PA, USA, 11–15 October 2015; Proceedings, Part I. Springer International Publishing: Cham, Switzerland, 2015; pp. 73–91. [Google Scholar] [CrossRef]
Carroll, J.J.; Dickinson, I.; Dollin, C.; Reynolds, D.; Seaborne, A.; Wilkinson, K. Jena: Implementing the Semantic Web Recommendations. In Proceedings of the 13th International World Wide Web Conference on Alternate Track Papers & Posters, New York, NY, USA, 17–20 May 2004; WWW Alt. ’04. pp. 74–83. [Google Scholar] [CrossRef]
Sirin, E.; Parsia, B.; Grau, B.C.; Kalyanpur, A.; Katz, Y. Pellet: A practical OWL-DL reasoner. Web Semant. Sci. Serv. Agents World Wide Web 2007, 5, 51–53. [Google Scholar] [CrossRef]
Angles, R.; Arenas, M.; Barceló, P.; Hogan, A.; Reutter, J.L.; Vrgoc, D. Foundations of Modern Query Languages for Graph Databases. ACM Comput. Surv. 2017, 50, 1–40. [Google Scholar] [CrossRef]
Fan, W.; Li, J.; Ma, S.; Tang, N.; Wu, Y. Adding regular expressions to graph reachability and pattern queries. Front. Comput. Sci. 2012, 6, 313–338. [Google Scholar] [CrossRef]
Barceló, P.; Fontaine, G.; Lin, A.W. Expressive Path Queries on Graphs with Data. In Logic for Programming, Artificial Intelligence, and Reasoning, Proceedings of the 19th International Conference, LPAR-19, Stellenbosch, South Africa, 14–19 December 2013; McMillan, K., Middeldorp, A., Voronkov, A., Eds.; Springer: Berlin/Heidelberg, Germany, 2013; pp. 71–85. [Google Scholar] [CrossRef]
Junghanns, M.; Kießling, M.; Averbuch, A.; Petermann, A.; Rahm, E. Cypher-based Graph Pattern Matching in Gradoop. In Proceedings of the Fifth International Workshop on Graph Data-management Experiences & Systems, GRADES@SIGMOD/PODS 2017, Chicago, IL, USA, 14–19 May 2017; pp. 1–8. [Google Scholar] [CrossRef]
Ghrab, A.; Romero, O.; Skhiri, S.; Vaisman, A.A.; Zimányi, E. GRAD: On Graph Database Modeling. arXiv 2016, arXiv:1602.00503. [Google Scholar]
Ghrab, A.; Romero, O.; Skhiri, S.; Vaisman, A.; Zimányi, E. Advances in Databases and Information Systems. In Proceedings of the 19th East European Conference, ADBIS 2015, Poitiers, France, 8–11 September 2015; Chapter A Framework for Building OLAP Cubes on Graphs. Springer International Publishing: Cham, Switzerland, 2015; pp. 92–105. [Google Scholar] [CrossRef]
Junghanns, M.; Petermann, A.; Teichmann, N.; Gomez, K.; Rahm, E. Analyzing Extended Property Graphs with Apache Flink. In Proceedings of the SIGMOD Workshop on Network Data Analytics (NDA), San Francisco, CA, USA, 1 July 2016. [Google Scholar]
Wolter, U.; Truong, T.T. Graph Algebras and Derived Graph Operations. Logics 2023, 1, 10. [Google Scholar] [CrossRef]
Rozenberg, G. (Ed.) Handbook of Graph Grammars and Computing by Graph Transformation: Volume I. Foundations; WSP: Anchorage, AK, USA, 1997. [Google Scholar]
Pérez, J.; Arenas, M.; Gutierrez, C. Semantics and Complexity of SPARQL. ACM Trans. Database Syst. (TODS) 2009, 34, 16:1–16:45. [Google Scholar] [CrossRef]
Szárnyas, G. Incremental View Maintenance for Property Graph Queries. In Proceedings of the 2018 International Conference on Management of Data, SIGMOD, Houston, TX, USA, 10–15 June 2018; ACM: New York, NY, USA, 2018; pp. 1843–1845. [Google Scholar]
Consens, M.P.; Mendelzon, A.O. GraphLog: A Visual Formalism for Real Life Recursion. In Proceedings of the Ninth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, PODS, Nashville, TN, USA, 2–4 April 1990; ACM: New York, NY, USA, 1990; pp. 404–416. [Google Scholar]
Bergami, G.; Zegadło, W. Towards a Generalised Semistructured Data Model and Query Language. SIGWEB Newsl. 2023, 2023, 1–22. [Google Scholar] [CrossRef]
Shmedding, F. Incremental SPARQL Evaluation for Query Answering on Linked Data. In Proceedings of the Second International Workshop on Consuming Linked Data, Bonn, Germany, 23 October 2011. COLD2011. [Google Scholar]
Huang, J.; Abadi, D.J.; Ren, K. Scalable SPARQL Querying of Large RDF Graphs. Proc. VLDB Endow. PVLDB 2011, 4, 1123–1134. [Google Scholar] [CrossRef]
Atre, M. Left Bit Right: For SPARQL Join Queries with OPTIONAL Patterns (Left-outer-joins). In Proceedings of the SIGMOD Conference, Melbourne, Australia, 31 May–4 June 2015; ACM: New York, NY, USA, 2015; pp. 1793–1808. [Google Scholar]
Colby, L.S. A recursive algebra and query optimization for nested relations. In Proceedings of the 1989 ACM SIGMOD International Conference on Management of Data, Portland, OR, USA, 31 May–2 June 1989; SIGMOD ’89. pp. 273–283. [Google Scholar] [CrossRef]
Liu, H.C.; Ramamohanarao, K. Algebraic equivalences among nested relational expressions. In Proceedings of the Third International Conference on Information and Knowledge Management, Gaithersburg, MD, USA, 29 November–1 December 1994; CIKM ’94. pp. 234–243. [Google Scholar] [CrossRef]
Leser, U.; Naumann, F. Informationsintegration: Architekturen und Methoden zur Integration Verteilter und Heterogener Datenquellen; dpunkt.verlag: Heidelberg, Germany, 2006. [Google Scholar]
Elmasri, R.A.; Navathe, S.B. Fundamentals of Database Systems, 7th ed.; Pearson: New York, NY, USA, 2016. [Google Scholar]
Atzeni, P.; Ceri, S.; Paraboschi, S.; Torlone, R. Database Systems—Concepts, Languages and Architectures; McGraw-Hill Book Company: New York, NY, USA, 1999. [Google Scholar]
den Bussche, J.V. Simulation of the nested relational algebra by the flat relational algebra, with an application to the complexity of evaluating powerset algebra expressions. Theor. Comput. Sci. 2001, 254, 363–377. [Google Scholar] [CrossRef][Green Version]
Green, T.J.; Karvounarakis, G.; Tannen, V. Provenance semirings. In Proceedings of the Twenty-Sixth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, Beijing, China, 11–13 June 2007; PODS ’07. pp. 31–40. [Google Scholar] [CrossRef]
Chapman, A.; Missier, P.; Simonelli, G.; Torlone, R. Capturing and querying fine-grained provenance of preprocessing pipelines in data science. Proc. VLDB Endow. 2020, 14, 507–520. [Google Scholar] [CrossRef]
Junghanns, M.; Petermann, A.; Rahm, E. Distributed Grouping of Property Graphs with GRADOOP. In Datenbanksysteme für Business, Technologie und Web (BTW 2017); Gesellschaft für Informatik: Bonn, Germany, 2017; pp. 103–122. [Google Scholar]
Bellatreche, L.; Kechar, M.; Bahloul, S.N. Bringing Common Subexpression Problem from the Dark to Light: Towards Large-Scale Workload Optimizations. In Proceedings of the 25th International Database Engineering & Applications Symposium, IDEAS, Montreal, QC, Canada, 14–16 July 2021; ACM: New York, NY, USA, 2021. [Google Scholar]
Aho, A.V.; Lam, M.S.; Sethi, R.; Ullman, J.D. Compilers: Principles, Techniques, and Tools, 2nd ed.; Addison-Wesley Longman Publishing Co., Inc.: San Francisco, CA, USA, 2006. [Google Scholar]
Ulrich, H.; Kern, J.; Tas, D.; Kock-Schoppenhauer, A.; Ückert, F.; Ingenerf, J.; Lablans, M. QL4MDR: A GraphQL query language for ISO 11179-based metadata repositories. BMC Med. Inform. Decis. Mak. 2019, 19, 1–7. [Google Scholar] [CrossRef]
Zhang, T.; Subburathinam, A.; Shi, G.; Huang, L.; Lu, D.; Pan, X.; Li, M.; Zhang, B.; Wang, Q.; Whitehead, S.; et al. GAIA—A Multi-media Multi-lingual Knowledge Extraction and Hypothesis Generation System. In Proceedings of the 2018 Text Analysis Conference, TAC 2018, Gaithersburg, MD, USA, 13–14 November 2018. [Google Scholar]
Parr, T. The Definitive ANTLR 4 Reference, 2nd ed.; Pragmatic Bookshelf: Raleigh, NC, USA, 2013. [Google Scholar]
Tarjan, R.E. Edge-Disjoint Spanning Trees and Depth-First Search. Acta Inform. 1976, 6, 171–185. [Google Scholar] [CrossRef]
de Marneffe, M.C.; Manning, C.D.; Nivre, J.; Zeman, D. Universal Dependencies. Comput. Linguist. 2021, 47, 255–308. [Google Scholar] [CrossRef]
Martelli, A.; Montanari, U. Unification in Linear Time and Space: A Structured Presentation; Technical Report Vol. IEI-B76-16; Consiglio Nazionale delle Ricerche, Pisa: Pisa, Italy, 1976. [Google Scholar]
Rozenberg, G. (Ed.) Handbook of Graph Grammars and Computing by Graph Transformation: Volume II; WSP: Anchorage, AK, USA, 1999. [Google Scholar]
Talmor, A.; Herzig, J.; Lourie, N.; Berant, J. CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, 2–7 June 2019; Volume 1 (Long and Short, Papers). Burstein, J., Doran, C., Solorio, T., Eds.; Association for Computational Linguistics: Stroudsburg, PA, USA, 2019; pp. 4149–4158. [Google Scholar] [CrossRef]
de Marneffe, M.; MacCartney, B.; Manning, C.D. Generating Typed Dependency Parses from Phrase Structure Parses. In Proceedings of the Fifth International Conference on Language Resources and Evaluation, LREC 2006, Genoa, Italy, 22–28 May 2006; Calzolari, N., Choukri, K., Gangemi, A., Maegaard, B., Mariani, J., Odijk, J., Tapias, D., Eds.; European Language Resources Association (ELRA): Paris, France, 2006; pp. 449–454. [Google Scholar]
Chen, D.; Manning, C.D. A Fast and Accurate Dependency Parser using Neural Networks. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, Doha, Qatar, 25–29 October 2014; A Meeting of SIGDAT, a Special Interest Group of the ACL. Moschitti, A., Pang, B., Daelemans, W., Eds.; ACL: Doha, Qatar, 2014; pp. 740–750. [Google Scholar] [CrossRef]
Kotiranta, P.; Junkkari, M.; Nummenmaa, J. Performance of Graph and Relational Databases in Complex Queries. Appl. Sci. 2022, 12, 6490. [Google Scholar] [CrossRef]
Győrödi, C.; Győrödi, R.; Pecherle, G.; Olah, A. A comparative study: MongoDB vs. MySQL. In Proceedings of the 2015 13th International Conference on Engineering of Modern Electric Systems (EMES), Oradea, Romania, 11–12 June 2015; pp. 1–6. [Google Scholar] [CrossRef]

Figure 1. Listing all the subgraphs of g being a solution of the subgraph isomorphism problem of g over L. (a) Graph g to be mathed; (b) Graph pattern L; (c) Morphism table

M T [L, g]

where each row describes a morphism

μ_{i}

between the graph matching L and the graph g.

Figure 1. Listing all the subgraphs of g being a solution of the subgraph isomorphism problem of g over L. (a) Graph g to be mathed; (b) Graph pattern L; (c) Morphism table

M T [L, g]

where each row describes a morphism

μ_{i}

between the graph matching L and the graph g.

Figure 2. Graph grammar production rules à la GraphLog [33] in this paper’s use case scenario: thick denotes insertions, crosses deletions, and optional matches are dashed. We extended it with multiple optional edge label matches (‖), key-value association

π (λ, X)

for property

λ

and vertex X, and multiple vertex values

ξ (X)

. (a) Injecting the articles/possessive pronouns (

λ

) in Y for an entity X as its own properties, while deleting

λ

and Y; (b) Expressing the verb as a binary relationship between subject and direct object; (c) Generating a new entity

H^{'}

coalescing the ones

\vec{H}

under the same conjunction Z, while referring to its original constituents via orig.

Figure 2. Graph grammar production rules à la GraphLog [33] in this paper’s use case scenario: thick denotes insertions, crosses deletions, and optional matches are dashed. We extended it with multiple optional edge label matches (‖), key-value association

π (λ, X)

for property

λ

and vertex X, and multiple vertex values

ξ (X)

. (a) Injecting the articles/possessive pronouns (

λ

) in Y for an entity X as its own properties, while deleting

λ

and Y; (b) Expressing the verb as a binary relationship between subject and direct object; (c) Generating a new entity

H^{'}

coalescing the ones

\vec{H}

under the same conjunction Z, while referring to its original constituents via orig.

Figure 3. Framing Figure 1 in the context of Neo4j’s implementation of the Property Graph model. (a) Dependency graph for “Alice and Bob play cricket”; (b) Neo4j’s property graph morphism table.

Figure 6. Applying the rewriting rules expressed in Figure 2: different colours refer to different graph grammar rules (b and c), filled vertices in the left (and right) graph refer to the distinct vertex entry-points (and newly generated components). (a) Dependency graph for “Matt and Tray believe that either Alice and Bob and Carl play cricket or Carl and Dan will not have a way to amuse themselves”. While object IDs are presented as numbers, containment IDs are omitted. (b) Generating a binary relationship between the subject as a single entity and the direct object.

Figure 7. Resulting morphisms from the application of the graph grammar rules from Listing 1 over the GSM database in Figure 6a, from which the resulting rewritten database Figure 6b is then obtained. (a) Morphisms

M [p_{1}, g_{1}]

. (b) Morphisms

M [p_{2}, g_{1}]

, where * refers to sub-matches nested over the entry point (See Algorithm from Section 6.4). (c) Morphisms

M [p_{3}, g_{1}]

.

Figure 7. Resulting morphisms from the application of the graph grammar rules from Listing 1 over the GSM database in Figure 6a, from which the resulting rewritten database Figure 6b is then obtained. (a) Morphisms

M [p_{1}, g_{1}]

. (b) Morphisms

M [p_{2}, g_{1}]

, where * refers to sub-matches nested over the entry point (See Algorithm from Section 6.4). (c) Morphisms

M [p_{3}, g_{1}]

.

Figure 9. Analysing DatagramDB. (a) Results grouped by number of words per sentence. (b) Results grouped by number of databases.

Figure 10. Sampled probability density function associated with the number of words within the sentences for each subset of traces.

Figure 11. Running time of each algorithm with different sentence samples.

Table 1. Table displaying results from rewriting the aforementioned sentences.

Data Model		Loading/Indexing (avg. ms)	Querying (avg. ms)	Materialisation (avg. ms)	Total (ms)
Neo4j {	Simple	$1.83 \times 10^{0}$	$7.03 \times 10^{0}$	N/A	$8.86 \times 10^{0}$
Neo4j {	Complex	$9.32 \times 10^{0}$	$1.93 \times 10^{2}$	N/A	$2.02 \times 10^{2}$
GSM {	Simple	$9.63 \times 10^{- 2}$	$4.82 \times 10^{- 1}$	$2.40 \times 10^{- 2}$	$6.02 \times 10^{- 1}$
GSM {	Complex	$6.91 \times 10^{- 1}$	$9.00 \times 10^{- 1}$	$6.67 \times 10^{- 1}$	$2.26 \times 10^{0}$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Multiple requests from the same IP address are counted as one view.

Matching and Rewriting Rules in Object-Oriented Databases

Abstract

1. Introduction

2. Preliminary Notation

Higher-Order Functions

3. Related Works

3.1. Graph Data

3.1.1. Logical Model

Direct Acyclic Graphs (DAGs) and Topological Sort

Property Graphs

RDF

3.1.2. Query Languages

Graph Grammars

Proper Graph Query Languages

3.2. Nested Relational Model

3.2.1. Logical Model

3.2.2. Query Languages

3.2.3. Columnar Physical Model

4. Generalised Semistructured Model v2.0

4.1. Logical Model

4.2. Physical Model

4.3. GSM View Δ ( g )

4.3.1. Object Replacement and Resolution

4.3.2. View Materialisation

4.4. Morphism Notation

5. Nested Natural Equi-Join

Properties

6. Generalised Graph Grammar

6.1. Syntax and Informal Semantics

6.2. Determining the Order of Application of the Rules

6.3. Containment Matching

6.3.1. Pseudocode Notation for L i

6.3.2. Procedural Semantics for Matching and Caching Containments

6.3.3. Algorithmic Choices for Optimisation

6.4. Morphism Instantiation and Indexing

6.5. Graph Rewriting Operations (op from R i )

7. Language Properties

8. Time Complexity

9. Empirical Evaluation

9.1. Comparing Cypher and Neo4j with Our Proposed Implementation

9.2. Scalability for the Proposed Implementation

10. Conclusions and Future Works

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

Appendix A. Full Definitions

Appendix A.1. Variable Resolution

Appendix A.2. Predicate Evaluation (Θ)

Appendix A.3. Expression Evaluation (expr)

Appendix A.4. Full SOS Rewriting Specifications

Appendix A.5. Converting GSM from Any Data Representation

Appendix B. Proofs

Appendix B.1. Transformation Isomorphism

Appendix B.2. Nested Equi-Join Properties

Appendix B.3. Language Properties

Appendix B.4. Time Complexity

References

Article Metrics

Article Access Statistics

4.3. GSM View $Δ (g)$

6.3.1. Pseudocode Notation for $L_{i}$

6.5. Graph Rewriting Operations (`op` from $R_{i}$ )