Next Article in Journal
A Review of Sentiment, Semantic and Event-Extraction-Based Approaches in Stock Forecasting
Next Article in Special Issue
Weighted Mean Inactivity Time Function with Applications
Previous Article in Journal
Modeling and Optimizing the System Reliability Using Bounded Geometric Programming Approach
Previous Article in Special Issue
Collaboration Effect by Co-Authorship on Academic Citation and Social Attention of Research
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

RNGSGLR: Generalization of the Context-Aware Scanning Architecture for All Character-Level Context-Free Languages

Faculty of Electrical Engineering and Computer Science, University of Maribor, Koroška cesta 46, 2000 Maribor, Slovenia
*
Author to whom correspondence should be addressed.
Mathematics 2022, 10(14), 2436; https://doi.org/10.3390/math10142436
Submission received: 19 May 2022 / Revised: 30 June 2022 / Accepted: 4 July 2022 / Published: 13 July 2022
(This article belongs to the Special Issue Mathematics: 10th Anniversary)

Abstract

:
The limitations of traditional parsing architecture are well known. Even when paired with parsing methods that accept all context-free grammars (CFGs), the resulting combination for any given CFG accepts only a limited subset of corresponding character-level context-free languages (CFL). We present a novel scanner-based architecture that for any given CFG accepts all corresponding character-level CFLs. It can directly parse all possible specifications consisting of a grammar and regular definitions. The architecture is based on right-nulled generalized LR (RNGLR) parsing and is a generalization of the context-aware scanning architecture. Our architecture does not require any disambiguation rules to resolve lexical conflicts, it conceptually has an unbounded parser and scanner lookahead and it is streaming. The added robustness and flexibility allow for easier grammar development and modification.

1. Introduction

The traditional parsing architecture is based on the assumption that programming languages are akin to natural languages. The input string of characters is decomposed into a string of words, which then form sentences. The analogy is sound, which justifies the division of the process into two phases, scanning and parsing [1]. Traditionally, scanning is expected to be deterministic and context-independent, which contrasts with the fact that scanning is heavily non-deterministic for virtually all programming languages. The non-determinism is resolved using ad hoc disambiguation rules, such as the longest match, priority, scanner lookahead, disambiguation functions, symbol tables, and/or modifications to the grammar [2,3]. The crux of the problem is that, even if the parsing method accepts all context-free grammars (CFGs), when paired with the traditional parsing architecture for any given CFG, the corresponding set of accepted character-level context-free languages (CFLs) is severely restricted [3,4,5,6,7,8,9,10]. The character-level CFLs are languages defined down to the character level [4,5]. This prompted the development of the following approaches:
(1)
Scannerless architectures, which solve the problem by abandoning the use of the scanner [4,5,11].
(2)
Context-dependent scanner-based architectures, where the scanner receives the contextual information from the parser [6,7,8,9]. Most are deterministic, which means they are still dependent on the ad hoc disambiguation rules.
(3)
Nondeterministic scanner-based architectures, where the problem of resolving the non-determinism in the scanner is offloaded to a non-deterministic parser [8,12,13,14].
The contribution of this work is a novel scanner-based architecture, which is a fusion of approaches (2) and (3), which for any given CFG accepts all corresponding character-level CFLs. That means it does not require any disambiguation rules to resolve lexical conflicts. However, as our architecture captures all lexical ambiguities for certain applications, disambiguation rules are still useful to limit the number of interpretations. It can directly parse all possible specifications consisting of grammar and regular definitions. Our architecture is based on the right-nulled generalized LR (RNGLR) parser [15,16,17], and is a direct generalization of the context-aware scanning architecture by Van Wyk [6] and Schwerdfeger [6,18]. To our knowledge, this is the first practical architecture using a scanner with such power. The only other practical approach that for any given CFG accepts all corresponding character-level CFLs is the scannerless GLR (SGLR) [5,11]. However it is based on the approach in (1) and, therefore, does not use a scanner.
We share the sentiment of Tomita [15], that is, focus should be placed on the nearly deterministic languages. Virtually all programming languages fall into this category. Our architecture is geared for this use case. It degenerates to context-aware scanning for the deterministic segments, otherwise the performance degrades gracefully. For the nearly deterministic languages, it performs in predominantly linear time.
Conceptually, our architecture has an unbounded parser and scanner lookahead. This is a direct consequence of the fact that the parser and the scanner are both non-deterministic. The tables for our architecture are almost the same as the ones for the context-aware scanning architecture. The difference is that the parsing tables can have a few additional entries for  ε -shift actions. Unlike other architectures based on approach (3), our architecture is streaming; thus, the characters are processed one by one. As a result, it is also usable in online applications [15]. That is, the characters can be processed as they are typed without buffering.
The CFLs are closed under union and concatenation, and a neutral element exists for both operations. Our architecture for any given CFG accepts all character-level CFLs and, thus, preserves these properties and the related identities, and, by extension, all other operators that are based on them, such as the Kleene star. For the lexical part of the specification, we used the extended regular expressions [19,20]; that means, the complement, intersection, and difference are supported there as well.
These benefits mean that the grammar writers have more freedom since they are not constrained to the grammars where the lexical syntax can be disambiguated deterministically. The grammars are also less fragile because they do not depend on a careful selection of disambiguation rules to be parsed. That way, the modifications to the existing grammar can be performed more easily.
Our architecture can be utilized for:
  • Legacy programming languages, such as COBOL and Fortran. Additionally, POSIX shell [21] could be supported as well (with a few extensions for the syntax that is not context-free). These languages were developed before the advent of modern parsing methods. Therefore, there was no inclination for the language to be parsed using deterministic methods, such as LR. However, even modern programming languages most commonly cannot be parsed using deterministic methods without ad hoc disambiguation rules and/or modifications to the grammar [11,14].
  • Composite programming languages. These are programming languages with embedded sublanguages. The sublanguages usually have a different lexical syntax than the host language. The problem is solved traditionally by surrounding the sublanguage with sentinels. The scanner then recognizes that sublanguage as a single lexeme, which is then processed separately, or it switches to a different scanner mode when it encounters a sentinel. Simple omnipresent examples are the escape sequences in strings and string interpolation [7,14,22]. Further examples where sublanguages are pervasive are:
    (a)
    HTML, which allows the embedding of JavaScript and CSS.
    (b)
    PHP, which is itself embedded in HTML [7].
    (c)
    Parser and scanner specifications, which include the sublanguages for specifying actions [7].
    (d)
    Tex, where each environment can be its own sublanguage.
  • Extensible languages. These are languages that allow extensions of the syntax to be specified modularly. This allows embedding of arbitrary domain-specific languages [23]. That way, even non-experts can extend the language [6,7,22,24,25,26].
  • Language workbenches. These allow for the automatic generation of interactive environments and related language processing and manipulation tools for domain-specific languages [23]. Traditionally these have utilized scannerless architectures [27,28,29].
  • Natural language applications. GLR parsing was originally developed for this domain [15,16,30].
All of these use cases are subsumed by supporting all possible specifications, although some require a more sophisticated metalanguage than the one we use. Such metalanguages have been studied extensively in other works [5,6,7,14,18,27,28,29].
The paper is organized as follows. In Section 2, we introduce the basic notational conventions used in this work. In Section 3, we introduce the preliminaries. In Section 3.1, we introduce RNGLR parsing. In Section 3.2, we introduce the traditional parsing architecture briefly to establish the terminology and the related notation. In Section 3.3, we introduce context-aware scanning, which presents the groundwork for our architecture. In Section 4, we present our architecture. The construction of the parse forest is discussed in Section 5. In Section 6, we present the addition of the scanner lookahead as an optimization. In Section 7, we provide a formal proof of correctness. In Section 8, we describe our experience using our architecture. Finally, Section 9 follows the discussion of the related work, and the conclusion is in Section 10.

2. Basic Notation

We use the notation introduced in the Dragon book [2] with the following modifications/additions: Lowercase letters from the end of the Greek alphabet υ , ψ , ω are used for strings of terminal symbols. The letters v , u , w are used for the graph-structured stack (GSS) vertices. The letter z is used for the shared packed parse forest (SPPF) nodes. The letters t , l are additionally used for the terminal symbols. The letters M , M ˙ , M ˚ are used for automata. Bold uppercase letters from the Latin or the Greek alphabets S , Ψ are used for sets. Fraktur lowercase letters from the Latin alphabet i , j , k are used for the positions in the input string. Cursive uppercase letters from the Latin alphabet S are used for mutable references to sets. That means that the set itself cannot change while its elements are iterated; however, S will point to different sets during the execution of the algorithm. The is used to denote initial elements. For example, we used q for the start state of the automaton, instead of  q 0 .
We use the following operators: P ( S ) is the power set of a set. S * is the Kleene closure of a set. S + is the positive closure of a set. R is the transitive reflexive closure of a relation. | ω | is the length of a string. | S | is the size of a set. ↛ is a partial function, which means it only maps a subset of its domain. S × P is the cross-product between two sets. ω ( i , j ) is a substring of  ω starting at  i and ending at  j . The character at  i is not included, and j is included in the substring.
An indexed family is a collection of elements from a set S indexed by a set I , denoted as  ( s i ) i I . It is equivalent to a function, ι : I S , such that ι ( i ) = s i . A tuple ( s 1 , , s n ) is a family indexed by a set { 1 , , n } , and can be written as  ( s i ) i { 1 , , n } . Appending to a tuple will be denoted as ( s 1 , , s n ) s = ( s 1 , , s n , s n + 1 ) .

3. Preliminaries

A CFG defined over a finite set of symbols Σ is a 4-tuple G = ( N , Σ , R , S ) , where N is a finite set of non-terminal symbols, Σ is a finite set of terminal symbols, such that N Σ = , R P ( N × ( N Σ ) * ) is a relation that contains production rules of the form A : : = β and S N is the start symbol. A regular grammar (RG) is a CFG with the production rules restricted to  A : : = ψ and A : : = ψ B , where A , B N ψ Σ * . Character-level CFG is a CFG over a finite set of characters Ω = { a , b , c , } [4,5]. A string of characters is any ω Ω * .
A derivation step for the production rule A : : = β is γ A η γ β η , where γ , η ( N Σ ) * . A derivation of  τ from  σ is a sequence of derivation steps σ τ , denoted as σ * τ or  σ + τ if the sequence is non-empty. A string σ ( N Σ ) * where S * σ is called a sentential form. A sentential form ω Σ * is called a sentence. The firstk set of a string σ ( N Σ ) * is defined as:
FIRST k ( σ ) = { ω Σ * σ * ω ψ | ω | k }
A CFL generated by a context-free grammar G is the set of all strings that can be produced by G and is defined as  L ( G ) = { ω Σ * S * ω } . The CFGs have an independent production property, which means that the derivation steps do not depend on the surrounding context. Therefore, we can extend the definition of the language to any non-terminal symbol A in the grammar, L ( A ) = { ω Σ * A * ω } [1]. The language of a terminal symbol a Σ is L ( a ) = { a } . Regular language (RL) is a language generated by an RG.
For ω , ψ Σ * , ψ is a prefix of  ω if  ω Σ * , ω = ψ ω . If ω ε , then ψ is a proper-prefix of  ω . For ω , ψ Σ * , ψ is a suffix of  ω if  ω Σ * , ω = ω ψ .
A language L is prefix-free if, for all non-empty ψ 1 , ψ 2 L , neither ψ 1 is a proper prefix of  ψ 2 nor ψ 2 is a proper prefix of  ψ 1 [31]. The languages L 1 and L 2 are prefix-disjoint L 1 pref L 2 if for all non-empty ψ 1 L 1 and ψ 2 L 2 , neither ψ 1 is a proper prefix of  ψ 2 nor ψ 2 is a proper prefix of  ψ 1 [32]. The languages L 1 and L 2 are disjoint L 1 L 2 , if  L 1 L 2 = .
The (extended) regular expressions (RE) over a finite alphabet Σ are defined recursively as follows:
  • ∅, ε and a Σ are RE.
  • If r and s are RE, then their concatenation ( r s ) , union ( r | s ) , Kleene star ( r ) * , Kleene plus ( r ) + , and as an extension complement ¬ ( r ) , intersection ( r & s ) , and difference ( r s ) are RE.
They denote the following languages:
L ( ) = L ( r s ) = { ω ψ ω L ( r ) ψ L ( s ) } L ( ¬ r ) = Σ * L ( r ) L ( ε ) = { ε } L ( r | s ) = L ( r ) L ( s ) L ( r & s ) = L ( r ) L ( s ) L ( a ) = { a } L ( r * ) = { ε } L ( r   r * ) L ( r s ) = L ( r ) L ( s ) L ( r + ) = L ( r   r + )
A deterministic finite automaton (DFA) is a 5-tuple M = ( Q , Σ , δ , q , F ) , where Q is a finite set of states, Σ is a finite alphabet, δ : Q × Σ Q is a transition function, q Q is the start state and F Q is a set of final states. A DFA is complete if  δ ( q , a ) is defined for all q Q and a Σ . The transition function of a DFA can be extended to  δ * : Q × Σ * Q :
δ * ( q , ε ) = q δ * ( q , a ω ) = δ * ( q , ω ) when q = δ ( q , a ) is defined
The language of a DFA is defined as  L ( M ) = { ω Σ * δ * ( q , ω ) F } [33]. A DFA corresponding to an empty language will be denoted as ⌀, such that L ( ) = . The path in the DFA q 0 , , q k , such that for all i { 0 , , k 1 } , a Σ , q i + 1 = δ ( q i , a ) , will be denoted as  q 0 q k , or as  q 0 ψ q k , where q k = δ * ( q 0 , ψ ) . For the following constructions, the DFAs need to be complete.
The union DFA of an indexed family of DFA is M 1 , , M n , M 1 | | M n = ( Q 1 × × Q n , Σ , δ , ( q , 1 , , q , n ) , F ) , where δ ( ( q 1 , , q n ) , a ) = ( δ 1 ( q 1 , a ) , , δ n ( q n , a ) ) and F = { ( q 1 , , q n ) q 1 F 1 q n F n } . The reflexive transitive closure of  δ , δ : Q P ( Q ) , is the least solution of the following set equation:
δ ( q ) = { q } a Σ δ ( δ ( q , a ) )
A set of dead states Q contains non-final states from which no final state can be reached, Q = { q Q δ ( q ) F = } . The FIRST set of an automaton M, such that FIRST ( M ) = FIRST ( L ( M ) ) can be defined as  FIRST ( M ) = { a a Σ δ ( q , a ) Q } .
A complete DFA with a variable start state is defined as  M q = ( Q , Σ , δ , q Q , F ) . Recognizing ψ using any M q , δ * ( q , ψ ) F , can be split into two steps. Let ψ = ψ 1 ψ 2 , where ψ 1 , ψ 2 Σ * , first, transition to state q = δ * ( q , ψ 1 ) using M q , and second, recognize the suffix ψ 2 using M q , δ * ( q , ψ 2 ) F . Notice that the second step can be split again. If we iterate this process k times, we can recognize ψ = ψ 1 ψ k , where ψ 1 , , ψ k Σ * , in k steps.
For each RL, there exists an RG G, a RE r, and a DFA M, such that L ( G ) = L ( r ) = L ( M ) . That is, all of the representations have the same expressive power, and can be converted into one another [1,2].

3.1. RNGLR Parsing

RNGLR [17] parsing for a grammar G over Σ is an extension of GLR parsing [15,16], which in turn is a generalization of LR parsing, which can parse all CFGs while remaining efficient on (near)-deterministic ones. It uses the handle-finding automaton for LR parsing extended with short circuiting-reduce actions for the right-nulled productions of the form A : : = γ η , where η * ε . Refer to [17] for the discussion on why that is required. The handle-finding automaton is defined as  ( S , N Σ , T G , s , H ) . The transition function is called the goto table T G : S × ( N Σ ) S . Each state s S of the automaton is labeled with a set of items. An item ( A : : = γ η a ) is a production with a dot in the right-hand-side, which delimits the symbols that have already been seen from those that have not. The a Σ $ is the lookahead symbol, where Σ $ = Σ { $ } is the set of all lookahead symbols (the $ is the end-of-input marker). The sets of items are used to construct the action table T A : S × Σ $ P ( Δ ) , where Δ is the set of all actions. For each state s S the shift items ( A : : = γ a η a ) s are converted into { ( s , a ) SHIFT } , and the reduce items ( A : : = γ η a ) s , where η * ε are converted into { ( s , a ) REDUCE ( A , | γ | ) } . The grammar needs to be augmented with the production rule S : : = S . It results in items ( S : : = S $ ) s and possibly ( S : : = $ ) s , which are converted into { ( s , $ ) ACCEPT } . The state whose label contains it is the final state s H of the automaton [1,2]. To construct the automaton, multiple methods exist, such as: LR, SLR, LALR [1], and IELR [7]. In this paper, LALR is used; however, the algorithms discussed apply to all of them.
The handle-finding automaton by itself is deterministic; however, more than one action may be possible in each state. If more than one reduce action is possible, there is a reduce/reduce conflict, and if a shift and a reduce action are possible, there is a shift/reduce conflict. We call these states inadequate. The presence of such states results in non-determinism, which is resolved in RNGLR parsing using a breadth-first search. Conceptually, when there are multiple actions possible in a given state the process splits, and each action is performed in its own branch on its own copy of the stack. Thus, the algorithm performs all possible traversals of the handle-finding automaton. The processes are synchronized, they perform all reduce actions at the given position, shift the next input symbol, and then all move to the next position at the same time. If there are no inadequate states, the process remains deterministic; in that case, the algorithm preserves the linear worst-case complexity of LR parsing [1,15,16].
Copying the stack would result in an exponential blow-up; thus, instead, a directional acyclic graph is used to keep track of the states, called the graph-structured stack (GSS) Γ = ( V , E ) . It allows combining the common parts of the stacks, and as a result, prevents repeating the same work multiple times. The vertices V are labeled with the automaton states S . The function STATE : V S retrieves the label s S corresponding to the vertex v V . For a subset of vertices, W V the function STATE : P ( V ) P ( S ) is defined as  STATE ( W ) = { STATE ( w ) w W } . The edges E V × V are unlabeled. We denote edges ( w , v ) E as  v w . In the diagrams, each edge ( w , v ) is labeled redundantly with the symbol that was used to make the transition X ( N Σ ) , where STATE ( w ) = T G ( STATE ( v ) , X ) , to make the presentation clearer. The graph paths w 1 , , w k , where, for all i { 1 , , k 1 } , ( w i , w i + 1 ) E , are denoted as  w k α w 1 , where STATE ( w 1 ) = T G * ( STATE ( w k ) , α ) .
The vertices are partitioned into subclasses ( U i ) i { 0 , , | ω | } . Each U i V is a set of vertices created when parsing a i , where i represents the position in the input string ω Σ * . The special case is U 0 , which is created at the start, before processing ω . It contains a vertex v , labeled with the start state of the handle-finding automaton s , called the bottom. The states that label paths v α w in  Γ correspond directly to stacks α in LR parsing. The tops of the stacks at position i are states in  STATE ( U i ) .
The algorithm works by traversing ω $ , where ω = a 1 a | ω | , from left to right. At each position i , there is a lookahead symbol a j , where j represents the next position, j = i + 1 . First, the vertices w U i , such that REDUCE ( A , | γ | ) T A ( STATE ( w ) , a j ) are identified. The reduction step is performed by finding all paths v γ w in  Γ of length | γ | . Then for a vertex v U h { 0 , , i } a vertex u U i labeled with T G ( STATE ( v ) , A ) is either found or created, and the vertices are connected with an edge v u (if one does not already exist). In the case of  ε -reduce actions, only a single path can be found v = w , since | γ | = 0 . For the vertex, the  w U i a vertex u U i is either found or created, and the vertices are connected with an edge w u . Since there can be no more than one such edge, the  ε -reduce actions only need to be considered when a new vertex is created. When a new vertex u U i is created as a result of  ε -reduce, the reduce actions where | γ | > 0 must not be considered since they are already applied from one of its descendants w U i [17]. The process is repeated as new vertices and/or edges are added, until there are no possible reduce actions left. Then, vertices w U i such that SHIFT T A ( STATE ( w ) , a j ) are identified. For each vertex w, a vertex u U j labeled with T G ( STATE ( w ) , a j ) is either found or created, and the vertices are connected with an edge w u (if one does not already exist). Then the process is repeated for  U j . If there exists w U | ω | , such that ACCEPT T A ( STATE ( w ) , $ ) , the parsing completed successfully [15,16].

3.2. Traditional Parsing Architecture

Traditional parsing architecture for parsing character strings ω Ω * is a two-phase process. The first phase, which is based on character-level RE, is called the lexical analysis and is performed by the scanner. The second phase, which is based on CFG, is called the syntactic analysis, and is performed by the parser. For directional parsers the processes of both phases can be interleaved, which is the case we will focus on.
The parser and the scanner are traditionally constructed using a specification, defined as a pair:
Ξ = ( G , ( r t ) t T )
It is composed of a CFG G = ( N , T , R , S ) over a finite set of symbols T and character-level RE ( r t ) t T over characters Ω for each terminal symbol t T in  G , which are called regular definitions [2]. A derivation step for  A N is γ A η γ β η , where γ , η ( N T ) * and there exists a production rule A : : = β , and a derivation step for  t T is σ t κ σ ψ κ , where σ , κ ( T Ω ) * , and ψ L ( r t ) . The language generated by Ξ is defined as  L ( Ξ ) = { ω Ω * S * ω } . It is easy to see Ξ generates a character-level CFL. For each Ξ , there exists a corresponding character-level CFG G ^ = ( N T Y , Ω , R , S ) , such that L ( G ^ ) = L ( Ξ ) . There is a production rule A : : = β R for each A : : = β R . The RE ( r t ) t T are converted into RG and the resultant productions are added to  R . The non-terminal symbols created during the conversion correspond to  Y . Due to the independent production property, a conversion in the reverse direction is possible as well.
The scanner automaton is constructed by converting ( r t ) t T into DFA ( M t ) t T , where M t = ( Q t , Ω , δ t , q , t , F t ) , such that L ( M t ) = L ( r t ) . To preserve the association between the terminal symbols and the DFA a function m : Q t P ( T ) is introduced:
m ( q t ) = { t } if q t F t otherwise
It maps a state q t Q t to the corresponding terminal symbol t T , if  q t F t . The scanner uses an automaton M ˙ T = | t T M t , where:
M ˙ T = ( Q ˙ = × t T Q t , Ω , δ ˙ , q ˙ = ( q , t ) t T , F ˙ )
M ˙ T is renumbered for execution, thus, the connection between each q ˙ Q ˙ and the corresponding ( q t ) t T is not preserved. Therefore, the states are labeled with m ( q ˙ ) = t T m ( q t ) , as they can no longer be determined after renumbering.
To combine RNGLR parsing with the traditional parsing architecture, the scanner automaton is constructed for  T $ , where r $ = $ . The $ is a character not appearing in  Ω . To parse the input string ω Ω , the algorithm works by traversing ω $ , from left to right. At each position i , the scanner recognizes a lexeme υ ( i , j ) Ω $ + in the rest of the input string ω Ω $ + , such that ω = υ ( i , j ) ω and q ˙ υ ( i , j ) q ˙ F ˙ . The next position is j = i + | υ ( i , j ) | and ω , thus starting at position j . The m ( q ˙ ) are the lookahead symbols that are matched.
  • If m ( q ˙ ) = no lexeme is recognized and an error is raised.
  • If m ( q ˙ ) = { t j } , the lookahead symbol t j is passed to the parser.
  • If | m ( q ˙ ) | > 1 , then multiple ( M t ) match the same lexeme and an error is raised.
The process is repeated for  j and ω . Note that for all j { 1 , , | ω | } , U j are non-empty if and only if t j is matched and a shift action is performed for  t j . The U 0 cannot be empty as it contains at least v .
During scanning, two types of lexical conflicts can occur [9]:
(I)
Multiple automata can match a single lexeme. This can occur if the languages of the automata are not pairwise disjoint, t , l T t l L ( M t ) L ( M l ) . This type of lexical conflict corresponds to a reduce/reduce conflict in LR parsing for  G ^ [8].
(II)
Multiple lexemes of different lengths can be recognized [7,8,18]. This can occur if the automata are not prefix-free or pairwise prefix-disjoint, t , l T t l L ( M t ) pref L ( M l ) . This type of lexical conflict corresponds to a shift/reduce conflict in LR parsing for  G ^ [7,8].
To keep the process deterministic, the automata ( M t ) t T must be pairwise disjoint, prefix-free, and pairwise-prefix disjoint. Since this condition is too restrictive, instead, in circumstances where the scanner is faced with multiple possible choices, one is chosen based on predefined disambiguation rules in the case of deterministic methods [6,7,8,18]. These restrictions are not included in the original definition of  Ξ ; therefore, not all Ξ are accepted by the traditional parsing architecture. In this paper, we present an architecture that accepts all Ξ directly and, as a result, for any given G , accepts all corresponding character-level CFL, instead of parsing G ^ , as is the case with character-level parsers [4,5].
Let us demonstrate the combination with the following specification Ξ :
S : : = A c | b c A : : = b r b = x y r c = z r $ = $
The handle-finding automaton, constructed using G in (1), is presented in Figure 1. The handle-finding automaton state s S is presented next to each vertex in the diagram and each vertex is labeled with the corresponding set of items. The scanner automaton is presented in Figure 2. Each vertex in the diagram is labeled by a scanner automaton state q ˙ Q ˙ . Above each vertex in the diagram is the lookahead symbol that is matched m ( q ˙ ) . The edges are labeled with sets of characters, a single character denotes a singleton set.
Parsing the input string ω = x y z results in the GSS and the scanner trace presented in Figure 3 and Figure 4, respectively. The scanner trace outlines the paths through the scanner automaton for each recognized lexeme. The number above each start state is the position i at which the automaton starts, and corresponds to  U i from which the shift action will be performed. The terminal symbol above each final state is the lookahead symbol that is passed to the parser. When the automaton enters a dead state, it can no longer match anything; thus, the paths end at that point. Additionally, note that there is no U 1 , as there are no matches at position 1.

3.3. Context-Aware Scanning Architecture

Context-aware scanning [6,18] is an extension of the traditional parsing architecture. Likewise, the scanner automaton is constructed for the lookahead symbols T $ . The difference is that the scanner is provided with the contextual information it lacks from the parser. The information is passed in the form of a valid lookahead set, which is a set of all lookahead symbols that are valid at a given point in the parse. The contextual information is summarized in the state of the handle-finding automaton; thus, for states s S , such a set can be constructed by collecting the lookahead symbols that have actions associated with them:
ν ( s ) = { t T $ T A ( s , t ) }
This limits the selection of automata that need to be considered at a given point; thus, for all s S , only a subset should be pairwise disjoint and pairwise prefix-disjoint t , l ν ( s ) t l L ( M t ) L ( M l ) L ( M t ) pref L ( M l ) . As a consequence, this allows more specifications Ξ to be parsed deterministically [6,18].
Let us define a function p : Q t P ( T $ ) for  M t as a reflexive transitive closure of m:
p ( q t ) = q t δ ( q t ) m ( q t )
It returns the lookahead symbols that could possibly be matched by continuing from  q t Q t . The function can easily be defined for  M ˙ T $ as  p ( q ˙ ) = t T $ p ( q t ) . Note that, for a path q ˙ 0 q ˙ k , where q ˙ 0 = q ˙ , in the automaton, the following holds p ( q ˙ 0 ) p ( q ˙ k ) . In other words, the possibilities narrow as more of the input string is seen. As the automaton enters a dead state, q ˙ Q ˙ , p ( q ˙ ) = .
Conceptually, a context-aware scanner uses a distinct scanner automaton M ˙ s for each state in the handle-finding automaton s S , specialized for recognizing lookahead symbols in  ν ( s ) . There exist many different ways to construct such a scanner [6,7,8,9,18]. We use an approach proposed by Van Wyk [6] and Schwerdfeger [6,18].
Instead of constructing an automaton for each handle-finding automaton state, the same automaton is used as in the traditional parsing architecture, M ˙ T $ , and the scanning process is modified in a way that only lookahead symbols from  ν ( s ) can be recognized. The m : S × Q ˙ P ( T $ ) and p : S × Q ˙ P ( T $ ) are defined as:
m ( s , q ˙ ) = ν ( s ) m ( q ˙ ) p ( s , q ˙ ) = ν ( s ) p ( q ˙ )
In a traditional scanner, we know that an automaton can no longer match anything when it enters a dead state. Here, this is no longer the case, since we are interested in recognizing only a subset of possible lookahead symbols. Instead, we can use the alternative definition: For some s S , no more matches can be made from the state q ˙ Q ˙ when p ( s , q ˙ ) = . In other words, no lookahead symbols from  ν ( s ) could possibly be matched by continuing from  q ˙ . At that point, the automaton can be stopped. The automaton finds a match when q ˙ F ˙ , then m ( s , q ˙ ) is passed to the parser.
Note that, since possibilities narrow as more of the input string is seen, for a path q ˙ 0 q ˙ k , where q ˙ 0 = q ˙ we can use the following recursive definition as an optimization:
p ( s , q ˙ 0 ) = ν ( s ) p ( q ˙ 0 ) p ( s , q ˙ j ) = p ( s , q ˙ j 1 ) p ( q ˙ j )
Since m ( s , q ˙ ) p ( q ˙ ) , we can instead use:
m ( s , q ˙ ) = p ( s , q ˙ ) m ( q ˙ )
This way, the sets contain less lookahead symbols and the intersection is more efficient to compute [6,18].
The worst-case time complexity is O ( log | T $ | · | ω | ) , due to the fact that the intersection needs to be computed at each step. For any s S , initially | p ( s , q ˙ s ) | = | T $ | ; however, the size of the sets decreases as possibilities narrow. The worst-case space complexity remains the same as for the traditional parsing architecture, O ( | T $ | · | Q t ^ | | T $ | ) , where | Q t ^ | is the number of states in the largest automaton; however, the size of the automata is increased by a constant factor due to the additional label on the states for p [18].
Let us combine RNGLR parsing and context-aware scanning. The combination was already suggested by Van Wyk [6] and Schwerdfeger [6,18], and is used by the Tree-sitter [34]. In RNGLR parsing, at each point in the parse, the handle-finding automaton may be in multiple states at once. The set of all possible combinations of these states Π P ( S ) for a given grammar G can be defined as:
Π = α A T G * ( s , α ) | ψ T * A = { α α ( N T ) * S * α ω * ψ ω }
The valid lookahead set needs to be extended to each P Π :
ν ( P ) = s P ν ( s )
Conceptually, the context-aware scanner uses a distinct scanner automaton M ˙ P for each possible combination of handle-finding automaton states P Π , specialized for recognizing lookahead symbols in  ν ( P ) .
The same idea is utilized, except that now ν ( P ) is used instead. The functions m : Π × Q ˙ P ( T $ ) and p : Π × Q ˙ P ( T $ ) are defined as:
m ( P , q ˙ ) = ν ( P ) m ( q ˙ ) p ( P , q ˙ ) = ν ( P ) p ( q ˙ )
The worst-case time and space complexity remain unchanged.
There are no lexical conflicts if for all P Π the automata corresponding to the lookahead symbols in  ν ( P ) are pairwise disjoint and pairwise prefix-disjoint; that is t , l ν ( P ) t l L ( M t ) L ( M l ) L ( M t ) pref L ( M l ) .
Let us define subsets ( U i , i ) i { 0 , , n } of each subclass U i , where each U i , 0 represents the vertices created as the shift actions were processed at the previous position, and each of the following ( U i , i ) i { 1 , , n } represents the vertices created after i reduce actions were processed, and n is the number of reduce actions processed in total. The corresponding sets of the handle-finding automaton states will be denoted as  ( P i , i ) i { 0 , , n } , where P i , i Π and P i , i = STATE ( U i , i ) . As identified by Van Wyk [6] and Schwerdfeger [6,18], the valid lookahead set for each state contains all lookahead symbols that can possibly follow at that position. That means that vertices created as a result of processing reduce actions cannot introduce any additional lookahead symbols; that is, ν ( P i , 0 ) = = ν ( P i , n ) . As a result, ν ( P i , 0 ) = ν ( P i ) , where P i Π and P i = STATE ( U i ) . This allows us to construct the automaton M ˙ P i , 0 and perform the scanning just once at each position in the parse i . For this reason, we introduce automata used at each position in the parse ( M ˙ i ) i { 0 , , | ω | } , where:
M ˙ i = M ˙ P i , 0
The scanning process remains the same as in the traditional parsing architecture, only the scanner automaton is modified. As a demonstration, let us use the following specification Ξ as an example:
S : : = A e d | B e f | c e g A : : = c B : : = c r c = x r d = y r e = x r f = z r g = x x r $ = $
The handle-finding automaton, constructed using G in (2), is presented in Figure 5, and the scanner automaton is presented in Figure 6. Recognizing the input string ω = x x y results in the GSS and the scanner trace in Figure 7 and Figure 8, respectively.
The valid lookahead sets for each handle-finding automaton state are ν ( 0 ) = { c } and ν ( 1 ) = ν ( 4 ) = ν ( 7 ) = ν ( 10 ) = { $ } and ν ( 2 ) = ν ( 5 ) = ν ( 8 ) = { e } and ν ( 3 ) = { d } and ν ( 6 ) = { f } and ν ( 9 ) = { g } . They are used to determine the valid lookahead sets ν ( P i , 0 ) at each position i based on the sets of states P i , 0 , both of which are given in Table 1.
   Each row in the scanner trace represents the path traced by the automaton M ˙ i = M ˙ P i , 0 . At position 0, the scanning is performed using the automaton M ˙ 0 = M ˙ { 0 } . The automaton starts in state 0, then a transition 0 x 1 is performed.
m ( { 0 } , 1 ) = ν ( { 0 } ) m ( 1 ) = { c } { c , e } = { c } p ( { 0 } , 1 ) = ν ( { 0 } ) p ( 1 ) = { c } { c , e } = { c }
Then a transition 1 x 5 is performed.
m ( { 0 } , 5 ) = p ( 1 ) m ( 5 ) = { c } = p ( { 0 } , 5 ) = p ( 1 ) p ( 5 ) = { c } =
At position 1, the scanning is performed using the automaton M ˙ 1 = M ˙ { 8 } .
The automaton starts in state 0, then a transition 0 x 1 is performed.
m ( { 8 } , 1 ) = ν ( { 8 } ) m ( 1 ) = { e } { c , e } = { e } p ( { 8 } , 1 ) = ν ( { 8 } ) p ( 1 ) = { e } { c , e } = { e }
Then a transition 1 y 6 is performed.
m ( { 8 } , 6 ) = p ( { 8 } , 1 ) m ( 6 ) = { c } = p ( { 8 } , 6 ) = p ( { 8 } , 1 ) p ( 6 ) = { c } =
At position 2, the scanning is performed using the automaton M ˙ 2 = M ˙ { 3 , 6 , 9 } . The automaton starts in state 0, then a transition 0 y 2 is performed.
m ( { 3 , 6 , 9 } , 2 ) = ν ( { 3 , 6 , 9 } ) m ( 2 ) = { d , f , g } { d } = { d } p ( { 3 , 6 , 9 } , 2 ) = ν ( { 3 , 6 , 9 } ) p ( 2 ) = { d , f , g } { d } = { d }
Then a transition 2 $ 6 is performed.
m ( { 3 , 6 , 9 } , 6 ) = p ( { 3 , 6 , 9 } , 2 ) m ( 6 ) = { d } = p ( { 3 , 6 , 9 } , 6 ) = p ( { 3 , 6 , 9 } , 2 ) p ( 6 ) = { d } =
As already mentioned, context-aware scanning allows some of the specifications where the scanner automata are not pairwise disjoint or pairwise prefix disjoint, to be parsed deterministically. In this example, the languages of  M c and M e are not disjoint, since they both match x , which results in  | m ( 1 ) | > 1 . However, there is no P Π and q ˙ Q ˙ , such that m ( P , q ˙ ) = { c , e } . This is easy to verify, Π = { { 0 } , { 8 } , { 8 , 5 } , { 8 , 5 , 2 } , { 3 , 6 , 9 } , { 4 } , { 4 , 1 } , { 7 } , { 7 , 1 } , { 10 } , { 10 , 1 } } . If that were not the case, there would be a lexical conflict. The possibility of such conflicts occurring was already identified by Tomita [15,16]; only a minor modification is required to the RNGLR parsing algorithm. The idea was later termed Schrödinger’s token by Aycock and Horspool [3]. In short, the algorithm performs the actions for all t m ( P , q ˙ ) .
The languages of automata M c and M g are disjoint; however, they are not prefix-disjoint, since x is a prefix of  x x . There is again no P Π and q ˙ Q ˙ , such that m ( P , q ˙ ) = { c , g } . If that were not the case, there would be a lexical conflict. This type of lexical conflict is much harder to resolve since the recognized lexemes are of different lengths. The scanner would end up in two positions after matching M c | M g . Subsequently, the scanner and the parser would need to split to handle both choices as proposed by Begel and Graham [14]. The same lexical conflict occurs if the languages of automata ( M t ) t T are not prefix-free. For example, the language { ε , x , x x , } , corresponding to r = x * exhibit this issue. This is usually resolved using the longest match disambiguation rule; however, this solution is not general. For example, the following specification cannot be parsed:
S : : = a b r a = x * r b = x r $ = $
If we take a zoomed-out view of the GSS for  Ξ in (2), focusing only on the shift actions, as presented in Figure 9, we can notice that there is a single path between any two subclasses. This is due to the fact that scanning has been completely deterministic up until this point. At each position i , at most, one lexeme could be recognized, which corresponds to (at most) one lookahead symbol.
In general, scanning is non-deterministic. Multiple automata ( M t ) t T $ can match the same lexeme, and/or multiple different lexemes can be recognized. Next, we present our extension of the context-aware scanning architecture and RNGLR parsing, which can handle the non-determinism efficiently.

4. RNGSGLR (0,1]

The algorithm presented in this section will be a recognizer. The lexical conflict resulting from multiple automata ( M t ) t T $ matching the same lexeme can already be resolved efficiently using Schrödinger’s token [3,15,16]. The problem we tackle is the lexical conflict resulting from recognizing multiple lexemes of different lengths.
Our architecture builds on context-aware scanning, likewise, we have the automata ( M ˙ i ) i { 0 , , | ω | } for each position in the parse. The difference is that we consider all possible lexemes matched by each M ˙ i . Let q ˙ i / i q ˙ i / l be a path traced by the automaton M ˙ i . We use i / k to denote that the state q ˙ i / k belongs to  M ˙ i and that it was entered at position k . The state entered at position i is the initial state; that is q ˙ i / i = q ˙ , and the state entered at position l is the last state, a state from which no more matches can be made if the automaton has one or l = | ω | + 1 . Some of these states are final; the set of positions in which the automaton enters a final state is:
φ ( q ˙ i / i q ˙ i / l ) = { k k { i + 1 , , l } q ˙ i / k F ˙ }
We consider all lexemes υ ( i , k ) , for all k φ ( q ˙ i / i q ˙ i / l ) , where each is a substring of  ω that starts at  i and ends at  k , such that q ˙ i / i υ ( i , k ) q ˙ i / k F ˙ . Note that the start state is not considered, as there is no need to recognize the lexeme υ ( i , i ) = ε as it is trivially present at any position.
As a result, the automaton M ˙ i that started scanning at position i must scan alongside automata ( M ˙ h ) h { 0 , , i 1 } that started scanning at previous positions. For this reason, the scanner automaton at position i in the parse M ˚ i q ˚ i = ( Q ˚ i , Ω $ , δ ˚ i , q ˚ i Q ˚ i , F ˚ i ) is defined as:
M ˚ i q ˚ i = | k { 0 , , i } M ˙ k q ˙ k / i q ˚ i = ( q ˙ k / i ) k { 0 , , i }
It is a runtime simulated union of automata ( M ˙ k q ˙ k / i ) k { 0 , , i } . These started scanning at position k and are now in state q ˙ k / i , as they already traced a path q ˙ k / k ψ ( k , i ) q ˙ k / i . The string ψ ( k , i ) is a prefix of the lexeme that the automaton is looking forward to recognize in the future. Each automaton M ˙ k q ˙ k / i is thus scanning for the not yet recognized suffix. This description is also trivially valid for the automaton M ˙ i q ˙ i / i added at position i , which is in the start state. The scanner automaton state q ˚ i is composed out of the states of automata in the union at position i .
The scanner automaton at position i can also be defined recursively:
M ˚ 0 q ˚ 0 = M ˙ 0 M ˚ i q ˚ i = M ˚ i 1 q ˚ i | M ˙ i q ˚ i = ( q ˙ h / i ) h { 0 , , i 1 } q ˚ i = q ˚ i q ˙ i / i
The scanner automaton at position 0 is initially just the automaton M ˙ 0 . At each position i 1 , the scanner automaton M ˚ i 1 q ˚ i 1 performs the transition q ˚ i 1 a i q ˚ i on the character a i Ω $ , becoming the automaton M ˚ i 1 q ˚ i . The scanner automaton M ˚ i q ˚ i is constructed by adding the automaton M ˙ i in the runtime simulated union. The scanner automaton state q ˚ i contains the states of  q ˚ i and additionally the start state q ˙ i / i for  M ˙ i .
As the input string ω $ is traversed, the scanner automaton traces a path q ˚ 0 q ˚ | ω | + 1 , where for all i { 1 , , | ω | + 1 } , q ˚ i = δ ˚ i 1 ( q ˚ i 1 , a i ) . The set of positions where the scanner automaton enters a final state is:
φ ( q ˚ 0 q ˚ | ω | + 1 ) = i { 0 , , | ω | + 1 } φ ( q ˙ i / i q ˙ i / l )
Therefore, all possible lexemes υ ( i , k ) , for all i , k φ ( q ˚ 0 q ˚ | ω | + 1 ) i k are considered. The lexemes can overlap arbitrarily. If they do not, our architecture degenerates to context-aware scanning.
The description given so far may seem infeasible, as defined, the number of states grows with the position in the parse. However, basic algebraic properties of the union, M = | M = M | = M | M can be used as the optimization [2].
  • Once no more matches can be made by the automaton, then M ˙ k q ˙ k / i = , and the automaton can be removed from the union. Therefore, at each position i , the automata, where p ( P i , 0 , q ˙ k / i ) = are filtered out.
  • If no matches were found at position i , it immediately means P i , 0 = , as no shift actions can be performed by the parser. As a result, M ˙ i = ; therefore, M ˚ i q ˚ i = M ˚ i 1 q ˚ i | . This allows the scanner to move uninterruptedly to the next match.
  • It is possible that automata are duplicates k l M ˙ k q ˙ k / i = M ˙ l q ˙ l / i . This happens if they are in the same state q ˙ k / i = q ˙ l / i , and if they can match the same set of symbols p ( P k , 0 , q ˙ k / i ) = p ( P l , 0 , q ˙ l / i ) . Such automata are merged, resulting in an automaton M ˙ k , l q ˙ k , l / i . It is crucial that both starting positions are preserved. An example of a specification that exhibits such duplication is:
    S : : = a S | ε r a = x * r $ = $
These optimizations are not strictly required; even the most naive implementation always results in a correct parse, albeit, it invariably exhibits the worst-case behavior.
The functions m : Q ˚ i P ( { 0 , , | ω | } × T $ ) and p : Q ˚ i P ( { 0 , , | ω | } × T $ ) are defined as:
m ( q ˚ i ) = h { 0 , , i 1 } ( h , t ) t m ( P h , 0 , q ˙ h / i ) p ( q ˚ i ) = h { 0 , , i 1 } ( h , t ) t p ( P h , 0 , q ˙ h / i )
The position h is added because the recognized lexeme υ ( h , i ) , starts at position h . That means the shift actions need to be performed from  U h .
The scanning process remains very similar to the (traditional) context-aware scanning process. At each position i in the parse, the scanner recognizes a string ψ ( i , j ) Ω $ + in the rest of the input string ω Ω $ + using the automaton M ˚ i q ˚ i , such that ω = ψ ( i , j ) ω and q ˚ i ψ ( i , j ) q ˚ j F ˚ i . Note that the scanner stops only when a match is found at position j . This is the same as in (traditional) context-aware scanning, except that multiple lexemes υ ( k , j ) are recognized for each automaton ( M ˙ k ) k { 0 , , i } , which enters the final state. Moreover, ψ ( i , j ) might not even be a lexeme. If the only lexeme that is recognized is υ ( i , j ) , then our architecture degenerates to context-aware scanning for that portion of the input string. Each lexeme υ ( k , j ) is recognized in  | φ ( q ˚ k q ˚ j ) | steps. That means that the scanning for each lexeme is suspended and resumed for each match between the positions k and j . The recognizer is passed m ( q ˚ j ) and p ( q ˚ j ) . The process is then repeated for  j and ω using M ˚ j q ˚ j = M ˚ i q ˚ j | M ˙ j .
Thus far, we only covered our extension of context-aware scanning, so let us motivate the needed modifications to the RNGLR parsing with an example:
S : : = b c d | A e A : : = b r b = x r c = y r d = z r e = y z r $ = $
The handle-finding automaton, constructed using G in (3), is presented in Figure 10, and the scanner automaton is presented in Figure 11. Notice that all ( M t ) t T are prefix-free, M c and M e are not prefix-disjoint. Recognizing the input string ω = x y z results in the GSS and the scanner trace in Figure 12 and Figure 13 respectively.
The valid lookahead sets for each handle-finding automaton state are ν ( 0 ) = { b } and ν ( 1 ) = ν ( 3 ) = ν ( 6 ) = { $ } and ν ( 2 ) = { e } and ν ( 4 ) = { c , e } and ν ( 5 ) = { d } . They are used to determine the valid lookahead sets ν ( P i , 0 ) at each position i based on the sets of states P i , 0 , both of which are given in Table 2.
We start with M ˚ 0 0 = M ˙ 0 0 , and scan until the first match m ( ( 2 ) ) = { ( 0 , b ) } , tracing a path ( 0 ) x ( 2 ) . The scanning continues with M ˚ 1 ( 2 , 0 ) = M ˙ 0 2 | M ˙ 1 0 to the next match m ( ( 5 , 1 ) ) = { ( 1 , c ) } , tracing a path ( 2 , 0 ) y ( 5 , 1 ) . Now a reduce action can be performed; however, notice that the item ( A : : = b e ) has e as the lookahead symbol, which will be recognized in two steps. At this point, we already covered the first step; however, we still need to cover the second for e to be matched. We could scan ahead; however, this would mean the scanner and the parser would need to split. Instead, we choose to work with the information we have at this point. We can use p ( ( 5 , 1 ) ) = { ( 1 , e ) } , which tells us that we can still expect to match e and make the reduction based on this information. At worst, this means a superfluous reduction will be made, which will be ignored in the future. Since 1 2 of the lexeme have been recognized, we call this approach the fractional lookahead. The automaton M ˙ 0 5 enters a dead state, so it can be removed. We continue scanning with M ˚ 2 ( 1 , 0 ) = M ˙ 1 1 | M ˙ 2 0 to the next match m ( ( 3 , 3 ) ) = { ( 1 , e ) , ( 2 , d ) } , tracing a path ( 1 , 0 ) z ( 3 , 3 ) The first thing to note is that | m ( ( 3 , 3 ) ) | > 1 . This lexical conflict occurs because L ( M ˙ 1 1 ) L ( M ˙ 2 0 ) does not hold. As already mentioned, this type of lexical conflict can be resolved by performing the actions for both e and d. In this case, a shift action is performed for both symbols. The second thing to note is that, to perform a shift action using e, it needs to be performed from  U 1 . In traditional RNGLR parsing a shift action can only be performed from the previous subclass, which is U 2 ; therefore, the parsing algorithm needs to be modified. We continue scanning with M ˚ 3 ( 3 , 3 , 0 ) = M ˙ 1 3 | M ˙ 2 3 | M ˙ 3 0 to the next match m ( ( 5 , 5 , 4 ) ) = { ( 3 , $ ) } , tracing a path ( 5 , 1 , 0 ) $ ( 5 , 5 , 4 ) . The reduce action is performed for S and parsing concludes successfully.
The scanner trace is a visual presentation of the path q ˚ 0 q ˚ | ω | + 1 . The path is, in this case, concretely:
( 0 ) x ( 2 , 0 ) y ( 5 , 1 , 0 ) z ( 3 , 3 , 0 ) $ ( 5 , 5 , 4 )
The scanner trace is not a data structure, at each point in the parse, only a single column is present in the memory. At each position i { 0 , 1 , 2 , 3 } , the column represents the state q ˚ i . At each position j { 1 , 2 , 3 , 4 } , the column without the added start state represents the state q ˚ j . The rows represent the paths q ˙ i / i q ˙ i / l traced by each ( M ˙ i ) i { 0 , 1 , 2 , 3 } . Note that M ˙ 1 traced the path 0 1 3 5 , where φ ( 0 1 3 5 ) = { 1 , 2 } . That is, the automaton entered the final state twice and two lexemes were recognized as a result. This occurred due to the fact that the lookahead symbols ν ( P 1 , 0 ) = { c , e } correspond to  M c and M e , the languages of which are not prefix-disjoint since y is a prefix of  y z .
If we take a zoomed-out view of the GSS for  Ξ in (3), focusing only on the shift actions, as presented in Figure 14, we can see that there can be more than one path between any two subclasses. The substring y z can be interpreted in two ways, either as e or c followed by d. In general, the shift actions alone now form a directed acyclic graph. The graph has a single path between any two subclasses exactly when our architecture degenerates to context-aware scanning.
The required modifications to the RNGLR parsing algorithm are:
  • When more than one lookahead symbol is matched, | m ( q ˚ j ) | > 1 , the actions are performed for all of them [3,15,16].
  • At position i , it must be possible to perform a shift action from any ( U k ) k { 0 , , i } .
  • The reduce actions are selected based on a prediction p ( q ˚ j ) , if the entire lexeme is not recognized at position j , which we call fractional lookahead.
The algorithm works as follows. At each position i in the parse, a string ψ ( i , j ) Ω $ + is recognized using M ˚ i q ˚ i . Instead of a single lookahead symbol t j , there are now two sets of lookahead symbols, B i and Λ i . The set B i = m ( q ˚ j ) contains the lookahead symbols and the accompanying starting positions for shift actions. The set Λ i = { t ( k , t ) p ( q ˚ j ) k = i } contains the lookahead symbols for the reduce actions. The lexemes for reduce actions can only start at position i ; thus, the others are removed. Instead of just lookahead symbols that were matched, the set also contains those that can still be matched continuing from  q ˚ j . This is essentially how the fractional lookahead is implemented. The vertices w U i , such that t Λ i , REDUCE ( A , | γ | ) T A ( STATE ( w ) , t ) are identified. The reduce actions are processed exactly as in RNGLR parsing. Then, for each ( k , t ) B i the vertices w U k , such that ( k , t ) B i , SHIFT T A ( STATE ( w ) , t ) are identified. The shift actions are processed exactly as in RNGLR parsing. Then the process is repeated for  ω and U j and M ˚ j q ˚ j = M ˚ i q ˚ j | M ˙ j . If there exists w U | ω | , such that t Λ i , ACCEPT T A ( STATE ( w ) , t ) , the parsing is completed successfully.

4.1. Support for Nullable RE

To support all possible Ξ , the case when ε L ( r t ) , where t T , needs to be considered as well. That means the symbols t T can be nullable as well, t ε . The solution is to use FIRST ( t ) = { ε , t } if  ε L ( r t ) when computing the  FIRST sets. In the case of shift actions, the lookahead traditionally always coincides with the symbol to be shifted; thus, no special consideration is needed [1]. When the terminal symbols can be nullable, this is no longer the case. For each state s S , the shift items ( A : : = γ t η l ) s are converted into { ( q , t ) SHIFT } { ( q , l ) ε - SHIFT t l FIRST ( t η l ) } . The ε -shift actions need to be differentiated because they are performed within U i . The definition of valid lookahead sets remains the same. That means ε -shift actions must be considered as well.
The scanning process remains exactly the same. There is no need to recognize the lexeme υ ( i , i ) = ε as it is trivially present at any position i { 0 , , | ω | } . The parsing process needs to be modified as follows: The reduce actions are processed in the same way as before. Additionally, at each position i in the parse, the vertices w U i such that l Λ i , ε - SHIFT t T A ( STATE ( w ) , l ) are identified. In the case of  ε -shift actions for the vertex w U i , a vertex u U i is either found or created and the vertices are connected with an edge w u . The ε -shift actions are processed in exactly the same way as  ε -reduce actions. Since ε -shift actions are performed within U i , this means new vertices can be created that need to be processed. For this reason, reduce and ε -shift actions are processed in a common loop until there are no possible actions left. The ε -shift actions can connect w U i to an existent vertex w U i , which was created as a result of a shift action starting from  U h { 0 , , i 1 } . Therefore, for all such w , the reduce actions, where | γ | > 0 , must be processed before processing ε -shift actions. Otherwise, the reduction step will be applied twice, once using the short-circuiting reduce action from w or one of its descendants, and once more using a reduce action from  w . The shift actions are processed in the same way as before.
To demonstrate the required modifications, consider the following specification:
S : : = d A | c r c = x * A : : = d e B e r d = x B : : = ε r e = x | ε r $ = $
The handle-finding automaton, constructed using G in (4), is presented in Figure 15, and the scanner automaton is presented in Figure 16. The handle-finding automaton results in the  T A and T G presented in Table 3 and Table 4 respectively. The productions that are right nullable are S : : = S and S : : = c and A : : = d e B e . In contrast to RNGLR parsing [17], the nullable part can contain both terminals and non-terminals. The T G remains unmodified; however, it is still presented for completeness. Recognizing the input string ω = x x x results in the GSS and the scanner trace in Figure 17 and Figure 18, respectively.
The valid lookahead sets for each handle-finding automaton state are ν ( 0 ) = { c , d , $ } and ν ( 3 ) = { d } and ν ( 5 ) = ν ( 6 ) = ν ( 7 ) = { e , $ } and ν ( 1 ) = ν ( 2 ) = ν ( 4 ) = ν ( 8 ) = { $ } . The valid lookahead set for state 5 is a result of  FIRST ( e B e $ ) = { e , $ } , since e ε and B ε . They are used to determine the valid lookahead sets ν ( P i , 0 ) at each position i based on the sets of states P i , 0 , which in turn, are used to determine the sets Λ i and B i . These are given in Table 5.
In U 3 a short-circuiting reduce action REDUCE ( A , 2 ) is applied from state 6, and a reduce action REDUCE ( A , 4 ) is applied from state 8. An ε -shift action is applied from state 7 to state 8, adding an additional descendant. The reduce action from state 8 must be applied before the edge 7 8 is created, otherwise both REDUCE ( A , 2 ) and REDUCE ( A , 4 ) are applied down the path 3 5 6 7 8 . This formation occurs in the GSS, because there are two possible interpretations for  e B e , either the first e corresponds to  x and the second to  ε , or vice versa.
If we take a zoomed-out view of the GSS for  Ξ in (4), focusing only on the shift actions, as presented in Figure 19, in addition to there being multiple paths between any two subclasses, there are loops due to  ε -shift actions.

4.2. Discussion

The solution is to perform all possible traversals of the scanner automata in addition to all possible traversals of the handle-finding automaton. In our architecture, the processes remain conceptually synchronized, all reduce actions are performed before shifting the next input symbols, and all the processes move to the next position at the same time. On top of that, the scanner automata remain synchronized as well, as they all scan the next character at the same time.
The fractional lookahead arises as a direct consequence of the requirement to preserve the synchronization. At position i in the parse, when scanning using M ˚ i q ˚ i , it is likely that multiple matches can be found by M ˙ i . Moreover, it is entirely possible that the next match will instead be found by any of  ( M ˙ h q ˙ h / i ) h { 0 , , i 1 } and not M ˙ i . We are aware of the following alternative solutions:
  • Split the scanner and the parser; in that way, one part can remain at the current position and the other can continue with the scanning. This was proposed by Begel and Graham [14].
  • Continue scanning, and when the matches are found, backtrack to the current position. This was conjectured by Keynes. This would be inefficient, due to repeated scanning and backtracking [8].
When either is used, the synchronization is lost. As already mentioned, in our case, no additional scanning is performed, which preserves the synchronization, and instead the reduce actions are selected based on the predictions made from the recognized prefix ψ ( i , j ) of the lexemes. In other words, only a fraction of the lexemes is recognized at this point. In the worst-case, | ψ ( i , j ) | = 1 , since at least one character must be scanned before a match is found. In the best case, ψ ( i , j ) is a lexeme. That happens when our architecture degenerates to context-aware scanning, and, in that case, the length of the lookahead is 1. In general, the length of the lookahead is a fraction | ψ ( i , j ) | | υ ( i , k ) | , where k φ ( q ˙ i / i q ˙ i / l ) , and is, therefore, on the interval ( 0 , 1 ] . Note that the fraction varies based on which lexeme is considered; however, the value is always on the said interval.
Our architecture is flexible enough, in that the automaton M ˙ T $ from the traditional parsing architecture could be used directly. The problem is that, in the case of lexical conflicts, matches would be found at position j , and then no shift actions could be performed using the corresponding lookahead symbols. As a result, the efficiency of the algorithm would be worse for the lexically ambiguous specifications. Additionally, superfluous matches would result in shorter ψ ( i , j ) and, thus, less lookahead.
A similar issue arises when using weaker handle-finding automaton construction algorithms, such as LALR or SLR [8]. In those cases, the valid lookahead sets ( ν ( s ) ) s S can contain additional symbols that cannot result in a sentential form. That means an invalid match can be found in certain cases. An example of a specification that exhibits this issue when used in combination with LALR is:
S : : = A b A : : = c A d | e r b = y * r c = x r d = y r e = z r $ = $
If an invalid match is found at position i , then U i = . In RNGLR parsing, it is an indication that the parsing concluded unsuccessfully; however, in the case of our architecture, M ˚ h q ˚ i can still find a match. Therefore, parsing concludes unsuccessfully if and only if M ˚ h q ˚ i = .
Lexical conflicts can occur at each position i , for the same reasons as in context-aware scanning, and additionally if the languages of automata ( M ˙ k q ˙ k / i ) k { 0 , , i } are not pairwise disjoint and pairwise prefix-disjoint.
The worst-case time complexity of our architecture is O ( | S | · | ω | | γ ^ | + 1 ) , where | γ ^ | is the length of the longest right-hand-side of a production rule in the grammar. The reasons for this are as follows:
  • There can be at most | S | vertices in any U i , since each vertex is labeled with a distinct handle-finding automaton state.
  • Each vertex can have at most i successors. This is due to the fact that, in the worst-case, the handle-finding automaton can start searching for  γ of some production rule A : : = γ η , where η * ε , in each ( U h ) h { 0 , , i 1 } , and then perform i reduction steps to the same vertex in  U i .
  • The number of steps required to find the paths v γ w of length | γ | , where w U i and v U h , is limited by the number of ways the intervening input symbols can be divided among | γ | symbols, i h 1 | γ | 1 . Summing over all possible final positions h { 0 , , i 1 } results in  O ( i | γ | ) steps.
  • The shift actions at position i can be performed from any subclass ( U k ) k { 0 , , i } . The U j can only contain | S | vertices at most and i actions can be performed for each vertex. Therefore, the shift actions are processed in  O ( | S | · i ) steps.
  • The automaton M ˚ i q ˚ i can be composed out of at most i automata ( M ˙ k ) k { 0 , , i } . In the worst-case, a match is found after scanning a single character a Ω $ , | ψ ( i , j ) | = 1 . Thus, i steps are needed to perform the transitions on a for all i of them. The worst-case time complexity of each M ˙ i is O ( log | T $ | · | ω | ) . Summing over all positions i { 0 , , | ω | + 1 } gives the worst-case time complexity of the scanner alone, O ( log | T $ | · | ω | 2 ) .
  • Each reduce action at position i , in addition to the steps required to find the paths, requires at most i steps, as the paths can end in at most i vertices. The processing is, therefore, completely dominated by the search for the paths. The worst-case is when | γ | = | γ ^ | .
  • The reduce actions at position i are processed in  O ( | S | · i | γ ^ | ) steps. For each newly created vertex w U i , in the worst-case, O ( i | γ ^ | ) steps are required. For each edge w w , where w U h { 0 , , i 1 } , connected to an existing vertex w, in the worst-case, O ( i | γ ^ | 1 ) steps are required, as there is only one way to reach w from w. In the worst-case i , such connections can be made; therefore, O ( | S | · i | γ ^ | ) steps are required in total.
  • Summing over all positions i { 0 , , | ω | } gives O ( | S | · | ω | | γ ^ | + 1 ) .
However, usually ζ | ω | , where ζ = | φ ( q ˚ 0 q ˚ | ω | + 1 ) | . The recognizer performs O ( | S | · ζ | γ ^ | + 1 ) and the scanner O ( log | T $ | · ζ · | ω | ) steps. Depending on the grammar, large segments of the input string can be processed by the scanner, and the recognizer is only involved when a match is found. The worst-case space complexity is O ( | S | · | ω | 2 ) , due to the size of the GSS. At each position i , there can be at most | S | vertices, and each vertex can have i successors (edges). Summing over all positions i { 0 , , | ω | } gives O ( | S | · | ω | 2 ) [16,17,35]. The analysis is based on the analysis of GLR by Kipps; refer to [35] for additional rationale.

4.3. Algorithm

The scanning algorithm is very similar to the (traditional) context-aware scanning algorithm, except that at each position i in the parse, M ˚ i q ˚ i is used as the scanner automaton and after it traces a path q ˚ i q ˚ j , both m ( q ˚ j ) and p ( q ˚ j ) are passed to the recognizer. We implement M ˚ i q ˚ i as a multi-map using q ˙ k / i and p ( P k , 0 , q ˙ k / i ) as keys:
k { 0 , , i } { M ˙ k q ˙ k / i k }
That way, the automata that can no longer find a match can be filtered out and the duplicates can be merged while preserving the start positions.
The parsing algorithm will be presented in the imperative form. For this reason, we use mutable references to sets for the GSS Γ = ( V , E ) and the subclasses ( U i ) i { 0 , , | ω | } . The presentation of the algorithm mainly follows that by Kipps [16,35] and Scott [17]. The actor is split into three functions actor, reduce_actor, shift_actor. Each one processes the actions directly by calling the corresponding processing functions reduce and shift. This allows us to specify precisely which additional actions need to be considered when a new vertex is added. In the parse function, we must first process the reduce actions, where | γ | > 0 , for vertices that are in  U i at the start. Only then can ε -reduce and ε -shift actions be processed.
The multi-map B i : T $ P ( V ) maps the lookahead symbols t to vertices w, such that SHIFT T A ( STATE ( w ) , t ) . There is one for each ( U i ) i { 0 , , | ω | } . They allow quick identification of the vertices in each subclass for which a shift action needs to be performed. These data structures are not strictly needed, since the vertices can be identified by just using U i ; however, a search is needed in that case. The worst-case space complexity of these data structures is O ( | T $ | · | S | · | ω | ) , as there can be at most | T $ | · | S | entries in each, and there is one at each position in the parse.
The algorithm listing is given in Algorithm 1. The input automata ( M ˙ i ) i { 0 , , | ω | } are just placeholders, as they are constructed at runtime. The real inputs are the automaton M ˙ T $ and a precomputed map { s ν ( s ) s S } .
Algorithm 1. RNGSGLR(0,1] recognizer for character-level CFL.
input T A action table, T G goto table, ( M ˙ i ) i { 0 , , | ω | } automata at each position in the parse,
ω character input string
outputaccept or reject the character input string
 
let rngsglr1 T A   T G   ( M ˙ i ) i { 0 , , | ω | }   ω =
let rec actor w =
  if l Λ i , ε - SHIFT t T A ( STATE ( w ) , l )
  then shift i  w t
  if l Λ i , SHIFT T A ( STATE ( w ) , l )
  then B i : = B i { l w }
  for each l Λ i , REDUCE ( A , 0 ) T A ( STATE ( w ) , l ) do
   reduce w A 0
  done
and reduce_actor w =
  for each l Λ i , REDUCE ( A , | γ | ) T A ( STATE ( w ) , l ) | γ | 0 do
   for each successor node w of w do
    reduce w  A  | γ |
   done
  done
and shift_actor () =
  for each match ( k , l ) B i do
   for each w B k ( l ) , SHIFT T A ( STATE ( w ) , l ) do
    shift j  w l
   done
  done
and reduce w  A  | γ | =
  for each vertex v that can be reached from  w along the path of length | γ | 1 , or length 0 if | γ | = 0 do
   let s = T G ( STATE ( v ) , A ) in
   if u U i , STATE ( u ) = s then
    add vertex u to  V labeled s and edge ( u , v ) to  E
     U i : = U i { u }
    actor u
    if | γ | 0
    then reduce_actor u
   else if ( u , v ) E then
    add edge ( u , v ) to  E
    if | γ | 0 then
     for each t Λ i , REDUCE ( B , | θ | ) T A ( STATE ( u ) , t ) | θ | 0 do
      reduce v B  | θ |
     done
and shift l  w t =
  let s = T G ( STATE ( w ) , t ) in
  if u U l , STATE ( u ) = s then
   add vertex u to V labeled s and edge ( u , w ) to  E
    U l : = U l { u }
   if i = l
   then actor u
  else if ( u , w ) E
  then add edge ( u , w ) to  E
in
let parse () =
  scan at position i in  ω $ using M ˚ i q ˚ i tracing q ˚ i q ˚ j
  get matches B i = m ( q ˚ j )
  get predictions Λ i = { t ( k , t ) p ( q ˚ j ) k = i }
 
   B i : = , U j : =
  for each vertex w U i do
   reduce_actor w
  done
  for each vertex w U i do
   actor w
  done
  shift_actor ()
 
  construct M ˚ j q ˚ j = M ˚ i q ˚ j | M ˙ j
   i : = j
in
 add the start vertex v to  V labeled s
i : = 0 , U 0 : = { v }
 
 construct M ˚ 0 q ˚ 0 = M ˙ 0
while i | ω | M ˚ i q ˚ i do
  parse ()
done
if w U | ω | , ACCEPT T A ( STATE ( w ) , $ )
then accept
else reject

5. Construction of the Parse Forest

In this section, we extend the recognizer to a parser. To construct the parse forest, the following considerations are needed:
  • The missing nullable parts must be added to the nodes produced by short-circuiting reduce actions [17].
  • Multiple leaf nodes can be created at each point in the parse, since multiple lexemes can be recognized and each can correspond to multiple lookahead symbols.
We use an approach reminiscent of that of Scott [17], which, in turn, is based on that of Rekers [36].
The parsing algorithm finds all possible right-most derivations; therefore, multiple derivation trees can be constructed for a given sentence if the grammar is ambiguous. The construction of all individual derivation trees requires an exponential amount of work, and in certain cases might not terminate, even though the worst-case time complexity of the algorithm is polynomial [15,16]. The solution is to combine the common parts; that is, the subtrees are shared if multiple trees have a common subtree. If multiple nodes that correspond to the same non-terminal symbol have subtrees, which derive the same substring, then these nodes are merged into a single packed node. The resultant data structure is called a shared packed parse forest (SPPF) [16]. The SPPF is a variant of a directed graph, where each node can have multiple distinct sets of children (in the case of packed nodes). Each set of children is represented as a dot in the diagrams. Each node is labeled by a triple ( X , k , l ) , such that X ( N T ) and X * ψ ( k , l ) , where ψ ( k , l ) is a substring of the input string ω , starting at  k and ending at  l . For example, the SPPF for  Ξ in (3) and ω = x y z is presented in Figure 20.
The presence of nullable productions in the grammar may result in certain sentences having infinitely many derivations. As a result, ε - shift and ε - reduce actions in those cases create cycles in the GSS, which, in turn, result in the cycles in the SPPF. The derivation trees can be recovered by unwinding the cycles [16,17]. An example of such a specification is Ξ in (5), which is presented further below.
There is a one-to-one correspondence between the edges in the GSS and the nodes in the SPPF. The construction is performed as follows: As a shift or an  ε - shift action corresponding to t is applied from the vertex w U k , a vertex u U l is either found or created. We search the SPPF for a node z labeled ( t , k , l ) . If it does not exist, it is created. The vertices are connected with an edge w u , labeled z. The node corresponds to a lexeme υ ( k , l ) . The lexemes should not be included in the nodes directly. Since they can overlap, their size can, in total, far exceed the size of the input string. As a reduce action corresponding to  A : : = β is applied from the vertex w U i , the paths v β w of length | β | are found. As each path is traversed we record the labels on the edges, z 1 , , z | β | . For each v U k a vertex u U i is either found or created. We search the SPPF for a node z labeled ( A , i , i ) in the case of  ε - reduce actions, or  ( A , k , i ) in the case of reduce actions, where the node z 1 is labeled ( B , k , f ) . If it does not exist, it is created. The nodes z 1 , , z | β | are then added as children to the node z if they are not already included. The vertices are connected with an edge v u labeled z [17,36]. In the case of right nullable productions A : : = γ η , where η * ε , the subtrees corresponding to  η are missing from the SPPF, since the reduction steps are short-circuited [17]. To remedy this, Scott [17] pre-constructed the SPPF for  η of each right nullable production A : : = γ η in the grammar, called the  ε -SPPF. The nodes of the  ε -SPPF are indexed, and the index is included in the reduce actions. As a short-circuiting reduce action is applied, a node z from the  ε -SPPF corresponding to the index is appended to the nodes of each path: z 1 , , z | γ | , z . The ε -SPPF consists of nodes created using productions A : : = β , where β * ε and terminal symbols t ε . For this reason, ε -SPPF needs to be used when such nodes are created using ε -reduce or  ε -shift actions, otherwise the nodes would be duplicated. Thus, the index of the  ε -SPPF node needs to be included in each of these actions as well. The ε -SPPF nodes are just placeholders, and the appropriate SPPF nodes need to be inserted at their places after the GSS is constructed [17].
In our approach, we construct these SPPF nodes immediately. We include η directly in the reduce actions, REDUCE ( A , | γ | , η ) . To generate a subtree for all derivations of  X * ε , where X ( N T ) , we search the SPPF for a node z labeled ( X , i , i ) . If it does not exist, it is created. For each κ , such that X κ , where κ + ε , we repeat the same process for all symbols κ = X 1 , , X | κ | , and add the resultant root nodes z 1 , , z | κ | as children to node z. This requires us to pass the grammar G to the parser; however, only productions A : : = β , where β * ε are needed. Therefore, we can construct a trimmed-down version G ε with only those productions. As a short-circuiting reduce action is applied, we generate the subtrees for all symbols η = X 1 , , X | η | corresponding to all derivations of  η * ε , and the resultant root nodes z 1 , , z | η | are appended to the nodes of each path: z 1 , , z | γ | , z 1 , , z | η | . No modifications are required to the processing of  ε -shift and ε -reduce actions, as the nodes from SPPF are reused when generating the subtrees. The benefit of our approach is that the SPPF is constructed to completion at each point during parsing. The downside is that the tables are slightly larger since the reduce actions additionally include a string instead of a single index. However, since the worst-case time complexity of RNGLR parsing depends on the lengths of the right-hand sides, we can expect these strings to be reasonably short. Despite the fact that the construction process is not exactly the same as that of Scott [17], the resulting SPPF is structurally equivalent. Therefore, our architecture can be used as a drop-in replacement for RNGLR in combination with (traditional) context-aware scanning.
Let us demonstrate the SPPF construction with the following example:
S : : = A | e B C : : = ε A : : = d A | ε r d = x | ε B : : = C C r e = x r $ = $
which corresponds to the following trimmed-down the grammar of nullable productions G ε :
S : : = A A : : = d A | ε B : : = C C C : : = ε
and the  T A and T G given in Table 6 and Table 7 respectively.
Parsing the input string ω = x results in the GSS and the SPPF in Figure 21 and Figure 22, respectively.
The edges in the GSS are labeled with the indices of the SPPF nodes. All productions in the grammar are right-nullable. The short-circuiting reduce actions are applied in  U 1 . REDUCE ( A , 1 , A ) and REDUCE ( A , 0 , d A ) are applied in state 3. The subtrees corresponding to  A * ε and d A * ε are generated, resulting in the SPPF nodes 7 and 8. REDUCE ( S , 1 , B ) and REDUCE ( B , 0 , C C ) are applied in state 5. The subtrees corresponding to  B * ε and C C * ε are generated, resulting in the SPPF nodes 1 and 4.
The sets of states P i , 0 , the valid lookahead sets ν ( P i , 0 ) , and the sets Λ i and B i for each position i are given in Table 8.
Two lookahead symbols d and e are matched at position 0. An ε -shift action is performed from states 0 and 3. The resultant SPPF node is, in both cases, labeled ( d , 0 , 0 ) ; thus, the edges share the index 5. For d, a shift action is performed from states 0 and 3, resulting in nodes labeled ( d , 0 , 1 ) ; thus, the edges again share the index 6. For e, a shift action is performed from state 0, resulting in a node labeled ( e , 0 , 1 ) and an edge labeled with index 3. There are infinitely many derivations for  A * ε , since A * d d and d ε . In our example, one d corresponds to  x and results in the SPPF node 6, while all others correspond to  ε and result in SPPF nodes 5 and 8, and there can be infinitely many of them; hence, the loops in nodes 2 and 7.

Algorithm

The modifications required to generate the SPPF are purely additive. The algorithm listing is given in Algorithm 2. The algorithm now additionally accepts the trimmed-down grammar G ε as an argument. The action processing functions shift and reduce are augmented to construct the SPPF nodes and label the edges. The shift now additionally accepts the start position of the lexeme, which is used as the start position of a leaf SPPF node. The function reduce now accepts the first edge of the path instead of just the successor vertex, because the SPPF node that labels this edge needs to be recorded as well. The nullable part of the right nullable productions is passed as well. That is then processed by the generate, which generates the missing subtrees. As in the algorithm proposed by Scott [17], the set N is introduced, which holds the SPPF nodes constructed at each position in the parse. Only the nodes in  N need to be considered when searching for existing SPPF nodes. The labels of the SPPF nodes in  N share the end position; thus, they could have been omitted [17]. Note that, in our case, the SPPF nodes resulting from  ε - shift and shift actions need to be added to  N as well, as more than one node can be created. If the parse succeeded, the root SPPF node, the one that corresponds to the input string ω , labels the edge between the start vertex v and a vertex w labeled with the final state of the handle-finding automaton. The nodes resulting from superfluously applied actions cannot be reached from the root SPPF node and can be removed.
Algorithm 2. RNGSGLR (0,1] parser for character-level CFL.
input T A action table, T G goto table, ( M ˙ i ) i { 0 , , | ω | } automata at each position in the parse,
G ε trimmed-down grammar of nullable productions, ω character input string
outputaccept or reject the character input string and return the constructed SPPF
 
let rngsglr2 T A   T G   ( M ˙ i ) i { 0 , , | ω | }   G ε   ω =
let rec actor w =
  if l Λ i , ε - SHIFT t T A ( STATE ( w ) , l )
  then shift i  w t
  if l Λ i , SHIFT T A ( STATE ( w ) , l )
  then B i : = B i { l w }
  for each l Λ i , REDUCE ( A , 0 , η ) T A ( STATE ( w ) , l ) do
   reduce ( w , w )  A 0 η
  done
and reduce_actor w =
  for each l Λ i , REDUCE ( A , | γ | , η ) T A ( STATE ( w ) , l ) | γ | 0 do
   for each successor node w of w do
    reduce ( w , w )  A  | γ |   η
   done
  done
and shift_actor () =
  for each match ( k , l ) B i do
   for each w B k ( l ) , SHIFT T A ( STATE ( w ) , l ) do
    shift k  w a
   done
  done
and reduce ( w , w )  A  | γ |   η =
  let rec generate X =
   if z N , LABEL ( z ) = ( X , i , i )
   then z
   else
    create node z labeled ( X , i , i )
     N : = N { z }
    for each X κ do
     let z 1 , , z | κ | = map generate over κ = X 1 X | κ | in
     add children z 1 , , z | κ | to z
    done
    z
   done
  in
  let z 1 , , z | η | = map generate over η = X 1 X | η | in
  for each v that can be reached via ( w , w ) along the path of length | γ | and the corresponding z 1 , z | γ | do
   if z N , LABEL ( z ) = ( A , k , i ) LABEL ( z 1 ) = ( B , k , f ) then
    create node z labeled ( A , k , i )
     N : = N { z }
   add children z 1 , , z | γ | , z 1 , , z | η | to node z
   let s = T G ( STATE ( v ) , A ) in
   if  u U i , STATE ( u ) = s then
    add vertex u to  V , labeled s, and edge ( u , v ) to  E labeled z
     U i : = U i { u }
    if | γ | 0
    then reduce_actor u
   else if ( u , v ) E then
    add edge ( u , v ) to  E
    if | γ | 0 then
     for each t Λ i , REDUCE ( B , | θ | , κ ) T A ( STATE ( u ) , t ) | θ | 0 do
      reduce ( u , v )  B  | θ |   κ
     done
and shift k   l  w t =
  if z N , LABEL ( z ) = ( t , k , l ) then
   create node z labeled ( t , k , l )
    N : = N { z }
  let s = T G ( STATE ( w ) , t ) in
  if u U l , STATE ( u ) = s then
   add vertex u to  V , labeled s, and edge ( u , w ) to  E labeled n
    U l : = U l { u }
  else if ( u , w ) E
  then add edge ( u , w ) to  E labeled z
in
let parse () =
  scan at position i in  ω $ using M ˚ i q ˚ i tracing q ˚ i q ˚ j
  get matches B i = m ( q ˚ j )
  get predictions Λ i = { t ( k , t ) p ( q ˚ j ) k = i }
 
   B i : = , U j : =
  for each vertex w U i do
   reduce_actor w
  done
  for each vertex w U i do
   actor w
  done
   N : =
  shift_actor ()
 
  construct M ˚ j q ˚ j = M ˚ i q ˚ j | M ˙ j
   i : = j
in
 add the start vertex v to  V labeled s
i : = 0 , U 0 : = { v }
 
 construct M ˚ 0 q ˚ 0 = M ˙ 0
while i | ω | M ˚ i q ˚ i do
  parse ()
done
if w U | ω | , ACCEPT T A ( STATE ( w ) , $ ) then
  the node z labeling ( w , v ) is the root of the forest
  remove nodes not reachable from z
  accept
else reject

6. Scanner Lookahead

By default, our architecture does not require a buffer. All characters are processed one by one. While all matched lookahead symbols result in a shift action due to the contextual information passed to the scanner by the parser (when using the canonical LR or an equivalent handle-finding automaton construction algorithm), some are still performed superfluously. In the example for specification Ξ in (4), the shift actions are performed for the symbol c in U 1 , U 2 , and U 3 ; however, all but the last are superfluous. The issue is that $ could have potentially followed at positions 2 or 3, and there was no way to tell if that were the case, since only one character was processed at a time. In practice, this arises when parsing embedded sub-languages. A simple example is the following specification:
S : : = a b a r a = " r b = ( Ω " ) *
When parsing the input string ω = x y z , the symbol b will be matched at positions 1, 2, 3, because there is no way to tell whether the next character is ".
A modest extension to our architecture, which solves this problem, is the addition of the scanner lookahead [2]. Usually, when a multitude of matches is found, all except the last result in a superfluous shift action. Therefore, the use of the scanner lookahead is functionally similar to the longest match disambiguation rule. However, the longest match in certain circumstances results in an incorrect parse. As identified by Nawrocki [9], there is the longest match conflict between not necessarily distinct t , l , l T $ , where γ , η ( N T ) * and σ Ω + and ρ , τ , π Ω * , when:
S * γ t η S * γ l l η ρ σ τ L ( r t ) ρ L ( r l ) σ π L ( r l )
When parsing ρ σ π , if the longest match is used, ρ σ is recognized by M t , and then an error is raised because π does not match τ [9]. To detect if the longest match conflict can occur, it is enough to consider only the first character of σ , as the parse is bound to be incorrect as soon as M t scans a single character past ρ without returning a match. There is a special case, where γ , η ( N T ) * and σ Ω + and ρ , π Ω * and i , j 0 :
S * γ t l η ρ σ i L ( r t ) σ j π L ( r l )
When parsing ρ σ k π = ρ σ i σ j π , all possible interpretations, such as i + j = k , need to be considered. When using the longest match, the only possible interpretation is i = k , j = 0 . The longest match results in an incorrect parse if M l can find a match after M t enters a final state.
In the case of our architecture, it occurs if M ˚ i q ˚ i can find a match as M ˚ h q ˚ i has entered a final state.
  • When using our architecture, this is, by default, always assumed to be true. Every matched symbol is passed to the parser.
  • When using the longest match, this is always assumed to be false. The scanning continues with M ˚ h q ˚ i until it is about to enter a state from which no more matches can be made on the next transition.
  • When using the scanner lookahead, a test is performed if that is the case. We could scan ahead using the automaton M ˚ i q ˚ i to see if it enters a state from which no more matches can be made and then continue scanning in this case. Since only continuing with the scanning results in an incorrect parse, we can always stop early and return the match to the parser. We present a variant that is equivalent to scanning a single character. We use a simple set membership test instead of scanning ahead. The match is passed to the parser only if the rest of the input string contains a Ω $ , such that a FIRST ( M ˚ i q ˚ i ) . This is equivalent to one character of lookahead in a character-level parser.
To determine the lookahead characters automatically using G we introduce a valid-lookahead-lookahead multi-map, ξ : S × T $ P ( T $ ) , which is the set of lookahead symbols for each lookahead symbol in the valid lookahead set. We identified the following approaches to compute ξ :
  • The simplest involves using the lookahead sets in a handle-finding automaton constructed using two symbols of lookahead as proposed by Keynes [8]. For each state s S , the shift items ( A : : = γ a η t l ) s are converted into { ( s , t ) l t l FIRST 2 ( a η t l ) } , and the reduce items ( A : : = β t l ) s are converted to { ( s , t ) l } . The issue with this approach is that construction of automata using two symbols of the lookahead is inefficient, as also noted by Keynes [8]. Additionally, it computes ξ indirectly. It needs to compute two tokens of lookahead per item, which we have no use for, and then collapse it. We did not manage to make this approach reasonably efficient for large grammars.
  • An approach based on exploring the right context of each LALR(1) handle-finding automaton state manually is given by Nawrocki [9].
  • We found an efficient approach that allows us to compute ξ directly. We construct the LALR(1) handle-finding automaton by converting the LR(0) handle-finding automaton to SLR [1]. The method works by constructing a grammar where each symbol is enhanced with the state of the LR(0) handle-finding automaton. The lookahead is computed as the FOLLOW sets of the enhanced non-terminal symbols. The ξ is computed as the FOLLOW sets of the enhanced terminal symbols. In a sense, the ξ corresponds to the lookahead that would be used when performing the reduce action for the symbol t T . The downside of the method is that it is tied closely to the handle-finding automaton construction method.
The scanner lookahead multi-map ξ : S × T $ P ( Ω $ ) is then computed as
ξ ( s , t ) = l ξ ( s , t ) FIRST ( M l )
For a set of handle-finding automaton states, ξ : Π × T $ P ( Ω $ ) is defined as
ξ ( P , t ) = s P ξ ( s , t )
The m : Q ˚ i × Ω $ P ( { 0 , , | ω | } × T $ ) and p : Q ˚ i × Ω $ P ( { 0 , , | ω | } × T $ ) are redefined as:
m ( q ˚ j , a ) = k { 0 , , j } ( k , t ) t m ( P k , 0 , q ˙ k / j ) a ξ ( P k , 0 , t ) p ( q ˚ j , a ) = k { 0 , , j } ( k , t ) t p ( P k , 0 , q ˙ k / j ) a ξ ( P k , 0 , t )
The scanning process is then modified as follows: The input string ω is extended with $ $ . At each position i in the parse a string ψ ( i , j ) Ω $ + is recognized in the rest of the input string ω using M ˚ i q ˚ i as usual, such that ω = ψ ( i , j ) ω . Then we peek at the next character a, such that ω = a ω . If none of the matched symbols pass the test, then m ( q ˚ j , a ) = , and as a result, M ˙ j = ; therefore, the scanning immediately continues with M ˚ i q ˚ j . Otherwise, m ( q ˚ j , a ) and p ( q ˚ j , a ) are passed to the parser and the scanning continues with M ˚ j q ˚ j .

7. Proof of Correctness

A recognizer is correct for a given specification Ξ if it accepts an input string ω Ω * , if and only if ω L ( Ξ ) , and otherwise rejects it. First, we will prove if our architecture accepts ω , then ω L ( Ξ ) .
Lemma 1.
For each vertex u U i , where ( A : : = γ X η l ) STATE ( u ) , there must also exist a vertex w U h , such that ( A : : = γ X η l ) STATE ( w ) , where γ , η ( N T ) * , and there must exist an edge w X u in the GSS.
Proof. 
This follows trivially from the definition of the handle-finding automaton and the basic operation of the RNGLR parser [15]. □
Lemma 2.
For each edge w t u , such that w U h and u U i , created using a shift action, ψ ( h , i ) L ( t ) .
Proof. 
The shift action is performed if ( h , t ) m ( q ˚ i ) after M ˚ h q ˚ h traces a path q ˚ h ψ ( h , i ) q ˚ i . That means the match was found by the automaton M ˙ h , such that q ˙ h / h ψ ( h , i ) q ˙ h / i , and that t m ( P h , 0 , q ˙ h / i ) . Since m ( P h , 0 , q ˙ h / i ) = ν ( P h , 0 ) m ( q ˙ h / i ) , which means that t ν ( P h , 0 ) and t m ( q ˙ h / i ) . The automaton M ˙ T $ is a union automaton, which means it finds a match if any of the constitutive automata find a match. The only automaton where t m ( q t ) is M t , which means it must necessarily be the one that recognizes ψ ( h , i ) . Therefore, ψ ( h , i ) L ( M t ) , and by definition of M t , ψ ( h , i ) L ( t ) . □
Lemma 3.
For each edge w X u , such that w U i and u U i , created using an ε-shift action or an ε-reduce action, ψ ( i , i ) L ( X ) .
Proof. 
An ε -shift action or ε -reduce action is only performed if X * ε . Since ψ ( i , i ) = ε at any position, ψ ( i , i ) L ( X ) . □
Lemma 4.
For each edge w A u , such that w U h and u U i , created using a reduce action, where | γ | > 0 , ψ ( h , i ) L ( A ) .
Proof. 
We will prove the statement by induction. If the GSS does not contain any edges, the statement is trivially satisfied. If the statement holds for all of the edges in the GSS, then it must hold after w A u has been added. The reduce action REDUCE ( A , | γ | , η ) T A ( STATE ( v ) , l ) , where v U i , is performed if there exists a path v 0 X 1 X | γ | v | γ | , where γ = X 1 X | γ | and v 0 = w and v | γ | = v . For each X i { 1 , , | γ | } , where v i 1 U e and v i U f , ψ ( e , f ) L ( X i ) . This statement holds for the edge v i 1 X i v i :
  • Due to Lemma 2, if it was created as a result of a shift action;
  • Due to Lemma 3, if it was created as a result of an ε -shift or ε -reduce action; or
  • Due to the induction hypothesis, if it was created as a result of a reduce action.
This exhausts all possible ways an edge can be created. The reduce action entry was added to the actions table because ( A : : = X 1 X | γ | η l ) STATE ( v ) . That means A * X 1 X | γ | η , where η * ε , and as a result ψ ( h , i ) L ( A ) . □
Theorem 1.
For each edge w X u , such that w U h and u U i , in the GSS, ψ ( h , i ) L ( X ) .
Proof. 
An edge can only be created as a result of a shift, an ε -shift, an ε -reduce or a reduce action, where | γ | > 0 ; therefore, the statement holds due to Lemmas 2, 3, 4. □
Theorem 2.
If our architecture accepts ω then ω L ( Ξ ) .
Proof. 
If our architecture accepts ω , it means there must exist a vertex w U | ω | , such that ACCEPT T A ( STATE ( w ) , $ ) . This implies ( S : : = S $ ) STATE ( w ) . By Lemma 1, there must exist a vertex v, such that ( S : : = S $ ) STATE ( v ) and an edge v S w . By the definition of the handle-finding automaton, the only such vertex is v . Therefore, there exists an edge v w , such that v U 0 and w U | ω | . By Theorem 1, ω L ( S ) , therefore, ω L ( Ξ ) [15]. □
Second, we will prove that if ω L ( Ξ ) , then our architecture accepts ω Ω * . We will do so by showing that our architecture accepts ω exactly when a character-level RNGLR parser accepts ω using G ^ . The correctness of RNGLR was proven by Scott [17].
We will formulate the lookahead as the right context. The item right context is the grammar that describes the language of all strings that can follow an item in state s S . For an item of the form ( A : : = β ) s , which was added as a result of the ε -closure on an item of the form ( B : : = γ A η ) s , a production F s , ( A : : = β ) : : = η F s , ( B : : = γ A η ) is added to the grammar. Shifting over a symbol does not change the item right context; thus, for the items of the form ( B : : = γ X η ) s and ( B : : = γ X η ) s , a production of the form is added F s , ( B : : = γ X η ) : : = F s , ( B : : = γ X η ) . The dot right context is grammar that describes the language of all strings that can follow the dot. The grammar is initially constructed in the same way as for the item’s right context. Then, for each item of the form ( B : : = γ X η ) s , a production D s , ( B : : = γ X η ) : : = X η F s , ( B : : = γ X η ) is added [1]. We assume the character-level RNGLR uses the whole right context as the lookahead; in that way, it cannot perform superfluous actions. Our architecture performs strictly worse in that regard.
To avoid confusion, the symbols that belong to RNGLR will be annotated with ( 1 ) and those that belong to our architecture will be annotated with ( 2 ) .
Lemma 5.
For all states s ( 1 ) S ( 1 ) and s ( 2 ) S ( 2 ) , if B N , where γ , η ( N T ) * :
( B : : = γ η ) s ( 1 ) ( B : : = γ η ) s ( 2 )
Proof. 
We will prove the statement by induction. The base cases are the states s ( 1 ) S ( 1 ) and s ( 2 ) S ( 2 ) . We will prove the statement holds for the base case, again by induction. Initially, ( S : : = S ) s ( 1 ) and ( S : : = S ) s ( 2 ) . If we assume S N the statement holds trivially as the initial items exist for both G ^ and G by the definition of the handle finding automaton. We will prove that, as the items are expanded using ε -closure, the statement still holds: By the induction hypothesis, there exist items ( B : : = X η ) s ( 1 ) and ( B : : = X η ) s ( 2 ) , where η = X η and η ( N T ) * , or there are no such items in s ( 1 ) and s ( 2 ) . If X N , the items ( X : : = β ) are added for both G ^ and G . That is because productions X : : = β , such that X N , exist in both grammars, by the definition of G ^ . Moreover β ( N T ) * ; therefore, the statement continues to hold.
Therefore, the statement holds for the base case. We will prove that the statement still holds after new states s ( 1 ) S ( 1 ) and s ( 2 ) S ( 2 ) are added. Each state is created from an existing state s ( 1 ) S ( 1 ) and s ( 2 ) S ( 2 ) . By the induction hypothesis, there exist items ( B : : = γ X η ) s ( 1 ) and ( B : : = γ X η ) s ( 2 ) , where η = X η and γ , η ( N T ) * , or there are no such items in s ( 1 ) and s ( 2 ) . If they exist, then a new state s ( 1 ) and an edge s ( 1 ) X s ( 1 ) are created for G ^ and new states s ( 2 ) and s ( 2 ) X s ( 2 ) are created for G . Initially, there exist items ( B : : = γ X η ) s ( 1 ) and ( B : : = γ X η ) s ( 2 ) . if η = ε , no more items are added and the statement continues to hold. Otherwise, if η = X η , then the items are expanded using ε -closure. We can use the same reasoning for s ( 1 ) and s ( 2 ) , to show that the statement still holds for the created items. Therefore, the statement holds for all s ( 1 ) S ( 1 ) and s ( 2 ) S ( 2 ) . □
Corollary 1.
For all states s ( 1 ) , s ( 1 ) S ( 1 ) and s ( 2 ) , s ( 2 ) S ( 2 ) , when B N , the item right context and the dot right context coincide for both G ^ and G :
F s ( 1 ) , ( A : : = β ) : : = η F s ( 1 ) , ( B : : = γ A η ) F s ( 2 ) , ( A : : = β ) : : = η F s ( 2 ) , ( B : : = γ A η ) F s ( 1 ) , ( B : : = γ X η ) : : = F s ( 1 ) , ( B : : = γ X η ) F s ( 2 ) , ( B : : = γ X η ) : : = F s ( 2 ) , ( B : : = γ X η ) D s ( 1 ) , ( B : : = γ X η ) : : = X η F s ( 1 ) , ( B : : = γ X η ) D s ( 2 ) , ( B : : = γ X η ) : : = X η F s ( 2 ) , ( B : : = γ X η )
This follows trivially from Lemma 5.
Lemma 6.
Γ ( 2 ) contains all paths labeled ι ( N T ) * in Γ ( 1 ) , for all derivations S * ι ω .
Proof. 
The ι are the configurations of the stack in LR parsing [1]. We will prove the statement by induction. The base cases are the empty paths starting at v ( 1 ) and v ( 2 ) , which exist by the definition of both recognizers. By the induction hypothesis, there are paths:
v ( 1 ) ι w ( 1 ) v ( 2 ) ι w ( 2 )
We will prove that if an edge w ( 1 ) X u ( 1 ) , where X ( N T ) , is added to Γ ( 1 ) , an edge w ( 2 ) X u ( 2 ) is added to Γ ( 2 ) . In either algorithm, an edge can only be added by processing an action. We will show that, if an action is performed by a character-level RNGLR, it is also performed by our architecture.
1.
A reduce action REDUCE ( A , | γ | , η ) T A ( 1 ) ( STATE ( v ( 1 ) ) , ω ) for production A : : = γ η , such that A N and | γ | > 0 and η * ε , is performed using a character-level RNGLR if there exists a path v ( 1 ) α w ( 1 ) γ v ( 1 ) , where w ( 1 ) U h ( 1 ) and v ( 1 ) U i ( 1 ) , such that ( B : : = σ A τ ) STATE ( w ( 1 ) ) and ( A : : = γ η ) STATE ( v ( 1 ) ) , and if the rest of the input matches ω , such that F STATE ( v ( 1 ) ) , ( A : : = γ η ) * ω .
By the induction hypothesis, there exists a path v ( 2 ) α w ( 2 ) γ v ( 2 ) , where w ( 2 ) U h ( 2 ) and v ( 2 ) U i ( 2 ) , as well. By Lemma 5, ( B : : = σ A τ ) STATE ( w ( 2 ) ) and ( A : : = γ η ) STATE ( v ( 2 ) ) , therefore REDUCE ( A , | γ | , η ) T A ( 2 ) ( STATE ( v ( 2 ) ) , l ) . That means l ν ( STATE ( v ( 2 ) ) ) . The reduce action is performed if l p ( q ˚ j ) after M ˚ i q ˚ i traces a path q ˚ i ψ ( i , j ) q ˚ j . That will happen if l ψ ( i , j ) ψ . Since F STATE ( v ( 1 ) ) , ( A : : = γ η ) * κ * ω , then by Corollary 1, F STATE ( v ( 2 ) ) , ( A : : = γ η ) * l κ , where κ = l κ and κ , κ T * .
Therefore, it is guaranteed that ω = ψ ( i , j ) ψ ω , where l κ * ψ ( i , j ) ψ ω . Thus, the action is also performed by our architecture. The relevant parts of the resulting Γ ( 1 ) and Γ ( 2 ) are presented in Figure 23.
Figure 23. The relevant part of Γ ( 1 ) after the reduce action is performed by the character-level RNGLR and the corresponding part of Γ ( 2 ) after the reduce action is performed by our architecture.
Figure 23. The relevant part of Γ ( 1 ) after the reduce action is performed by the character-level RNGLR and the corresponding part of Γ ( 2 ) after the reduce action is performed by our architecture.
Mathematics 10 02436 g023
2.
The conditions for performing an ε - reduce action by the character-level RNGLR and our architecture are the same as in 1, except in this case, v ( 1 ) = w ( 1 ) and v ( 2 ) = w ( 2 ) , as | γ | = 0 . The relevant parts of the resulting Γ ( 1 ) and Γ ( 2 ) are presented in Figure 24.
Figure 24. The relevant part of Γ ( 1 ) after the ε -reduce action is performed by the character-level RNGLR and the corresponding part of Γ ( 2 ) after the ε -reduce action is performed by our architecture.
Figure 24. The relevant part of Γ ( 1 ) after the ε -reduce action is performed by the character-level RNGLR and the corresponding part of Γ ( 2 ) after the ε -reduce action is performed by our architecture.
Mathematics 10 02436 g024
3.
The conditions for performing a reduce action REDUCE ( t , | γ | , η ) T A ( 1 ) ( STATE ( v ( 1 ) ) , ω ) for production t : : = γ η , such that | γ | > 0 and η * ε and t T , are the same as in 1.
By the induction hypothesis, there exists a path v ( 2 ) α w ( 2 ) , where w ( 2 ) U h ( 2 ) . The path does not exist for γ , because t N ; therefore, γ ( N T ) * . By Lemma 5, ( B : : = σ t τ ) STATE ( w ( 2 ) ) . Since t is a terminal symbol in G , there exists a shift action SHIFT T A ( 2 ) ( STATE ( w ( 2 ) ) , t ) . This means t ν ( STATE ( w ( 2 ) ) ) . The shift action is performed if ( h , t ) m ( q ˚ i ) after M ˚ q ˚ h traces a path q ˚ h ψ ( h , i ) q ˚ i . That will happen if t ψ ( h , i ) . The reduce action creates an edge w ( 1 ) X u ( 1 ) , such that w ( 1 ) U h ( 1 ) and u ( 1 ) U i ( 1 ) , in Γ ( 1 ) , which means t * ψ ( h , i ) [15,17]. As a result, a match will necessarily be found for t and the action will also be performed by our architecture. Since D STATE ( w ( 1 ) ) , ( B : : = σ t τ ) * t κ * ψ ( h , i ) ω , then by Corollary 1, D STATE ( w ( 2 ) ) , ( B : : = σ t τ ) * t l κ , where t κ = t l κ and κ , κ T * . Furthermore, ψ ( h , i ) ω = ψ ( h , i ) a ψ ω , where l a ψ . That means l ξ ( STATE ( w ( 2 ) ) , t ) and a ξ ( STATE ( w ( 2 ) ) , t ) , and it is guaranteed that a will be found in the rest of the input string. Therefore, the action will also be performed if the scanner lookahead is used. The relevant part of the resulting Γ ( 1 ) and Γ ( 2 ) are presented in Figure 25.
Figure 25. The relevant part of Γ ( 1 ) after the reduce action is performed by the character-level RNGLR and the corresponding part of Γ ( 2 ) after the shift action is performed by our architecture.
Figure 25. The relevant part of Γ ( 1 ) after the reduce action is performed by the character-level RNGLR and the corresponding part of Γ ( 2 ) after the shift action is performed by our architecture.
Mathematics 10 02436 g025
4.
The conditions for performing an ε -reduce action REDUCE ( t , | γ | , η ) T A ( 1 ) ( STATE ( w ( 1 ) ) , ω ) , such that | γ | = 0 and η * ε and t T , are the same as in 1. By the induction hypothesis, there exists a path v ( 2 ) α w ( 2 ) , where w ( 2 ) U h ( 2 ) . The path does not exist for γ , because | γ | = 0 . By Lemma 5, ( B : : = σ t τ ) STATE ( w ( 2 ) ) . Since | γ | = 0 , there exists an ε -shift action ε - SHIFT t T A ( 2 ) ( STATE ( w ( 2 ) ) , l ) . This means l ν ( STATE ( w ( 2 ) ) ) . The ε -shift action is performed if l p ( q ˚ j ) after M ˚ q ˚ i traces a path q ˚ i ψ ( i , j ) q ˚ j . That will happen if l ψ ( i , j ) ψ . Since D STATE ( w ( 1 ) ) , ( B : : = σ t τ ) * t κ * ω , where t * ε , then by Corollary 1, D STATE ( w ( 2 ) ) , ( B : : = σ t τ ) * t l κ , where t κ = t l κ and t ε and κ , κ T * . Therefore, it is guaranteed that ω = ψ ( i , j ) ψ ω , where t l κ * ψ ( i , j ) ψ ω . Thus, the action is also performed by our architecture. The relevant part of the resulting Γ ( 1 ) and Γ ( 2 ) are presented in Figure 26.
This exhausts all possible combinations of actions, where X ( N T ) ; therefore, the statement holds for all paths ι . Γ ( 2 ) indeed contains all paths labeled ι ( N T ) * in Γ ( 1 ) . □
Theorem 3.
If ω L ( Ξ ) , then our architecture accepts ω.
Proof. 
The (character-level) RNGLR algorithm accepts a string ω if there exists an edge v ( 1 ) S w ( 1 ) in Γ ( 1 ) , such that v ( 1 ) U 0 ( 1 ) and w ( 1 ) U | ω | ( 1 ) [15,17]. By Lemma 6, since S N is a path of length one, it exists in Γ ( 2 ) as well. By Theorem 2, our architecture accepts ω if such an edge exists. If ω L ( G ^ ) , then the (character-level) RNGLR algorithm accepts ω as proven by Scott [17]. Our architecture accepts ω , if the character-level RNGLR algorithm accepts ω ; therefore, if ω L ( Ξ ) then our architecture accepts ω . □
Traditionally, the parser is correct for a given specification Ξ if the constructed SPPF is a representation of all possible derivations, such that S * σ , where σ T * is a string of terminal symbols passed to the parser by the scanner. As a generalization, in our case, all possible σ are considered. Therefore, our architecture captures all lexical ambiguities.
Theorem 4.
The parser constructs the SPPF for all possible derivations S * σ * ω , where σ T * and ω Ω * .
Proof. 
This is easy to see from Lemma 6, our architecture constructs the GSS using Ξ containing all paths labeled ι ( N T ) * in the GSS constructed by the character-level RNGLR using G ^ . The SPPF nodes are stored in the GSS edges and the construction of the SPPF nodes is equivalent as in the (character-level) RNGLR. The SPPF nodes ( A , k , l ) , where A N , constructed by our architecture, are the same as the one constructed by the character-level RNGLR. The SPPF nodes ( t , k , l ) , where t T , constructed by our architecture, are leaf nodes because t was recognized using the scanner. The same nodes constructed by the character-level RNGLR contain a subtree, created as a result of parsing t. Therefore, the SPPF constructed by our architecture is not exactly the same as the one for character-level RNGLR, the difference is that the subtrees for derivations σ * ω are missing. □

8. Experience

As our architecture utilizes context-aware scanning [6,7] and the Schrödinger’s token [3,15,16] directly, it naturally supports all of the use cases of these techniques, and the interested reader should refer to [3,6,7]. It can also resolve trivially the conflicts that are reported to be difficult to resolve using these techniques alone. These difficulties arise due to conflicts of type (II), and are exactly the ones we aimed to resolve. We will assume the scanner lookahead is not utilized, otherwise, some of these conflicts are resolved even sooner. The [6] mentions a problem of scanning keywords (or operators) that are not pairwise-prefix disjoint, for example, for and foreach , appearing in C#. In the case of architecture, the automaton M ˙ 0 recognizes both keywords. After the shorter keyword is recognized and a shift action is performed, the said automaton continues scanning and recognizes the longer keyword if possible. In that case, the first shift action is performed superfluously. The [9] mentions a problem of recognizing 1 . . 5 as two integer numbers delimited by the range operator, instead of a floating-point number followed by a dot and a cardinal number, appearing in Pascal and Ada. In the case of architecture, the automaton M ˙ 0 recognizes both an integer and a floating-point number. After the integer number is recognized and a shift action is performed, an automaton M ˙ 1 is added, recognizing the range operator. After the floating-point number is recognized and a shift action is performed, no automaton would recognize the dot; therefore, this shift action was performed superfluously. After the range operator is recognized and a shift action is performed M ˙ 3 is added, again recognizing the integer or floating-point number. Note that our architecture can also handle the case where floating-point numbers are used in the range 1 5 . 0 , which are not valid in Pascal or Ada. A similar case also occurs in COBOL, where the sentences conclude with a dot. Therefore, at the end of the sentence, 1 . should be regarded as an integer number, not a floating-point number. The [12,13] mentions a synthetic case where 1 . 1 can be recognized either as a floating-point number or two integer numbers separated by a dot. In the case of our architecture, the automaton M ˙ 0 first recognizes an integer number. After that, an automaton M ˙ 1 that recognizes a dot is added, which is followed by an automaton M ˙ 2 that recognizes an integer number. The automaton M ˙ 0 continues scanning and recognizes the floating-point number as well. The use cases in most existing programming languages are all fairly trivial to resolve using our architecture and do not even come close to utilizing its full potential. The reason for this is that grammar writers have intentionally avoided introducing these conflicts due to the limitations of the traditional parsing architecture. Our architecture opens up new possibilities:
  • Lexical syntax can vary arbitrarily for different parts of the programming language without carefully constructed scanner modes. All of the embedded sublanguages can be described in a single specification.
  • During the development, thoughtless additions of regular expressions never result in an error, as our architecture accepts all possible specifications.
  • A union, concatenation, and Kleene star of arbitrary specifications are invariably accepted, due to the closure properties of CFL, which allow for specifications to be composed.
  • The transitions between different parts of the lexical syntax can be fluid, there is no need to surround them with sentinels. That means that multiple sublanguages can be possible in a certain context, and it can be left to the parser to determine which one it is parsing. For example, if we construct a specification that is a union of Pascal and C, the parser can determine if the input string contains Pascal or C.
  • Capturing subpatterns in regular expressions are not required [37]. There is no need to treat parts of the lexical syntax as an opaque block and then extract certain parts out of it; in the case of our architecture, they can be treated as an embedded sublanguage.
  • A simple error recovery resembling error tokens is supported without extensions, for example, where ; acts as a resynchronization token:
    S : : = A b | c b A : : = d r b = ; r c = ¬ ( Ω * ; ) r d = x r $ = $
  • When writing a specification with inserted white space, multiple types of white-space can be used, for example:
    r w = * r v = +
    Even if they both occur in the same context, it does not pose a problem.
This list is of course not exhaustive. The downside of our architecture is that while it can accept any possible specification, parsing can become slow when the number of possible interpretations becomes too large. The techniques developed to resolve this issue in the scannerless architectures [4,5,27,28,29] can be applied readily and we will consider adding them in future work.

9. Related Work

The related work can be grouped into scannerless and scanner-based methods. Our architecture can be categorized into the second group. The scanner-based methods can be deterministic or non-deterministic. The former used a variety of disambiguation rules to resolve the lexical conflicts in the scanner, and the latter offload the problem to the parser. How the methods are interrelated is shown in Figure 27.
The earliest deterministic scanner-based method we are aware of is the work by Nawrocki [9]. Nawrocki categorized the types of lexical conflicts:
  • The longest match conflicts, which are caused by the use of the longest match disambiguation rule.
  • Identity conflicts, which are caused when the languages of the automata are not pairwise disjoint.
Additionally, lexical conflicts can occur when the languages of the automata are not prefix-free or pairwise prefix-disjoint, without causing the longest match conflicts. In that case, the longest match results in a correct parse. This was also identified by Denny [7]. Nawrocki presented two methods of resolving the lexical conflicts. The first one is called the left context LC : T $ P ( S ) , which is equivalent to an inverse of the valid lookahead ν . The second one is called the extended left context ELC : T $ × T $ P ( S ) , which is equivalent to an inverse of the valid lookahead–lookahead ξ . Because the functions are inverted, the tests s LC ( t ) and s ELC ( l , l ) are performed instead of t ν ( s ) and l ξ ( s , l ) , which is inconsequential. The LC is used in the same way as in the subsequent works. When the handle-finding automaton is in state s, the symbol t T $ is matched if s LC ( t ) . The method is used only if a single match is found in every state; that is when l T $ , LC ( t ) LC ( l ) = . Otherwise, ELC is used to disambiguate the matches further. The conflicts between t , l T $ for all l T $ are found as in (6). The method is used only if LC ( X ) l T $ ELC ( l , l ) . When the handle-finding automaton is in state s, the symbol l T $ is matched if s l T $ ELC ( l , l ) .
As identified by Denny [7] and to some extent by Keynes [8], the methods are limited. It is required that a single match is found in all handle-finding automaton states, otherwise, they cannot be used. This limitation can be lifted by further disambiguating the matched symbols using priority, as is done in context-aware scanning [6,18] and PSLR [7]. In the case of our architecture, no further disambiguation is needed.
They also cannot be used to decide when to pass a match to the parser. Let us demonstrate the problem with an example:
S : : = a b c r a = / * r b = ¬ ( Ω * / Ω * ) r c = / r $ = $
The longest match conflict only really occurs when the rest of the input string is ω = / . If the longest match disambiguation rule is used, M b would also recognize ∗ and an error would be raised. Therefore, a match must be passed to the parser beforehand. Just the information that a conflict is possible is useless in deciding when to pass the match to the parser. This was also identified by Denny [7], although the solution of additionally providing the shortest match disambiguation rule does not solve the problem.
We used ξ as an optimization, which is equivalent to ELC as already mentioned. We used it to construct the scanner lookahead ξ . Because we used one character of the scanner lookahead, our method required a buffer of one character, exactly the same as the longest match disambiguation rule. Nonetheless, our method is able to resolve the problem in the specification (8). When ∗ follows in the rest of the input, a match is passed to the parser. Note that a match is also passed to the parser for each Ω . This is not an issue, because, without the use of this optimization, the match is always passed.
Keynes [8] presents a deterministic architecture, which uses a scanner construction approach, where the labels m and p are stored in the edges, instead of the states. The construction is equivalent to first directly constructing the automata ( M ˙ s ) s S for each handle-finding automaton state. Additionally, each edge q s a q s is labeled with the symbols that can be matched by starting again from the start state l p ( q s , a q s ) . Then, the lookahead symbols t m ( q s a q s ) are removed if l ξ ( s , t ) . That way, the scanner lookahead is incorporated into the automaton. That is possible because the labels are stored in the edges. The automata are then collapsed into a single automaton using minimization and a scanner mode is created for each of them. The scanner mode is then chosen based on the handle-finding automaton state. As noted by Keynes, the downside of the approach is that the construction method is inefficient [8]. Additionally, if we follow the disambiguation rules as given by Keynes, they cannot resolve the problem in (8). Keynes concludes that whether the matched symbol is passed to the parser or not is inconsequential as the conflict is unresolvable.
The PSLR by Denny [7] presents a deterministic architecture, which uses a scanner construction approach similar to the one used in context-aware scanning. Denny proposes a modified IELR handle-finding automaton construction method, which resolves the lexical conflicts that arise when weaker handle-finding automaton construction methods are used. The resulting handle-finding automaton has less states than the one constructed using LR(1). This would be the optimal method to use in combination with our architecture. However, it is not simple to implement; therefore, we used LALR instead.
Context-aware scanning by Van Wyk [6] and Schwerdfeger [6,18] presents the groundwork for our architecture. As already mentioned, our architecture degenerates to it when it is used for deterministic languages. We used the scanner construction approach described by Van Wyk [6] and Schwerdfeger [6,18]. Van Wyk and Schwerdfeger also suggest the architecture can be used in combination with GLR. The Tree-sitter [34] is an incremental parser generator based on this idea. However, just simply combining the approaches is not fully general.
The earliest non-deterministic method we are aware of is mentioned in the presentation of the GLR algorithm by Tomita [15,16]. As already mentioned, the lexical conflicts are treated as a parser conflict; therefore, the actions are performed for all matched symbols. The method was further discussed and termed Schrödinger’s token by Aycock and Horspool [3]. Our architecture can be viewed as a direct extension that additionally handles overlapping lexemes of different lengths. To maximize the time spent in the scanner and the length of lookahead, our architecture requires lexical feedback. As identified by Aycock and Horspool, this means that the phases cannot be parallelized. We agree that this is indeed a downside.
Keynes [8] conjectures a non-deterministic method termed generalized scanning. A generalized scanner would append the tokens to a list every time it enters a final state, and continue scanning until it enters a dead state. Then, the parser would process the tokens in the list, and the generalized scanner would move to the location of the match for the first token in the list and repeat the process. The generalized scanner would necessarily process characters multiple times in the case of lexical conflicts, and would require the entire string to be available in memory. Keynes noted that such an architecture would most likely be inefficient, and we agree with that assessment. Our architecture is a practical realization of the broader idea; for this reason, the GS in RNGSGLR stands for generalized scanning. It should be clear that our architecture does not actually work as conjectured by Keynes, as the parser “moves” with the scanner, and in that way the list is not required.
Character-level parsing was first proposed by Salomon and Cormack [4]. For this purpose, they extended the NSLR parser with exclusion and adjacency restriction disambiguation rules. NSLR is a non-canonical parser, which means that, in addition to terminal symbols, non-terminal symbols can be used as the lookahead when processing the reduce actions. Our architecture is similar in the sense that symbols in T , which are used as lookahead, are non-terminal symbols in G ^ . The difference is that they are recognized by the scanner instead of the parser, and that in the case of lexical conflicts, they are only recognized partially, which we termed fractional lookahead. The exclusion rules implement the differences between the languages. In the case of our architecture, the difference is only available to the regular portion of the grammar in the form of extended regular expressions. The worst-case time complexity of the parser is | ω | , which is better than our architecture; however, it is not fully general, and for any given CFG, it cannot accept all corresponding character-level CFLs.
SGLR by Visser [5] is a character-level parser based on GLR. It uses the disambiguation rules introduced in character-level NSLR [4,5,38]. The parser is not non-canonical. Consequently, the reduction order is more predictable, although the worst-case complexity of the parser is the same as GLR | ω | γ ^ + 1 and our architecture. This is the only other practical approach we are aware of that is fully general, and for any given CFG can accept all corresponding character-level CFLs. To parse Ξ , it instead parses the corresponding character-level CFG G ^ . Due to the use of exclusion rules in the context-free part of the grammar, it can also parse some of the context-sensitive languages. The downside of SGLR is that it uses a parser for the lexical portion of the grammar. As a result, the characters in the input string are processed multiple times—once when the shift action is performed and then again when (possibly) multiple reduce actions are performed and the parser searches for the paths passing the edge created by the shift action. In the case of our architecture, each character is processed exactly once. Since a parser is used as a scanner, the worst-case time complexity is | ω | γ ^ + 1 , while, in the case of our architecture, it is | ω | 2 . The lookahead in SGLR is a single character, while, in our case, it can be multiple characters. In the case of deterministic languages, the lookahead is the whole lexeme; therefore, the same as in deterministic parsing. We only used a single character of lookahead as the scanner lookahead. The original description of the SGLR parser uses the solution by Farshi [16,39] to parse ε -grammars. Since this solution is less efficient, the SRNGLR was developed by Economopoulos [11], which is instead based on RNGLR.
XGLR (by Begel and Graham [14]) is a non-deterministic architecture based on GLR that is the most closely related to ours. The scanner needs to be constructed manually. The grammar writer can provide multiple scanner modes. These are required to be lexically unambiguous. The scanner mode is then chosen based on the handle-finding automaton state. Whenever a lexical conflict is encountered the parser process is split, and each is associated with a scanner mode. That way, the parser processes end up in different positions when the lexemes are of different lengths [14]. Consequently, each character is processed multiple times, and the entire input string must be available in the memory. For example, if a regular expression Ω + is used in one of the modes, the rest of the input string will be traversed immediately by that parser. The others will then process the same characters again, as they follow it. In contrast, our scanner processes the characters one by one, and each is processed exactly once because the synchronization is preserved. The automaton recognizing Ω + runs alongside other automata until the end of the input string. Due to the fact that each parser process is associated with a scanner mode, this poses additional difficulties when they are later merged. The parsers also cannot share some of the data structures once they split [14]. Our architecture avoids this, as each character is processed exactly once and the synchronization is preserved. Because the parser processes are synchronized and the handle-finding automaton and the scanner states are not directly associated, no special considerations are needed when they are merged. The benefit of XGLR is that a whole lexeme is used as the lookahead, even when the grammar is heavily non-deterministic. Since XGLR uses the longest match disambiguation rule and scanner modes need to be provided manually, it is not fully general and for any given CFG cannot accept all corresponding character-level CFLs.
The LAMB and Fence by Quesada [12,13] is a non-deterministic architecture based on a chart parser that is able to resolve lexical conflicts in a similar way as a generalized scanner, as conjectured by Keynes [8]. The scanner is based on top-down regular expression recognizers, one for each terminal symbol. The recognizers are run one by one at each position i I , where I { 0 , , | ω | } . This means that, in the case of lexical conflicts, the characters are necessarily processed multiple times, and the entire input string must be available in the memory. For example, if a regular expression Ω + is used, the rest of the input string will be traversed from each position i I . The scanner has two variants. When using the first one, at each position i I the recognizers search for the longest match, the positions of the matches are recorded, and the scanner moves to the closest one. When using the second one, at each position i I , all possible matches are considered, and the recognizers are run at each position in the input string I = { 0 , , | ω | } . To prevent duplicates, the scanner needs to keep track of tokens that are already found at each position in the input string. The worst-case time complexity of both variants is | ω | 2 , the same as our architecture. However, the first one is not fully general and the second one always considers all interpretations as in (7), even when they are not possible; therefore, the complexity is bound to be | ω | 2 even in the best case. The scanner generates a graph of all transitions, which is passed to the parser, where it is used to initialize the chart and the parsing commences. The input string is traversed fully by the scanner before the parsing takes place. That means the passes are completely disjoint; therefore, contextual information cannot be passed from the parser to the scanner. As a result, some of the traversals must be discarded immediately by the parser. That only occasionally happens when our architecture is used in combination with weaker handle-finding automaton construction methods, such as LALR and SLR.
The parser by Borsotti [40] is an extension of GLR that can parse the extended CFGs directly, without first translating them to CFGs. It would seem that such a parser could parse the lexical part of the grammar as well; however, this is not the case. This was also identified by the authors who stated that the parser should be used in combination with the traditional parsing architecture [40]. The parser works by constructing an automaton for each right-hand-side of the production, which is then combined into the handle-finding automaton. The handle-finding automaton also includes links in the reverse direction. When performing the reduce actions the search for the paths in the GSS is performed by following these links. Consequently, each character would be processed multiple times, once when performing the shift action, and then again when performing (possibly) multiple reduce actions.
The faster GLR parsing by Aycock and Horspool [1,41] and the generalized regular (GR) parsing by Johnstone and Scott [42] are similar to our architecture in that they improve the performance of GLR parsing using an automaton. The parsers can be used at the character level, and for any given CFG, accept all corresponding character-level CFLs (the same as our architecture). These parsers do not perform shift and reduce actions as is customary with LR parsing. Instead, the parsing process is driven mostly by the automaton and the stack is used similar to the call stack for subroutines. When searching for a non-terminal symbol, the next state is pushed on the stack, and when the non-terminal symbol is recognized, and the parser continues from the popped state. As the use of the stack is reduced, these parsers are efficient; however, as a trade-off, the automaton is prohibitively large. Our architecture completely avoids the use of the stack for the lexical syntax, otherwise it functions similar to the RNGLR parser. That way, the tables for the parser are almost the same as in RNGLR parsing and the tables for the scanner are the same as in context-aware scanning.

10. Conclusions

In this paper, we presented a novel scanner-based architecture, which, for any given CFG accepts all corresponding character-level CFLs. Our architecture lifts the restrictions of the existing scanner-based methods. It does not require any disambiguation rules for resolving lexical conflicts. However, as our architecture captures all lexical ambiguities for certain applications, disambiguation rules are still useful to limit the number of interpretations. It can directly parse all possible specifications consisting of a grammar and regular definitions. As a result, the grammar writers are not restricted to grammars where the lexical syntax can be resolved deterministically. Grammars are also easier to modify because they are not dependent on a careful selection of disambiguation rules to be parsed. Conceptually, our architecture has an unbounded parser and scanner lookahead. Our architecture is streaming, which means the characters are processed one by one and do not require a buffer.
In future work, we will attempt to incorporate the scanner lookahead into the automata, following Keynes [8], as we expect it to be more efficient. To limit the number of possible interpretations and ease the development of specifications, we will consider adding the adjacency restriction disambiguation rule [4,5,38]. We might consider other extensions, such as conjunctive and Boolean grammars, as well [43]. Our architecture can currently process symbols as they are typed at the end of the input string. Inserting them in the middle or at the beginning is currently not supported. We will attempt to create an incremental version of our architecture, which will support insertions anywhere in the input string. Our architecture could be adapted to other parsing methods that have the same power as GLR, such as GLL [44]. Since our architecture can parse any specification, it can be useful for applications where the specifications are generated procedurally, such as grammar synthesis [45]. We will also extend our metalanguage to support some of the use cases of our architecture better. That is, following [5,6,22], we will enable composing, embedding, and extending specifications.

Author Contributions

Conceptualization, Ž.L.; methodology, Ž.L., M.Č., M.M. and T.K.; software, Ž.L.; validation, M.Č., M.M. and T.K.; formal analysis, Ž.L.; writing—original draft preparation, Ž.L., M.Č., M.M. and T.K.; writing—review and editing, Ž.L., M.Č., M.M. and T.K.; visualization, Ž.L.; supervision, M.M. and T.K. All authors have read and agreed to the published version of the manuscript.

Funding

The authors acknowledge the financial support from the Slovenian Research Agency (Research Core Funding No. P2-0041).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Grune, D.; Jacobs, C.J.H. Parsing Techniques; Springer: Berlin/Heidelberg, Germany, 2008. [Google Scholar]
  2. Aho, A.V.; Lam, M.S.; Sethi, R.; Ullman, J.D. Compilers: Principles, Techniques, and Tools, 2nd ed.; Pearson Education: London, UK, 2006. [Google Scholar]
  3. Aycock, J.; Horspool, N.R. Schrödinger’s token. Softw. Pract. Exp. 2001, 31, 803–814. [Google Scholar] [CrossRef]
  4. Salomon, D.J.; Cormack, G.V. Scannerless NSLR(1) parsing of programming languages. In Proceedings of the ACM SIGPLAN 1989 Conference on Programming Language Design and Implementation, Portland, OR, USA, 19–23 June 1989; Wexelblat, R.L., Ed.; ACM: New York, NY, USA, 1989; Volume 24, pp. 170–178. [Google Scholar]
  5. Visser, E. Scannerless Generalized-LR Parsing; Technical Report; University of Amsterdam, Programming Research Group: Amsterdam, The Netherlands, 1997. [Google Scholar]
  6. Van Wyk, E.R.; Schwerdfeger, A.C. Context-aware scanning for parsing extensible languages. In Proceedings of the GPCE’07—Proceedings of the Sixth International Conference on Generative Programming and Component Engineering, Salzburg, Austria, 1–3 October 2007; Consel, C., Lawall, J., Eds.; ACM: New York, NY, USA, 2007; pp. 63–72. [Google Scholar]
  7. Denny, J.E. PSLR(1): Pseudo-Scannerless Minimal LR(1) for the Deterministic Parsing of Composite Languages. Ph.D. Thesis, Clemson University, Clemson, SC, USA, 2010. [Google Scholar]
  8. Keynes, N. Better Parsing Through Lexical Conflict Resolution. Unpublished Thesis. Available online: http://www.deadcoderemoval.net/files/honours-thesis.pdf (accessed on 4 March 2022).
  9. Nawrocki, J.R. Conflict detection and resolution in a lexical analyzer generator. Inf. Process. Lett. 1991, 38, 323–328. [Google Scholar] [CrossRef]
  10. Dejanović, I. Parglare: A LR/GLR parser for Python. Sci. Comput. Program. 2022, 214, 102734. [Google Scholar] [CrossRef]
  11. Economopoulos, G.; Klint, P.; Vinju, J. Faster scannerless GLR parsing. In Proceedings of the International Conference on Compiler Construction, York, UK, 22–29 March 2009; Franke, B., Ed.; Springer: Berlin/Heidelberg, Germany, 2009; Volume 24, pp. 126–141. [Google Scholar]
  12. Quesada, L.; Berzal, F.; Cortijo, F.J. A lexical analyzer with ambiguity support. In Proceedings of the International Conference on Software and Data Technologies, Seville, Spain, 18–21 July 2011; pp. 297–300. [Google Scholar]
  13. Quesada, L.; Berzal, F.; Cortijo, F.J. Fencing the lamb: A chart parser for modelCC. In Proceedings of the Software and Data Technologies, Rome, Italy, 24–27 July 2012; Cordeiro, J., Hammoudi, S., van Sinderen, M., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; Volume 27, pp. 21–35. [Google Scholar]
  14. Begel, A.; Graham, S.L. XGLR: An algorithm for ambiguity in programming languages. Sci. Comput. Program. 2006, 61, 211–227. [Google Scholar] [CrossRef] [Green Version]
  15. Tomita, M. Efficient Parsing for Natural Language; Springer: Berlin/Heidelberg, Germany, 1986. [Google Scholar]
  16. Tomita, M. (Ed.) Generalized LR Parsing; Springer: Berlin/Heidelberg, Germany, 1991. [Google Scholar]
  17. Scott, E.; Johnstone, A. Right nulled GLR parsers. ACM Trans. Program. Lang. Syst. 2006, 28, 577–618. [Google Scholar] [CrossRef]
  18. Schwerdfeger, A.C. Context-Aware Scanning and Determinism-Preserving Grammar Composition, in Theory and Practice. Ph.D. Thesis, University of Minnesota, Minneapolis, MN, USA, 2010. [Google Scholar]
  19. Baxter, I.D.; Pidgeon, C.; Mehlich, M. DMS: Program transformations for practical scalable software evolution. In Proceedings of the 26th International Conference on Software Engineering, Edinburgh, UK, 23–28 May 2004; Titsworth, F., Ed.; IEEE: Los Alamitos, CA, USA, 2004; pp. 625–634. [Google Scholar]
  20. Owens, S.; Reppy, J.; Turon, A. Regular-expression derivatives re-examined. J. Funct. Program. 2009, 19, 173–190. [Google Scholar] [CrossRef] [Green Version]
  21. Régis-Gianas, Y.; Jeannerod, N.; Treinen, R. Morbig: A Static parser for POSIX shell. J. Comput. Lang. 2020, 57, 100944. [Google Scholar] [CrossRef] [Green Version]
  22. Schwerdfeger, A.C.; Wyk, E.R.V. Verifiable composition of deterministic grammars. In Proceedings of the 30th ACM SIGPLAN Conference on Programming Language Design and Implementation, Dublin, Ireland, 15–21 June 2009; Hind, M., Diwan, A., Eds.; ACM: New York, NY, USA, 2009; Volume 44, pp. 199–210. [Google Scholar]
  23. Mernik, M.; Heering, J.; Sloane, A.M. When and how to develop domain-specific languages. ACM Comput. Surv. (CSUR) 2005, 37, 316–344. [Google Scholar] [CrossRef] [Green Version]
  24. Dageförde, J.C.; Kuchen, H. A compiler and virtual machine for constraint-logic object-oriented programming with Muli. J. Comput. Lang. 2019, 53, 63–78. [Google Scholar] [CrossRef]
  25. Mernik, M. An object-oriented approach to language compositions for software language engineering. J. Syst. Softw. 2013, 86, 2451–2464. [Google Scholar] [CrossRef]
  26. Fister, I.; Kosar, T.; Fister, I.; Mernik, M. EasyTime++: A case study on incremental domain-specific language development. Inf. Technol. Control. 2013, 42, 77–85. [Google Scholar] [CrossRef] [Green Version]
  27. Van den Brand, M.; van Deursen, A.; Heering, J.; de Jong, H.; de Jonge, M.; Kuipers, T.; Klint, P.; Moonen, L.; Olivier, P.; Scheerder, J.; et al. The Asf+Sdf Meta-Environment: A Component-Based Language Development Environment. Electron. Notes Theor. Comput. Sci. 2001, 44, 3–8. [Google Scholar] [CrossRef] [Green Version]
  28. Kats, L.C.; Visser, E. The spoofax language workbench: Rules for declarative specification of Languages and IDEs. In Proceedings of the ACM International Conference on Object Oriented Programming Systems Languages and Applications, Reno/Tahoe, NV, USA, 17–21 October 2010; Association for Computing Machinery: New York, NY, USA, 2010; pp. 444–463. [Google Scholar]
  29. Klint, P.; van der Storm, T.; Vinju, J. RASCAL: A domain specific language for source code analysis and manipulation. In Proceedings of the 2009 Ninth IEEE International Working Conference on Source Code Analysis and Manipulation, Edmonton, AB, Canada, 20–21 September 2009; pp. 168–177. [Google Scholar]
  30. Ma, L.; Ren, H.; Zhang, X. Effective Cascade Dual-Decoder Model for Joint Entity and Relation Extraction. Unpublished Paper. Available online: https://arxiv.org/pdf/2106.14163 (accessed on 16 June 2022).
  31. Krausová, M. Prefix-free regular languages: Closure properties, difference, and left quotient. In Proceedings of the Mathematical and Engineering Methods in Computer Science, Znojmo, Czech Republic, 25–28 October 2012; Kotásek, Z., Bouda, J., Černá, I., Sekanina, L., Vojnar, T., Antoš, D., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; Volume 7119. [Google Scholar]
  32. Redziejowski, R.R. From EBNF to PEG. Fundam. Inform. 2013, 128, 177–191. [Google Scholar] [CrossRef]
  33. Watson, B.W. A Taxonomy of Finite Automata Construction Algorithms; Technical Report; Eindhoven University of Technology: Eindhoven, The Netherlands, 1993. [Google Scholar]
  34. Brunsfeld, M. Tree-Sitter. Available online: https://tree-sitter.github.io/tree-sitter/ (accessed on 4 March 2022).
  35. Kipps, J.R. GLR parsing in time O(n3). In Generalized LR Parsing; Tomita, M., Ed.; Springer: Berlin/Heidelberg, Germany, 1991; pp. 43–59. [Google Scholar]
  36. Rekers, J.G. Parser Generation for Interactive Environments. Ph.D. Thesis, Universty of Amsterdam, Amsterdam, The Netherlands, 1992. [Google Scholar]
  37. Laurikari, V. NFAs with tagged transitions, their conversion to deterministic automata and application to regular expressions. In Proceedings of the Seventh International Symposium on String Processing and Information Retrieval, A Curuna, Spain, 27–29 September 2000; pp. 181–187. [Google Scholar]
  38. Brand, M.G.J.V.; Scheerder, J.; Vinju, J.J.; Visser, E. Disambiguation filters for scannerless generalized LR parsers. In Proceedings of the International Conference on Compiler Construction, Grenoble, France, 8–12 April 2002; Horspool, N.R., Ed.; Springer: Berlin/Heidelberg, Germany, 2002; pp. 143–158. [Google Scholar]
  39. Nozohoor-Farshi, R. GLR Parsing for ε-Grammers. In Generalized LR Parsing; Tomita, M., Ed.; Springer: Berlin/Heidelberg, Germany, 1991; pp. 61–75. [Google Scholar]
  40. Borsotti, A.; Breveglieri, L.; Crespi Reghizzi, S.; Morzenti, A. Fast GLR parsers for extended BNF grammars and transition networks. J. Comput. Lang. 2021, 64, 101035. [Google Scholar] [CrossRef]
  41. Aycock, J.; Horspool, N.R. Faster generalized LR parsing. In Proceedings of the 8th International Conference, Amsterdam, The Netherlands, 22–28 March 1999; Jähnichen, S., Ed.; Springer: Berlin/Heidelberg, Germany, 1999; pp. 32–46. [Google Scholar]
  42. Johnstone, A.; Scott, E. Generalised regular parsers. In Proceedings of the 12th International Conference, Warsaw, Poland, 7–11 April 2003; Hedin, G., Ed.; Springer: Berlin/Heidelberg, Germany, 2003; pp. 232–246. [Google Scholar]
  43. Okhotin, A. Describing the Syntax of Programming Languages Using Conjunctive and Boolean Grammars. Unpublished Paper. Available online: https://arxiv.org/abs/2012.03538 (accessed on 10 March 2022).
  44. Van Binsbergen, L.T.; Scott, E.; Johnstone, A. Purely functional GLL parsing. J. Comput. Lang. 2020, 58, 100945. [Google Scholar] [CrossRef]
  45. Saffran, J.; Barbosa, H.; Pereira, F.M.Q.; Vladamani, S. On-line synthesis of parsers for string events. J. Comput. Lang. 2021, 62, 101022. [Google Scholar] [CrossRef]
Figure 1. The handle-finding automaton for the specification Ξ in (1), constructed using G .
Figure 1. The handle-finding automaton for the specification Ξ in (1), constructed using G .
Mathematics 10 02436 g001
Figure 2. The scanner automaton M ˙ T $ = M b | M c | M $ for the specification Ξ in (1).
Figure 2. The scanner automaton M ˙ T $ = M b | M c | M $ for the specification Ξ in (1).
Mathematics 10 02436 g002
Figure 3. The GSS constructed by the traditional parsing architecture using the specification Ξ in (1).
Figure 3. The GSS constructed by the traditional parsing architecture using the specification Ξ in (1).
Mathematics 10 02436 g003
Figure 4. The scanner trace outlining the paths taken through the scanner automaton using the specification Ξ in (1).
Figure 4. The scanner trace outlining the paths taken through the scanner automaton using the specification Ξ in (1).
Mathematics 10 02436 g004
Figure 5. The handle-finding automaton for the specification Ξ in (2), constructed using G .
Figure 5. The handle-finding automaton for the specification Ξ in (2), constructed using G .
Mathematics 10 02436 g005
Figure 6. The scanner automaton M ˙ T $ = M c | M d | M e | M f | M g | M $ for the specification Ξ in (2).
Figure 6. The scanner automaton M ˙ T $ = M c | M d | M e | M f | M g | M $ for the specification Ξ in (2).
Mathematics 10 02436 g006
Figure 7. The GSS constructed by the context-aware scanning architecture using the specification Ξ in (2).
Figure 7. The GSS constructed by the context-aware scanning architecture using the specification Ξ in (2).
Mathematics 10 02436 g007
Figure 8. The scanner trace outlining the paths taken through the scanner automaton using the specification Ξ in (2).
Figure 8. The scanner trace outlining the paths taken through the scanner automaton using the specification Ξ in (2).
Mathematics 10 02436 g008
Figure 9. Zoomed-out view of the GSS for the specification Ξ in (2) only displaying the edges resulting from shift actions.
Figure 9. Zoomed-out view of the GSS for the specification Ξ in (2) only displaying the edges resulting from shift actions.
Mathematics 10 02436 g009
Figure 10. The handle-finding automaton for the specification Ξ in (3), constructed using G .
Figure 10. The handle-finding automaton for the specification Ξ in (3), constructed using G .
Mathematics 10 02436 g010
Figure 11. The scanner automaton M ˙ T $ = M b | M c | M d | M e | M $ for the specification Ξ in (3).
Figure 11. The scanner automaton M ˙ T $ = M b | M c | M d | M e | M $ for the specification Ξ in (3).
Mathematics 10 02436 g011
Figure 12. The GSS constructed by our architecture using the specification Ξ in (3).
Figure 12. The GSS constructed by our architecture using the specification Ξ in (3).
Mathematics 10 02436 g012
Figure 13. The scanner trace outlining the paths taken through the scanner automaton using the specification Ξ in (3).
Figure 13. The scanner trace outlining the paths taken through the scanner automaton using the specification Ξ in (3).
Mathematics 10 02436 g013
Figure 14. Zoomed-out view of the GSS for the specification Ξ in (3) only displaying the edges resulting from shift actions.
Figure 14. Zoomed-out view of the GSS for the specification Ξ in (3) only displaying the edges resulting from shift actions.
Mathematics 10 02436 g014
Figure 15. The handle-finding automaton for the specification Ξ in (4), constructed using G .
Figure 15. The handle-finding automaton for the specification Ξ in (4), constructed using G .
Mathematics 10 02436 g015
Figure 16. The scanner automaton M ˙ T $ = M c | M d | M e | M $ for the specification Ξ in (4).
Figure 16. The scanner automaton M ˙ T $ = M c | M d | M e | M $ for the specification Ξ in (4).
Mathematics 10 02436 g016
Figure 17. The GSS constructed by our architecture using the specification Ξ in (4).
Figure 17. The GSS constructed by our architecture using the specification Ξ in (4).
Mathematics 10 02436 g017
Figure 18. The scanner trace outlining the paths taken through the scanner automaton using the specification Ξ in (4).
Figure 18. The scanner trace outlining the paths taken through the scanner automaton using the specification Ξ in (4).
Mathematics 10 02436 g018
Figure 19. Zoomed-out view of the GSS for the specification Ξ in (4) only displaying the edges resulting from shift actions.
Figure 19. Zoomed-out view of the GSS for the specification Ξ in (4) only displaying the edges resulting from shift actions.
Mathematics 10 02436 g019
Figure 20. The SPPF for the specification Ξ in (3) and ω = x y z .
Figure 20. The SPPF for the specification Ξ in (3) and ω = x y z .
Mathematics 10 02436 g020
Figure 21. The GSS constructed by our architecture using the specification Ξ in (5).
Figure 21. The GSS constructed by our architecture using the specification Ξ in (5).
Mathematics 10 02436 g021
Figure 22. The SPPF for the specification Ξ in (5) and ω = x .
Figure 22. The SPPF for the specification Ξ in (5) and ω = x .
Mathematics 10 02436 g022
Figure 26. The relevant part of Γ ( 1 ) after the ε -reduce action is performed by the character-level RNGLR and the corresponding part of Γ ( 2 ) after the ε -shift action is performed by our architecture.
Figure 26. The relevant part of Γ ( 1 ) after the ε -reduce action is performed by the character-level RNGLR and the corresponding part of Γ ( 2 ) after the ε -shift action is performed by our architecture.
Mathematics 10 02436 g026
Figure 27. The dependency graph of the related work. Each is accompanied with the year of publication.
Figure 27. The dependency graph of the related work. Each is accompanied with the year of publication.
Mathematics 10 02436 g027
Table 1. The sets of states P i , 0 and the valid lookahead sets ν ( P i , 0 ) for the specification Ξ in (2).
Table 1. The sets of states P i , 0 and the valid lookahead sets ν ( P i , 0 ) for the specification Ξ in (2).
i P i , 0 ν ( P i , 0 )
00c
18e
2 3 , 6 , 9 d , f , g
34$
Table 2. The sets of states P i , 0 and the valid lookahead sets ν ( P i , 0 ) for the specification Ξ in (3).
Table 2. The sets of states P i , 0 and the valid lookahead sets ν ( P i , 0 ) for the specification Ξ in (3).
i P i , 0 ν ( P i , 0 )
00b
14 c , e
25d
3 3 , 6 $
Table 3. The action table T A for the specification Ξ in (4).
Table 3. The action table T A for the specification Ξ in (4).
cde$
0ss ε - S c , R ( S , 0 ) , A
1 a
2 R ( S , 1 )
3 s
4 R ( S , 2 )
5 ε - S e , S ε - S e , R ( A , 1 )
6 R ( B , 0 ) R ( B , 0 ) , R ( A , 2 )
7 S ε - S e , R ( A , 3 )
8 R ( A , 4 )
Table 4. The goto table T G for the specification Ξ in (4).
Table 4. The goto table T G for the specification Ξ in (4).
SABcde$
01 23
1
2
3 4 5
4
5 6
6 7
7 8
8
Table 5. The sets of states P i , 0 , the valid lookahead sets ν ( P i , 0 ) , and the sets Λ i and B i for the specification Ξ in (4).
Table 5. The sets of states P i , 0 , the valid lookahead sets ν ( P i , 0 ) , and the sets Λ i and B i for the specification Ξ in (4).
i P i , 0 ν ( P i , 0 ) Λ i B i
00 c , d , $ c , d c , d
12, 3 d , $ d c , d
22, 5 e , $ e c , e
32, 6, 8 e , $ $$
Table 6. The action table T A for the specification Ξ in (5).
Table 6. The action table T A for the specification Ξ in (5).
de$
0 ε - S d , S s ε - S d , R ( A , 0 , d A ) , R ( A , 0 , ε ) , R ( S , 0 , A ) , A
1 a
2 R ( S , 1 , ε )
3 ε - S d , S ε - S d , R ( A , 1 , A ) , R ( A , 0 , d A ) , R ( A , 0 , ε )
4 R ( A , 2 , ε )
5 R ( C , 0 , ε ) , R ( B , 0 , C C ) , R ( S , 1 , B )
6 R ( S , 2 , ε )
7 R ( C , 0 , ε ) , R ( B , 1 , C )
8 R ( B , 2 , ε )
Table 7. The goto table T G for the specification Ξ in (5).
Table 7. The goto table T G for the specification Ξ in (5).
SABCde$
012 35
1
2
3 4 3
4
5 67
6
7 8
8
Table 8. The sets of states P i , 0 , the valid lookahead sets ν ( P i , 0 ) , and the sets Λ i and B i for the specification Ξ in (5).
Table 8. The sets of states P i , 0 , the valid lookahead sets ν ( P i , 0 ) , and the sets Λ i and B i for the specification Ξ in (5).
i P i , 0 ν ( P i , 0 ) Λ i B i
00 d , e , $ d , e d , e
13, 5 d , $ $$
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Leber, Ž.; Črepinšek, M.; Mernik, M.; Kosar, T. RNGSGLR: Generalization of the Context-Aware Scanning Architecture for All Character-Level Context-Free Languages. Mathematics 2022, 10, 2436. https://doi.org/10.3390/math10142436

AMA Style

Leber Ž, Črepinšek M, Mernik M, Kosar T. RNGSGLR: Generalization of the Context-Aware Scanning Architecture for All Character-Level Context-Free Languages. Mathematics. 2022; 10(14):2436. https://doi.org/10.3390/math10142436

Chicago/Turabian Style

Leber, Žiga, Matej Črepinšek, Marjan Mernik, and Tomaž Kosar. 2022. "RNGSGLR: Generalization of the Context-Aware Scanning Architecture for All Character-Level Context-Free Languages" Mathematics 10, no. 14: 2436. https://doi.org/10.3390/math10142436

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop