Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Printed Edition

A printed edition of this Special Issue is available at MDPI Books....

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Parsing Expression Grammars and Their Induction Algorithm

Appl. Sci. 2020, 10(23), 8747; https://doi.org/10.3390/app10238747

by Wojciech Wieczorek^1,*

, Olgierd Unold²

and Łukasz Strąk³

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Appl. Sci. 2020, 10(23), 8747; https://doi.org/10.3390/app10238747

Submission received: 27 October 2020 / Revised: 1 December 2020 / Accepted: 3 December 2020 / Published: 7 December 2020

(This article belongs to the Special Issue Applied Machine Learning)

Round 1

Reviewer 1 Report

The authors present a work on grammatical inference (GI), intended as the task of finding rules of a (formal) grammar given a seT of observations (i.e., words and phrases). Interestingly, this has several applications, such as in the analyses of amyloidogenic sequence fragments that are of interest in the study of neurodegenerative diseases, like the authors suggest.

The work is based on parsing expression grammars (PEGs); the main contributions stand in an algorithm efficient enough to deal with real biological data and properly compare with selected comparative grammatical inference (GI) algorithms and a machine learning approach (SVM), and an actual Python library for handling PEGs to be made available to the community.

The topic is not new, and the ideas do not appear to be novel; nevertheless, the work is of interest. Furthermore, the paper is well written and presented; proper background is given, along with a fair reference to related work.

Results look sound, and experiments show the viability of the approach.

We have some minor issues:

1) Authors should perform a more thorough examination of the related work (for instance, see, e.g., Moss A., LATA 2020)

2) Authors should assess the viability of the approach by conducting a more detailed experiemental analysis, and better motivating the choice of the compared methods.

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Reviewer 2 Report

This paper discusses grammatical inference for a class of grammars called PEGs, introduced in 2005. The main idea is to use genetic programming to learn a grammar for a finite set of strings.

The claim in the introduction that PEGs are as fast as FSAs is

There are some issues with clarity in the presentation. The choice of notation is unfortunate and I would recommend being consistent with the literature (e.g., the book by Hopcroft et al., which is referenced by the authors) and more clear. For instance:

Generally, the arrow in rules point from the nonterminal to the replacement, such as S - > aSA
Generally, the single arrow -> is used for illustrating the rules, whereas => is used to illustrate a derivation.
The change in font with the rules of the context grammar at the end of p 2 should be rectified.

The definition of the language of a PEG is given on line 109. But it differs fundamentally from the much more well-known CFGs: in particular, the analogy to a CFG would be consume(s,x)=|x|. I think this is worth noting and emphasizing in the paper – PEGs accept any word x for which the rules consume some prefix of x. This is especially helpful in understanding Fig 3 – the resulting expression e consumes only prefixes of length 2 but all the words in A are of length 4.

The organization of the paper must be improved. Section 2.1, for instance, seems really out of place for Section 2 - it's not part of the introduction to PEGs. Further, it's unclear what is being evaluated. Is it the implementation of the grammar or the inference tools? Line 119 states “In order to assess this property of the algorithm” – but this is the first sentence of the section and it is completely unclear what property is being referred to, or what algorithm is being referred to.

In line 125, the authors state “As for PEGs, the symbol >> has not been calculated, since it is redundant in an expression.” I believe by redundant, the authors mean that “a>>b” is effectively “ab”, but this is not 100% clear. This likely struck the reader much earlier: if the semantics of >> is concatenation, why is new notation used? It detracts from the readability, and it prevents readers from leveraging their knowledge of CFGs. It also prevents the users from focusing on the real differences (| and ~). I think the authors should spend some time talking about this when introducing PEGs. (Further, in this section, what “not calculated” means is unclear.)

The presentation of Figure 2 is not good. There is no consistency to when the left or right branch is used with |. For instance, if |a means a goes in the right branch then why does |b show b in left branch?

Algorithm 1 has a few problems:

In line 4, I believe it should say “as an empty expression”
Line 7 only works if e has some expression in it already
The term “addend” is not defined here. I believe it means “if e is an empty expression.”

While not as applied as the current work, the authors should note both conjunctive and Boolean grammars, defined by Okhotin in a series of papers. The inclusion of ~ in PEGs motivates the comparison. There is some work on learning of grammars in this category of grammars, see Yoshinaka (LATA 2015). In particular, I don’t believe in the genetic algorithm, the priority of | is used at any point. As such, it seems like the main difference over CFGs for this application is the use of ~. But this is covered in Boolean grammars.

In fact, I think the authors deserve a description of when the power of PEGs is used in Alg 1 (if at all) and how the crossover operation introduces situations where either | or ~ are used in “non-CFG” ways. An example of a crossover that achieves this would be particularly illustrative. Further, an illustration the gain of the GA would also benefit the user – what is the change over 0-6000 iterations in the fitness of the population vs the lengths of the PEGs in the population? Are the lengths increasing or decreasing over time?

Lines 303 and 314 need a bit more explanation. At a minimum, it should say that “in the case of binary PREDICTION is equivalent to balanced accuracy.” (Binary classification could include cases where there is a confidence and threshold, where AUC will be calculated from the actual ROC and not be equal to balanced accuracy.)

The grammar could definitely use help. For example, “derivative” (51), phrasing in lines 67-end of page (lack of sentence structure), a comma splice (149), “as not so significant” (204), “let say r_i” (216).

Author Response

Please see the attachment.

Author Response File: Author Response.pdf

Article Menu

Printed Edition

Parsing Expression Grammars and Their Induction Algorithm

Further Information

Guidelines

MDPI Initiatives

Follow MDPI