Efficient Algorithms for Computing the Inner Edit Distance of a Regular Language via Transducers

Kari, Lila; Konstantinidis, Stavros; Kopecki, Steffen; Yang, Meng

doi:10.3390/a11110165

Open AccessArticle

Efficient Algorithms for Computing the Inner Edit Distance of a Regular Language via Transducers

by

Lila Kari

¹,

Stavros Konstantinidis

^2,*,

Steffen Kopecki

^2,3 and

Meng Yang

²

¹

School of Computer Science, University of Waterloo, Waterloo, ON N2L 3G1, Canada

²

Department of Mathematics and Computing Science, Saint Mary’s University, Halifax, NS B3H 3C3, Canada

³

Department of Computer Science, University of Western Ontario, London, ON N6A 5B7, Canada

^*

Author to whom correspondence should be addressed.

Algorithms 2018, 11(11), 165; https://doi.org/10.3390/a11110165

Submission received: 6 October 2018 / Revised: 16 October 2018 / Accepted: 18 October 2018 / Published: 23 October 2018

(This article belongs to the Special Issue String Matching and Its Applications)

Download

Browse Figures

Versions Notes

Abstract

:

The concept of edit distance and its variants has applications in many areas such as computational linguistics, bioinformatics, and synchronization error detection in data communications. Here, we revisit the problem of computing the inner edit distance of a regular language given via a Nondeterministic Finite Automaton (NFA). This problem relates to the inherent maximal error-detecting capability of the language in question. We present two efficient algorithms for solving this problem, both of which execute in time

O (r^{2} n^{2} d)

, where r is the cardinality of the alphabet involved, n is the number of transitions in the given NFA, and d is the computed edit distance. We have implemented one of the two algorithms and present here a set of performance tests. The correctness of the algorithms is based on the connection between word distances and error detection and the fact that nondeterministic transducers can be used to represent the errors (resp., edit operations) involved in error-detection (resp., in word distances).

Keywords:

algorithms; automata; complexity; edit distance; implementation; transducers; regular language

1. Introduction

The concept of edit distance and its variants has applications in many areas such as computational linguistics [1], bioinformatics [2], and synchronization error detection in data communications [3]. The edit distance of a language L with at least two words—also referred to as inner edit distance of L—is the minimum edit distance between any two different words in L. In [4], the author considers the problem of computing the edit distance of a regular language, which is given via a Nondeterministic Finite Automaton (NFA), or a Deterministic Finite Automaton (DFA). For a given automaton

a

with n transitions and an alphabet of r symbols, the algorithm proposed in [4] has worst-case time complexity

O (r^{2} n^{2} q^{2} (q + r)),

(1)

where q is either the number of states in

a

(if

a

is a DFA), or the square of the number of states in

a

(if

a

is an NFA). If the size of the alphabet is ignored and the automaton in question has only states that can be reached from the start state, then the number of states is

O (n)

and the worst-case time complexity shown in Label (1) can be written as

O (n^{5}) for DFAs, and O (n^{8}) for NFAs .

(2)

In this paper, we present two efficient algorithms to compute the inner edit distance of a regular language given via an NFA with n transitions—see Theorems 1 and 3. Both algorithms, which are called DistErrDetect and DistInpAlter, have the same worst-case time complexity

O (n^{2} d r^{2}),

(3)

where d is the computed distance, which is a significant improvement over the original algorithm in [4].

Our first algorithm, DistErrDetect, is based on the general method of [5] for computing distances via the error-detection property. Now, however, we have an efficient way of realizing algorithmically that general method using an incremental construction of a (nondeterministic) transducer and the test of [6] for partial identity for transducers. In our second algorithm, DistInpAlter, the idea is to model the edit operations of the desired distance using an efficient, in terms of size, input-altering transducer (a transducer whose output is always different from the input used). Please see subsequent sections for definitions of terms. For clarity of presentation, we give in detail not only the new algorithms, but also their preliminary versions PrelimDistErrDetect and PrelimDistInpAlter that could possibly be applied to other types of distances. We have implemented the preliminary and final versions of the second algorithm (PrelimDistInpAlter and DistInpAlter) in Python using the well maintained, open source package FAdo for automata [7]. We have also tested our implementation experimentally, and we present in this paper the outcomes of the tests.

We note that some related problems involving distances between words and languages can be found in [8,9] (edit distance between a word and a language), and in [10,11,12,13,14] (various distances between two languages). Also in [15], the newer concept of edit distance with moves is investigated. The problem considered here is technically different, however, as the desired distance involves different words within the same language. More specifically, if we used directly the tools of [10,11], for instance, to compute an edit string with minimal number of errors between the given language and itself, then that string would simply be an edit string of zero errors, as the edit distance between any word and itself is zero. We also note that the inner prefix distance of a regular language, which is quite different from the inner edit distance, is considered in [16] and computed in time

O (n^{2} \log n)

.

The paper is organized as follows. The next section contains basic notions on languages, word relations, finite-state machines and edit-strings. Section 3 describes the approach of computing the desired edit distance via the concept of error-detection and presents the preliminary version PrelimDistErrDetect of the first algorithm. Then, Section 4 explains the improved and final version DistErrDetect of the first algorithm. In Section 5, it is shown that the edit distance is definable via an efficient input-altering transducer—see Theorem 2—and then the second algorithm DistInpAlter is presented. Section 6 discusses the implementation and testing of the second algorithm and its preliminary version. The last section contains a few concluding remarks. The appendix contains the proofs of two technical lemmata.

2. Notation, Background and Preliminary Results

This section contains basic terminology about formal languages, automata, transducers, and edit strings. Most of the basic notions presented here can be found in various texts such as [17,18,19,20,21].

2.1. Sets, Words, Languages, Channels

The set of positive integers is denoted by

N

. Then,

N_{0} = N \cup {0}

. If S is any set, the expression

| S |

denotes the cardinality of S. We use standard basic notation and terminology for alphabets, words and languages—see [22], for instance. For example,

Σ

denotes an alphabet,

Σ^{+}

the set of nonempty words,

λ

the empty word,

Σ^{*} = Σ^{+} \cup {λ}

,

| w |

the length of the word w. We write

u \leq_{p} w

to indicate that the word u is a prefix of w, that is,

w = u v

for some word v. Then,

u <_{p} w

means that u is a proper prefix, that is,

u \leq_{p} w

and

u \neq w

. We use the concepts of (formal) language and concatenation between words, or languages, in the usual way. We say that w is an L-word if

w \in L

and L is a language.

A binary word relation

ρ

on

Σ^{*}

is any subset of

Σ^{*} \times Σ^{*}

. The domain of

ρ

is

{u ∣ (u, v) \in ρ for some v \in Σ^{*}}

. A channel

γ

is a binary relation on

Σ^{*}

that is input-preserving; that is,

γ \subseteq Σ^{*} \times Σ^{*}

and

(w, w) \in γ

for all words w in the domain of

γ

. When

(u, v) \in γ

, we say that u can be received as v via the channel

γ

, or v is a possible output of

γ

when u is used as input. If

v \neq u

, then we say that u can be received (via

γ

) with errors. Here, we only consider the channel

sid (k)

, for some

k \in N

, such that

(u, v) \in sid (k)

if and only if v can be obtained by applying at most k errors in u, where an error could be a deletion of a symbol in u, a substitution of a symbol in u with another symbol, or an insertion of a symbol in u—see further below for a more rigorous definition via edit-strings.

2.2. NFAs and Transducers

A Nondeterministic Finite Automaton with empty transitions, λ-NFA for short, or just automaton, is a quintuple

a = (Q, Σ, T, s, F)

such that Q is the finite set of states,

Σ

is the alphabet,

s \in Q

is the start (or initial) state,

F \subseteq Q

is the set of final states, and

T \subseteq Q \times (Σ \cup {λ}) \times Q

is the finite set of transitions or edges. Let

(p, x, q)

be a transition of

a

. Then, x is called the label of the transition, and we say that the transition goes out of p. We also use the notation

p \overset{x}{\to} q

for a transition

(p, x, q)

. The

λ

-NFA

a

is called an NFA, if no transition label is empty, that is,

T \subseteq Q \times Σ \times Q

. A Deterministic Finite Automaton, DFA for short, is a special type of NFA in which, for each state p, there are no two transitions with equal labels going out of p.

A path of

a

is a finite sequence of consecutive transitions:

(p_{0}, x_{1}, p_{1}) (p_{1}, x_{2}, p_{2}) \dots (p_{ℓ - 1}, x_{ℓ}, p_{ℓ}),

(4)

for some nonnegative integer ℓ, where we use concatenation of these transitions to denote the path. Then, if

P_{1}

and

P_{2}

are two paths such that the last state of

P_{1}

is equal to the first state of

P_{2}

,

P_{1} P_{2}

denotes the path resulting by concatenating the transitions of

P_{1}

and

P_{2}

.

The word

x_{1} \dots x_{ℓ}

is called the label of the path in Label (4). We write

p_{0} \overset{x}{\to}^{*} p_{ℓ}

to indicate that there is a path with label x from

p_{0}

to

p_{ℓ}

. A path as above is called a computation of

a

if

p_{0}

is the start state. It is called an accepting path/computation if

p_{0}

is the start state and

p_{ℓ}

is a final state. The language accepted by

a

, denoted as

L (a)

, is the set of labels of all the accepting paths of

a

. The automaton

a

is called trim, if every state appears in some accepting path of

a

.

A (finite nondeterministic) transducer [17,20] is a quintuple (In the literature, a transducer also has an output alphabet

Γ

, but we consider here that

Γ

is the same as the input alphabet

Σ

. Without further mention all transducers considered here are nondeterministic.)

t = (Q, Σ, T, s, F)

such that

Q, s, F

are exactly the same as those in

λ

-NFAs,

Σ

is the alphabet, and

T \subseteq Q \times (Σ \cup {λ}) \times (Σ \cup {λ}) \times Q

is the finite set of transitions or edges. We write

(p, x / y, q)

, or

p \overset{x / y}{\to} q

for a transition—the label here is

(x / y)

, with x being the input and y being the output label of the transition. The concepts of path, computation, accepting path, and trim transducer are similar to those in

λ

-NFAs. However, the label of a transducer path

(p_{0}, x_{1} / y_{1}, p_{1}) \dots (p_{ℓ - 1}, x_{ℓ} / y_{ℓ}, p_{ℓ})

is the pair

(x_{1} \dots x_{ℓ}, y_{1} \dots y_{ℓ})

of the two words consisting of the concatenations of the input and output labels in the path, respectively. The relation realized by the transducer

t

, denoted by

R (t)

, is the set of labels in all the accepting paths of

t

. We write

t (u)

for the set of possible outputs of

t

on input u, that is,

v \in t (u) if and only if (u, v) \in R (t) .

The transducer

t

is called functional, if the relation

R (t)

is a function, that is,

t (u)

consists of at most one word, for all input words u. We say that

t

realizes a partial identity, if

v \in t (u)

implies that

v = u

.

If

m

is an automaton or a transducer, then the size of

m

, denoted by

| m |

, is the number of states plus the number of transitions in

m

. We shall write

Q_{m}, T_{m} for the sets of states and transitions of m, respectively .

If

m

is trim then

| Q_{m} | \leq | T_{m} | + 1

; thus,

if m is trim then | m | = O (| T_{m} |) .

We recall that making an automaton or transducer

m

trim can be done in linear time

O (| m |)

.

2.3. Edit Strings and Edit Distance

The alphabet

E_{Σ}

of the (basic) edit operations, which depends on the alphabet

Σ

of ordinary symbols, consists of all symbols

(x / y)

such that

x, y \in Σ \cup {λ}

and at least one of x and y is in

Σ

. If

(x / y) \in E_{Σ}

and x is not equal to y, then

(x / y)

is called an error [23]. The edit operations

(a / b)

,

(λ / a)

,

(a / λ)

, where

a, b \in Σ - {λ}

and

a \neq b

, are called substitution, insertion, deletion, respectively. We write

(λ / λ)

for the empty word over the alphabet

E_{Σ}

. We note that

λ

is used as a formal symbol in the elements of

E_{Σ}

. For example, if

a, b \in Σ

, then

(λ / a) (b / b) \neq (b / a) (λ / b)

. The elements of

E_{Σ}^{*}

are called edit strings. The weight of an edit string h, denoted by

weight (h)

, is the number of errors occurring in h. For example, for

g = (a / a) (a / λ) (b / b) (b / a) (b / b),

(5)

weight (g) = 2

. The input and output parts of an edit string

h = (x_{1} / y_{1}) \dots (x_{n} / y_{n})

are the words (over

Σ

)

x_{1} \dots x_{n}

and

y_{1} \dots y_{n}

, respectively. We write

inp (h)

for the input part and

out (h)

for the output part of h. For example, for the g shown above,

inp (g) = a a b b b

and

out (g) = a b a b

. The inverse of an edit string h is the edit string resulting by inverting the order of the input and output parts in every edit operation in h. For example, the inverse of g shown above is

(a / a) (λ / a) (b / b) (a / b) (b / b) .

The channel

sid (k)

can be defined more rigorously via edit strings:

sid (k) = {(u, v) ∣ u = inp (h), v = out (h), for some h \in E_{Σ}^{*} with weight (h) \leq k} .

The edit (or Levenshtein) distance [24] between two words u and v, denoted by

δ (u, v)

, is the smallest number of errors (substitutions, insertions and deletions) that can be used to transform u to v. More formally,

δ (u, v) = \min {weight (h) ∣ h \in E_{Σ}^{*}, inp (h) = u, out (h) = v} .

We say that an edit string h realizes the edit distance between two words u and v, if

weight (h) = δ (u, v)

and, either

inp (h) = u

and

out (h) = v

, or

inp (h) = v

and

out (h) = u

. For example, for

Σ = {a, b}

, we have that

δ (a b a b a, b a b b b) = 3

and the edit string

h = (a / λ) (b / b) (a / a) (b / b) (a / b) (λ / b)

realizes

δ (a b a b a, b a b b b)

. Note that several edit strings can realize the distance

δ (u, v)

. If L is a language containing at least two words, then the edit distance of L is

δ (L) = \min {δ (u, v) ∣ u, v \in L and u \neq v} .

Testing whether a given NFA accepts at least two words is not a concern in this paper, but we note that this can be done efficiently (in linear time via a breadth first search type algorithm) [25].

The next lemma comes from [4]. The bound

D_{a}

is always less than or equal to the number of states in the NFA

a

. Moreover, there are NFAs for which this bound is tight—see Section 6.

Lemma 1.

For every NFA

a

accepting at least two words, we have that

δ (L (a)) \leq D_{a},

where

D_{a}

is the number of states in the longest path in

a

from the start state having no repeated state.

However, the bound

D_{a}

is of no use in our context, as the problem of determining the length of a longest path in a given automaton, or a graph in general, is NP-complete since an algorithm solving this problem can be used to decide the existence of a Hamiltonian path; see for example [26]. There are many ways to obtain an efficiently computable upper bound on the edit distance of

L (a)

that is always at most equal to the number of states in

a

. For example, that distance is always less than or equal to the distance of the two shortest accepted words. We agree to use this as a working upper bound:

Lemma 2.

For every NFA

a

accepting at least two words, we have that

δ (L (a)) \leq B_{a},

where

B_{a}

is the edit distance of two shortest words in

L (a)

.

3. Edit Distance via Error-Detection

In [5], the authors discuss a conceptual method for computing integral distances of regular languages—integral means that all distance values are nonnegative integers—via the property of error-detection. In this section, we review that method and produce a concrete preliminary algorithm for computing the edit distance of a regular language.

A language L is error-detecting for a channel

γ

, [27], if no L-word can be received as a different L-word via

γ

; that is (The definition of error-detection in [27] uses

L \cup {λ}

instead of L in Formula 6. This slight change makes the presentation here simpler and has no bearing on any existing results regarding error-detecting languages.), for any words u and v,

u, v \in L and (u, v) \in γ \to u = v .

(6)

Remark 1.

The error-detection method of [5] for computing inner distances of regular languages is based on the following observations, where

a

is an NFA and

t

is an input-preserving transducer.

1.: A language L is error-detecting for $sid (m)$ , if and only if $δ (L) > m$ .
2.: $δ (L)$ is equal to the positive integer k such that L is error-detecting for $sid (k - 1)$ and L is not error-detecting for $sid (k)$ .
3.: We have the following facts from [27]. A language L is error-detecting for a channel γ if and only if the following relation is a function

$γ \cap (L \times Σ^{*}) \cap (Σ^{*} \times L) .$

(7)

Moreover, if $a$ accepts L and $t$ realizes γ, then a transducer $(t ↓ a)$ realizing $γ \cap (L \times Σ^{*})$ can be constructed in time $O (| t | | a |)$ and, analogously, a transducer $(t ↑ a)$ realizing $γ \cap (Σ^{*} \times L)$ can be constructed in time $O (| t | | a |)$ . Both constructions are cross-product constructions. In each case, the resulting transducer has $O (| Q_{a} | | Q_{t} |)$ states and $O (| T_{a} | | T_{t} |)$ transitions. Thus, the transducer

$(t ↓ a ↑ a)$

realizes relation (7) and can be constructed in time $O (| t | | a |^{2})$ .
4.: There is an $O (| T_{s} |^{2} + r | Q_{s} |^{2})$ time algorithm that decides whether a given transducer $s$ is functional [6,28], where r is the size of the alphabet.

Using the above observations, we present first a preliminary error-detection-based algorithm for computing the desired edit distance.

Algorithm PrelimDistErrDetect

0.

Input: NFA

a

1.

Let

B_{a}

be the edit distance bound in Lemma 2

2.

Let

\min \leftarrow 1

and

\max \leftarrow B_{a} - 1

Perform binary search to find the largest k in

{\min, \dots, \max}

for which

L (a)

is error-detecting for

sid (k)

as follows:

while (

\min \leq \max

)

a): Let $k \leftarrow ⌊ (\min + \max) / 2 ⌋$
b): Construct transducer ${sid}_{k}$ realizing the channel $sid (k)$ —see Figure 1
c): Construct the transducer $t_{k}^{'} \leftarrow ({sid}_{k} ↓ a ↑ a)$
d): If $L (a)$ is error-detecting for $sid (k)$ , let $\min \leftarrow k + 1$
Else let $\max \leftarrow k - 1$

4.

return min

Remark 2.

Step (3d) of the above algorithm can be computed using the transducer functionality algorithm on

t_{k}^{'}

, which leads again to a polynomial but expensive algorithm. It turns out, however, using standard logical arguments, that

C o n d i t i o n (6) i s e q u i v a l e n t t o w h e t h e r (t ↓ a ↑ a) r e a l i z e s a p a r t i a l i d e n t i t y,

when

t

realizes γ—in the above algorithm,

t

is

{sid}_{k}

. Moreover, [6], there is an

O (| T_{t} | + r | Q_{t} |)

time algorithm that tests whether a given transducer

t

realizes a partial identity, where r is the size of the alphabet.

Corollary 1.

Consider the algorithm PrelimDistErrDetect. Using the partial identity test for

t_{k}^{'}

in step 3d, the algorithm computes the edit distance of a language given via a trim NFA

a

in time

O (n^{2} r^{2} B_{a} \log B_{a}),

where r is the cardinality of the alphabet used in

T_{a}

, and

n = | T_{a} |

.

Proof.

The correctness of the algorithm follows from Remarks 1 and 2. For the time complexity, the whole loop will perform

O (\log B_{a})

iterations. In each iteration, the value k is used to construct the transducer

{sid}_{k}

shown in Figure 1 with alphabet being the set of alphabet symbols appearing in the definition of

a

. Then, the transducer

t_{k}^{'}

is constructed having

O (k | Q_{a} |^{2})

states and

O (k r^{2} | T_{a}^{2} |)

transitions. Then, the partial identity of

t_{k}^{'}

is tested in time

O (| T_{a} |^{2} k r^{2})

. As

k < B_{a}

, it follows that the total time complexity is as required. □

We note that, in the worst case,

B_{a}

is of order

O (n)

and, assuming a fixed alphabet, the above algorithm operates in time

O (n^{3} \log n),

which is asymptotically better than the time complexity of the algorithm in [4], even when the given automaton is a DFA.

4. An $O (n^{2} d)$ Algorithm for Edit Distance via Error-Detection

In this section, we observe that the algorithm of the previous section repeats a lot of computations, and we eliminate those repeated computations to arrive at an improved algorithm that computes the edit distance d of a trim NFA

a

in time

O (n^{2} d r^{2})

, where r is the cardinality of the alphabet used in

T_{a}

, and

n = | T_{a} |

. The improved algorithm is based on the following two observations:

The previous algorithm starts the binary search loop by constructing the transducer $t_{⌊ B_{a} / 2 ⌋}^{'}$ , but the edit distance might be much smaller than $⌊ B_{a} / 2 ⌋$ . It turns out that it is more efficient in the end to construct in turn $t_{1}^{'}, t_{2}^{'}, \dots$ until the first $t_{d}^{'}$ that does not realize a partial identity.
If $t_{k}^{'}$ is constructed and tested that does not realize a partial identity, then the transducer $t_{k + 1}^{'}$ is constructed from scratch and the partial identity test is repeated for the part of $t_{k + 1}^{'}$ that corresponds to $t_{k}^{'}$ . We shall define the transducer $t_{k + 1}^{″}$ to be the part that is added to $t_{k}^{'}$ in order to obtain $t_{k + 1}^{'}$ , plus some initial state. Moreover, we shall show that, if $t_{k}^{'}$ realizes a partial identity, then $t_{k + 1}^{'}$ realizes a partial identity if and only if $t_{k + 1}^{″}$ does so. Thus, the partial identity test in each step will apply only to the new part that is added to the transducer of the previous step.

We proceed with details based on the above observations.

Product construction of trim $t^{'} = t ↓ a ↑ a$ , given transducer

t

and NFA

a

. As usual in cross product constructions, the states of

t^{'}

are triples of the form

(φ, q, q^{'})

, where

φ

is a state of

t

, and q,

q^{'}

are states of

a

. The initial state of

t^{'}

is

(φ_{0}, q_{0}, q_{0})

, where

φ_{0}

is the initial state of

t

and

q_{0}

is the initial state of

a

. The construction is incremental, starting with the creation of

(φ_{0}, q_{0}, q_{0})

; then:

If state $(φ, q, q^{'})$ has been created, and there are transitions $φ \overset{x / y}{\to} ψ, q \overset{x}{\to} r, q^{'} \overset{y}{\to} r^{'}$ of $t$ , $a^{λ}$ and $a^{λ}$ , respectively, then the transition

$(φ, q, q^{'}) \overset{x / y}{\to} (ψ, r, r^{'})$

is added in $t^{'}$ . Here, $a^{λ}$ is the $λ$ -NFA that results if we add in $a$ the loop transitions $(q, λ, q)$ to all states q of $a$ .

The final states of

t^{'}

are those constructed triples consisting of final states in

t

and

a

. In the end, we also make

t^{'}

trim.

Optimized construction of $t_{k + 1}^{″}$ and $t_{k + 1}^{'}$ from the trim $t_{k}^{'}$ . Suppose that

t_{k}^{'}

has been constructed, where initially

t_{1}^{'} = {sid}_{1} ↓ a ↑ a

. Constructing

t_{k + 1}^{'}

using

t_{k}^{'}

will be done again incrementally. The first phase of the incremental construction is to add the new transitions

([k], q, q^{'}) \overset{x / y}{\to} ([k + 1], r, r^{'}),

(8)

where

x / y

is an error and

q \overset{x}{\to} r, q^{'} \overset{y}{\to} r^{'}

are transitions in

a^{λ}

. There will be no new transitions of the form

([i], q, q^{'}) \overset{x / y}{\to} ([k + 1], r, r^{'})

for

i < k

, because the transducer

{sid}_{k + 1}

has no transitions from any state

[i]

with

i < k

to state

[k + 1]

. Note that the numbers of new transitions and new states created as in (8) are

O (| T_{a} |^{2} r^{2})

and

O (| Q_{a} |^{2})

, respectively.

After the first phase, the incremental construction proceeds from the new states

([k + 1], r, r^{'})

in (8). Any new transition must be of the form

([k + 1], q, q^{'}) \overset{σ / σ}{\to} ([k + 1], r, r^{'}),

(9)

where

σ \in Σ

. This is because the transducer

{sid}_{k + 1}

has only transitions of the form

[k + 1] \overset{σ / σ}{\to} [k + 1]

going out of the state

[k + 1]

. The process ends when no new states are created. The transitions and final states of the transducer

t_{k + 1}^{'}

are those in

t_{k}^{'}

plus the newly created ones, after removing any new states that cannot reach a final state (thus, also

t_{k + 1}^{'}

is trim). The transducer

t_{k + 1}^{″}

has as transitions and final states only the newly created ones, and has as initial state a new state

[- 1]

with transitions

[- 1] \overset{λ / λ}{\to} ([k], q, q^{'})

, for all states of the form

([k], q, q^{'})

. □

Lemma 3.

Suppose the trim transducer

t_{k}^{'}

realizes a partial identity.

If $C_{1}$ is a computation of $t_{k}^{'}$ ending at a state of the form $([k], p, p^{'}),$ then the label of $C_{1}$ is of the form $(w_{1}, w_{1})$ .
$t_{k + 1}^{'}$ realizes a partial identity if and only if $t_{k + 1}^{″}$ does so.

Proof.

For the first statement, consider any computation

C_{1}

of

t_{k}^{'}

having some label

(w_{1}, w_{1}^{'})

and ending at a state of the form

([k], p, p^{'})

. We show that

w_{1} = w_{1}^{'}

. If the state

([k], p, p^{'})

is final, then

C_{1}

is an accepting computation, which implies

w_{1} = w_{1}^{'}

, as

t_{k}^{'}

realizes a partial identity. If

([k], p, p^{'})

is not a final state, then, as

t_{k}^{'}

is trim, there is a path

C_{1}^{'}

from

([k], p, p^{'})

to a final state of

t_{k}^{'}

, where all states of that path are of the form

([k], r, r^{'})

and all labels of that path are of the form

σ / σ

—this is because any transition of

{sid}_{k}

from state

[k]

can only go to state

[k]

and can only have a label of the form

σ / σ

. Thus, there is an accepting path of

t_{k}^{'}

of the form

C_{1} C_{1}^{'}

with label

(w_{1} z, w_{1}^{'} z)

for some nonempty word z. Then, as

t_{k}^{'}

realizes a partial identity, we have that

w_{1} z = w_{1}^{'} z

, which implies

w_{1} = w_{1}^{'}

, as required.

For the ‘only if’ part of the second statement, assume that

t_{k + 1}^{'}

realizes a partial identity. Consider any accepting computation

C_{2}

of

t_{k + 1}^{″}

with some label

(w_{2}, w_{2}^{'})

. We show that

w_{2} = w_{2}^{'}

. Let

[- 1] \overset{λ / λ}{\to} ([k], p, p^{'})

be the first transition of

C_{2}

. Let

C_{2}^{'}

be the path that results when we remove the first transition of

C_{2}

. By the construction of

t_{k + 1}^{'}

, there is a computation

C_{1}

of

t_{k}^{'}

that ends at state

([k], p, p^{'})

. Let

(w_{1}, w_{1})

be the label of

C_{1}

. Then,

C_{1} C_{2}^{'}

is an accepting computation of

t_{k + 1}^{'}

with label

(w_{1} w_{2}, w_{1} w_{2}^{'})

. As

t_{k + 1}^{'}

realizes a partial identity,

w_{1} w_{2} = w_{1} w_{2}^{'}

, which implies

w_{2} = w_{2}^{'}

, as required.

For the ‘if’ part of the second statement, assume that

t_{k + 1}^{″}

realizes a partial identity. Consider any accepting computation C of

t_{k + 1}^{'}

. We show that the label of C must be of the form

(w, w)

. If C is already a computation of

t_{k}^{'}

, then this holds, as

t_{k}^{'}

realizes a partial identity. Now suppose that

C = C_{1} C_{2}

such that

C_{1}

is a computation of

t_{k}^{'}

and

C_{2}

is a path in

t_{k + 1}^{'}

that starts with a transition as in (8) and then uses transitions as in (9). Let

([k], p, p^{'})

be the last state of

C_{1}

, which is also the first state of

C_{2}

. Then,

C_{1}

has some label

(w_{1}, w_{1})

. In addition, the path

([- 1] \overset{λ / λ}{\to} ([k], p, p^{'})) C_{2}

is an accepting computation of

t_{k + 1}^{″}

, which implies that it has some label

(w_{2}, w_{2})

. Hence, the label of C is

(w_{1} w_{2}, w_{1} w_{2})

and, therefore,

t_{k}^{'}

realizes a partial identity.

The improved algorithm is shown next:

Algorithm DistErrDetect

0.

Input: NFA

a

1.

Construct the transducer

{sid}_{1}

realizing the channel

sid (1)

—see Figure 1

2.

Construct the trim transducer

t_{1}^{'} = {sid}_{1} ↓ a ↑ a

3.

Let

k \leftarrow 1

4.

Let

s \leftarrow t_{1}^{'}

5.

while (

s

realizes a partial identity)

a): Construct $t_{k + 1}^{″}$ and $t_{k + 1}^{'}$ from $t_{k}^{'}$ using the optimized construction
b): Let $s \leftarrow t_{k + 1}^{″}$
c): Let $k \leftarrow k + 1$

6.

returnk

Theorem 1.

Algorithm DistErrDetect computes the edit distance of a language given via a trim NFA

a

in time

O (n^{2} d r^{2}),

where d is the computed edit distance,

n = | T_{a} |

, and r is the cardinality of the alphabet used in

T_{a}

.

Proof.

The correctness of the algorithm follows from the optimized construction and the above lemma. For the time complexity of the algorithm, we note the following. First,

t_{1}^{'}

is constructed in time

O (| a |^{2} r^{2})

. Then,

t_{2}^{″}, \dots, t_{d}^{″}

are constructed according to the optimized construction. Each of these is constructed in time

O (| a |^{2} r^{2})

and has

O (| Q_{a} |^{2})

states and

O (| T_{a} |^{2} r^{2})

transitions. In addition, each

t_{k}^{″}

is tested for partial identity in time

O (| T_{a} |^{2} r^{2} + {| Q_{a} |}^{2} r)

, which is

O (| a |^{2} r^{2})

.

5. An $O (n^{2} d)$ Algorithm for Edit Distance via Input-Altering Transducers

In this section, we present another algorithm for computing the desired edit distance via input-altering transducers—see Theorem 3 and the associated algorithm. A transducer

t

is called input-altering, if

w \notin t (w), for all words w,

that is, the output of

t

is never equal to the input used.

We explain now how input-altering transducers are related to edit-distance and error-detection. Let

t

be a transducer. A language L is

t

-independent, 29,30], if

u, v \in L and u \in t (v) \to u = v .

(10)

Of course, when

R (t)

is input-preserving, then

t

-independence is the same as error-detection for the channel

R (t)

, and condition (10) can be tested as explained in Remark 2. On the other hand, if the transducer

t

is input-altering, then [30], condition (10) is equivalent to

t (L) \cap L = \emptyset .

(11)

If L is accepted by some NFA

a

, then the above condition can be tested using two product constructions: first, construct an NFA

b

accepting

t (L)

, then construct an NFA

c

by intersecting

b

with

a

, and then test whether there is a path from the start to a final state of

c

. Thus, condition (11) can be tested in time

O (| a |^{2} | t |) .

(12)

Certain types of input-altering transducers are useful in constructing maximal

t

-independent languages [30]. In Theorem 2, we show how an input-altering transducer can be used to model the edit operations used in the definition of the edit distance.

5.1. An Input-Altering Transducer for Edit-Distance

We shall define the input-altering transducer

{ia}_{k}

, which is partially shown in Figure 2. The value i in a state

[i]

or

[i, a]

is called the error counter, meaning that any path from

[0]

to a state with error counter i has to be labeled

u v

such that

δ (u, v) \leq i

. More precisely, we will define the edges such that a state

[i, a]

can be reached from

[0]

via a path with label

u v

if and only if

u = v a x

for some word x and

i = | a x |

, thus, v is a proper prefix of u and state

[i, a]

remembers the left-most letter of u that occurs after its prefix v. A state

[i]

with

i \geq 1

can only be reached via a path labeled

u v

from

[0]

if

1 \leq δ (u, v) \leq i

, thus,

u \neq v

. Furthermore, we make sure that for

u \neq v

such that neither

u \leq_{p} v

nor

v \leq_{p} u

there is a path from

[0]

to

[δ (u, v)],

which is labeled by

u v

or

v u

.

Definition 1.

The transducer

{ia}_{k} = (Q, Σ, E, [0], F)

is defined as follows. The set of states is

Q = \{[i] | 0 \leq i \leq k\} \cup \{[i, a] | 1 \leq i \leq k, a \in Σ\}

with all but the initial state

[0]

being final states:

F = Q \ \{[0]\} .

The transitions in

{ia}_{k}

can be divided into the four sets of edges

E = E_{0} \cup E_{s} \cup E_{i} \cup E_{d}

. The transitions from

E_{0}

do not introduce any error, edges from the other sets model one substitution (

E_{s}

), insertion (

E_{i}

), or deletion (

E_{d}

):

\begin{matrix} E_{0} = & \{[i] \overset{σ / σ}{\to} [i] σ \in Σ, 0 \leq i \leq k\} \cup, \end{matrix}

(13)

\begin{matrix} \{[i, a] \overset{σ / σ}{\to} [i] a, σ \in Σ, a \neq σ, 1 \leq i \leq k\}, \end{matrix}

(14)

\begin{matrix} E_{s} = & \{[i] \overset{σ / τ}{\to} [i + 1] σ, τ \in Σ, σ \neq τ, 0 \leq i < k\}, \end{matrix}

(15)

\begin{matrix} E_{i} = & \{[i] \overset{λ / σ}{\to} [i + 1] σ \in Σ, 1 \leq i < k\}, \end{matrix}

(16)

\begin{matrix} E_{d} = & \{[0] \overset{a / λ}{\to} [1, a] a \in Σ\} \cup, \end{matrix}

(17)

\begin{matrix} \{[i] \overset{σ / λ}{\to} [i + 1] σ \in Σ, 1 \leq i < k\} \cup, \end{matrix}

(18)

\begin{matrix} \{[i, a] \overset{σ / λ}{\to} [i + 1, a] a, σ \in Σ, 1 \leq i < k\} . \end{matrix}

(19)

Terminology. If

t = (Q, Σ, T, q_{0}, F)

is a transducer in standard form, then, we write

t^{e}

for the NFA

t^{e} = (Q, E_{Σ}, T, q_{0}, F)

over the edit alphabet

E_{Σ}

, where the labels of the transitions in

t

are viewed as elements of

E_{Σ}

. Note that, the label of a path P in

t

is a pair of words

(u, v)

, whereas the label of the corresponding path in

t^{e}

, which we denote as

P^{e}

, is an edit string h such that

inp (h) = u

and

out (h) = v

. This type of NFA is called an eNFA in [23].

Definition 2.

An edit string h of nonzero weight is calledreduced, if (a) the first error in h is not an insertion, and (b) if the first error in h is a deletion of the form

(a / λ)

, then the first non-deletion edit operation that follows

(a / λ)

in h (if any) is of the form

σ / σ

with

σ \in Σ \ {a}

.

Example. The edit string

(a / a) (a / b) (a / λ) (λ / a)

is reduced as its first error is a substitution. The edit string

(a / a) (a / λ) (b / b) (b / a)

is reduced as well. The edit string

(λ / a) (a / a)

is not reduced as it starts with an insertion, and the edit string

(a / λ) (b / a) (b / b)

is not reduced either.

The proofs of the next two lemmata are given in the appendix.

Lemma 4.

Let

x, y, u, v

be words. The following statements hold true:

1.: $δ (x u y, x v y) = δ (u, v)$ .
2.: If $v <_{p} u$ then $δ (u, v) = | u | - | v |$ .
3.: If $u \neq v$ , then there is a reduced edit string h realizing $δ (u, v)$ .

Lemma 5.

Let

k \in N

and let

u, v

be words. The following statements hold true with respect to the transducer

{ia}_{k}

.

1.: In ${ia}_{k}^{e}$ , every path from the start state $[0]$ to any state $[i]$ or $[i, a]$ has as label a reduced edit string whose weight is equal to i.
2.: If $1 \leq δ (u, v) \leq k$ and h is a reduced edit string realizing $δ (u, v)$ , then h is accepted by ${ia}_{k}^{e}$ .
3.: If $v \in {ia}_{k} (u)$ , then $1 \leq δ (u, v) \leq k$ .

Theorem 2.

For each

k \in N

, the transducer

{ia}_{k}

is input-altering and of size

O (k r^{2})

, where r is the cardinality of the alphabet, and satisfies the following condition, for any language L containing at least two words

{ia}_{k} (L) \cap L = \emptyset i f a n d o n l y i f δ (L) > k .

(20)

Proof.

By construction, it follows that

t_{k}

is trim and has

O (r k)

states and

O (k r^{2})

transitions. Hence, it is indeed of size

O (k r^{2})

. The third statement of Lemma 5 implies that the transducer is input-altering. Next, we show that (20) is true for all languages L containing at least two words.

First, for the ‘if’ part, assume

δ (L) > k

and consider any words

u, v \in L

. We need to prove

v \notin {ia}_{k} (u)

. If

u = v

, then this holds as

{ia}_{k}

is input-altering. Else, it follows from the third statement of Lemma 5. Now, for the ‘only if’ part, assume

{ia}_{k} (L) \cap L = \emptyset,

(21)

but, for the sake of contradiction, suppose there are different words

u, v \in L

such that

1 \leq δ (u, v) \leq k

. Let h be a reduced edit string realizing

δ (u, v)

. By the second statement of Lemma 5, h is accepted by

{ia}_{k}^{e}

via some path

P^{e}

and, therefore, either of

(u, v)

and

(v, u)

is the label of the path P of

{ia}_{k}

, that is, we have

u \in {ia}_{k} (v)

or

v \in {ia}_{k} (u)

, which contradicts (21). □

Corollary 2.

For each NFA

a

accepting at least two words and for each transducer

{ia}_{k}

, with

k \in N

, the following condition is satisfied:

R ({ia}_{k} ↓ a ↑ a) = \emptyset i f a n d o n l y i f δ (L (a)) > k .

(22)

Proof.

The statement follows from the above theorem and the fact (based on standard logic arguments) that

R ({ia}_{k} ↓ a ↑ a) = \emptyset

is equivalent to

{ia}_{k} (L (a)) \cap L (a) = \emptyset

. □

The reason why condition

R ({ia}_{k} ↓ a ↑ a) = \emptyset

is preferred to the equivalent one in Theorem 2 is explained further below in the remark that follows Theorem 3.

5.2. The Second $O (n^{2} d)$ Algorithm for Edit Distance

Here, we use the results of the previous subsection to arrive at the second algorithm for computing the desired edit distance. Corollary 2 implies that the preliminary algorithm PrelimDistInpAlter shown below correctly computes the desired edit distance. Moreover, by reasoning as in the proof of Corollary 1, it follows that this algorithm also executes in time

O (n^{2} r^{2} B_{a} \log B_{a})

, where r is the cardinality of the alphabet used in

T_{a}

, and

n = | T_{a} |

.

Algorithm PrelimDistInpAlter

0.

Input: NFA

a

1.

Let

B_{a}

be the bound in Lemma 2

2.

Let

\min \leftarrow 1

and

\max \leftarrow B_{a} - 1

3.

Perform binary search to find the largest k in

{\min, \dots, \max}

for which

L (a)

is error-detecting for

sid (k)

as follows:

while (

\min \leq \max

)

a): Let $k \leftarrow ⌊ (\min + \max) / 2 ⌋$
b): Construct the transducer ${ia}_{k}$ (see Figure 2)
c): Construct the transducer $t_{k}^{'} \leftarrow {ia}_{k} ↓ a ↑ a$
d): If $(R (t_{k}^{'}) = \emptyset)$ let $\min \leftarrow k + 1$
Else let $\max \leftarrow k - 1$

4.

return min

We discuss now how to improve the above algorithm. The two observations we made at the beginning of Section 4 apply here as well if, instead of partial identity of

t_{k}^{'}

, we talk about the emptiness of

t_{k}^{'}

. Thus, we want the improved algorithm to construct in turn

t_{1}^{'}, t_{2}^{'}, \dots

until the first

t_{d}^{'}

with

R (t_{d}^{'}) \neq \emptyset

. Moreover, when

t_{k}^{'}

has been constructed and realizes ∅, we continue in the next step with new transitions added to

t_{k}^{'}

in order to get

t_{k + 1}^{'}

.

Optimized construction of $t_{k + 1}^{'}$ from the trim $t_{k}^{'}$ . Suppose that the trim

t_{k}^{'}

has been constructed, where initially

t_{1}^{'} \leftarrow {ia}_{1} ↓ a ↑ a

. Constructing

t_{k + 1}^{'}

using

t_{k}^{'}

will be done again incrementally. The first phase of the incremental construction is to add two sets of new transitions: the new transitions

([k], q, q^{'}) \overset{x / y}{\to} ([k + 1], r, r^{'}),

(23)

where

x / y

is an error and

q \overset{x}{\to} r, q^{'} \overset{y}{\to} r^{'}

are transitions in

a^{λ}

; and the new transitions

([k, a], q, q^{'}) \overset{σ / λ}{\to} ([k + 1, a], r, r^{'}),

(24)

where

a, σ \in Σ

, and

q \overset{σ}{\to} r, q^{'} \overset{λ}{\to} r^{'}

are transitions in

a^{λ}

. Note that the total numbers of new transitions and states created in the first phase are

O (| T_{a} |^{2} r^{2})

and

O (| Q_{a} |^{2} r)

, respectively.

After the first phase, the incremental construction proceeds from the new states

([k + 1], r, r^{'})

and

([k + 1, a], r, r^{'})

. Any new transition must be of the form

([k + 1], r, r^{'}) \overset{σ / σ}{\to} ([k + 1], t, t^{'}) or ([k + 1, a], r, r^{'}) \overset{σ / σ}{\to} ([k + 1], t, t^{'}),

(25)

where

σ \in Σ

and, in the second case above,

σ \neq a

. This is because the transducer

{ia}_{k + 1}

has only transitions of the form

[k + 1] \overset{σ / σ}{\to} [k + 1]

and

[k + 1, a] \overset{σ / σ}{\to} [k + 1]

, with

σ \neq a

, going out of the state

[k + 1]

. The incremental process ends when no new states are created. The transitions and final states of the transducer

t_{k + 1}^{'}

are those in

t_{k}^{'}

plus the newly created ones, after removing any new states that cannot reach a final state (thus, also

t_{k + 1}^{'}

is trim). □

Remark 3.

If the trim transducer

t_{k}^{'}

has no final states, then

t_{k + 1}^{'}

has no final states if and only if none of the new created states in the optimized construction is a final state.

Algorithm DistInpAlter

0.

Input: NFA

a

1.

Construct the transducer

{ia}_{1}

—see Figure 2

2.

Construct the trim transducer

t_{1}^{'} \leftarrow {ia}_{1} ↓ a ↑ a

3.

Let

k \leftarrow 1

4.

Let

s \leftarrow

the set of states of

t_{1}^{'}

5.

while (

s

contains no final states)

a): Construct $t_{k + 1}^{'}$ from $t_{k}^{'}$ using the optimized construction
b): Let $s$ be the set of new states in the optimized construction
c): Let $k \leftarrow k + 1$

6.

returnk

Theorem 3.

Algorithm DistInpAlter computes the edit distance of the language given via a trim NFA

a

in time

O (n^{2} d r^{2}),

where d is the computed edit distance,

n = | T_{a} |

, and r is the cardinality of the alphabet used in

T_{a}

.

Proof.

The correctness of the algorithm follows from the above optimized construction and Corollary 2. For the time complexity of the algorithm, we note the following: first,

t_{1}^{'}

is constructed in time

O (| a |^{2} r^{2})

. Then,

t_{2}^{'}, \dots, t_{d}^{'}

are constructed according to the optimized construction. Each of these is constructed in time

O (| a |^{2} r^{2})

and has

O (| Q_{a} |^{2} r)

states and

O (| T_{a} |^{2} r^{2})

transitions. In addition, each

t_{k}^{'}

is tested for final states in linear time. □

Remark 4.

When a final state f, say, of

t_{k}^{'}

is created, then we know that there is an accepting path of

{ia}_{k} ↓ a ↑ a

ending at f. The label of that path is a word pair

(u, v)

such that

δ (u, v) = k

. Thus, the above algorithm can be modified to return not only the edit distance of

L (a)

, but also a witness pair for that distance.

6. Implementation and Testing

As both algorithms DistErrDetect and DistInpAlter have the same theoretical complexity, we chose to implement one of the two. We chose to implement DistInpAlter because it requires a simpler test for each constructed transducer (although

{ia}_{k}

is slightly more complex than

{sid}_{k}

, the test for partial identity, [6], is more sophisticated than testing merely for existence of final states). We have also implemented the preliminary algorithm PrelimDistInpAlter.

Our implementation uses the FAdo package for automata, version 1.3.5.1, [7], which is well maintained and provides several useful tools for manipulating automata. We have performed several tests (All tests were performed on a laptop with the following specification. Make: Apple, Model: MacBook Pro, Processor 2.5 GHz Intel Core i7, Memory (RAM): 16.00 GB, Operating System: macOS High Sierra Version 10.13.6.) for the correctness of these algorithms, as well as two sets of tests for the time complexity, which confirm the theoretical result that DistInpAlter is indeed faster than PrelimDistInpAlter. The two sets of tests correspond to two lists of DFAs,

(a_{n})

and

(b_{n})

, shown in Figure 3 and Figure 4. The first test set is such that the desired distance is equal to n, for each DFA

a_{n}

, that is, the distance grows with n and, in fact, it is a worst-case scenario where the distance is equal to the number of states of the automaton. The second test set is such that the desired distance is fixed, equal to 2, for all n.

Table 1 shows the actual running times (in seconds) of the two algorithms on the DFAs

a_{28}, a_{41}, a_{56}, a_{76}, a_{100}, a_{124}, a_{152}, a_{184}

. The number in parentheses next to each

a_{n}

indicates the number of transitions in

a_{n}

. The column d shows the computed edit distance, and the column

B_{a_{n}}

shows the computed upper bound on the edit distance (used in algorithm PrelimDistInpAlter).

Table 2 shows the actual running times (in seconds) of the two algorithms on the DFAs

b_{6}, \dots, b_{13}

. Again, the number in parentheses next to each

b_{n}

indicates the number of transitions in

b_{n}

.

In both test sets, the empirical outcomes confirm the asymptotic outcome that the improved algorithm based on the optimized construction is faster than the preliminary one.

7. Conclusions

This paper represents a significant improvement in the time complexity of computing the inner edit distance of a given regular language. The performance tests of the implemented algorithm show that in practice the algorithm is reasonably fast for moderate size automata. As discussed in [4], this problem is related to the inherent capability of a language to detect substitution, insertion, and deletion errors.

The two preliminary algorithms can be applied to different distances as long as these distances can be related to appropriate transducers. For some of those distances, the idea in the optimized algorithms can also be used. For example, one can construct a transducer similar to

{sid}_{k}

for insertion/deletion only errors. A direction for future research is to investigate to what extent the methods used here can be extended to compute inner weighted distances or the inner edit distance with moves.

Author Contributions

All the authors contributed equally to this research. The outcome would not be possible without the contributions of all authors.

Funding

This research was supported by the Natural Science and Engineering Research Council of Canada (NSERC) Discovery Grants R2824A01 to Lila Kari and 220259 to Stavros Konstantinidis.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

In this appendix, we present the proofs of Lemmas 4 and 5.

Proof.

(Of Lemma 4) The first statement already appears in [24]. The second statement is rather folklore, but we provide a proof here for the sake of completeness. Let

u = σ_{1} \dots σ_{n}

and

v = σ_{1} \dots σ_{m}

, where

m, n \in N_{0}

and

m < n

and all

σ_{i}

’s are in

Σ

. Then, the edit string

h = (σ_{1} / σ_{1}) \dots (σ_{m} / σ_{m}) (σ_{m + 1} / λ) \dots (σ_{n} / λ)

has weight

n - m

and

inp (h) = u

and

out (h) = v

. We show that h realizes

δ (u, v)

by proving that, for any edit string g realizing

δ (u, v)

,

weight (g) = n - m

. Indeed, first note that

weight (g) \leq weight (h) = n - m

. Let i and d be the number of insertions and deletions in g. Then,

| v | = | u | + i - d

, which implies

n - m = d - i

. Now,

weight (g) \geq d + i \geq d - i = n - m

, as required.

For the third statement, let

g_{0}

be any edit string realizing

δ (u, v)

. The following process can be used to obtain the required reduced edit string h:

If the first error in $g_{0}$ is a substitution, then $h = g_{0}$ .
If the first error in $g_{0}$ is an insertion, then set $g_{0}$ to the inverse of $g_{0}$ and continue with the next step.
If the first error in $g_{0}$ is a deletion $(a / λ)$ , then $g_{0}$ is of the form

$g_{0} = (e_{1} \dots e_{r}) (a / λ) (a_{1} / λ) \dots (a_{d} / λ) g_{0}^{'},$

where the $e_{i}$ ’s are non-errors, $d \in N_{0}$ and each $(a_{j} / λ)$ is a deletion, and $g_{0}^{'}$ does not start with a deletion. We have the following subcases:
- If $g_{0}^{'}$ is empty or starts with an edit operation $(σ / σ)$ in which $σ \in Σ \ {a}$ , then the required h is $g_{0}$ .
- If $g_{0}^{'}$ starts with an edit operation $(x / τ)$ in which $τ \in Σ \ {a}$ and $x \in Σ \cup {λ}$ , then $g_{0}^{'}$ is of the form $g_{0}^{'} = (x / τ) g_{1}^{'}$ , and the required h is
  
  $h = (e_{1} \dots e_{r}) (a / τ) (a_{1} / λ) \dots (a_{d} / λ) (x / λ) g_{1}^{'} .$
- If $g_{0}^{'}$ starts with an edit operation $(x / a)$ in which $x \in Σ \cup {λ}$ , then it is of the form $g_{0}^{'} = (x / a) g_{1}^{'}$ , and the edit string
  
  $g_{1} = (e_{1} \dots e_{r}) (a / a) (a_{1} / λ) \dots (a_{d} / λ) (x / λ) g_{1}^{'},$
  
  realizes $δ (u, v)$ , as $weight (g_{1}) = weight (g_{0})$ . The process now continues from the first step using $g_{1}$ for $g_{0}$ .

As the edit string

g_{0}

is finite, the above process terminates with a reduced edit string h, as required. □

Proof.

(Of Lemma 5) The first statement follows when we note that the definition of

{ia}_{k}

and

{ia}_{k}^{e}

implies the following facts: (a) an edge exists between a state with error counter i to one with error counter

i + 1

, if and only if the label of that edge is an error; thus, in any path from

[0]

to

[i]

or

[i, a]

, the label of that path consists of exactly i errors; (b) any edit string accepted by

{ia}_{k}^{e}

is indeed reduced.

For the second statement, consider any reduced edit string h realizing

δ (u, v)

. We have two cases for the first error of h. If the first error in h is a deletion, then h is of the form

h = (e_{1} \dots e_{r}) (a / λ) (b_{1} / λ) \dots (b_{d} / λ) h^{'},

where each

e_{i}

is a non-error edit operation of the form

(σ_{i} / σ_{i})

,

(a / λ)

is a deletion error,

d \in N_{0}

and each

(b_{j} / λ)

is a deletion error, and

h^{'}

is an edit string that is either empty or starts with a non-error

(σ / σ)

such that

σ \neq a

. Consider the following path of

{ia}_{k}^{e}

P_{1} = [0] \overset{(e_{1} \dots e_{r})}{\to}^{*} [0] \overset{(a / λ) (b_{1} / λ) \dots (b_{d} / λ)}{\to}^{*} [1 + d, a] .

If

h^{'}

is empty, then

P_{1}

is an accepting path of

{ia}_{k}^{e}

. If

h^{'}

is nonempty, then it is of the form

h^{'} = (σ / σ) h^{″}

, for some

σ \in Σ \ {a}

. Then, by definition of

{ia}_{k}^{e}

, the following is a path of

{ia}_{k}^{e}

P_{1} ([1 + d, a] \overset{σ / σ}{\to} [1 + d] \overset{h^{″}}{\to}^{*} [1 + d + weight (h^{″})])

accepting h. For the case where the first error in h is a substitution, one verifies that again h is accepted by

{ia}_{k}^{e}

.

For the third statement, if

v \in {ia}_{k} (u)

, then

(u, v)

is the label of a path P from

[0]

to a final state

[i]

or

[i, a]

, with

0 < i \leq k

. As the label of the path

P^{e}

has exactly i errors, it follows that

δ (u, v) \leq i \leq k

.

We also need to show that

δ (u, v) \geq 1

, that is,

u \neq v

. First, consider the case where the path P ends at

[i, a]

, with

1 \leq i \leq k

. Then, the label of

P^{e}

is an edit string of the form

h = (σ_{1} / σ_{1}) \dots (σ_{r} / σ_{r}) (a / λ) (b_{1} / λ) \dots (b_{d} / λ)

and

u = inp (h) = σ_{1} \dots σ_{r} a b_{1} \dots b_{d}

and

v = out (h) = σ_{1} \dots σ_{r}

. Hence,

u \neq v

. Now, consider the case where the path P ends at state

[i]

. There are two cases: (a) the states used in the path are

[0], [1], \dots, [i]

; (b) the states used in P are

[0], [1, a], \dots, [r, a], [r], \dots, [i]

, for some appropriate

[r]

. In both cases, one verifies that

u \neq v

. For example, in case (b), u must be of the form

x a σ_{1} \dots σ_{r - 1} σ y

and v of the form

x σ z

, where the

σ_{j}

’s are symbols,

x, y, z

are words, and

σ

is a symbol other than a; hence,

u \neq v

. □

References

Sankoff, D.; Kruskal, J.B. (Eds.) Time Warps, String Edits, and Macromolecules: The Theory and Practice of Sequence Comparison; CSLI Publications: Stanford, CA, USA, 1999. [Google Scholar]
Gusfield, D. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology; Cambridge University Press: New York, NY, USA, 1997. [Google Scholar]
Paluncic, F.; Abdel-Ghaffar, K.; Ferreira, H. Insertion/deletion detecting codes and the boundary problem. IEEE Trans. Inf. Theory 2013, 59, 5935–5943. [Google Scholar] [CrossRef]
Konstantinidis, S. Computing the edit distance of a regular language. Inf. Comput. 2007, 205, 1307–1316. [Google Scholar] [CrossRef]
Konstantinidis, S.; Silva, P. Computing maximal error-detecting capabilities and distances of regular languages. Fundam. Inform. 2010, 101, 257–270. [Google Scholar]
Allauzen, C.; Mohri, M. Efficient algorithms for testing the twins property. J. Autom. Lang. Comb. 2003, 8, 117–144. [Google Scholar]
FAdo. Tools for Formal Languages Manipulation. Available online: http://fado.dcc.fc.up.pt/ (accessed on 20 October 2018).
Wagner, R. Order-n correction for regular languages. Commun. ACM 1974, 17, 265–268. [Google Scholar] [CrossRef]
Pighizzini, G. How hard is computing the edit distance? Inf. Comput. 2001, 165, 1–13. [Google Scholar] [CrossRef]
Mohri, M. Edit-distance of weighted automata: General definitions and algorithms. Intern. J. Found. Comput. Sci. 2003, 14, 957–982. [Google Scholar] [CrossRef]
Kari, L.; Konstantinidis, S.; Perron, S.; Wozniak, G.; Xu, J. Finite-State Error/Edit-Systems and Difference- Measures for Languages and Words; Report 2003-01; Mathematics and Computing Science, Saint Mary’s University: Halifax, NS, Canada, 2003. [Google Scholar]
Benedikt, M.; Puppis, G.; Riveros, C. The cost of traveling between languages. In ICALP 2011, Part II. LNCS 6756; Aceto, L., Henziger, M., Sgall, J., Eds.; Springer: Heidelberg, Germany, 2011; pp. 234–245. [Google Scholar]
Han, Y.S.; Ko, S.K.; Salomaa, K. Computing the edit-distance between a regular language and a context-free langauge. In DLT 2012. LNCS 7410; Yen, H.C., Ibarra, O., Eds.; Springer: Heidelberg, Germany, 2012; pp. 85–96. [Google Scholar]
Han, Y.S.; Ko, S.K.; Salomaa, K. Approximate matching between a context-free grammar and a finite-state automaton. In CIAA 2013. LNCS 7982; Konstantinidis, S., Ed.; Springer: Heidelberg, Germany, 2013; pp. 146–157. [Google Scholar]
Takabatake, Y.; Nakashima, K.; Kuboyama, T.; Tabei, Y.; Sakamoto, H. siEDM: An Efficient String Index and Search Algorithm for Edit Distance with Moves. Algorithms 2016, 9, 26. [Google Scholar] [CrossRef]
Ng, T. Prefix Distance Between Regular Languages. In Proceedings of the 21st CIAA, Seoul, South Korea, 19–22 July 2016; Volume 9705, pp. 224–235. [Google Scholar]
Berstel, J. Transductions and Context-Free Languages; B.G. Teubner: Stuttgart, Germany, 1979. [Google Scholar]
Wood, D. Theory of Computation; John Wiley & Sons: New York, NY, USA, 1987. [Google Scholar]
Rozenberg, G.; Salomaa, A. (Eds.) Handbook of Formal Languages, Vol. I; Springer: Berlin, Germany, 1997. [Google Scholar]
Yu, S. Regular Languages. Handbook of Formal Languages, Vol. I; Springer: Berlin, Germany, 1997; pp. 41–110. [Google Scholar]
Sakarovitch, J. Elements of Automata Theory; Cambridge University Press: Cambridge, UK, 2009. [Google Scholar]
Mateescu, A.; Salomaa, A. Formal Languages: an Introduction and a Synopsis. Handbook of Formal Languages, Vol. I; Springer: Berlin, Germany, 1997; pp. 1–39. [Google Scholar]
Kari, L.; Konstantinidis, S. Descriptional Complexity of Error/Edit Systems. J. Autom. Lang. Comb. 2004, 9, 293–309. [Google Scholar]
Levenshtein, V.I. Binary Codes Capable of Correcting Deletions, Insertions, and Reversals. Sov. Phys. Dokl. 1966, 10, 707–710. [Google Scholar]
Yang, M. Application and Implementation of Transducer Tools in Answering Certain Questions about Regular Languages. Master’s Thesis, Department Mathematics and Computing Science, Saint Mary’s University, Halifax, NS, Canada, 2012. [Google Scholar]
Schrijver, A. Combinatorial Optimization: Polyhedra and Efficiency; Springer Science & Business Media: Berlin, Germany, 2003. [Google Scholar]
Konstantinidis, S. Transducers and the Properties of Error-Detection, Error-Correction and Finite-Delay Decodability. J. Univ. Comput. Sci. 2002, 8, 278–291. [Google Scholar]
Béal, M.; Carton, O.; Prieur, C.; Sakarovitch, J. Squaring transducers: An efficient procedure for deciding functionality and sequentiality. Theor. Comput. Sci. 2003, 292, 45–63. [Google Scholar] [CrossRef]
Shyr, H.; Thierrin, G. Codes and Binary Relations. In Séminaire d’Algèbre Paul Dubreil, Paris 1975–1976 (29ème Année); Lecture Notes in Mathematics; Springer: Berlin/Heidelberg, Germany, 1975; pp. 180–188. [Google Scholar]
Konstantinidis, S. Applications of Transducers in Independent Languages, Word Distances, Codes. In Proceedings of the DCFS 2017, Milano, Italy, 3–5 July 2017; Volume 10316, pp. 45–62. [Google Scholar]

Figure 1. The input-preserving transducer

{sid}_{k}

realizing the channel

sid (k)

. Each edge label

σ / σ

represents many transitions, one for each symbol

σ

of the alphabet, and similarly for

σ / λ

and

λ / σ

. Each edge label

σ / τ

represents many transitions, one for each pair of distinct symbols

σ

and

τ

from the alphabet. Thus, if the alphabet size is r, then the transducer has

O (k)

states and

O (r^{2} k)

transitions.

Figure 1. The input-preserving transducer

{sid}_{k}

realizing the channel

sid (k)

. Each edge label

σ / σ

represents many transitions, one for each symbol

σ

of the alphabet, and similarly for

σ / λ

and

λ / σ

. Each edge label

σ / τ

represents many transitions, one for each pair of distinct symbols

σ

and

τ

from the alphabet. Thus, if the alphabet size is r, then the transducer has

O (k)

states and

O (r^{2} k)

transitions.

Figure 2. A segment of the input-altering transducer

{ia}_{k}

: for each

a \in Σ

the complete transducer has k states of the form

[i, a]

. The labels

σ

and

τ

on an edge mean: one edge for each

σ, τ \in Σ

with

σ \neq τ

; for some edge sets, additional restrictions apply denoted, for example, by

∣_{σ \neq a}

.

Figure 2. A segment of the input-altering transducer

{ia}_{k}

: for each

a \in Σ

the complete transducer has k states of the form

[i, a]

. The labels

σ

and

τ

on an edge mean: one edge for each

σ, τ \in Σ

with

σ \neq τ

; for some edge sets, additional restrictions apply denoted, for example, by

∣_{σ \neq a}

.

Figure 3. The automaton

a_{n}

accepting the language

0^{n - 1} {(10^{n - 1})}^{*}

.

Figure 3. The automaton

a_{n}

accepting the language

0^{n - 1} {(10^{n - 1})}^{*}

.

Figure 4. The automaton

b_{n}

accepting the Levenshtein code, which consists of all binary words

b_{1} \dots b_{n}

of length n such that

(\sum_{i = 1}^{n} i \cdot b_{i}) % (n + 1) = 0

, where ‘%’ is the integer division remainder operation. This code has edit distance equal to 2. On the other hand, its distance for insertion/deletion errors only is 3. The automaton (before making it trim) has

n^{2} + n + 1

states:

[n, 0]

and

[i, s]

, with

0 \leq i \leq n - 1

and

0 \leq s \leq n

. The meaning of state

[i, s]

is that the automaton has read i bits

b_{1} \dots b_{i}

and

s = 1 \cdot b_{1} + \dots + i \cdot b_{i}

. We have that

f (i, s) = [i + 1, (s + i + 1) % (n + 1)]

.

Figure 4. The automaton

b_{n}

accepting the Levenshtein code, which consists of all binary words

b_{1} \dots b_{n}

of length n such that

(\sum_{i = 1}^{n} i \cdot b_{i}) % (n + 1) = 0

, where ‘%’ is the integer division remainder operation. This code has edit distance equal to 2. On the other hand, its distance for insertion/deletion errors only is 3. The automaton (before making it trim) has

n^{2} + n + 1

states:

[n, 0]

and

[i, s]

, with

0 \leq i \leq n - 1

and

0 \leq s \leq n

. The meaning of state

[i, s]

is that the automaton has read i bits

b_{1} \dots b_{i}

and

s = 1 \cdot b_{1} + \dots + i \cdot b_{i}

. We have that

f (i, s) = [i + 1, (s + i + 1) % (n + 1)]

.

Table 1. Outcomes of performance tests on the automata

(a_{n})

.

Table 1. Outcomes of performance tests on the automata

(a_{n})

.

DFA	d	$B_{a_{n}}$	PrelimDistInpAlter	DistInpAlter
$a_{28} (28)$	28	28	0.696s	0.078s
$a_{41} (41)$	41	41	3.977s	0.267s
$a_{56} (56)$	56	56	12.811s	0.691s
$a_{76} (76)$	76	76	52.086s	1.885s
$a_{100} (100)$	100	100	159.370s	4.841s
$a_{124} (124)$	124	124	354.306s	10.643s
$a_{152} (152)$	152	152	998.438s	21.991s
$a_{184} (184)$	184	184	2294.636s	43.484s

Table 2. Outcomes of performance tests on the automata

(b_{n})

.

Table 2. Outcomes of performance tests on the automata

(b_{n})

.

DFA	d	$B_{b_{n}}$	PrelimDistInpAlter	DistInpAlter
$b_{6} (28)$	2	3	0.112s	0.076s
$b_{7} (41)$	2	3	0.302s	0.196s
$b_{8} (56)$	2	4	0.731s	0.436s
$b_{9} (76)$	2	3	1.621s	0.927s
$b_{10} (100)$	2	3	3.223s	1.844s
$b_{11} (124)$	2	4	5.673s	3.238s
$b_{12} (152)$	2	5	61.416s	5.892s
$b_{13} (184)$	2	4	16.624s	9.272s

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kari, L.; Konstantinidis, S.; Kopecki, S.; Yang, M. Efficient Algorithms for Computing the Inner Edit Distance of a Regular Language via Transducers. Algorithms 2018, 11, 165. https://doi.org/10.3390/a11110165

AMA Style

Kari L, Konstantinidis S, Kopecki S, Yang M. Efficient Algorithms for Computing the Inner Edit Distance of a Regular Language via Transducers. Algorithms. 2018; 11(11):165. https://doi.org/10.3390/a11110165

Chicago/Turabian Style

Kari, Lila, Stavros Konstantinidis, Steffen Kopecki, and Meng Yang. 2018. "Efficient Algorithms for Computing the Inner Edit Distance of a Regular Language via Transducers" Algorithms 11, no. 11: 165. https://doi.org/10.3390/a11110165

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Efficient Algorithms for Computing the Inner Edit Distance of a Regular Language via Transducers

Abstract

1. Introduction

2. Notation, Background and Preliminary Results

2.1. Sets, Words, Languages, Channels

2.2. NFAs and Transducers

2.3. Edit Strings and Edit Distance

3. Edit Distance via Error-Detection

4. An $O (n^{2} d)$ Algorithm for Edit Distance via Error-Detection

5. An $O (n^{2} d)$ Algorithm for Edit Distance via Input-Altering Transducers

5.1. An Input-Altering Transducer for Edit-Distance

5.2. The Second $O (n^{2} d)$ Algorithm for Edit Distance

6. Implementation and Testing

7. Conclusions

Author Contributions

Funding

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Efficient Algorithms for Computing the Inner Edit Distance of a Regular Language via Transducers

Abstract

1. Introduction

2. Notation, Background and Preliminary Results

2.1. Sets, Words, Languages, Channels

2.2. NFAs and Transducers

2.3. Edit Strings and Edit Distance

3. Edit Distance via Error-Detection

4. An O ( n 2 d ) Algorithm for Edit Distance via Error-Detection

5. An O ( n 2 d ) Algorithm for Edit Distance via Input-Altering Transducers

5.1. An Input-Altering Transducer for Edit-Distance

5.2. The Second O ( n 2 d ) Algorithm for Edit Distance

6. Implementation and Testing

7. Conclusions

Author Contributions

Funding

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4. An $O (n^{2} d)$ Algorithm for Edit Distance via Error-Detection

5. An $O (n^{2} d)$ Algorithm for Edit Distance via Input-Altering Transducers

5.2. The Second $O (n^{2} d)$ Algorithm for Edit Distance