*2.4. The DNA Pattern Matching Algorithm*

One of the most recurrent and widely studied problems in computer science is pattern matching—this problem has several real-world applications such as fast sub-string searching for network intrusion detection, mail spam filters, protein motif search and DNA/RNA sequence alignments [22]. Given a text string *T* of length *n* and a pattern string *P* of length *m* ≤ *n*, the pattern matching problem can be stated as retrieving all positions *i* where pattern *P* occurs in text *T*, such that 0 ≤ *i* ≤ *n* − *m*.

A straightforward solution for the pattern matching problem consists of looking for the pattern sequence in the text position by position until every occurrence is found. Unfortunately, such an approach leads to a *O*(*m* · *n*) asymptotic complexity, which is not acceptable for large sets of data.

Given the practical relevance of this problem, many approaches were proposed in the literature for improving the naïve way. One of these is the *Boyer-Moore* algorithm [23,24], which trades space usage for time efficiency, defining rules for pruning the search space avoiding the exploration of all text positions. A C++ implementation is available in Reference [25].

Figure 2 provides an intuition for this approach; given the text in the picture, the first attempt looks for pattern "GTA" in position 0 , which is not correct.

**Figure 2.** Intuition of the Boyer-Moore search procedure.

The naïve approach would perform the next search from position 1 , but this is not ideal since the first instance of the letter "G" in the pattern occurs at position 3 in the text, meaning that searching any position in the middle is useless. Implementing this optimization requires pre-processing of the pattern to be matched; a *shift table* is computed, storing the number of text positions that can be safely skipped for each symbol in the target alphabet. Whenever a mismatch is found, given the next symbol to be searched, the *shift table* is accessed and the next position to be considered is computed.

We used a refined version of the *Boyer-Moore* algorithm, also known as *Fast string matching method for Encoded DNA sequences (FED)* [26], which takes advantage of the low-cardinality of the DNA alphabet. In the *FED* version, each of the four symbols composing the DNA alphabet is assigned a unique 2-bit code, packing four elements into a single byte, padding last bits with zeros in the case of sequences where the length is not a multiple of 4. Additionally, a bit-mask is used to distinguish valid bits from padding in the last encoded byte.

The procedure consists of two successive steps:


Figure 3 summarizes the string matching procedure flow. As long as the customised *Boyer-Moore* procedure can perform a matching operation on encoded sequences, the encoding step can be considered not part of the algorithm as it can be done offline by storing the encoded sequences in custom binary files which constitute the actual source of data for the pattern matching engine.

**Figure 3.** Flowchart of the string matching algorithm.
