Detection of Unknown Polymorphic Patterns Using Feature-Extracting Part of a Convolutional Autoencoder

Kucharski, Przemysław; Ślot, Krzysztof

doi:10.3390/app131910842

Open AccessArticle

Detection of Unknown Polymorphic Patterns Using Feature-Extracting Part of a Convolutional Autoencoder

by

Przemysław Kucharski

^*

and

Krzysztof Ślot

Institute of Applied Computer Science, Lodz University of Technology, Stefanowskiego 18, 90-537 Lodz, Poland

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(19), 10842; https://doi.org/10.3390/app131910842

Submission received: 28 July 2023 / Revised: 16 September 2023 / Accepted: 19 September 2023 / Published: 29 September 2023

(This article belongs to the Special Issue Applications of Artificial Intelligence in Biomedical Data Analysis)

Download

Browse Figures

Versions Notes

Abstract

:

Background: The present paper proposes a novel approach for detecting the presence of unknown polymorphic patterns in random symbol sequences that also comprise already known polymorphic patterns. Methods: We propose to represent rules that define the considered patterns as regular expressions and show how these expressions can be modeled using filter cascades of neural convolutional layers. We adopted a convolutional autoencoder (CAE) as a pattern detection framework. To detect unknown patterns, we first incorporated knowledge of known rules into the CAE’s convolutional feature extractor by fixing weights in some of its filter cascades. Then, we executed the learning procedure, where the weights of the remaining filters were driven by two different objectives. The first was to ensure correct sequence reconstruction, whereas the second was to prevent weights from learning the already known patterns. Results: The proposed methodology was tested on sample sequences derived from the human genome. The analysis of the experimental results provided statistically significant information on the presence or absence of polymorphic patterns that were not known in advance. Conclusions: The proposed method was able to detect the existence of unknown polymorphic patterns.

Keywords:

polymorphic pattern detection; knowledge and learning integration; convolutional autoencoder

1. Introduction

Recent trends in machine learning indicate an emerging need for methods and models capable of incorporating explicitly formulated knowledge. This is especially important when training data are scarce and a shortage of available information needs to be compensated for with other means. The integration of knowledge and learning is crucial for enabling progress in several challenging domains, such as in the field of medical research and, in particular, in motif detection [1]. Motif detection is one of the most challenging tasks in data analysis due to the polymorphic nature of patterns that can encode genomic information and the scarcity of data, which severely impairs learning feasibility.

The present paper is concerned with research on the detection of polymorphic patterns in sequences by utilizing prior knowledge to facilitate this difficult task. The polymorphic sequences considered in this paper are sequences generated by regular expressions with flexible rules, resulting in a high diversity of valid, alternative sequence instances. A particular research question that we attempted to address was the following: having a set of sequences, each comprising several polymorphic patterns, where some of these match known flexible rules (defined via regular expressions), is it possible to determine the presence of previously unknown ones?

In what follows, we present a method that enables polymorphic pattern detection in sequences. The key elements of the proposed method are the integration of learning with prior knowledge and the introduction of mechanisms that discourage the learning of concepts that are already known. We assumed that polymorphic pattern-describing rules can be, at least partially, represented as filter cascades of a convolutional neural network. To discourage the learning of known patterns, we defined an appropriate loss component that evaluates the similarity of filter-represented pattern detectors and used this information to guide training. We evaluated the method on a dataset extracted from human genome data and showed that it enabled the statistically significant detection of unknown polymorphic patterns.

A significant amount of research has been carried out so far on handling the problem of detecting patterns (motifs) in sequences, among which we can name methods that aim to disentangle patterns to be recognized by convolutional neural networks. Liang et al. [2] proposed a training method for classification models that makes convolutional filters class-specific. They used an additional mask that assigned filters to classes and, then, in two-phase training, optimized that mask. Although this was an interesting approach, its main contribution was the disentanglement of knowledge by minimizing the number of filters sharing class-specific knowledge, thus making a step towards better interpretability. However, it did not expand upon the internal representation of said knowledge.

Koo et al. [3] examined representations of genomic motifs in CNNs. They searched for motifs in first-layer convolutional filters, transforming them into position–weight matrices based on the filter’s response to specific samples. This work exemplified a common approach in the representation of genomic patterns in filters of convolutional networks. It should be emphasized that this approach focuses on one-layer representation, which constrains the complexity and in-place interaction analysis between the discovered patterns.

Zhang et al. [4] proposed a method for updating a specific subfilter cascade chosen dynamically during training to produce more diverse convolutional filters and reduce overlap in representation. A more general examination of this problem found solutions like structuring a network to resemble the knowledge base, which could be achieved either manually [5] or through a training process [6]. In the case of relational neural machines (RNMs) [7], knowledge is provided through optimization constraints, thus being a part of the parametric regularization and not the learner structure.

The use of convolutional neural networks has also emerged in omic studies, for example, in protein design [8]. Ingraham et al. [9] proposed a novel model named Structured Transformer based on the assumptions of a self-attention mechanism. They introduced prior knowledge into the model formulated as a graph of relations between amino acids. Sabban and Markowsky [10] used LSTM to generate novel proteins with the addition of a GAN-like classifier scheme to determine acceptable designs. Bepler and Berger [11] used BiLSTM to learn protein sequence embeddings based on structural similarity. They proposed a mechanism based on soft alignment and attention and successfully predicted the structural embeddings. Hie et al. [12] proposed a way of quantifying the uncertainty of prediction and used it in the training loop, thus reducing the volume of data needed. Ding et al. [13] proposed an interesting approach to transforming convolutional kernels into position–weight matrices.

The two widely used motif-detection methods that do not resort to deep learning are DREME (discriminative regular expression motif elicitation) [14] and MEME (multiple EMs for motif elicitation) [15]. The MEME method uses a probabilistic approach to generate a position-specific weight matrix of a recurring, ungapped motif, while DREME searches for consensus sequences using an enumeration approach [16]. Due to their frequent use, these methods are natural candidates for a baseline; however, they do not target the problem of finding polymorphic patterns, as opposed to our approach.

Among the methods that focus on finding motifs in sequences, algorithms such as ListMotif [17] and YMF [18] should be mentioned as those use the enumerative approach. In the family of probabilistic methods, the approaches worth noting are EXTREME [19] and STEME [20] (both use the same principle as the MEME method). Also, soft computing approaches, such as ant colony optimization, have been considered for handling the problem [21].

The main contribution of the proposed approach is in addressing the two limitations of existing methods, which are poor handling of the mutability of patterns to be detected and poor use of knowledge on known, possibly polymorphic patterns, which can exist in sequences to be analyzed. To cope with the first problem, we propose to use a flexible pattern representation scheme; whereas, to solve the second one, we propose to inject knowledge into the network (by means of preset convolutional filter cascades) and introduce a learning mechanism that enables the efficient capture of novelties (by discouraging learning of what is already known). The ideas proposed in the paper provide the first, to our knowledge, effective way of detecting the presence of unknown, latent data generation mechanisms that are difficult to discover due to the flexibility of their underlying data-generation rules.

2. Materials and Methods

2.1. The Proposed Polymorphic-Pattern Detection Methodology

As patterns to be detected are unknown, an unsupervised learning strategy was adopted to solve the problem. The framework of the proposed concept is provided by the convolutional autoencoder (CAE), and the core module involved in pattern detection is CAE’s cascade of convolutional filters. A part of this structure is preset to encode knowledge on known polymorphic sequences (we refer to it as the Fixed Convolutional Module—FCM), and the remaining part is expected to learn any previously unknown regularities that exist in data (we refer to it as the Learnable Convolutional Module—LCM).

An overall computational architecture used for polymorphic pattern detection, the adopted two-layer convolutional, pattern-detecting filter cascade, is depicted in Figure 1.

2.2. Input Sequences

Sequences to be analyzed are short samples extracted from the human genome of the fixed length L, composed of symbols from the four-element alphabet {a,c,g,t}. For the purpose of subsequent processing, each symbol is represented using one-hot encoding. The dataset contains sequences that match polymorphic patterns defined by regular expressions. Sample regular expressions along with the adopted encoding convention, where the expression-matching strings are highlighted, are shown in Figure 2. The regular expressions are formulated in accordance with the specification of the Python

r e

library (https://docs.python.org/3/library/re.html, (accessed on 1 June 2023)). The motif-detection scenario considered in the reported work assumes that sequences to be analyzed comprise patterns generated by two known, flexible rules and, in addition, can also contain an unknown, polymorphic pattern. We transfer knowledge on the known patterns by fixing FCM parameters of the proposed architecture.

To quantify a level of the pattern polymorphism, we propose the following metrics:

\begin{matrix} m = \frac{{| P |}_{0}}{{| A |}_{0}^{l}} \end{matrix}

(1)

where

{| P |}_{0}

is the zero-norm of a set of patterns that can be produced by the considered, underlying pattern-generating rule; A is the alphabet over which the pattern is constructed; and l is the length of the pattern. This metric is closely correlated with the difficulty of the problem at hand; as can be observed in Table 1, its minimum value is

\frac{1}{{| A |}_{0}^{l}}

for a non-mutable pattern and its maximum value is one for random noise.

2.3. Polymorphic Pattern Representation with Convolutional Filters

For the purpose of introducing knowledge to convolutional modules, a method for representing patterns, which are defined via regular expressions, employing convolutional filter parameters, was developed. Convolutional filters can be seen as pattern-detecting operators that produce the maximum output for inputs that match filters’ weights. It is reasonable to use a cascade of simple filters arranged in conventional convolutional layers to provide a flexible representation of polymorphic patterns of arbitrary length. The first-layer filters could be designed to capture different short patterns that comply with local rules defined by a given regular expression, whereas subsequent layers could combine these short chunks into longer strings in accordance with the considered rules.

The simplicity of defining filter cascades that specialize in detecting specific input patterns enables the simple incorporation of initial knowledge into a structure of convolutional neural networks. For example, to represent patterns generated by the provided sample regular expression provided in Figure 3, the first layer filters need to have a receptive field of size

w = 3 \times h = 1 \times d = 4

, where the adopted width (w) enables capturing three-element strings and the adopted depth (d) reflects one-hot encoding used for the considered four-letter alphabet (height—i.e., h—is set to unity as the one-dimensional input is considered). To match symbol-generating rules used by a regex, three different values can be assigned to each weight of a filter. If some symbol is encouraged at a given position of a string, the corresponding filter’s weight is set to a value of ‘1’; if it is forbidden, the corresponding weight is set to ‘−1’. Otherwise, the weight is set to ‘0’ if it is neither encouraged nor forbidden.

The second-layer filters that merge short segments detected by first-layer filters into the longer chunks require a depth that enables integrating all relevant first-layer detectors. For the example presented in Figure 3, the required depth d = 2 allows for combining results produced by the two three-symbol string detectors (two first-layer filters), into a six-element string detector. The span of the receptive field in the second-layer filters must enable seamless string-stitching, which can be achieved by ensuring sufficient filter width (as shown in Figure 3) or by employing appropriate weight dilation. The rules for setting second-layer filters’ weights, which transform outputs of first-layer filters, are the same as presented for the first-layer filters. The method can be scaled both in terms of the number and size of patterns, as well as in terms of rule complexity, by adding new layers of filters.

It should be emphasized that the size of filters does not define the exact length of the pattern but only constrains its maximum length, as the proposed methodology allows for patterns that contain any character at each position, including the ends. Therefore, shorter patterns can be injected by filling the trailing positions with ‘zeros’ (any symbol is allowed).

2.4. Regular Expression Reconstruction

To search for unknown polymorphic patterns that are embedded in sequences comprising runs of random symbols together with known, possibly polymorphic strings, we initialize our algorithm with knowledge provided in the form of a cascade of appropriately preset filters, as explained in the previous subsection.

Several additional filters with parameters to be learned from data are also placed in convolutional layers of the proposed algorithm to detect unknown patterns. A structure of the convolutional pattern detectors can be bundled in the form of the position–weight tensor R [22] of the following form:

\begin{matrix} R = [\begin{matrix} [\begin{matrix} [\begin{matrix} θ_{11 A} & θ_{12 A} & θ_{13 A} \\ θ_{11 C} & θ_{12 C} & θ_{13 C} \\ θ_{11 G} & θ_{12 G} & θ_{13 G} \\ θ_{11 T} & θ_{12 T} & θ_{13 T} \end{matrix}] \\ ⋮ \\ [\begin{matrix} θ_{n 1 A} & θ_{n 2 A} & θ_{n 3 A} \\ θ_{n 1 C} & θ_{n 2 C} & θ_{n 3 C} \\ θ_{n 1 G} & θ_{n 2 G} & θ_{n 3 G} \\ θ_{n 1 T} & θ_{n 2 T} & θ_{n 3 T} \end{matrix}] \end{matrix}] ⊙ [\begin{matrix} γ_{1, 1} \\ γ_{1, 2} \\ γ_{1, 3} \\ ⋮ \\ γ_{1, n - 2} \\ γ_{1, n - 1} \\ γ_{1, n} \end{matrix}] & \dots & [\begin{matrix} [\begin{matrix} θ_{11 A} & θ_{12 A} & θ_{13 A} \\ θ_{11 C} & θ_{12 C} & θ_{13 C} \\ θ_{11 G} & θ_{12 G} & θ_{13 G} \\ θ_{11 T} & θ_{12 T} & θ_{13 T} \end{matrix}] \\ ⋮ \\ [\begin{matrix} θ_{n 1 A} & θ_{n 2 A} & θ_{n 3 A} \\ θ_{n 1 C} & θ_{n 2 C} & θ_{n 3 C} \\ θ_{n 1 G} & θ_{n 2 G} & θ_{n 3 G} \\ θ_{n 1 T} & θ_{n 2 T} & θ_{n 3 T} \end{matrix}] \end{matrix}] ⊙ [\begin{matrix} γ_{m, 1} \\ γ_{m, 2} \\ γ_{m, 3} \\ ⋮ \\ γ_{m, n - 2} \\ γ_{m, n - 1} \\ γ_{m, n} \end{matrix}] \end{matrix}] \\ θ_{x y z} a value in x^{t h} filter, y^{t h} position, and z^{t h} channel of 1^{s t} layer \\ γ_{x, y} a value in x^{t h} position, y^{t h} channel of 2^{n d} layer filter \end{matrix}

(2)

where the symbol ⊙ denotes component-wise multiplication, i.e.,

\begin{matrix} [\begin{matrix} [\begin{matrix} θ_{11 A} & θ_{12 A} & θ_{13 A} \\ θ_{11 C} & θ_{12 C} & θ_{13 C} \\ θ_{11 G} & θ_{12 G} & θ_{13 G} \\ θ_{11 T} & θ_{12 T} & θ_{13 T} \end{matrix}] \\ [\begin{matrix} θ_{21 A} & θ_{22 A} & θ_{23 A} \\ θ_{21 C} & θ_{22 C} & θ_{23 C} \\ θ_{21 G} & θ_{22 G} & θ_{23 G} \\ θ_{21 T} & θ_{22 T} & θ_{23 T} \end{matrix}] \\ ⋮ \end{matrix}] ⊙ [\begin{matrix} γ_{1} \\ γ_{2} \\ ⋮ \end{matrix}] = [\begin{matrix} θ_{1} \\ θ_{2} \\ ⋮ \end{matrix}] ⊙ [\begin{matrix} γ_{1} \\ γ_{2} \\ ⋮ \end{matrix}] = [\begin{matrix} γ_{1} θ_{1} \\ γ_{2} θ_{2} \\ ⋮ \end{matrix}] \end{matrix}

(3)

The first-layer filters

1 . ., i \dots n

, with weights arranged into matrices

θ_{i}

are combined by the second-layer ones (from 1 through m), with entries

γ_{i, j}

. The constructed PWM provides a basis for identifying a regular expression that emerges from the weights of a trained network.

The reconstruction follows the scheme provided in Equation (4), which involves weight normalization, aimed at converting learned weights into probabilities (transformation of R into R″) followed by thresholding, which is expected to produce an unambiguous basis for the reconstruction of the detected regular expression. Here, the symbols U and W denote the position-wise minimum of R and the position-wise sum of R′, respectively.

\begin{matrix} R^{'} = R + 2 * | U | where U_{i} = \min_{j \in J} R_{i j}, i = 1, 2 \dots n \\ R^{″} = R^{'} / W where W_{i} = \sum_{j \in J} R_{i j}^{'}, i = 1, 2 \dots n \\ R^{‴} = f (R^{″}) where f (r_{i j}) = \{\begin{matrix} 1 & r_{i j} > t h_{u p p e r} \\ - 1 & r_{i j} < t h_{l o w e r} \\ 0 & otherwise \end{matrix} \end{matrix}

(4)

The reconstructed regular expression

R E

is defined as a tree, built of shorter chunks

R S

, encoded by the first-layer filters:

\begin{matrix} \forall_{s_{i} \in S} s_{i} \in R_{i} \\ R \in R E \\ R E = {R S} ♢ {\lor, \land} \end{matrix}

(5)

where ♢ denotes a recursive tree operator, and

s_{i}

is a symbol at the ith position of a string S that is admissible at this position of the regular expression R (i.e.,

R_{i}

), which is a specific instance of an expression

R E

.

2.5. Learning Outcome Evaluation

We expect that throughout learning, any new polymorphic, unknown patterns present in input sequences will be learned by learnable weights of the convolutional module. As described in the previous section, these patterns can be subsequently reconstructed from the learned weights.

To measure the similarity of rules that generate patterns, we use a mean Levenshtein distance [23] applied to pairs of sequences produced by the considered regular expressions, i.e., target sequences

S^{T} = {s_{1}^{T}, \dots s_{n}^{T}}

generated by a rule

R^{T}

to be learned and sequences

S^{L} = {s_{1}^{L}, \dots s_{m}^{L}}

generated by a rule that has actually been learned and is reconstructed from the learned filter cascade weights (

R^{L}

):

d (R^{T} \mapsto S^{T}, R^{L} \mapsto S^{L}) = \frac{1}{n m} \sum_{i = 1}^{n} \sum_{j = 1}^{m} d_{i, j} (s_{i}^{T}, s_{j}^{L})

(6)

The Levenshtein (edit) distance

L_{i, j}

between sequences

s_{i}^{T}, s_{j}^{L}

is defined as

\begin{matrix} L_{i, j} = L e v_{M, N} \\ L e v_{m, 0} = m \\ L e v_{0, n} = n \\ L e v_{m, n} = m i n \{\begin{matrix} L e v_{m, n - 1} + c_{i n s} (i_{m}) \\ L e v_{m - 1, n} + c_{d e l} (j_{n}) \\ L e v_{m - 1, n - 1} + c_{s u b} (i_{m}, j_{n}) \end{matrix} \end{matrix}

(7)

where M =

| s_{i}^{T} |_{0}

, N =

| s_{j}^{L} |_{0}

are dimensions of the distance matrix

L e v

.

c_{i n s}

,

c_{d e l}

, and

c_{s u b}

are the insertion, deletion, and substitution costs, respectively. The distance between sequences i and j is placed at position M, N of the matrix

L e v

and represents the minimum number of insertion, deletion, and substitution operations necessary to match one string to the other.

2.6. Disentanglement of Novel Rules from Preset Knowledge

Learning of rules that underlie new polymorphic patterns might be impaired by the influence of patterns already known to the network. Therefore, we consider an additional learning scenario, where learnable convolutional filters are discouraged from discovering knowledge already injected into the network via FCM filters.

To focus on learning new patterns, we use the adopted way of evaluating the proximity of regular expressions, which can also be seen as a way of measuring the similarity between filter cascades that encode regular expressions. Adopting this perspective, one can easily introduce additional mechanisms to control the training of filter cascades that are to learn unknown polymorphic patterns. In particular, one may hypothesize that learning entirely new concepts would be more efficient if learned weight representations were uncorrelated with existing knowledge, embedded in preset filter cascades. Therefore, we propose to augment the training procedure with an additional step aimed at maximizing dissimilarity between cascades that are to be learned and knowledge-encoding cascades.

To implement this objective, at each training iteration, we evaluate the proximity of filter cascades preset with prior knowledge and filter cascades that are to learn unknown patterns. Given some known regular expression

R_{k}^{F}

, encoded through some k-th filter cascade of FCM, and a regular expression

R_{l}^{L}

that is decoded from some j-th filter cascade of LCM, we generate the corresponding two sets of sequences:

S^{k}

and

S^{l}

. The similarity of the regular expressions

R_{k}^{F}

and

R_{l}^{L}

is then quantified using the mean edit distance (6), which, for the considered case, adopts the form

d (R_{k}^{F}, R_{l}^{L}) = \frac{1}{| S |} \sum_{(s_{i}, s_{j}) \in S^{k} \times S^{l}} d (s_{i}^{F}, s_{j}^{L})

(8)

where × denotes a Cartesian product of the sets

S_{k}, S_{l}

.

The evaluated similarity is then used to modify filter weight updates to favor learning regular expressions dissimilar to each other, as presented in the listing of Algorithm 1. Training of convolutional filters proceeds differently for FCM and LCM neurons. The weights of neurons that make up knowledge-representing filter cascades are fixed. In the case of learnable module filters, weight updates are driven by scaled gradients, where scaling favors increasing dissimilarity between known and learned regular expressions. This is implemented using an iterative two-step procedure. An objective of the first, ‘trial’ step is to calculate new, candidate filter weights using basic backpropagation. A ‘virtual’ pattern

V_{k}^{L}

reconstructed from the updated candidate filters is then confronted with all patterns generated by preset rules

R_{l}^{F}

. Suppose that between-pattern distances (i.e.,

d (V_{k}^{L}, R_{l}^{F})

) get increased; in that case, all gradients for a learnable cascade ‘k’ are amplified, i.e.,

\forall_{g_{i, j} \in G} g_{i . j}^{'} = g_{i . j} * d (V_{i}^{L}, R_{j}^{F})

(9)

and the final weight updates are made using old weights and modified gradients. However, if virtual updates increased the similarity between the learned and preset expressions (the distance

d (V_{k}^{L}, R_{l}^{F})

increases), the gradients get diminished, i.e.,

\forall_{g_{i, j} \in G} g_{i . j}^{'} = \frac{g_{i . j}}{d (V_{i}^{L}, R_{j}^{F})}

(10)

where G is the matrix of gradients backpropagated to the considered filter in the network, and i, j are specific positions in this filter.

Algorithm 1 Modification of Backpropagation algorithm
$t s \leftarrow w e i g h t s - l_{r} * g r a d$ $V^{L} \leftarrow r e c o n s t r u c t (t s)$ for $V_{k}^{L}$ `in` $V^{L}$ do for $R_{l}^{F}$ `in` $R^{F}$ do if $d (V_{k}^{L}, R_{l}^{F}) > d (R_{k}^{L}, R_{l}^{F}) \|$ then $g r a d [V_{k}^{L}] \leftarrow a m p l i f y (g r a d [V_{k}^{L}])$ else $g r a d [V_{k}^{L}] \leftarrow a t t e n u a t e (g r a d [V_{k}^{L}])$ end if end for end for	▷ ts - trial optimizer step

2.7. Experiments

The proposed polymorphic pattern detection computational architecture was trained on a dataset made up of one hundred 40-element-long sequences selected from human genomic data (the small number of samples was motivated by the scarcity of genomic data representing rare patterns). These samples were used as input for training the proposed CAE model, where the objective was to tune CAE’s weights (including LCM parameters) to maximize sequence reconstruction accuracy.

The datasets used in this work were extracted from GRCh38.p14 assembly by searching for sequences that contained two or three patterns exactly matching regular-expression-encoded rules. For the extracted sequences, additional analysis was performed to ensure no extra motifs were present in the data. This was achieved using DREME with the word length set to 3 and, if the test proved positive, the corresponding sample was removed from the dataset.

Different scenarios considered in experiments involved different combinations of regular expressions that are known to the network (two out of three were assumed to be known), the presence or absence of patterns generated by an unknown rule, and optional use of the proposed backpropagation algorithm modification.

2.8. Network Architecture and Training

The convolutional autoencoder of the architecture depicted in Figure 4 was used as the computational framework for polymorphic pattern detection experiments. The encoder consists of two Conv1D layers, with 9 and 6 filters, respectively, followed by ReLU activation. The latent representation is obtained using a 15-neuron dense layer.

The decoder mirrors the encoder’s architecture; however, it has a different number of neurons in dense and transpose convolution layers. The activation functions for dense layers are ReLU and Softmax for the network’s output. The size of the output sequence is the same as the input one.

Every network was trained for 500 epochs using the Mean Squared Error loss function and the RMSProp optimizer with a learning rate of 0.001.

The developed model was implemented using the PyTorch library [24] and Python 3.8. Experiments were run on a machine with an AMD Ryzen 9 3950X processor, 64 GB of RAM, and an NVIDIA GeForce RTX 2080 Ti graphics card.

3. Results

The target characteristic to be quantified at the completion of training was a mean distance between the learned filter cascades and the fixed, knowledge-representing cascades.

We were interested in learning whether statistically significant differences exist among results produced for four different experimental scenarios:

An unknown pattern present and filter similarity discouragement turned on;
An unknown pattern not present and filter similarity discouragement turned on;
An unknown pattern present and no filter similarity discouragement;
An unknown pattern not present and no filter similarity discouragement.

For each scenario (repeated 100 times), the resulting mean distances between pairs of regular expressions represented by LCM filters and FCM filters were evaluated. Results of the Levene [25] and Fligner [26] tests (widely accepted for equality of variance testing [27])—for which the null hypothesis is that samples are drawn from the same distribution—show significant outcomes when group 2 is tested against group 4. For groups 1 and 3, test results give no basis for rejecting the null hypothesis; in both training regimes, rules that are similarly distant from the preset ones are learned (Table 2). In other words, statistical tests have significant results when the dataset does not contain unknown patterns. If the test result is insignificant, we can conclude that there is a new, unknown pattern in the data. The flowchart of this method is presented in Figure 5. It can also be observed that patterns learned by the trainable filter cascade are most similar to those actually present in the data (Table 3).

The datasets used in our study were also analyzed using the two previously indicated baseline methods: DREME and MEME. The analysis was performed in two scenarios—for two or three patterns present in the data. Both methods were set to search for six-element-long motifs—the same as used in our method. Table 4 and Table 5 show the results of experiments for DREME and MEME, respectively. As can be observed, both methods perform very well in discovering significant motifs; such a conclusion can be drawn by analysis of the E-value [14], which is responsible for motif significance. On the other hand, distances between the patterns discovered by these methods and the patterns considered in our experiments clearly show that neither of those two baselines can exploit prior knowledge, as the patterns they discover are unrelated to the target motifs. Moreover, both MEME and DREME admit very low pattern mutability, reporting multiple motifs that can be described with one regular expression using our methodology.

4. Discussion

The proposed unknown polymorphic pattern detection method introduces several novel elements. Firstly, we show how prior knowledge of rules, which generate polymorphic patterns, can be incorporated into a network and preserved during training. Another contribution concerns the proposal of measuring a distance between pattern-describing rules by evaluating Levenshtein distances between sequences generated by these rules.

The proposed network architecture is designed to enable the injection of complex knowledge. Furthermore, several constraints introduced to facilitate measurements (e.g., keeping unused parts of filters) can be removed for more general utilization of prior knowledge. It is also worth noting that the presented problem is complex and difficult to be solved by traditional approaches.

As can be seen from the results, the proposed data processing pipeline built upon the introduced methodology can answer whether new, unknown patterns are present in the data. The proposed model is not yet capable of reconstructing the exact pattern. However, it learns a pattern more similar to the original unknown one that the fixed cascades present in the network.

The introduced method’s main contribution and advantage over existing solutions is the ability to utilize prior knowledge about complex polymorphic patterns.

As can be seen from the results, the classical methods, although capable of motif finding, have two severe limitations. They cannot benefit from prior knowledge of regular expressions 1 and 2; so, they waste resources learning them on the same basis as in the case of the unknown pattern 3. Detection of another pattern can only be achieved by further analysis of motifs extracted from the data, which is additionally hindered by the inability to handle patterns’ mutability; again, as can be seen from the comparison, classical methods find multiple motifs whereas the proposed methodology correctly discovers only a single mutable motif. Also, baseline methods are not capable of showing that a discovered pattern is unknown.

The combination of prior knowledge and gradient-based learning is, at its current stage, capable of unknown pattern detection. An obvious direction for future development is modifying the method so that it is also capable of reconstructing the unknown pattern.

The proposed approach still has several limitations that need to be addressed. A matter for future work is generalizing results beyond regular expressions of structure considered in this study. New structures in the convolutional modules would have to be introduced for other regex operators. Also, patterns are not currently allowed to use common knowledge, which is a limitation introduced for experimental purposes; however, the behavior of the network with several constraints relaxed is unknown.

Despite several limitations of the method, it must be emphasized that it achieves the contribution declared at the beginning: it allows for both injection and extraction of knowledge on polymorphic patterns, which is difficult or impossible by means of other methods.

Author Contributions

Conceptualization, P.K. and K.Ś.; methodology, P.K. and K.Ś.; software, P.K.; validation, P.K. and K.Ś.; formal analysis, K.Ś.; investigation, P.K. and K.Ś.; resources, P.K. and K.Ś.; data curation, P.K.; writing—original draft preparation, P.K.; writing—review and editing, K.Ś.; visualization, P.K.; supervision, K.Ś.; project administration, K.Ś.; funding acquisition, K.Ś. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

CAE	Convolutional Autoencoder
FCM	Fixed Convolutional Module
LCM	Learnable Convolutional Module

References

He, Y.; Shen, Z.; Zhang, Q.; Wang, S.; Huang, D.S. A Survey on Deep Learning in DNA/RNA Motif Mining. Brief. Bioinform. 2020, 22, bbaa229. [Google Scholar] [CrossRef] [PubMed]
Liang, H.; Ouyang, Z.; Zeng, Y.; Su, H.; He, Z.; Xia, S.T.; Zhu, J.; Zhang, B. Training Interpretable Convolutional Neural Networks by Differentiating Class-Specific Filters. Lect. Notes Comput. Sci. (Incl. Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinform.) 2020, 12347 LNCS, 622–638. [Google Scholar] [CrossRef]
Koo, P.K.; Eddy, S.R. Representation learning of genomic sequence motifs with convolutional neural networks. PLoS Comput. Biol. 2019, 15, 1–17. [Google Scholar] [CrossRef] [PubMed]
Zhang, D.; He, L.; Luo, M.; Xu, Z.; He, F. Weight asynchronous update: Improving the diversity of filters in a deep convolutional network. Comput. Vis. Media 2020, 6, 455–466. [Google Scholar] [CrossRef]
Towell, G.G.; Shavlik, J.W. Knowledge-Based Artificial Neural Networks. Artif. Intell. 1994, 70, 119–165. [Google Scholar] [CrossRef]
Gaier, A.; Ha, D. Weight Agnostic Neural Networks. In Advances in Neural Information Processing Systems: Proceedings of the 2004 Conference; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; MIT Press: Cambridge, MA, USA, 2019; Volume 32, pp. 5364–5378. [Google Scholar]
Marra, G.; Diligenti, M.; Giannini, F.; Gori, M.; Maggini, M. Relational Neural Machines. arXiv 2020, arXiv:cs/2002.02193. [Google Scholar]
Wu, Z.; Johnston, K.E.; Arnold, F.H.; Yang, K.K. Protein sequence design with deep generative models. Curr. Opin. Chem. Biol. 2021, 65, 18–27. [Google Scholar] [CrossRef] [PubMed]
Ingraham, J.; Garg, V.K.; Barzilay, R.; Jaakkola, T. Generative models for graph-based protein design. Adv. Neural Inf. Process. Syst. 2019, 32, 15820–15831. [Google Scholar]
Sabban, S.; Markovsky, M. RamaNet: Computational de novo helical protein backbone design using a long short-term memory generative neural network. F1000Research 2020, 9, 298. [Google Scholar] [CrossRef]
Bepler, T.; Berger, B. Learning protein sequence embeddings using information from structure. arXiv 2019, arXiv:1902.08661. [Google Scholar]
Hie, B.; Bryson, B.D.; Berger, B. Leveraging Uncertainty in Machine Learning Accelerates Biological Discovery and Design. Cell Syst. 2020, 11, 461–477.e9. [Google Scholar] [CrossRef] [PubMed]
Ding, Y.; Li, J.Y.; Wang, M.; Tu, X.; Gao, G. An exact transformation for CNN kernel enables accurate sequence motif identification and leads to a potentially full probabilistic interpretation of CNN. bioRxiv 2017, 163220. [Google Scholar] [CrossRef]
Bailey, T.L. DREME: Motif discovery in transcription factor ChIP-seq data. Bioinformatics 2011, 27, 1653–1659. [Google Scholar] [CrossRef] [PubMed]
Bailey, T.L.; Elkan, C. Fitting a Mixture Model by Expectation Maximization to Discover Motifs in Bipolymers; University of California: La Jolla, CA, USA, 1994. [Google Scholar]
Hashim, F.A.; Mabrouk, M.S.; Al-Atabany, W. Review of different sequence motif finding algorithms. Avicenna J. Med. Biotechnol. 2019, 11, 130. [Google Scholar]
Sun, H.Q.; Low, M.Y.H.; Hsu, W.J.; Rajapakse, J.C. ListMotif: A time and memory efficient algorithm for weak motif discovery. In Proceedings of the 2010 IEEE International Conference On Intelligent Systems and Knowledge Engineering, Hangzhou, China, 15–16 November 2010; pp. 254–260. [Google Scholar]
Sinha, S.; Tompa, M. YMF: A program for discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Res. 2003, 31, 3586–3588. [Google Scholar] [CrossRef] [PubMed]
Quang, D.; Xie, X. EXTREME: An online EM algorithm for motif discovery. Bioinformatics 2014, 30, 1667–1673. [Google Scholar] [CrossRef] [PubMed]
Reid, J.E.; Wernisch, L. STEME: Efficient EM to find motifs in large data sets. Nucleic Acids Res. 2011, 39, e126. [Google Scholar] [CrossRef] [PubMed]
Bouamama, S.; Boukerram, A.; Al-Badarneh, A.F. Motif finding using ant colony optimization. In Proceedings of the Swarm Intelligence: 7th International Conference, ANTS 2010, Brussels, Belgium, 8–10 September 2010; Springer: Berlin/Heidelberg, Germany, 2010; pp. 464–471. [Google Scholar]
Staden, R. Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res. 1984, 12, 505–519. [Google Scholar] [CrossRef] [PubMed]
Levenshtein, V.I. Binary codes capable of correcting deletions, insertions, and reversals. In Proceedings of the Soviet Physics Doklady, Soviet Union; 1966; Volume 10, pp. 707–710. [Google Scholar]
Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 8024–8035. [Google Scholar]
Levene, H. Contributions to probability and statistics. In Essays in Honor of Harold Hotelling; Springer: Berlin/Heidelberg, Germany, 1960; pp. 278–292. [Google Scholar]
Fligner, M.A.; Killeen, T.J. Distribution-Free Two-Sample Tests for Scale. J. Am. Stat. Assoc. 1976, 71, 210–213. [Google Scholar] [CrossRef]
Lim, T.S.; Loh, W.Y. A comparison of tests of equality of variances. Comput. Stat. Data Anal. 1996, 22, 287–301. [Google Scholar] [CrossRef]

Figure 1. The adopted sequence-processing pipeline. Input sequences (bottom) are first converted to one-hot encoded vectors and then passed to convolutional filters of the encoder’s first layer (entries of two preset filters

F_{1}^{1}

and

F_{2}^{1}

that represent preset knowledge are highlighted in blue and green, whereas entries of the remaining filters, denoted with ‘x’ symbols, are to be learned). Filtering results are next combined in the second-layer filters (again, two are preset and the remaining ones are to be learned) and passed to subsequent layers of the CAE.

Figure 1. The adopted sequence-processing pipeline. Input sequences (bottom) are first converted to one-hot encoded vectors and then passed to convolutional filters of the encoder’s first layer (entries of two preset filters

F_{1}^{1}

and

F_{2}^{1}

that represent preset knowledge are highlighted in blue and green, whereas entries of the remaining filters, denoted with ‘x’ symbols, are to be learned). Filtering results are next combined in the second-layer filters (again, two are preset and the remaining ones are to be learned) and passed to subsequent layers of the CAE.

Figure 2. Sample regular expressions (top) and a string of symbols comprising the produced polymorphic patterns (bottom) with matches highlighted in the corresponding colors. The symbols ‘

()

’, ‘

[]

’, ‘.’, and ‘|’ denote subsequence repetitions, a set of characters, any character, and an alternative, respectively. The only pattern overlap is highlighted in brown.

Figure 2. Sample regular expressions (top) and a string of symbols comprising the produced polymorphic patterns (bottom) with matches highlighted in the corresponding colors. The symbols ‘

()

’, ‘

[]

’, ‘.’, and ‘|’ denote subsequence repetitions, a set of characters, any character, and an alternative, respectively. The only pattern overlap is highlighted in brown.

Figure 3. Representation of knowledge on the polymorphic sequence though filters of two convolutional layers. The filters are preset to maximize response to the four presented patterns (the processing outcome at filters’ outputs—y(…), for one of these patterns, AxGTxG, is shown).

Figure 4. The network architecture.

Figure 5. The flowchart of the proposed testing method. Datasets with samples containing two known regexes are fed into two sets of networks of identical architecture; however, one has backprop modification turned on and the other has this mechanism turned off. After training, both scenarios’ distances between FCM and LCM are analyzed for variance equality. A lack of statistically significant differences indicates the presence of an unknown pattern C.

Table 1. Examples of six-element patterns generated from a four-element alphabet and their corresponding m measures: ‘.’ stands for any alphabet element, and

[x y]

denotes any element of the subset

{x, y}

.

Table 1. Examples of six-element patterns generated from a four-element alphabet and their corresponding m measures: ‘.’ stands for any alphabet element, and

[x y]

denotes any element of the subset

{x, y}

.

Pattern	m
$A A A A A A$	$0.0002$
$A A . . A A$	$0.004$
$C C . [A C T] . C$	$0.012$
$[A C T] \dots . T$	$0.19$

Table 2. Results of statistical tests for GRCh38 data. Statistically significant results support the conclusion that samples are drawn from different distributions.

	Levene Test		Fligner Test
♢	Statistics	p-Value	Statistics	p-Value
True	$1.89$	$0.09$	$3.74$	$0.24$
False	$3.56$	$0.02$	$9.57$	$0.01$

♢ Unknown regex present in data.

Table 3. Averaged distances between regular expressions retrieved from filters and the unknown pattern. Columns

d_{1}

and

d_{2}

represent distances from FCMs with injected knowledge and

d_{3}

represents LCM, which is supposed to reconstruct the pattern.

Table 3. Averaged distances between regular expressions retrieved from filters and the unknown pattern. Columns

d_{1}

and

d_{2}

represent distances from FCMs with injected knowledge and

d_{3}

represents LCM, which is supposed to reconstruct the pattern.

Backprop Modification	♢	$d_{1}$	$d_{2}$	$d_{3}$
False	True	3.43	3.46	2.91
False	False	3.40	3.43	2.87
True	True	3.52	3.49	2.91
True	False	3.44	3.45	2.94

♢ Unknown regex present in data.

Table 4. DREME results. Three motifs with the lowest E-value are presented for each dataset. Columns

d_{1}

,

d_{2}

, and

d_{3}

represent distances from patterns considered in the study, with

d_{3}

being the unknown pattern.

Table 4. DREME results. Three motifs with the lowest E-value are presented for each dataset. Columns

d_{1}

,

d_{2}

, and

d_{3}

represent distances from patterns considered in the study, with

d_{3}

being the unknown pattern.

♢	Motif No.	E-Value	$d_{1}$	$d_{2}$	$d_{3}$
False	1.	$1.4 \times 10^{- 56}$	3.65	2.43	2.54
False	2.	$3.8 \times 10^{- 44}$	4.76	2.54	4.85
False	3.	$8.4 \times 10^{- 38}$	2.87	5.43	4.03
True	1.	$7.6 \times 10^{- 45}$	3.63	2.52	2.91
True	2.	$4.7 \times 10^{- 34}$	3.45	1.93	4.72
True	3.	$2.7 \times 10^{- 32}$	2.97	2.16	2.59

♢ Unknown regex present in data.

Table 5. MEME results. Three motifs with the lowest E-value are presented for each dataset. Columns

d_{1}

,

d_{2}

, and

d_{3}

represent distances from patterns considered in the study, with

d_{3}

being the unknown pattern.

Table 5. MEME results. Three motifs with the lowest E-value are presented for each dataset. Columns

d_{1}

,

d_{2}

, and

d_{3}

represent distances from patterns considered in the study, with

d_{3}

being the unknown pattern.

♢	Motif No.	E-Value	$d_{1}$	$d_{2}$	$d_{3}$
False	1.	$3.2 \times 10^{- 30}$	1.53	4.50	3.91
False	2.	$3.5 \times 10^{- 27}$	1.76	2.93	3.34
False	3.	$4.7 \times 10^{- 20}$	2.40	4.43	4.87
True	1.	$3.2 \times 10^{- 30}$	3.43	3.75	3.91
True	2.	$6.0 \times 10^{- 20}$	3.33	2.40	3.45
True	3.	$5.1 \times 10^{- 18}$	4.40	2.80	2.92

♢ Unknown regex present in data.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kucharski, P.; Ślot, K. Detection of Unknown Polymorphic Patterns Using Feature-Extracting Part of a Convolutional Autoencoder. Appl. Sci. 2023, 13, 10842. https://doi.org/10.3390/app131910842

AMA Style

Kucharski P, Ślot K. Detection of Unknown Polymorphic Patterns Using Feature-Extracting Part of a Convolutional Autoencoder. Applied Sciences. 2023; 13(19):10842. https://doi.org/10.3390/app131910842

Chicago/Turabian Style

Kucharski, Przemysław, and Krzysztof Ślot. 2023. "Detection of Unknown Polymorphic Patterns Using Feature-Extracting Part of a Convolutional Autoencoder" Applied Sciences 13, no. 19: 10842. https://doi.org/10.3390/app131910842

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Detection of Unknown Polymorphic Patterns Using Feature-Extracting Part of a Convolutional Autoencoder

Abstract

1. Introduction

2. Materials and Methods

2.1. The Proposed Polymorphic-Pattern Detection Methodology

2.2. Input Sequences

2.3. Polymorphic Pattern Representation with Convolutional Filters

2.4. Regular Expression Reconstruction

2.5. Learning Outcome Evaluation

2.6. Disentanglement of Novel Rules from Preset Knowledge

2.7. Experiments

2.8. Network Architecture and Training

3. Results

4. Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI