Hidden Abstract Stack Markov Models with Learning Process

Özbaltan, Mete

doi:10.3390/math12132144

Open AccessArticle

Hidden Abstract Stack Markov Models with Learning Process

by

Mete Özbaltan

Department of Computer Engineering, Faculty of Engineering and Architecture, Erzurum Technical University, 25050 Erzurum, Türkiye

Mathematics 2024, 12(13), 2144; https://doi.org/10.3390/math12132144

Submission received: 10 May 2024 / Revised: 4 July 2024 / Accepted: 5 July 2024 / Published: 8 July 2024

(This article belongs to the Section Mathematics and Computer Science)

Download

Browse Figures

Versions Notes

Abstract

:

We present hidden abstract stack Markov models (HASMMs) with their learning process. The HASMMs we offer carry the more expressive nature of probabilistic context-free grammars (PCFGs) while allowing faster parameter fitting of hidden Markov models (HMMs). Both HMMs and PCFGs are widely utilized structured models, offering an effective formalism capable of describing diverse phenomena. PCFGs are better accommodated than HMMs such as for expressing natural language processing; however, HMMs outperform PCFGs for parameter fitting. We extend HMMs towards PCFGs for such applications, by associating each state of an HMM with an abstract stack, which can be thought of as the single-stack alphabet of pushdown automata (PDA). As a result, we leverage the expressive capabilities of PCFGs for such applications while mitigating the cubic complexity of parameter learning in the observation sequence length of PCFGs by adopting the bilinear complexity of HMMs.

Keywords:

hidden Markov model; probabilistic context-free grammar; stack; counter; learning

MSC:

62M05

1. Introduction

Long-term dependencies with a nested hierarchical structure, as observed in various domains such as natural language and queuing systems, pose a significant challenge in the modeling of sequential data. HMMs and PCFGs are two widely utilized structured models for sequential data. HMMs are known for their simplicity, fast learning, and efficient parameter fitting. On the other hand, PCFGs are well suited and offer an effective formalism for expressing complex structures against HMMs; however, they learn slower compared to HMMs. In this study, we present our model HASMM, which incorporates the advantageous features of both HMMs and PCFGs, aiming to reduce both the complexity of expressiveness and computational complexity.

Probabilistic models incorporating a counter are natural models for systems such as queuing systems, where the counter tracks the number of clients waiting [1,2]. Counter systems are frequently employed as abstractions, with the counter value typically bounded by constraints.

One-counter automata, utilized for system verification, recognize one-counter languages, which encompass regular languages. They are a subset of context-free languages which are identified from PDAs [3,4]. By choosing one-counter languages, i.e., assuming a PDA-associated singleton stack, significant speedup can be achieved compared to the cubic time complexity of algorithms like CYK for determining the membership of context-free languages. Probabilistic one-counter automata (P1CAs) [5,6] can be considered as a special case of probabilistic pushdown automata (PPDAs) [7,8] with just a single stack symbol, where the Markov chains (MCs) can be perceived as maintaining constant counter values. The Inside-Outside algorithm enables parameter re-estimation and probability computation for a PCFG from observations, with both tasks having a computational complexity of cubic time [9,10]. MCs offer the advantage of faster linear time parsing and analysis; however, they exhibit limited expressivity when compared to one-counter automata [11].

The HMM was introduced by [12] for various applications; later, instead of HMMs, which have difficulties in expressing situations such as natural language processing, PCFGs were proposed [13]. In HMMs, attempts have been made to improve their expressive capabilities, as the complexity of the learning process is relatively low. Some of HMM extensions that came out are lexicon tree-HMMs [14], factorial HMMs [15], hierarchical HMMs [16], and hidden semi-Markov models [17]. In recent studies, the hidden one-counter Markov model (H1MM) has presented an effective position between PCFGs and HMMs [18]. However, we present an effective new counter machine that combines performance and accuracy in finite cases by abstracting the counter used in the H1MM.

Based on our comprehensive literature review, we have observed models that struggle with the learning step due to learning algorithms with cubic complexity, as well as models that, despite faster learning algorithms, have limited capacity and therefore fall short in more complex tasks. To address this gap in the literature, we propose a new modeling approach that incorporates the strengths of both types of models. Our motivation is to propose a model that strikes a balance: one that is deemed suitable in terms of learning complexity and capable enough to handle complex tasks. By doing so, we aim to harness the expressive power of PCFGs while preserving and even enhancing the parameter fitting advantages of HMMs. This enhancement can be achieved by augmenting the HMM structure with an abstract stack, making it more suitable for handling complex tasks.

Our main contribution is to introduce a novel effective model in the literature that is more complex than HMMs (applicable in domains such as natural language processing, speech recognition, and bioinformatics) and faster in terms of training than PCFGs (using algorithms like CYK learning). Therefore, our contribution aims to advance the field by introducing a model that integrates the rapid learning capabilities of HMMs with an extension via an abstract stack. This extension makes HMMs more complex, leading to broader applicability across various domains.

Markov models can be designed to be abstracted, can be learned from observations, and then used for analysis. We present the learning capabilities of the Markov models in this work by transferring them to our own model. During the learning process, there is no direct access to the probabilities of hidden state transitions in the model itself. However, based on observations with a finite state alphabet, we can make state predictions. While learning MC from observations is called HMM, we refer to the learning process from observation as HASMM, which has a customized P1CA while abstracting the counter. In the case of HASMMs, we demonstrate that despite being specialized pushdown systems capable of generating stochastic languages that may not be regular, we make an adaption to the learning process by means of algorithms employed in HMMs. As a result, the learning complexity for HASMM is comparable to that of HMM, with a bilinear dependency on the length of the input.

We adapt commonly used expectation-maximization algorithms, including forward, backward, and Baum–Welch algorithms, for the HASMM analysis as in HMMs. This adaptation considerably reduces the complexity of the parameter estimation against other such models as PPDAs and PCFGs. The modified algorithms for HASMM achieve a bilinear time complexity, offering a significant improvement over the cubic complexity of full pushdown models.

The layout of the remaining sections is as follows. The next subsection provides a comprehensive literature review. In Section 2, an outline of the HMM is provided. In Section 3, we introduce our HASMM with a formal definition. Section 4 explains the learning processes of HASMMs by adapting the forward, backward, and Baum–Welch algorithms. Section 5 details our experimental evaluation showing the effectiveness of our proposed HASMM compared to H1MM, HMM, and PCFG. At last, Section 6 concludes this paper.

Related Work

Mor et al. [19] examined several papers on HMMs from various fields, particularly in speech recognition and bioinformatics. The survey paper categorized HMM variants into several types, including First-Order HMM (FO-HMM), Higher-Order HMM (HO-HMM), Hidden Semi-Markov Model (HSMM), Factorial HMM (FHMM), Second-Order HMM (SO-HMM), Layered HMM (LHMM), Autoregressive HMM (AR-HMM), Non-Stationary HMM (NS-HMM), and Hierarchical HMM (HHMM). It highlighted the need to consolidate research efforts on HMMs across different domains. The paper provided an extensive discussion of different types of HMMs and their implementation areas. Despite significant advances, it emphasized that there were still undiscovered models for HMMs in emerging domains.

Kumar and Kumar [20] examined hardware and software failures in automatic ticket vending machines (ATVMs) using a Markov process-based model to evaluate reliability, mean time to failure, and component sensitivity. Kumar and Ram [21] applied Markov’s birth–death process to study failure factors in the decomposition unit of a urea fertilizer plant (UFP), analyzing metrics like availability, reliability, and mean time to failure, and identifying critical components for enhanced maintenance strategies in industrial contexts. Kumar and Kumar [22] assessed the reliability of components in wireless communication systems through Markov processes, investigating various failure scenarios, repair strategies, and mean time to failure to enhance system performance and reliability.

Yao et al. [23] extended HMMs as Nonhomogeneous Hidden Markov Models (NHMMs) by varying their transition matrices over time, to enhance the understanding and practical application of entropy rates. Wei et al. [24] applied a Semi-Markov Decision Process (SMDP) in order to develop a comprehensive group maintenance strategy for optimizing operational availability and reduce downtime in complex industrial systems. Wang et al. [25] applied the Markov process imbedding method (MPIM) for developing optimized maintenance and spare-part strategies that prioritized varying degrees of urgency in service demands. Lee et al. [26] extended HMMs with covariate-dependent transitions to uncover factors influencing phenomena that were challenging to observe directly, such as Matsutake mushroom growth.

Alvaro et al. [27] introduced a formal model that uses 2D-PCFGs and HMMs to recognize handwritten mathematical expressions online. HMMs detected mathematical symbols, while the PCFG showed how these symbols related, making it easier to apply traditional parsing and estimation techniques. In a mathematical expression recognition contest, the model performed exceptionally well at different levels by accurately capturing variations and making parsing decisions based solely on statistical information, without relying on heuristic judgments. Dyrka et al. [28] introduced a machine learning model for detecting amyloid signaling motifs, overcoming challenges in their detection due to diversity and short length. Results showed that PCFGs outperformed traditional profile HMMs in sensitivity and false positive rates. PCFGs demonstrated the ability to generalize across diverse motif collections, including identifying related motifs in fungi. Experimental validation confirmed the structural and functional relationship of identified peptides. Overall, PCFGs offered the potential for improving specificity in motif detection beyond traditional methods like profile HMMs. Oseki and Marantz [29] addressed the debate surrounding human linguistic competence, particularly focusing on morphological competence, often overshadowed in cognitive science discussions. Through a crowdsourced experiment and computational modeling, five morphological competency frameworks underwent evaluation in comparison to assessments made by humans. The results indicated that computational models incorporating abstract hierarchical structures, especially PCFGs, more accurately explained human judgments, suggesting that human morphological competence involved internally generated hierarchical structures rather than being solely based on surface linear strings. Carravetta and White [30] addressed the challenge of constructing statistical frameworks for languages using PCFGs, while taking into consideration potential noise in observed strings. They introduced the stochastic syntactic process (SSP) for addressing inference problems like estimating parse trees for noisy processes, relevant in scenarios such as metalevel target tracking. By utilizing probability theory, the article illustrated the integration of an SSP into a Markov random field (MRF), facilitating sophisticated machine learning techniques for addressing such complex estimation. Moreover, it provided an elementary illustration demonstrating the integration of a basic CFG into an MRF, with potential expansions explored for context-sensitive grammars. Lopes and Freitas [31] offered valuable models for the creation of algorithms, style modeling, and analysis of musical theories. The paper stated that HMMs and PCFGs exhibited similar characteristics and employed a Gibbs Sampling algorithm to infer a PCFG model and applied them to model chord sequence generation. Findings demonstrated that the model could enhance PCFGs and the existing approaches, achieving improved results.

Bourlard and Bengio [32] presented and compared variants of Stochastic finite state automata (SFSA) and HMM models utilized in sequence processing. Finite state automata (FSA), including SFSA and HMMs, excelled in addressing sequential pattern recognition tasks by decomposing patterns into stationary segments. Dupont et al. [33] elucidated the relationships between probabilistic automata (PA) and discrete HMMs. The paper outlined the requirements to describe a probabilistic language and confirmed that probabilistic deterministic automata (PDFA) was a subset of probabilistic nondeterministic automata (PNFA). Furthermore, it delineated a pair of comparable frameworks: HMMs and PNFA without final probabilities, and HMMs with final probabilities and probabilistic automata. The paper additionally presented diverse learning methods for PA and HMMs, formalizing challenges such as parameter prediction, encompassing frameworks like PAC learning, and including discussions on Bayesian learning. Adhikary et al. [34] demonstrated how common quantum tensor network models, in their stationary or uniform forms, related to ideas in stochastic processes and in the weighted automata area, especially for indefinite-length sequences. Various equivalences were identified among frameworks employed in these fields, encompassing uniform versions of matrix product states and Born machines, predictive state representations, and stochastic weighted automata. Bhattacharya and Ray [35] addressed pattern classification and anomaly detection in dynamical systems, primarily through time series analysis. While conventional methods like HMMs and neural networks (NNs) often demand significant resources, an alternative, probabilistic finite state automata (PFSA), offers quicker learning and evaluation. However, the standard PFSA have limitations the study aimed to overcome. The proposed p-PFSA method’s efficacy was demonstrated across four chaotic dynamical system models, compared with s-PFSA, HMM, and various NN setups, and its performance was further evaluated in a real-life scenario, assessing accuracy, robustness, and computational complexity in simulating industrial combustion systems.

Almutiri and Nadeem [36] explored the application of Markov models in three core aspects of NLP: natural language generation, named-entity recognition, and parts-of-speech tagging. The study reviewed recent research on methods, advantages, and limitations of employing Markov models in NLP, aiming to reduce reliance on lexicons and manual annotation tasks. Through an extensive literature review, it examined the utilization of Markov models in NLP, discussing both benefits and drawbacks. The survey paper emphasized the prevalence of supervised models in NLP research and how Markov models were integrated to reduce dependency on annotation tasks. Pande et al. [37] combined NLP and machine learning models to assess text data quality, using probability states. They also integrated HMM with the named-entity model for classifying network variables. Their proposed HMMNE model achieved high precision rates, notably with strong accuracy in categorizing and identifying variables in text documents. Zhang et al. [38] addressed the challenges in integrating computer technology with language education methods, focusing on improving student engagement and language skills. They introduced a novel method, the Heuristic Hidden Markov Model and Statistical Language model (HHMM-SLM), to recognize large unconstrained handwritten text vocabulary, leveraging NLP and HMM. The approach aimed to enhance system performance through statistical language models and trials with single and multiple writer data. Li et al. [39] proposed a Conditional Hidden Markov Model (CHMM) for tackling the challenge of learning a named-entity recognition (NER) tagger with noisy labels from multiple weak supervision sources in NLP. The CHMM leveraged contextual representation power from pretrained language models to effectively infer true labels from noisy observations, enhancing the classic HMM. Chiu and Rush [40] tackled the task of adapting HMMs to accommodate large language modeling datasets in NLP, drawing inspiration from recent developments in neural modeling approaches. The proposed techniques aimed to address the scalability of HMMs to extensive state spaces while preserving efficient exact inference, concise parameterization, and robust regularization.

2. Background of Learning Algorithms in Hidden Systems

The HMM is an effective instrument used to map the probability distribution of discrete time series data. In reality, this tool is an enhanced Markov chain. In brief, a chain makes future predictions for a discrete time series sequence. To make these predictions more accurate for solving complex problems, the HMM hides the state transition process. This tool is used for NLP [41], stock market forecasting [42], and bioinformatics [43].

HMM involves two events which are called observed and hidden. In the observed event, an observation symbol

o_{t}

at time t emerges from a latent variable

S_{t}

in the hidden event. This statistical model is used to explain the process of an observable event by an internal process which is hidden. Parameters in an HMM come in pairs. The state transition matrix, denoted as

A (S_{t}, S_{t + 1})

, is the first parameter. It specifies the likelihood of moving from one state to the next. The Markov property, which states that the next latent state relies solely on the current latent state, is satisfied by this procedure.

B (S_{t}, o_{t})

holds the current hidden state’s probability of emitting output symbols. Additionally, this procedure meets the initial condition, whereby each latent state’s corresponding output symbol maintains autonomy from states and symbols.

The simple structure of an HMM is shown in Figure 1. The circles represent the latent variable of the model. At each time step as illustrated above the circles, an observation symbol is emitted, which is shown as a square, by a hidden state at a particular time t. In this study, the initial distribution of the HMM (i.e., it is represented as

π

) is not used as a parameter of the model. However, an initial state is used instead of the initial distribution.

The representation of the model is defined with O and

θ = (S, A, B)

:

O represents the set of observable symbols. At each time interval t, hidden variables generate an observation symbol, represented as a square in Figure 1. For instance, $o_{1}$ at $t = 1$ retains an observation symbol emitted by the hidden variable $S_{1}$ .
S denotes latent variables. As depicted in Figure 1, the state value can vary over time intervals. For instance, at $t = 0$ , it reflects the initial state $S_{0}$ , while for $t > 0$ , it holds the state value corresponding to $t^{'}$ as $S_{t^{'}}$ .
A represents the probability values for the state transitions, and $A (q, q^{'})$ is the specific probability value:

$\begin{matrix} A (q, q^{'}) & = \Pr (S_{t + 1} = q^{'} | S_{t} = q), q, q^{'} \in S \\ \sum_{q^{'} \in S} A (q, q^{'}) = 1 \forall q \in S \end{matrix}$

(1)
B represents the distributions of the observations, and $B (q, o)$ is the probability specific probability value:

$\begin{matrix} B (q, o) & = \Pr (o_{t} = o | S_{t} = q), o \in O, q \in S \\ \sum_{o \in O} B (q, o) = 1 \forall q \in S \end{matrix}$

(2)

The configuration of the elements is approached by following the methodology outlined in [44]. This study proposed the iterative Baum–Welch algorithm. The Baum–Welch algorithm is a subset of the expectation-maximization algorithm for the parameter prediction on HMM. The Baum–Welch algorithm consists of two steps: expectation and maximization. The state transition and emission matrices are computed in the expectation phase. Initially,

ξ_{t} (q, q^{'})

is calculated. The

θ

model for a given output sequence O is as follows:

ξ_{t} (q, q^{'}) = \Pr (S_{t} = q, S_{t + 1} = q^{'} | O, θ)

(3)

In line with Bayes’ theorem,

ξ_{t} (q, q^{'}) = \frac{α_{t} (q) A (q, q^{'}) B (q^{'}, o_{t + 1}) β_{t + 1} (q^{'})}{\Pr (O | θ)}

(4)

where

1 \leq t \leq T

(T is the length of output sequence), the forward computation is denoted by

α_{t} (q)

, and

β_{t} (q)

denotes the backward computation. Then, the probability of an observation being in a state at time t can be calculated with

γ_{t} (q)

as follows:

γ_{t} (q) = \Pr (S_{t} = q | O, θ)

(5)

Upon applying Bayes’ theorem, Equation (5) can be reformulated as follows:

γ_{t} (q) = \frac{α_{t} (q) β_{t} (q)}{\Pr (O | θ)}

(6)

Then, from the above equations,

α_{t} (q)

is used for the forward calculation, and

β_{t} (q)

is used for backward calculation.

Forward calculation: We show the forward calculation in Figure 2. The computation that yields the total probability of an observed sequence in an HMM involves summing up the products of previous values depending on the matrices A and B at each instant.

α_{t} (q)

can be calculated as follows:

α_{t} (q) = \Pr (o_{1} o_{2} \dots o_{t}, S_{t} = q)

(7)

The iterative steps for

α_{t} (q)

can be expressed as follows:

α_{t} (q) = \sum_{i = 1}^{N} α_{t - 1} (S_{i}) A (S_{i}, q) B (q, o_{t}),

(8)

where

α_{t - 1}

represents the previous forward calculation. At each time step, it equals the sum of the products of the previous forward value, the state transition probability (A), and the probability of generating an observation symbol by the hidden state (B).

There is only one starting state at the beginning, that is, it does not have an initial vector like

π

; the starting state is denoted by

q_{0}

. The calculation of the value of the forward algorithm for each hidden state at the initial point is as follows:

α_{1} (q) = A (q_{0}, q) B (q, o_{1}) q \in S, o_{1} \in O

(9)

Backward calculation: The computation that yields the probability of observing a sequence in an HMM is performed by calculating the probabilities of observing the remaining sequence from a particular state at a specific time, as shown in Figure 3.

The state estimation is calculated as follows:

β_{t} (q) = \Pr (o_{t + 1} o_{t + 2} \dots o_{T} ∣ S_{t} = q)

(10)

The prediction involves two phases as follows:

Initialization:

$β_{1} (q) = 1 \forall q \in S$

(11)
Update:

$β_{t} (q) = \sum_{i = 1}^{N} β_{t + 1} (S_{i}) A (q, S_{i}) B (S_{i}, o_{t + 1})$

(12)

In the M-step, estimating the state transition matrix requires considering two factors: the sum of all state transitions and all transitions from the current state. The resulting formulation of the transition matrix

\hat{A} (q, q^{'})

is as follows:

\hat{A} (q, q^{'}) = \frac{\sum_{t = 1}^{T - 1} ξ_{t} (q, q^{'})}{\sum_{t = 1}^{T - 1} \sum_{k = 1}^{N} ξ_{t} (q, S_{k})} S_{k} \in S

(13)

The updated formulation of the emission matrix comprises two components: the accumulation of

γ_{t} (q)

for all instants where the output variable

o_{t} = o

, and the summation of

γ_{t} (q)

across all instants. The resulting

\hat{B} (q, o)

is as follows:

\hat{B} (q, o) = \frac{\sum_{t = 1}^{T} I (o_{t}) γ_{t} (q)}{\sum_{t = 1}^{T} γ_{t} (q)}

(14)

I (o_{t})

denotes the output variable at time t and is given as follows:

I (o_{t}) = \{\begin{matrix} 1 : & o_{t} = o \\ 0 : & otherwise . \end{matrix}

(15)

3. Hidden Abstract Stack Markov Models

A hidden abstract stack Markov model is a tuple

Υ

=

(Q, Σ, A, B, q_{0}, q_{F})

where

Q is the set of hidden control states;
$Σ$ is the space of observation symbols;
A is a probability state transition matrix where $A : Q \times Ψ \times Q$ . $Ψ = {E, L, M, H, F}$ represents the content of the stack (empty, low, middle, high, and full, respectively, i.e., the abstract stack) with possible changes;
B is a probability emission matrix where $B : Q \times Σ$ ;
$q_{0} \in Q$ is an initial state;
$q_{F} \in Q$ is a final state.

The probability distributions for the transition matrix A and the emission matrix B are as follows for each state:

\sum_{q^{'} \in Q} \sum_{ψ \in Ψ} A (q, ψ, q^{'}) = 1, \forall q \in Q

(16)

\sum_{σ \in Σ} B (q, σ) = 1, \forall q \in Q

(17)

We associate each state

q \in Q

with an abstract stack value (i.e., the content of the stack)

ψ \in Ψ

in our configuration for the HASMM, which is represented as a pair

(q, ψ)

. Our initial configuration is denoted as

(q_{o}, ψ_{0})

, and the final configuration as

(q_{F}, ψ_{F})

.

The matrix A specifies the transition probabilities between states by modifying the stack value from one state to another. The matrix B, on the other hand, represents the observation probabilities associated with each state. The expression

A (q, L, q^{'}) = 0.4

signifies that the probability of transitioning from state q to

q^{'}

when the stack level is low is 40%. Likewise, when the matrix B is defined such that

B (q, σ) = 0.2

, it indicates that irrespective of the stack state, there is a 20% chance of generating symbol

σ

in state q.

The model operates as a stochastic process, where latent variables, including stack changes for each transition, are governed by a conditional probability distribution. At each time step t, all states emit an output letter

σ \in Σ

. The prediction of a state transition follows a Markovian property, indicating that the current hidden control state

S_{t} = q^{'}

depends solely on the previous hidden control state

S_{t - 1} = q

. If the counter is

ψ

at time

t - 1

for state q, then A calculates the probabilistic variables as follows:

\begin{matrix} P r (S_{t} = (q^{'}, ψ^{'}) ∣ S_{t - 1} = (q, ψ)) = A (q, Δ, q^{'}) \end{matrix}

(18)

where

Δ : ψ \to ψ^{'}

.

For a T-time HASMM,

M

is a

T + 1

-length sequence

(q_{0}, ψ_{0}) (q_{1}, ψ_{1}) (q_{2}, ψ_{2}) \dots (q_{T}, ψ_{T})

, where the last state accepted by A is our final state

(q_{T}, ψ_{T}) = (q_{F}, ψ_{F})

.

In this case, it is important to emphasize that the finite state trace highlights the non-observable nature of the states and stack values. However, a hidden state q generates an observable symbol

σ

through the emission matrix B. If we denote the observation

O_{t}

at step t, the probability values generated by the emission matrix B can be represented as follows:

\begin{matrix} P r (O_{t} = σ ∣ S_{t} = q) = B (q, σ) . \end{matrix}

(19)

Thus, the observation sequence is

σ_{0} σ_{1} σ_{2} \dots σ_{T}

with probability

\prod_{t = 0}^{T} P r (O_{t} = σ_{t} ∣ S_{t} = q_{t})

.

While the transition matrix A calculates probabilities by considering the stack value, the B emission matrix takes into account both state and stack values when calculating probability values on a given set of observations:

\begin{matrix} P r (O) & = \sum_{Q} \sum_{Ψ} P r (O, Q, Ψ) \\ = \sum_{Q} \sum_{Ψ} P r (O ∣ Q) P r (Q, Ψ) \end{matrix}

(20)

Due to the exponential complexity, direct computation of the probability for the given observation sequence is infeasible, necessitating more efficient calculation methods. Dynamic programming algorithms, as described in the literature [45], offer a solution to reduce the computational requirements for such problems. Adapted versions of likelihood and learning algorithms for HMMs provide an efficient approach to calculating the probability of an observation sequence and estimating the parameters in an HASMM. In the following section, we present a dynamic programming solution that addresses this issue. Notation details of HASMM are presented in Table 1.

4. Learning Process

We adapted the expectation-maximization algorithms from HMMs to our HASMM model by incorporating the stack value into the configuration. These algorithms were forward, backward, and Baum–Welch algorithms. Calculations were made according to dynamic program paradigms. As a result, they updated the values of the elements in the model to take into account the occupancy of the stacks. These three algorithms we adapted were calculated as in HMMs, except additional stack values were taken into account during the calculation. The Baum–Welch algorithm has two computational functions:

ξ

and

γ

. The

ξ

function calculates the probabilities in the HASMM according to the observation sequence. The

γ

function calculates the probabilities of specific consecutive configuration pairs based on the observation sequence.

4.1. Adapted Forward Calculation

The forward algorithm finds the emission probability by adding up all the values in the trajectory for the observed outputs. The trellis structure depicted in Figure 4 illustrates how the forward algorithm computes the partial probabilities

α_{t} (q, ψ)

for each state. These partial probabilities indicate the likelihood of being in hidden state

S_{j}

after observing the sequence of observations up to time t. In our adapted version of the forward algorithm, we considered the stack value and incorporated it into the probability calculations. Thus, we considered the sum of all possibilities in order to calculate

(σ_{1 : T})

.

The resulting

α_{t} (q, ψ)

represents

σ_{1 : t}

, and

S_{t} = (q, ψ)

at step t, where the initial configuration is

S_{0} = (q_{0}, ψ_{0} = E)

and the final configuration is

S_{T} = (q_{F}, ψ_{F} = E)

. In other words,

α_{t} (q, ψ)

quantifies the probability of observing the sequence of observations up to time t while being in the specific hidden state q and having the stack value

ψ

. At the base step (i.e.,

t = 0

) of this calculation, we know that we are at the initial state. The base step was defined as;

Base step:

$α_{0} (q, ψ) = \{\begin{matrix} 1 : \\ 0 : \end{matrix} \begin{matrix} if q = q_{0} and ψ = E \\ otherwise \end{matrix}$

(21)

For any other step, we calculated the recursion step as follows:

Recursion step:

$α_{t} (q^{'}, ψ^{'}) = \sum_{q \in Q} \sum_{Δ} α_{t - 1} (q, ψ^{'} - Δ) A (q, Δ, q^{'}) B (q^{'}, σ_{t})$

(22)

Theorem 1.

α_{t} (q, ψ)

computes

P r (σ_{1 : t}, S_{t} ∣ S_{0})

.

Proof.

At the base step of the induction proof, at

t = 0

,

α_{0} (q, ψ) = P r (S_{0})

is true if

q = q_{0}

and

ψ = E

. In the induction step, let us assume the calculation holds true at time

t = k

, where

k ≧ 1

. We need to prove the calculation

α_{k + 1} (q, ψ) = P r (σ_{1 : k + 1}, S_{k + 1} ∣ S_{0})

. Here, it is

α_{k + 1} (q, ψ) = α_{k} (q^{'}, ψ) + P r (σ_{k + 1}, S_{k + 1} = q ∣ S_{k} = q^{'})

. We already know that

α_{k} (q^{'}, ψ) = P r (σ_{1 : k}, S_{k} ∣ S_{0})

is correct. Additionally, we can apply the sum rule of probability; thus, we obtain the following:

α_{k + 1} (q, ψ) = P r (σ_{1 : k + 1}, S_{k}, S_{k + 1} = q ∣ S_{0})

. According to the head-to-tail rule of d-separation,

S_{k}

is conditionally independent of

S_{0}

. Thus, the calculation of the adapted forward algorithm at

t = k + 1

is

α_{k + 1} (q, ψ) = P r (σ_{1 : k + 1}, S_{k + 1} = q ∣ S_{0})

. □

4.2. Adapted Backward Calculation

The backward algorithm is a dynamic programming approach commonly used in HMMs. We updated the backward algorithm in the HMM in order to calculate the conditional probability future observation sequence

σ_{t + 1 : T}

for the given current hidden state

S_{t}

. In the adapted version of the algorithm for the HASMM, it was crucial to consider the stack values and specific terminal configurations. By incorporating these factors, the adapted backward algorithm provided an efficient and accurate estimation of the conditional probability for future observations.

In our calculation, we determined

β_{t} (q, ψ)

, which indicates the likelihood of achieving the end state at the last stage T starting from a specific configuration

(q, σ)

at step t as in Figure 5.

Base step:

$β_{T} (q, ψ) = \{\begin{matrix} 1 : \\ 0 : \end{matrix} \begin{matrix} if q = q_{F} and ψ = E \\ otherwise \end{matrix}$

(23)

For any other step, we calculated the recursion step as follows:

Recursion step:

$β_{t} (q, ψ) = \sum_{q^{'} \in Q} \sum_{Δ} β_{t + 1} (q^{'}, ψ^{'} + Δ) A (q, Δ, q^{'}) B (q^{'}, σ_{t + 1})$

(24)

Theorem 2.

β_{t} (q, ψ)

computes

P r (σ_{t + 1 : T} ∣ S_{t} = (q, ψ), S_{T})

.

Proof.

At the base step,

t = T

, we know that an optimal path starts from the initial terminal (i.e.,

S_{0}

) and ends at the final terminal (i.e.,

S_{T}

). If the final terminal is an accepting state

q_{F}

and the abstract stack is E, then the probability is equal to 1. At the induction step, we assume that the calculation of this algorithm is true at time step

t = k

. Now, we need to prove that

β_{k - 1} (q, ψ) = P r (σ_{k : T} ∣ S_{k - 1} = (q, ψ), S_{T})

at

t = k - 1

. Here, it is

β_{k - 1} (q, ψ) = β_{k} (q^{'}, ψ^{'}) + P r (σ_{k} ∣ S_{k - 1} = (q, ψ), S_{T})

. At

t = k

, we assume that the calculation is true. Thus, if we put this part into the calculation, we obtain

β_{k - 1} (q, ψ) = P r (σ_{k + 1 : T} ∣ S_{k} = (q^{'}, ψ^{'}), S_{T}) + P r (σ_{k} ∣ S_{k - 1} = (q, ψ), S_{T})

. After applying the sum rule, we obtain

β_{k - 1} (q, ψ) = P r (σ_{k : T} ∣ S_{k - 1} = (q, ψ), S_{k} = (q^{'}, ψ^{'}), S_{T})

. According to the d-separation,

σ_{k : T}

is conditionally independent from the latent state

S_{k}

. Thus, we obtain

β_{k - 1} (q, ψ) = P r (σ_{k : T} ∣ S_{k - 1} = (q, ψ), S_{T})

. □

4.3. Adapted Baum–Welch Calculation

In the Baum–Welch algorithm [46], which is a specific instance of a maximum likelihood optimization, as when adapting our forward and backward algorithms, we found our probability matrices by getting closer to the original model by taking the stack into account, and our iterative steps were as follows:

E-Step

This algorithm comprises two computational components, the functions

γ

and

ξ

:

Theorem 3.

γ_{t} (q, ψ)

computes

P r (S_{t} ∣ O, S_{0}, S_{T})

.

γ_{t} (q, ψ) = \frac{α_{t} (q, ψ) β_{t} (q, ψ)}{\sum_{q^{'} \in Q} α_{t} (q^{'}, ψ^{'}) β_{t} (q^{'}, ψ^{'})}

(25)

Proof.

We can start this proof by applying the production rule to

P r (S_{t} ∣ O, S_{0}, S_{T})

. Here, we obtain

\frac{P r (O, S_{t}, S_{T} ∣ S_{0})}{P r (O, S_{T} ∣ S_{0})}

. We can separate the whole given observation sequence into two parts as

σ_{0 : t}

and

σ_{t + 1 : T}

and apply the production rule as

\frac{P r (σ_{0 : t}, σ_{t + 1 : T}, S_{T} ∣ S_{t}, S_{0}) P r (S_{t} ∣ S_{0})}{P r (O, S_{T} ∣ S_{0})}

. According to the d-separation,

σ_{0 : t}

is independent from

σ_{t + 1 : T}

and the final state

S_{T}

given condition

S_{t}

. Now, we obtain

\frac{P r (σ_{0 : t} ∣ S_{t}, S_{0}) P r (S_{t} ∣ S_{0}) P r (σ_{t + 1 : T}, S_{T} ∣ S_{t}, S_{0})}{P r (O, S_{T} ∣ S_{0})}

. Again, if we apply the production rule, then we obtain

\frac{P r (σ_{0 : t}, S_{t} ∣ S_{0}) P r (σ_{t + 1 : T} ∣ S_{t}, S_{T}, S_{0}) P r (S_{T} ∣ S_{t}, S_{0})}{P r (O, S_{T} ∣ S_{0})}

. Here,

σ_{t + 1 : T}

is conditionally independent from

S_{0}

according to the d-separation rule. Thus,

P r (σ_{0 : t}, S_{t} ∣ S_{0})

represents the forward calculation,

P r (σ_{t + 1 : T} ∣ S_{t}, S_{T})

is the backward calculation, and

P r (S_{T} ∣ S_{t}, S_{0})

equals to 1 due to the probability calculation of a path from

S_{0}

to

S_{T}

being an absolute result. On the other hand, the value of the denominator is the sum of the probability of the whole given observation sequence from the

S_{0}

to

S_{T}

terminals. □

Theorem 4.

ξ_{t} ((q^{'}, ψ^{'}), (q, ψ))

computes

P r (S_{t - 1}, S_{t} ∣ O, S_{0}, S_{T})

.

ξ_{t} ((q^{'}, ψ^{'}), (q, ψ)) = \frac{α_{t} (q^{'}, ψ^{'}) A (q^{'}, ψ, q) β_{t + 1} (q, ψ) B (q, σ_{t + 1})}{\sum_{q^{'} \in Q} α_{t} (q^{'}, ψ^{''}) β_{t} (q^{'}, ψ^{''})}

(26)

Proof.

We can start the proof by applying the production rule to

P r (S_{t - 1}, S_{t} ∣ O, S_{0}, S_{T})

. Then, we obtain

\frac{P r (S_{t - 1}, S_{t}, S_{T}, O ∣ S_{0})}{P r (O, S_{T} ∣ S_{0})}

. We can divide the given observation sequence into two separated parts as

σ_{0 : t}

and

σ_{t + 1 : T}

. Here, we obtain

\frac{P r (σ_{0 : t}, σ_{t + 1 : T}, S_{t - 1}, S_{t}, S_{T} ∣ S_{0})}{P r (O, S_{T} ∣ S_{0})}

. We can again apply the production rule:

\frac{P r (σ_{0 : t}, σ_{t + 1 : T}, S_{T} ∣ S_{t - 1}, S_{t}, S_{0}) P r (S_{t - 1}, S_{t} ∣ S_{0})}{P r (O, S_{T} ∣ S_{0})}

. The single observation symbol

σ_{t}

is independent of all other observation symbols and states conditioned on

S_{t}

depending on the d-separation rule. Also, the prior observation sequence part

σ_{0 : t - 1}

is independent of the other observation parts and states conditioned on

S_{t - 1}

depending on the d-separation rule. Here, we obtain

\frac{P r (σ_{0 : t - 1} ∣ S_{t - 1}, S_{0}) P r (σ_{t + 1 : T}, S_{T} ∣ S_{t - 1}, S_{t}, S_{0}) P r (S_{t - 1}, S_{t} ∣ S_{0}) P r (σ_{t} ∣ S_{t})}{P r (O, S_{T} ∣ S_{0})}

. At

P r (σ_{t + 1 : T}, S_{T} ∣ S_{t - 1}, S_{t}, S_{0})

,

S_{t - 1}

is an irrelevant item in the condition part due to the future observation sequence and the final terminal being independent from this hidden variable conditionally on

S_{t}

. If we apply the production rule again, then we obtain

\frac{P r (σ_{0 : t - 1} ∣ S_{t - 1}, S_{0}) P r (σ_{t + 1 : T}, S_{T} ∣ S_{t}, S_{0}) P r (S_{t} ∣ S_{t - 1}, S_{0}) P r (S_{t - 1} ∣ S_{0}) P r (σ_{t} ∣ S_{t})}{P r (O, S_{T} ∣ S_{0})}

. In

P r (S_{t} ∣ S_{t - 1}, S_{0})

,

S_{0}

is trivial according to the Markov property. Also,

P r (σ_{0 : t - 1} ∣ S_{t - 1}, S_{0})

and

P r (S_{t - 1} ∣ S_{0})

can be merged by using the production rule. If we apply another production rule to

P r (σ_{t + 1 : T}, S_{T} ∣ S_{t}, S_{0})

, then we obtain

P r (σ_{t + 1 : T} ∣ S_{t}, S_{T}, S_{0}) P r (S_{T} ∣ S_{t}, S_{0})

. According to the d-separation rule,

σ_{t + 1 : T}

is independent from

S_{0}

and also

P r (S_{T} ∣ S_{t}, S_{0})

is equal to 1 as is mentioned in Theorem 3. Finally, we have

\frac{P r (σ_{0 : t - 1}, S_{t - 1} ∣ S_{0}) P r (S_{t} ∣ S_{t - 1}) P r (σ_{t + 1 : T} ∣ S_{t}, S_{T}) P r (σ_{t} ∣ S_{t})}{P r (O, S_{T} ∣ S_{0})}

. Here,

P r (σ_{0 : t - 1}, S_{t - 1} ∣ S_{0})

is the forward calculation,

P r (S_{t} ∣ S_{t - 1})

is the state transition from

S_{t - 1}

to

S_{t}

,

P r (σ_{t + 1 : T} ∣ S_{t}, S_{T})

is the backward calculation, and

P r (σ_{t} ∣ S_{t})

is the emission probability of observation symbol

σ_{t}

. □

M-Step

During the M-step, the objective is to determine new model parameters which maximize the expected likelihood, as predicted in functions

γ

and

ξ

. This involves expressing the probabilities of the matrices A and B over hidden control variables, stack variables, and output variables in the following manner:

The updated emission transition matrix

\hat{B}

is calculated as:

\hat{B} (q, σ) = \frac{\sum_{t = 1}^{T} Π (σ_{t}) γ (q, ψ)}{\sum_{t = 1}^{T} γ (q, ψ)},

(27)

where

Π (σ)

represents the corresponding observation symbol at time t.

The updated state transition matrix

\hat{A}

is calculated as:

\hat{A} (q, ψ, q^{'}) = \frac{\sum_{t = 1}^{T - 1} ξ_{t} ((q, ψ^{'}), (q^{'}, ψ))}{\sum_{t = 1}^{T - 1} \sum_{q^{'} \in Q} ξ_{t} ((q, ψ^{'}), (q^{'}, ψ))}

(28)

5. Experimental Evaluations

The pseudocode of our algorithm is given in Algorithm 1. Our algorithm consists of three phases. In the first phase, the length of the observation sequence, denoted by

| O |

or the T value, and the probability matrix

P r_{(T, Q, Ψ)}

are specified. The probability matrix is a three-dimensional matrix that holds a probability value corresponding to discrete time steps (

t \in T

), states (

q \in Q

), and abstract stacks (

ψ \in Ψ

), and the initial value is given as

Π

. In the second phase, to find the most probable path, the Viterbi algorithm is used with nested iterations (

t \in T; q, q^{'} \in | Q |; ψ, ψ^{'} \in | Ψ |

, respectively, where

| L |

denotes the number of the element in the list

L

) to compute the probability value by using the function

M a x

holding the most probable path to a state, the abstract stack pair at a given time step. In the final phase, the probability matrix holding the most probable path is tracebacked to provide the most probable state, the abstract stack pair (

(q, ψ)

).

We implemented the forward, backward, and Baum–Welch algorithms for the HASMM, H1MM, and HMM for their evaluations. The results were obtained on a machine with an Intel i7 3.3 GHz CPU and 16 GB RAM using the Python (version 3.8.5) programming language.

The H1MM [18] is an adapted version of HMMs structured and algorithmically tailored from probabilistic one-counter automata, a subclass of PCFGs. The purpose of creating this model was to reduce the cubic complexity associated with learning algorithms used in PCFGs. Successful results in this regard involve adapting algorithms while preserving the counter value, thereby achieving a quadratic complexity for the model. The model H = (

Q, Σ, A^{0}, A^{+}, B, q_{0}, q_{F}

), where

Q, Σ, B, q_{0}

, and

q_{F}

denote the sets of hidden variables, observation symbols, emission matrix, initial state, and final state, respectively, following similar notation as described in our proposed model. The main distinction lies in having two different state transition matrices due to the counter value. For instance, if the counter value is zero, in the next transition, the counter value may either remain zero or be incremented, as the counter value is positive. Thus, only the

A^{0}

matrix is active. If the counter value is greater than zero, the counter value may either decrease, remain constant, or increase in the next transition, activating the

A^{+}

matrix. In our model, we employed a single state transition matrix and additionally incorporated a stack. Unlike the H1MM, our stack symbol did not contain a single symbol but multiple symbols. This allowed faster learning processes on a given task despite exhibiting similar performance relative to the counter model. Consequently, we propose a model closer to the H1MM in complexity and closer to the HMM in learning time.

Algorithm 1: The most probable derivation

1:: Starting Step:
2:: $T = | O |$
3:: $P r_{(T, Q, Ψ)}$
4:: $P r_{(0, Q_{Π}, Ψ_{Π})} = P r_{Π}$
5:
6:: The most probable path:
7:: for each step $t = 1 \leq t \leq T$ do
8:: for each step $q, q^{'} = 1, 2, 3, \dots, | Q |$ do
9:: for each step $ψ, ψ^{'} = | Ψ |$ do
10:: $P r_{(t, q, ψ}) = M a x (A (q^{'}, ψ^{'}, q) \cdot B (q, o) \cdot P r_{(t - 1, q^{'}, ψ^{'})})$
11:: end for
12:: end for
13:: end for
14:
15:: Sequence traceback:
16:: for each step $t = 1 \leq t \leq T$ do
17:: $(q, ψ) \leftarrow P r_{(t, Q, Ψ)}$
18:: end for

Firstly, we randomly generated six different

M_{i}

models (where each

M_{i}

had

i + 1

states for the HASMM) as inputs for the HASMM, H1MM, and HMM, where each

M_{i}

consisted of sets of states, initial and final state, observations, transitions, and emissions. Subsequently, for the HASMM, H1MM, and HMM, we generated datasets using each

M_{i}

model. Our datasets consisted of 20,000 observation sequences randomly generated using the corresponding

M_{i}

models, with each sequence having a fixed length of 16.

To ensure a fair comparison, we used

\frac{1}{\sqrt{5}}

and

\frac{1}{\sqrt{5}}

numbers of states in the H1MM and HASMM, respectively, for each state count in the HMM. The reason for this choice was that while the HMM required

n^{2}

values for n states in the transition matrix, the H1MM required

5 n^{2}

values, and the HASMM required

5 n^{2}

values, as can be observed in the transition matrices. In a sense, we made a fair comparison by using the same number of parameters for each model. Therefore, we did not allow competition inequality because a model had more parameters, such as the Akaike and Bayesian information criteria.

During the training process, we observed that all the HMM, H1MM, and HASMM algorithms often encountered local minima, which hindered their convergence. To address this issue, we generated 100 random initial models

M_{i}

for all HMMs, H1MMs, and HASMMs. In each learning iteration, we removed the models with the lowest 25% likelihood values. Eventually, only one model remained, which we continued training until convergence was achieved.

After training the HMM, H1MM, and HASMM, we used new observation sequences in the testing process in order to find the maximum likelihood estimate of the parameters. This estimation was performed by calculating the Kullback–Leibler (KL) divergence as follows:

K L (P | | Q) = P (x) log (\frac{P (x)}{Q (x)})

(29)

P (x)

denotes the initial probability matrices chosen, while

Q (x)

signifies the matrices acquired after training for each variable x. The KL score measures the separation between these matrices. A smaller value indicates a more similar match for the matrices, with a score of zero indicating a perfect match. Therefore, the nearer the value gets to zero, the better the accuracy of the model.

As shown in Table 2, the H1MM and HASMM achieved almost perfect estimation of the original model, while the HMM lagged significantly behind in terms of prediction. The parameters of the original model were randomly generated. Subsequently, the original observation sequence was created based on that model. These sequences, totaling 20,000, comprised the dataset for this study. That dataset was utilized to evaluate the performance of six different models that were generated randomly for each specified model over the same task. Following this, the data inputs were used to compute the values in the table using the KL divergence. Furthermore, both HASMM and HMM exhibited superior training performance in terms of computational expenses compared to the H1MM, and the comparison with PCFGs was unnecessary due to its cubic complexity. We leveraged the expressive capabilities of PCFGs while mitigating the cubic complexity of parameter learning by adopting the bilinear complexity of HMMs.

To achieve more effective comparison results between models, observation sequences of varying lengths were generated. New datasets of different lengths for six different models were initially produced by the original model. These datasets were individually used to retrain the models, and their KL divergence scores were evaluated as shown in Figure 6. From the observed results, it was inferred that the length of the observation sequence most closely resembled the original model in the HASMM and H1MM, whereas the HMM exhibited a KL score further from the original model. This confirmed our expectation that the model we proposed aligned closely with the original for tasks involving more complex systems with stacks. Additionally, when comparing the training times of the models, the HASMM and HMM completed model training in a shorter duration for longer observation sequences compared to the H1MM. Despite the HASMM also being a system that includes a stack like the H1MM, its computational cost was closer to that of the HMM due to a more favorable cost function, as observed in the learning time.

For the validity of the results, various conditions were applied: the number of hidden states varied between models from

M_{1}

to

M_{6}

, and emission matrices’ output symbols and probability values were randomized. Additionally, comparative results using the KL divergence are presented in Table 2. To create models closer to the true original model, initially, each model used 100 different fully random models, discarding 25% based on the lowest likelihood values at each training step. This process was repeated until one model remaining and it converged. The aim was to approximate the original model with these six distinct models as closely as possible.

6. Conclusions

We introduced a novel counter machine which was a subclass of the PPDAs with a single-stack alphabet where the control states were hidden, i.e., the generalization of P1CAs by using latent variables. The proposed HASMMs are an effective midpoint between HMMs and PCFGs for expressing natural language processing, for example. We also adapted the expectation-maximization algorithms employed for the training process in HMMs in order to find the maximum likelihood estimate of the parameters of our HASMMs and evaluated them experimentally with the datasets we produced. HASMMs, which are bilinear in the observation sequence length, demonstrated superior performance compared to PCFGs, which are cubic for the same process. Additionally, our results showed that HASMMs outperformed HMMs in terms of accuracy while maintaining the same complexity. As a result, HASMMs are both closer to the expressivity of PCFGs and are like HMMs, with faster parameter fitting, so they are a very good candidate in terms of accuracy and performance for such applications.

As known, in fields such as natural language processing, speech recognition, and bioinformatics, solutions to domain problems have been sought using systems incorporating counters or stacks. While these models have individually shown success, there have been instances where combining two models has been advantageous in addressing these domain problems. Our proposed model aims to reduce computational complexity for solving complex problems in these domains, particularly by leveraging the advantages of two different models. The proposed computations, including forward, backward, and Baum–Welch algorithms, enhance the learning capabilities of the model. The theoretical validation of these computations enables the generation of high-accuracy results in solving problems in probabilistic and PCFG-related domains.

Since the model presented in this study exhibits bilinear complexity, it may encounter challenges in modeling non-linear relationships, it necessitates a careful evaluation based on dataset characteristics and problem requirements. This study could be extended by integrating different stack models to address more complex tasks effectively. Furthermore, in this study, the initial state was set to zero, and the abstract stack was initialized as empty, ending with the same pair. In future work, starting and ending values can be modeled similarly to accommodate any value. Additionally, the abstract stack, conceptualized as a finite set with a value of five, was abstracted. The categorization of the abstract stack could be adjusted to allow for higher accuracy in different specific applications. Investigating the optimal number of units for the abstract stack in future studies will be highly beneficial for applications in specific fields. Lastly, studies addressing the presence of multiple stacks in the system or involving two-headed probabilistic automata and the mutual exclusions between these automata will be pivotal.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The author declares no conflicts of interest.

References

Kwiatkowska, M.; Norman, G.; Parker, D. PRISM: Probabilistic model checking for performance and reliability analysis. ACM SIGMETRICS Perform. Eval. Rev. 2009, 36, 40–45. [Google Scholar] [CrossRef]
Carpenter, B.; Gelman, A.; Hoffman, M.D.; Lee, D.; Goodrich, B.; Betancourt, M.; Brubaker, M.; Guo, J.; Li, P.; Riddell, A. Stan: A probabilistic programming language. J. Stat. Softw. 2017, 76, 1. [Google Scholar] [CrossRef]
Fitch, W.T.; Friederici, A.D. Artificial grammar learning meets formal language theory: An overview. Philos. Trans. R. Soc. B Biol. Sci. 2012, 367, 1933–1955. [Google Scholar] [CrossRef]
Berwick, R.C.; Okanoya, K.; Beckers, G.J.; Bolhuis, J.J. Songs to syntax: The linguistics of birdsong. Trends Cogn. Sci. 2011, 15, 113–121. [Google Scholar] [CrossRef]
Nakanishi, M.; Yakaryılmaz, A. Classical and quantum counter automata on promise problems. In Proceedings of the Implementation and Application of Automata: 20th International Conference, CIAA 2015, Umeå, Sweden, 18–21 August 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 224–237. [Google Scholar]
Stewart, A.; Etessami, K.; Yannakakis, M. Upper bounds for Newton’s method on monotone polynomial systems, and P-time model checking of probabilistic one-counter automata. J. ACM 2015, 62, 1–33. [Google Scholar] [CrossRef]
Brázdil, T.; Esparza, J.; Kiefer, S.; Kučera, A. Analyzing probabilistic pushdown automata. Form. Methods Syst. Des. 2013, 43, 124–163. [Google Scholar] [CrossRef]
Forejt, V.; Jancar, P.; Kiefer, S.; Worrell, J. Bisimilarity of probabilistic pushdown automata. arXiv 2012, arXiv:1210.2273. [Google Scholar]
Eisner, J. Inside-Outside and Forward-Backward Algorithms Are Just Backprop (tutorial paper). In Proceedings of the Workshop on Structured Prediction for NLP, Austin, TX, USA, 5 November 2016; pp. 1–17. [Google Scholar] [CrossRef]
Wang, S.; Wang, S.; Cheng, L.; Greiner, R.; Schuurmans, D. Exploiting syntactic, semantic, and lexical regularities in language modeling via directed markov random fields. Comput. Intell. 2013, 29, 649–679. [Google Scholar] [CrossRef]
Valiant, L.G.; Paterson, M.S. Deterministic one-counter automata. J. Comput. Syst. Sci. 1975, 10, 340–350. [Google Scholar] [CrossRef]
Baum, L.E.; Eagon, J.A. An inequality with applications to statistical estimation for probabilistic functions of Markov processes and to a model for ecology. Bull. Amer. Math. Soc. 1967, 73, 360–363. [Google Scholar] [CrossRef]
Johnson, M. PCFG models of linguistic tree representations. Comput. Linguist. 1998, 24, 613–632. [Google Scholar]
Lee, H.; Ng, A.Y. Spam Deobfuscation using a Hidden Markov Model. In Proceedings of the Second Conference on Email and Anti-Spam (CEAS), Stanford, CA, USA, 21–22 July 2005. [Google Scholar]
Kolter, J.Z.; Jaakkola, T. Approximate inference in additive factorial hmms with application to energy disaggregation. In Proceedings of the Artificial Intelligence and Statistics, La Palma, Spain, 21–23 April 2012; PMLR 2012. pp. 1472–1482. [Google Scholar]
Raman, N.; Maybank, S.J. Activity recognition using a supervised non-parametric hierarchical HMM. Neurocomputing 2016, 199, 163–177. [Google Scholar] [CrossRef]
Yu, S.Z. Hidden semi-Markov models. Artif. Intell. 2010, 174, 215–243. [Google Scholar] [CrossRef]
Kurucan, M.; Özbaltan, M.; Schewe, S.; Wojtczak, D. Hidden 1-Counter Markov Models and How to Learn Them. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, International Joint Conferences on Artificial Intelligence Organization, Vienna, Austria, 23–29 July 2022; pp. 4857–4863. [Google Scholar]
Mor, B.; Garhwal, S.; Kumar, A. A systematic review of hidden Markov models and their applications. Arch. Comput. Methods Eng. 2021, 28, 1429–1448. [Google Scholar] [CrossRef]
Kumar, A.; Kumar, P. Reliability assessment for multi-state automatic ticket vending machine (ATVM) through software and hardware failures. J. Qual. Maint. Eng. 2022, 28, 448–473. [Google Scholar] [CrossRef]
Kumar, A.; Ram, M. Process modeling for decomposition unit of a UFP for reliability indices subject to fail-back mode and degradation. J. Qual. Maint. Eng. 2023, 29, 606–621. [Google Scholar] [CrossRef]
Kumar, A.; Kumar, P. Application of Markov process/mathematical modelling in analysing communication system reliability. Int. J. Qual. Reliab. Manag. 2020, 37, 354–371. [Google Scholar] [CrossRef]
Yao, Q.; Cheng, L.; Chen, W.; Mao, T. Some Generalized Entropy Ergodic Theorems for Nonhomogeneous Hidden Markov Models. Mathematics 2024, 12, 605. [Google Scholar] [CrossRef]
Wei, F.; Wang, J.; Ma, X.; Yang, L.; Qiu, Q. An Optimal Opportunistic Maintenance Planning Integrating Discrete-and Continuous-State Information. Mathematics 2023, 11, 3322. [Google Scholar] [CrossRef]
Wang, X.; Wang, J.; Ning, R.; Chen, X. Joint optimization of maintenance and spare parts inventory strategies for emergency engineering equipment considering demand priorities. Mathematics 2023, 11, 3688. [Google Scholar] [CrossRef]
Lee, B.; Park, J.; Kim, Y. Hidden Markov Model Based on Logistic Regression. Mathematics 2023, 11, 4396. [Google Scholar] [CrossRef]
Alvaro, F.; Sánchez, J.A.; Benedí, J.M. Recognition of on-line handwritten mathematical expressions using 2D stochastic context-free grammars and hidden Markov models. Pattern Recognit. Lett. 2014, 35, 58–67. [Google Scholar] [CrossRef]
Dyrka, W.; Gąsior-Głogowska, M.; Szefczyk, M.; Szulc, N. Searching for universal model of amyloid signaling motifs using probabilistic context-free grammars. BMC Bioinform. 2021, 22, 222. [Google Scholar] [CrossRef] [PubMed]
Oseki, Y.; Marantz, A. Modeling human morphological competence. Front. Psychol. 2020, 11, 513740. [Google Scholar] [CrossRef]
Carravetta, F.; White, L.B. Embedded stochastic syntactic processes: A class of stochastic grammars equivalent by embedding to a Markov Process. IEEE Trans. Aerosp. Electron. Syst. 2021, 57, 1996–2005. [Google Scholar] [CrossRef]
Lopes, H.B.; de Freitas, A.R. Probabilistic (k, l)-Context-Sensitive Grammar Inference with Gibbs Sampling Applied to Chord Sequences. In Proceedings of the ICAART, Online Streaming, 4–6 February 2021; Volume 2, pp. 572–579. [Google Scholar]
Bourlard, H.; Bengio, S. Hidden Markov Models and Other Finite State Automata for Sequence Processing; Technical Report; IDIAP: Martigny, Switzerland, 2001. [Google Scholar]
Dupont, P.; Denis, F.; Esposito, Y. Links between probabilistic automata and hidden Markov models: Probability distributions, learning models and induction algorithms. Pattern Recognit. 2005, 38, 1349–1371. [Google Scholar] [CrossRef]
Adhikary, S.; Srinivasan, S.; Miller, J.; Rabusseau, G.; Boots, B. Quantum tensor networks, stochastic processes, and weighted automata. In Proceedings of the International Conference on Artificial Intelligence and Statistics, Virtual, 13–15 April 2021; PMLR2021. pp. 2080–2088. [Google Scholar]
Bhattacharya, C.; Ray, A. Thresholdless Classification of chaotic dynamics and combustion instability via probabilistic finite state automata. Mech. Syst. Signal Process. 2022, 164, 108213. [Google Scholar] [CrossRef]
Almutiri, T.; Nadeem, F. Markov models applications in natural language processing: A survey. Int. J. Inf. Technol. Comput. Sci 2022, 2, 1–16. [Google Scholar] [CrossRef]
Pande, S.D.; Kanna, R.K.; Qureshi, I. Natural language processing based on name entity with n-gram classifier machine learning process through ge-based hidden markov model. Mach. Learn. Appl. Eng. Educ. Manag. 2022, 2, 30–39. [Google Scholar]
Zhang, J.; Wang, C.; Muthu, A.; Varatharaju, V. Computer multimedia assisted language and literature teaching using Heuristic hidden Markov model and statistical language model. Comput. Electr. Eng. 2022, 98, 107715. [Google Scholar] [CrossRef]
Li, Y.; Shetty, P.; Liu, L.; Zhang, C.; Song, L. Bertifying the hidden markov model for multi-source weakly supervised named entity recognition. arXiv 2021, arXiv:2105.12848. [Google Scholar]
Chiu, J.T.; Rush, A.M. Scaling hidden Markov language models. arXiv 2020, arXiv:2011.04640. [Google Scholar]
Nefian, A.V.; Liang, L.; Pi, X.; Xiaoxiang, L.; Mao, C.; Murphy, K. A coupled HMM for audio-visual speech recognition. In Proceedings of the 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, Orlando, FL, USA, 13–17 May 2002; IEEE: Piscataway, NJ, USA, 2002; Volume 2. pp. II–2013. [Google Scholar]
Hassan, M.R.; Nath, B. Stock market forecasting using hidden Markov model: A new approach. In Proceedings of the 5th international conference on intelligent systems design and applications (ISDA’05), Wroclaw, Poland, 8–10 September 2005; IEEE: Piscataway, NJ, USA, 2005; pp. 192–196. [Google Scholar]
De Fonzo, V.; Aluffi-Pentini, F.; Parisi, V. Hidden Markov models in bioinformatics. Curr. Bioinform. 2007, 2, 49–61. [Google Scholar] [CrossRef]
Juang, B.H.; Rabiner, L.R. Hidden Markov models for speech recognition. Technometrics 1991, 33, 251–272. [Google Scholar] [CrossRef]
Sitinjak, A.; Pasaribu, E.; Simarmata, J.; Putra, T.; Mawengkang, H. The Analysis of Forward and Backward Dynamic Programming for Multistage Graph. IOP Conf. Ser. Mater. Sci. Eng. 2018, 300, 012010. [Google Scholar] [CrossRef]
Lindberg, D.V.; Omre, H. Inference of the transition matrix in convolved hidden Markov models and the generalized Baum–Welch algorithm. IEEE Trans. Geosci. Remote Sens. 2015, 53, 6443–6456. [Google Scholar] [CrossRef]

Figure 1. Diagram of a hidden Markov model.

Figure 2. Algorithm for the forward computation.

Figure 3. Algorithm for the backward computation.

Figure 4. Trellis diagram of the forward calculation.

Figure 5. Trellis diagram of the backward calculation.

Figure 6. Comparison of KL scores based on the different lengths of observation sequence.

Table 1. Notation details of the HASMM.

Notations	Description
$α_{t} (q, ψ)$	Forward calculation at time t over state q and stack symbol $ψ$
$α_{t - 1} (q, ψ)$	Previous forward calculation at time $t - 1$ over state q and stack symbol $ψ$
$β_{t} (q, ψ)$	Backward calculation at time t over state q and stack symbol $ψ$
$β_{t + 1} (q, ψ)$	Future backward calculation at time $t + 1$ over state q and stack symbol $ψ$
$γ_{t} (q, ψ)$	The likelihood of being in state q and the stack symbol is $ψ$ at time t
$ξ_{t} ((q^{'}, ψ^{'}), (q, ψ))$	The likelihood of transition from state $q^{'}$ and stack symbol $ψ^{'}$ to state q and stack symbol $ψ$ at time t
$Δ$	The parameter of the stack symbol changing while proceeding a run, $ψ \to ψ^{'}$
$Ψ$	The content of the stack
$σ_{t}$	An emission symbol that is produced by a latent variable at time t
$σ_{t + 1}$	An emission symbol that is produced by a latent variable at time $t + 1$
$A (q, ψ, q^{'})$	State transition matrix. The run is performed from state q to $q^{'}$ with the stack symbol $ψ$
$B (q, σ)$	The emission matrix holds the probability of emitting symbol $σ$ by latent variable q
$S_{t}$	Hidden state at time t
$S_{t + 1}$	Next hidden state at time $t + 1$
$S_{t - 1}$	Previous hidden state at time $t - 1$

Table 2. Performance of HASMM, H1MM, and HMM: KL score values and training times of the 6 individual models.

	HASMM		H1MM		HMM
Model	KL	Time (h)	KL	Time (h)	KL	Time (h)
$M_{1}$	0.078	0.2	0.0080	1	4.94	0.2
$M_{2}$	0.064	0.7	0.0022	6	5.44	0.5
$M_{3}$	0.066	1.5	0.1100	22	5.24	1
$M_{4}$	0.098	5	0.0027	79	5.26	4
$M_{5}$	0.081	9	0.0045	218	6.80	6
$M_{6}$	0.077	14	0.0064	552	5.11	10

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Özbaltan, M. Hidden Abstract Stack Markov Models with Learning Process. Mathematics 2024, 12, 2144. https://doi.org/10.3390/math12132144

AMA Style

Özbaltan M. Hidden Abstract Stack Markov Models with Learning Process. Mathematics. 2024; 12(13):2144. https://doi.org/10.3390/math12132144

Chicago/Turabian Style

Özbaltan, Mete. 2024. "Hidden Abstract Stack Markov Models with Learning Process" Mathematics 12, no. 13: 2144. https://doi.org/10.3390/math12132144

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Hidden Abstract Stack Markov Models with Learning Process

Abstract

1. Introduction

Related Work

2. Background of Learning Algorithms in Hidden Systems

3. Hidden Abstract Stack Markov Models

4. Learning Process

4.1. Adapted Forward Calculation

4.2. Adapted Backward Calculation

4.3. Adapted Baum–Welch Calculation

5. Experimental Evaluations

6. Conclusions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI