In this study, the convolution subsampling network, shown in
Figure 1, was replaced by a capsule network to downsample or extract shallow features of speech signals for the conformer blocks, and therefore, improve the accuracy of speech feature extraction. Then, based on this new model architecture that combined a capsule network and a conformer encoder, an end-to-end speech recognition model was proposed, which is shown in
Figure 3. The model includes three core modules: the CapsNet blocks, conformer blocks, and bi-transformer decoder. The following describes our model according to its workflow.
3.1. Encoder
The speech features are extracted as input for our model. After data enhancement using SpecAugment [
26], the shallow features are extracted by the basic CNN blocks as input to the CapsNet blocks in the capsule network. In this study, the CapsNet blocks consist of two internal layers: primary and digital capsule layers. The primary capsule layer spreads the speech data into multiple capsules, and then integrates them into a matrix, which is sent to the digital capsule layer for the acquisition of high-level features. The digital capsule layer follows the dynamic routing algorithm in Equations (1)–(5), and the magnitude of the output vector is the probability that the speech signal belongs to a category.
First, to improve the training speed of the capsule network, we added a layer of residual network to the CapsNet blocks, as shown in
Figure 4, because a residual network can deal with the degradation problem resulting from the increase in network layers [
26]. After the introduction of a shortcut connection in the original network, our model allows data to flow across layers; thus, the initial learning goal of the network is changed, and the learning difficulty is reduced.
Assuming the input and output dimensions of the nonlinear unit in the neural network are the same, the function
Z to be fitted within the neural network unit can be decomposed into two parts, given as:
where
F represents the model function of the CapsNet blocks and
x represents the system input.
Inside the network, the constant mapping
Z(
x)→
x is learned, that is, the residual part
F(
x) converges to 0. This is equivalent to replacing the learning target with
Z(
x)-
x, leading the whole structure to converge in the direction of iden tity mapping. This transformation can speed up the training of the capsule network and simplify the parameter optimization process. Then,
Z is linearized to integrate the features and change the dimension, and the vector
x is processed in the dropout layer to avoid over fitting:
Finally, in this study, inputs, i.e.,
x, to the conformer module enhance the performance of the encoder. Four modules are included in the conformer blocks: the feedforward, multi-head self-attention [
27], convolution, and layer normalization modules; the specific structure is shown in
Figure 5.
After feeding
x into the conformer blocks, output
y is fed through the feedforward module FFN, the multi-head self-attentive module MHSA, the convolutional module Conv, and the layer normalization module Layernorm. The specific process is as follows:
3.2. Decoder
Because a bi-transformer model can give output as a left-to-right target sequence and a right-to-left target sequence with the advantage of better attention and more effective utilization of contextual information, our system uses the bi-transformer model as the decoder [
28,
29]. This decoder superimposes
N sublayers into a multi-head attention model and feedforward network, and adds residual connections between the inputs and outputs of the sublayers. The bi-transformer decoder comprises two single weight-sharing decoders: one decodes from left to right to generate the forward context, and the other decodes from right to left to generate the backward context. The input information of the bi-transformer is obtained from the mapping at the output end of the encoder. Suppose that the output of the conformer is
Y, through matrix multiplication, it can be mapped into three matrices on different spaces:
K,
Q, and
V. Then, the forward and backward vectors are spliced to obtain a two-way dot product attention layer, and the attention scores
H are calculated as:
Similarly, the multi-head attention value can be obtained by splicing:
where
WO is the trainable weight matrix, Concat is the concatenate operation,
h is the number of heads.
Because the decoder is bidirectional, the decoding process is performed in both directions using the magnitude of the beam search value, and it stops when a termination symbol is predicted. The search process is as follows: At each time step, the alive hypothesis statement consists of two candidate statements guided by the special character
<L2R>, where two candidate statements guided by the special character
<R2L>, and the score of the completed sentence determines the best generated sequence. If the highest score is guided by the right-to-left sequence, the final predicted sequence must be reversed. This search method effectively solves the problem that the traditional transformer model can only decode in one direction, therefore, the consistency between the different directional hypotheses is improved. The loss function of the bi-transformer is defined as:
where
α is the inversion coefficient to balance the weight of the decoding order, and
Ll2r and
Lr2l represent the scores of the left-to-right and right-to-left decoding, respectively. In our system
α = 0.5.
3.3. Decoding
The purpose of this study was to design an offline speech recognition model architecture that does not contain a language model. Therefore, there is a high requirement for accuracy and stability, as well as trainability and decoding performance. The decoding framework in this study uses the U2 model with a two-channel joint connectionist temporal classification/attention-based encoder–decoder (CTC/AED) [
30]; its structure is shown in
Figure 6. The framework is derived from the traditional joint CTC-attention framework and includes three modules: CTC, attention, and shared encoders, where the connection between the CTC decoder and attention decoder is strengthened by the rescoring method. The CTC decoder consists of a linear layer that is used to transform the intermediate output of the encoder into the final output and complete the decoding using the attention rescoring method. Specifically, the CTC prefix beam search algorithm generates the
N-best intermediate candidates, and then the attention decoder is used to jointly rescore these candidates with a shared encoder to obtain a more accurate final output, as shown in
Figure 7. In addition, the U2 model uses the chunk training method to limit attention modeling. The principle is that the distribution of chunk sizes dynamically changes during the training process, and the attention model captures different information of various chunk sizes and learns how to make accurate predictions in different limited contexts to achieve a trade-off between decoding speed and accuracy.
The basic problem of speech recognition is to map a speech feature sequence to a word sequence, which can be solved by the Bayesian decision theory and probability function estimation theory [
31,
32]. First, the word sequence with the highest probability is estimated as follows:
where
Z* = {
zn V|
n = 1, …,
N} is the estimated word sequence with the highest probability in all the possible word sequences,
X = {
xt RD|
t = 1, …,
T} is the input speech feature sequence,
xt is the
D dimensional speech feature vector in the
t-th frame, and
zn the word at position
n of the vocabulary
V.
According the Bayesian theory,
p(
Z|
X) is transformed into:
Because in the decoding process, the input
X does not change,
p(
X) is neglected. Therefore, Equation (16) is simplified as follows:
where
p(
Z) is the probability of the output word sequence, which is described by the language model and
p(
Z|
X) is the likelihood probability, characterized by the acoustic model.
Because the input audio frame is only about 20–30 ms and its information cannot cover a word or even a character, the general output modeling unit is a phoneme. The CTC method defines a path sequence
C = {
c1, …,
cT}, where both the non-blank label and blank label exist, and the probability of the whole path sequence is composed of the label probability of each frame:
where
denotes the posterior probability of path
ct in the
t-th frame.
Define
L as a label element table, the augmented letter sequence is defined with the blank symbol
b as follows:
Define
Φ:
L≤T→L′
T, and map the label sequence
Z to the path sequence
C. Then, the probability of label
Z is:
The factorization of the posterior probability
p(
C|
X) is:
In the CTC method, p(C|X) is obtained using the conditional independence hypothesis, which can simplify the dependence of acoustic model and alphabetic model.
As compared with the CTC, the attention method estimates
p(
C|
X) based on the rules of probability chain without any hypothesis [
33] as follows:
where
patt(
C|
X) is an object function based on attention,
patt(
cl|
c1, …,
cl−1,
X) is calculated as follows:
First, the input speech feature is encoded as:
The weight of attention is given as:
where
alt is the weight of attention, which denotes the soft connection
ht of hidden vectors; Contentattention ( ) and Locationattention ( ) denote the content-based attention mechanism with and without convolution features, respectively.
For each output, the hidden vector of the lexical sequence is calculated as:
Then, the decoder network including a recurrent network based on the prior output
cl−1 and hidden vector
ql−1, except for the hidden vector
r1, is used to calculate
pt(
cl|
c1, …,
cl−1,
X) as follows:
In our training process, the CTC is combined with attention-based cross entropy [
34,
35], and a multi-objective learning framework is adopted to improve robustness. The loss function is denoted as:
where
LCTC and
LAED are the loss functions of the CTC and attention decoders, respectively;
λ is an adjustable parameter, which satisfies 0 ≤
λ ≤ 1;
S denotes the training collection.
The
LCTC is calculated as:
where
Z* represents the correct label value,
p(
Z*|
X) denotes the probability of obtaining the correct label corresponding to given
X, which can be calculated using the forward–backward algorithm as follows:
where
c′u represents all the possible prefix paths ending with
u,
αt(
u) represents the total probability of all possible prefix paths ending with the
u-th label,
βt(
u) is the total probability of all possible suffixes starting from the
u-th label, and
qt(
c′u) represents the output value of
c′u at time
t after activation by the softmax function.
The
LAED is calculated as:
where
represents the correct label before the current character
u.
In summary, the flexible alignment of CTC accelerates the training process of the network, and the fusion of the CTC and attention decoder loss functions significantly simplifies the training channels and speeds up the model training process; therefore, training of the multiple branches of the encoder can be realized in speech recognition.