*3.4. Segmentation and Recognition Layer*

After texts are detected at the detection layer, text segmentation and recognition of words are performed. Text instance regions are segmented using four consecutive convolution layers with 3 × 3 filters and deconvolution layers with 2 × 2 filters and strides on the outputs of RoI align feature in the previous layer, with predicted bounding boxes. Finally, the outputs of the segmented text instance feature *x* = (*x*1, *x*2, ... , *xT*) are fed for a time-restricted self-attention encoder-decoder module.

In [43], a time-restricted (attention window) self-attention encoder-decoder module is presented for automatic speech recognition, which produces a state-of-the-art result by improving the limitations of CTC (i.e., hard alignment problem and conditional independence constraints) and the attention encoder-decoder module. Unlike [9], we use a time-restricted self-attention module using a bidirectional Gated Recurrent Unit (GRU) as an encoder and a GRU as a decoder. Form the extracted and segmented features, the bidirectional encoder computes the hidden feature vector *ht* as follows:

$$z\_{l} = \sigma(\mathcal{W}\_{\text{xz}}\mathbf{x}\_{l} + \mathcal{U}\_{l\mathbf{z}}\mathbf{l}\_{l-1} + \mathbf{b}\_{\mathbf{z}}) \tag{5}$$

$$r\_t = \sigma(\mathcal{W}\_{\text{rr}}\mathbf{x}\_t + \mathcal{U}\_{\text{hr}}h\_{t-1} + b\_r) \tag{6}$$

$$\stackrel{\circ}{h}\_{l} = \text{tanh}\left(\mathcal{W}\_{\text{x}lr}\mathbf{x}\_{l} + l\mathcal{I}\_{rl}\left(r\_{l}\otimes h\_{l-1}\right) + b\_{\text{lt}}\right) \tag{7}$$

$$h\_t = (1 - z\_t) \otimes h\_{t-1} + z\_t \otimes \stackrel{\prime}{h\_t} \tag{8}$$

where *zt*, *rt*, *ht*, and *ht* are update gate, reset gate, current memory, and final memory at the current time step, respectively. *W*, *U*, and *b* are parameter matrices and vector; σ and tanh stand for sigmoid and hyperbolic tangent function, respectively.

Using the embedding matrix *Wemb* the hidden vector *ht* is converted to embedding matrix *bt* as follows:

$$b\_t = W\_{emb} h\_{t\prime} t = u - \tau, \dots, u + \tau \tag{9}$$

By applying a linear projection on the embedded vector *bt* query (*qt*), values (*vt*), and keys (*kt*) vectors are computed as follows:

$$q\_t = Qb\_t \\ t = u \tag{10}$$

$$k\_t = Kb\_t,\\
t = \mu - \tau, \dots, \mu + \tau \tag{11}$$

$$v\_t = Vb\_t,\\
t = u - \tau, \dots, u + \tau \tag{12}$$

where *Q*, *K*, and *V* are query, key and value matrices, respectively.

Based on these results, attention weight *au* and attention result *cu* are derived as follows:

$$c\_{\rm ut} = \frac{q\_{\rm u}^T k\_t}{\sqrt{d\_k}} \tag{13}$$

$$a\_{\rm ut} = \frac{\exp(c\_{\rm ut})}{\sum\_{t'=1}^{\nu+\tau} \exp(c\_{\rm ut'})} \tag{14}$$

$$c\_{\mathfrak{u}\_{\mathfrak{U}}} = \sum\_{t=\mathfrak{u}-\mathfrak{r}}^{\mathfrak{u}+\mathfrak{r}} a\_{\mathfrak{u}t} \mathfrak{h}\_{\mathfrak{t}} \tag{15}$$

To address the conditional independence assumption in CTC, an attention layer is placed before the CTC projection layer *phu* and transforms it to a particular dimension representing the number of CTC output labels. Then, the attention layer output that carries context information is served as the input of CTC projection layer at the current time *u*.

$$
\hbar \rho \mathbf{t}\_u = \mathcal{W}\_{\text{proj}} \mathbf{c}\_u + \mathbf{b} \tag{16}
$$

where *Wproj* and *b* are the weight matrix and bias of the CTC projection layer, respectively.

Finally, the projected output is optimized as follows:

$$L\_{\rm CTC} = -\log \sum\_{\pi \in B^{-1}(y)} p(\pi | ph\_{\mu}) \tag{17}$$

where *y* denotes the output label sequence. A many-to-one mapping *B* is defined to determine the correspondence between a set of paths and the output label sequences. The self-attention layer links all positions with a constant number of operations that are performed in sequence.
