An End-to-End Formula Recognition Method Integrated Attention Mechanism

Zhou, Mingle; Cai, Ming; Li, Gang; Li, Min

doi:10.3390/math11010177

Open AccessArticle

An End-to-End Formula Recognition Method Integrated Attention Mechanism

by

Mingle Zhou

,

Ming Cai

,

Gang Li

and

Min Li

^*

Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences), Jinan 250353, China

^*

Author to whom correspondence should be addressed.

Mathematics 2023, 11(1), 177; https://doi.org/10.3390/math11010177

Submission received: 5 November 2022 / Revised: 18 December 2022 / Accepted: 22 December 2022 / Published: 29 December 2022

Download

Browse Figures

Versions Notes

Abstract

:

Formula recognition is widely used in document intelligent processing, which can significantly shorten the time for mathematical formula input, but the accuracy of traditional methods could be higher. In order to solve the complexity of formula input, an end-to-end encoder-decoder framework with an attention mechanism is proposed that converts formulas in pictures into LaTeX sequences. The Vision Transformer (VIT) is employed as the encoder to convert the original input picture into a set of semantic vectors. Due to the two-dimensional nature of mathematical formula, in order to accurately capture the formula characters’ relative position and spatial characteristics, positional embedding is introduced to ensure the uniqueness of the character position. The decoder adopts the attention-based Transformer, in which the input vector is translated into the target LaTeX character. The model adopts joint codec training and Cross-Entropy as a loss function, which is evaluated on the im2latex-100k dataset and CROHME 2014. The experiment shows that BLEU reaches 92.11, MED is 0.90, and Exact Match(EM) is 0.62 on the im2latex-100k dataset. This paper’s contribution is to introduce machine translation to formula recognition and realize the end-to-end transformation from the trajectory point sequence of formula to latex sequence, providing a new idea of formula recognition based on deep learning.

Keywords:

formula recognition; neural network; vision transformer; encoder-decoder; attention mechanism

MSC:

68T07

1. Introduction

In scientific and technical documents, scientific formulas are essential to express the relationship between variables concisely. However, formula input is complicated and prone to error, this is because the formula is essentially a tree structure, which needs to recognize not only the characters of the formula but also the relationship between the characters. The traditional method can not deal with variable substitution and lacks good generalization, which prompts researchers to investigate vision-based automatic formula recognition. The traditional research, mainly with a two-step process (character segmentation and relationship description), is focused on the recognition of superscript and subscript relations, special symbols, and fractions. For example, the INFTY system [1] aims to convert printed scientific formulas into LaTeX characters [2]. However, some disadvantages have not been solved, such as the inability to generate complicated structures and error accumulation.

With the development of deep learning, the original method is gradually replaced by the end-to-end framework [3,4,5,6,7]. In recent years, optical character recognition (OCR) [8,9,10,11]. based on deep learning, has developed rapidly. However, employing the existing OCR technology for scientific formula recognition is challenging because the two-dimensional structure of scientific formulas is very complicated. Identifying characters accurately, especially the logical relationship among characters, is necessary. In other words, the formula recognition task can be regarded as transforming an image into mark-up. In Ref. [12], the authors constructed WYGIWYS to transform an image into mark-up language. Karpathy et al. [13] proposed an image description framework that can learn semantic coding of the image, then the encoded semantic vector is inputted to the decoder to generate each character.

This paper proposes an attention-based end-to-end encoder-decoder framework to realize the formula transformation from image to mark-up. For an input image, the feature map set is firstly extracted by the feature extractor and is encoded to include the context semantic vector C [14,15,16]. The standard model combines the two steps into an end-to-end structure. Since the formula recognizes the output sequence mark, this paper uses BLEU, Maximum Edit Distance (MED) and Exact Match (EM) as evaluation indicators. The training and validation of the model were carried out on the im2latex-100k dataset.

The goals of this paper are as follows: firstly, the idea of machine translation is introduced into formula recognition to explore a new idea of formula ocr. Secondly, using YOLO model to detect multi-line formula to separate single-line formula from multi-line formula to improve the model’s accuracy. The specific contributions are as follows: 1. The formula’s trajectory points are regarded as a particular language translated into a LaTeX sequence. 2. The preprocessing method of the multi-line formula is proposed. It uses YOLOV4 to detect the type of multi-line formula, segment the multi-line formula, and identify them, respectively, and then combine the results to increase the recognition accuracy.

Following the introduction, Section 2 introduces the work in the field of formula recognition; Section 3 introduces the models used in this paper; Section 4 introduces the experimental environment, experimental methods, and evaluation indicators; Section 6 summarizes the methods proposed in this paper; Section 5 discusses the good results of this method and puts forward the author’s personal opinions; back matter states the usability of data in this article.

2. Related Work

Recovering mathematical formulas from images has always been considered a challenging task. The first step of formula recognition is accurately identifying the picture characters. According to the character’s type, location, and size, the published structure is analyzed, and the formula is finally converted into a LaTeX sequence. Traditional methods usually include three independent stages: character segmentation, recognition, and structure generation. With the development of artificial intelligence, people are aware of the great application potential of deep learning in various fields, and researchers have begun to apply deep learning to formula recognition.

2.1. Traditional Methods

Okamoto et al. [17] proposed a three-stage formula recognition, which uses horizontal and vertical projection to locate a single character and then uses regular expressions to recognize special symbols and numbers. In the structural analysis stage, logical structure recognizes superscript, subscript, and radical expressions. This method has a good effect on simple structural formulas, but it has a poor effect on complex structures, such as matrix and nested structures. Berman et al. [18] proposed a method based on the principle of the image-connected region. Álvaro et al. [19] compared four types of formula symbol recognizers and found that the classification errors mainly include overline, fractions, and minus signs. In the work of Zanibbi et al. [20], a method based on a baseline structure tree was proposed to establish an operator tree to describe the structure of scientific formula. Lee et al. [21] proposed a method to identify formula areas from images, so that formula and ordinary text areas can be processed separately. Twaakyondo et al. [22] proposed a method that divides the formula into several sub-formulas, then merges several sub-formulas into an overall tree structure, and then returns. Suzuki et al. [23] proposed to locate the formula character and use the minimum cost spanning tree algorithm to obtain the formula structure. This work makes commercial formula recognition a reality and is also the core principle of the formula recognition software INFTY reader.

2.2. Neural Methods for Formula Recognition

With the development of artificial intelligence, researchers began to apply deep learning to formula recognition. Gao et al. [24] proposed a deep neural network based on PDF character information combined with visual feature training to recognize familiar characters and formula areas in documents and process them separately. In Ref. [25], the encoder-decoder framework of seq2seq is implemented to realize formula recognition.

The end-to-end formula recognition method based on deep learning combines the encoder-decoder into one step. Deng et al. [26] proposed an encoder-decoder framework with coarse attention to realize end-to-end image to mark-up generation. The author uses a convolutional neural network to extract formula image feature information in the article. The author applies a coarse-to-fine scaling attention mechanism for each extracted feature vector in the decoder. Zhang et al. [27] proposed a gated recurrent unit (GRU) based on the encoder-decoder framework to realize handwritten formula recognition. Based on GRU, the author adds an attention mechanism so that the output of the encoder is no longer a fixed-length context vector but is dynamically calculated with different decoding times. In Ref. [28], the author proposed that the TAP model uses stroke information as input and uses GRU with attention mechanism as a decoder to generate LaTeX character sequence. The literature [29] replaced CNN with DenseNet [30] and enhanced the attention using the joint attention mechanism. In Ref. [31], Zhang et al. first enlarged the input image to twice the original size and then applied double attention to improve the model performance.

Peng et al. [32] proposed a large-scale pre-training model named MathBERT based on BERT to improve accuracy while paying attention to itself and its context, which is jointly trained with mathematical formulas and their corresponding contexts. In addition, in order to further capture the semantic-level structural features of formulas, a new pre-training task is designed to predict the masked formula substructures extracted from the Operator Tree (OPT), which is the semantic structural representation of formulas.

Wu et al. [33] proposed a graph-to-graph (G2G) codec framework and tested it on the handwritten mathematical formula dataset. In this paper, the formula’s embedding and latex tags are used as input and output, respectively, with Graph Neural Networks (GNN) to explore the structural information and a novel sub-graph attention mechanism to match the primitives in the input and output graphs. The xperimental results show the model reaches new SOTA on the CROHM dataset.

Wang et al. [34] proposed a deep neural network model named MI2LS to convert pictures into LaTeX markers. The model consists of an encoder and a decoder. In the coding stage, the convolutional neural network is used to extract formula features to generate a feature map, and then input the coding network composed of LSTM to generate semantic vector C in the decoding phase, bi-directional LSTM is used to decode the semantic vector C in turn to generate LaTeX tags.

3. Methods

The model inference process can be regarded as the mapping from the picture to the character sequence, Figure 1 shows the recognition process. The formula is expressed as f(

χ

) ->

Y

, and the model’s input is a picture in H × W × C format, which is indicated by

χ

. The picture of input training data does not need to be displayed and acquired. Only the LaTeX sequence needs to be obtained, and then the formula picture can be dynamically rendered using the LaTeX library, such as KaTeX.

The output result of the model is the LaTeX character sequence

Υ = {y_{1}, \dots, y_{T}}

, where

y_{i}

denotes the ith character decoded, and T denotes the total length of the output LaTeX character sequence. The proposed model consists of an encoder and a decoder. The encoder adopts the Vision Transformer (VIT) model, which encodes the formula’s point trajectory sequence into an abstract semantic vector. VIT divides the original input picture into a series of image blocks of 16 × 16 [35] and adds positional embedding because the relative position between different blocks significantly impacts the results. The decoder sequentially decodes the semantic vector and outputs the LaTeX sequence. The initial input of the decoder is a special mark <START>, where each decoding time step t receives the context vector and the decoded output at time

t - 1

as total inputs to obtain the decoded character until the decoder outputs the end character <EOS>. The overall architecture of the model proposed in this paper is shown in Figure 2, the Encoder architecture is shown in Figure 3.

3.1. Encoder

The input of the standard Transformer is a sequence of token embedding. However, the image uses two-dimensional structure data. To use Transformer to process the image, the two-dimensional structure data

χ \in R^{H \times W \times C}

must first be flattened as

χ_{p} \in R^{N \times P^{2} \cdot C}

2D sequence of blocks, where p denotes the dimension of each image block C denotes the number of channels of the image, and (h, w) denotes the wide and height of the image, respectively,

N = H \cdot W / P^{2}

. The hidden vector of fixed length used by the Transformer is D. Keeping n constant makes the

P^{2} \cdot C

picture blocks map to a D-dimensional sequence with Mapping Layer, i.e.,

C_{0} = [X_{class}; X_{P}^{1} ω; X_{P}^{2} ω; X_{P}^{3} ω; \dots, X_{P}^{N} ω]

(1)

where

X_{class}

is the category vector addition.

ω

is image block embedding with the size

ω \in R^{(P^{2} \cdot C) \times D}

, N denotes the number of image embedding blocks,

X_{P}^{i}

denotes the category vector. In the model proposed in this paper, the encoder part scales the image to 224 × 224 × C, where C = 3, then divides it into 9 image blocks of 16 × 16 blocks. It attaches a category mark to each image block, expands each into a one-dimensional linear sequence, and adds a position vector to the Transformer encoder, the calculation process of position vector is shown in the Figure 4. After adding the position vector, the initial time input is

C_{0} = C_{0} + ω_{p o s}

(2)

where,

ω_{p o s}

is a position vector and satisfies

ω \in R^{(N + 1) \times D}

, and

ω_{p o s} = \{\begin{matrix} sin (p o s / 10000^{\frac{2 i}{d_{model}}}), pos \in 2 \cdot i \\ cos (p o s / 10000^{\frac{2 i}{d_{model}}}), pos \in 2 \cdot i^{*} + 1 \end{matrix}

(3)

where

i \in 0, 1, . ., N / 2

,

C_{0}

is the output at the initial time,

y_{i}

is the decoder input at time t, and

Y

is the LaTeX character sequence of the final output. When the input vector is obtained, the input data will be overlapped by L identical layers. Each layer includes two sub-layers: Multi-Head Self-Attention (MSA) and Feed Forward Network (FFN). After the data enters the two layers, the data normalization is carried out in the Layer Normalization (LN), i.e.,

\{\begin{matrix} μ = \frac{1}{H} \sum_{i = 1}^{H} x_{i} \\ σ = \sqrt{\frac{1}{H}} \sum_{i = 1}^{H} {(x_{i} - μ)}^{2} \\ \bar{X} = a \cdot \frac{x - μ}{\sqrt{σ^{2} + ε}} + b \end{matrix}

(4)

where

μ

,

σ

denotes the mean and variance, respectively. a and b are learnable parameters.

ε

is a random decimal number, and H is the length of the vector.

In order to improve the data feature extraction ability, each layer adopts the residual structure. The forward calculation formula is

\{\begin{matrix} Q = W^{Q} \cdot C_{0} \\ K = W^{K} \cdot C_{0} \\ V = W^{V} \cdot C_{0} \end{matrix}

(5)

where Q, K, and V denote the query matrix, key matrix, and value matrix, respectively, that can be learned. The calculation of the encoder layer is

Attention (Q, K, V) = Softmax (\frac{Q \cdot K^{T}}{\sqrt{d_{k}}}) V

(6)

The output of the encoder layer includes not only the encoded vector but also the key matrix K and the value matrix V. The picture block vector encoded in the prediction stage is used as the query matrix Q to output K and V to obtain the prediction result.

Introducing an attention mechanism into the model can indeed improve the model’s performance. Based on classical attention, several varieties of attention have been put forward one after another, among which the more popular one is the multi-head attention mechanism (MLA).

The classical self-attention mechanism can be regarded as a particular case of multi-head attention, that is, there is only one detection head.

Multi-head attention is used to extract features from multiple self-attention heads in parallel and combine all attention heads’ output as the final result. The multi-head attention mechanism does not introduce new parameters but divides the original Q, K, and V into several sub-parts. After splitting, each part is mapped to different subspaces of the high-dimensional space. The weights are calculated so that the model assigns different attention scores to different regions, the multi-head attention’s architecture is shown in the Figure 5. The calculation process of multi-head attention is as follows:

\{\begin{matrix} M u l t i H e a d (Q, K, V) = C o n c a t ({h e a d}_{1}, {h e a d}_{2}, \dots, {h e a d}_{h}) W^{O} \\ {h e a d}_{i} = A t t e n t i o n (Q W_{i}^{Q}, K W_{i}^{K}, V W_{i}^{V}) \end{matrix}

(7)

where,

W^{O}

,

W^{Q}

,

W^{K}

,

W^{V}

is the linear transformation parameter matrix that can be learned

Where matmul denotes matrix multiplication and scale means multiplying by the scaling factor.

3.2. Decoder

The decoder’s function in this paper is to translate the encoded vector into a LaTeX character sequence. It is composed of several layers with the same structure, the architecture of Decoder is shown in the Figure 6. Each layer consists of a Mask multi-head self-attention layer, fully connected layer, and a feedforward layer. The layers are connected by residual structure, using the softmax as activation function. In training, the input of the decoder is ground truth. After masking the Multi-Head Self Attention layer, the decoder will randomly mask 15% of the total characters so the model can learn the internal structure of LaTeX. In the prediction stage, the global semantic information encoded as the decoder’s input and the previous step’s prediction output jointly generate the output LaTeX character.

Let the word vector set of ground truth be

V = {v_{1}, \dots, v_{N}}

, the mask multi-head attention layer will only calculate attention for

V^{'} = {v_{1}, \dots, v_{N - 1}}

when the predicted results is

y_{i}

. The codec attention layer input includes two parts: output of Masked multi-head self-attention layer and K, V of encoder output.

It can be seen that the input of the decoder includes not only the encoder output key vector and value vector but also the ground truth word vector as the query vector. The query, the key, and the value vectors are scaled, and then dot product attention is calculated to obtain the output vector.

The feedforward neural network layer includes two sub-layers whose input is the output of the attention layer. The output of the feedforward layer is subjected to linear transformation and softmax function as the final output of the decoder.

The calculation process is:

F F N (Z) = (Z W_{1} + b_{1}) W_{2} + b_{2}

(8)

where,

W_{1}, b_{1}, W_{2}, b_{2}

are all learnable parameters.

3.3. Loss Function

In order to verify the reliability of the prediction results of the model, this paper designs two granularity loss functions: sequence level and character level. Since the sequence level loss function cannot cover the initial random strategy, the proposed model adopts the character level loss function.

Character level loss is based on maximum likelihood estimation(MLE). A dataset contains the mapping of training pictures to LaTeX character sequences:

{χ_{i}, Y_{i}}_{i = 1}^{N}

where

χ_{i}

denotes a picture;

Y_{i} = {y_{1}, \dots, y_{T}}

denotes a LaTeX sequence, N indicates the size of the dataset, T denotes the length of a LaTeX sequence.

The purpose of model training is to find the appropriate parameter

θ

to maximize the predicted accurate characters, that is

{\hat{θ}}_{M L E} = argmax \{L_{M L E} (θ)\}

(9)

where

L_{M L E} (θ) = \sum_{i = 1}^{N} p (Y^{i}, χ^{i}) = \sum_{i = 1}^{N} \sum_{t = 1}^{T} p (y_{t}^{i} ∣ y_{1}^{i}, \dots, y_{t - 1}^{i}, x^{i})

(10)

which is equivalent to minimizing cross entropy loss function [36]:

L_{X E N T} (θ) = - \frac{1}{N} (\sum_{i = 1}^{N} y^{i} \cdot log (y^{i}))

(11)

4. Experiments

4.1. Preprocessed Data

Experimental tests were performed on the im2latex-100k dataset, which was built based on various scientific and technical documents using regular expressions. A total of 103,356 actual scientific formulas were extracted from more than 60,000 documents. The dataset was divided into 3 parts: the training set (83,883 equations), the test set (9319 equations), and the validation set (10,354 equations). Each formula includes LaTeX code and rendered PNG format pictures, label pairs of LaTeX code, and image mapping. The length of each LaTeX sequence ranged range from 38 to 997. There is no clear boundary between the extracted LaTeX characters, so it is necessary to insert spaces between each character. Some structural errors cannot be rendered, and it is necessary to filter out these formulas.

The data needs to be preprocessed uniformly to improve prediction accuracy and reduce model parameters. For example, the a_{b} and (a)_{{b}} rendering results are completely consistent. Reducing a layer of

{}

can decrease the calculation amount of the model. Another situation is that

ψ

and the rendering result are the same. However, if the former model is adopted, the output will be predicted as ‘∖’, ‘p’, ‘s’, ‘i’, which will undoubtedly increase the calculation amount of the model. In this paper, the normalization algorithm removes redundant characters and replaces multi-character symbols with single characters. The original LaTeX can be converted into a symbol tree using the LaTeX parsing library. The classic LaTeX parsing library is KaTeX, which is written in JavaScript. After generating the symbol tree, the character level LaTeX sequence by traversing the tree will be obtained.

The algorithm execution flow is as follows. The input is an unprocessed LaTeX sequence, such as Before Normalization. The algorithm outputs a Post-Process LaTeX sequence such as After Normalization in Figure 7. In the algorithm,

l a t e x s t r i n g

denotes Post-Process results,

t r e e

denotes the formula tree of analysis by KaTeX library.

t r e e [i] . t y p e

denotes the node type in the formula, including two types such as

s t r u c t u r a l C h a r a c t e r

and

o r d i n a r y C h a r a c t e r

. For

s t r u c t u r a l C h a r a c t e r

, this node is processed recursively; for

o r d i n a r y C h a r a c t e r

, this node is added to the variable

l a t e x s t r i n g

as the final result. The algorithm process is shown in Algorithm 1.

Algorithm 1 LaTeX sequence normalization

Require : R a w L a T e X c h a r a c t e r s

Ensure : P o s t - P r o c e s s L a T e X

1:: $v a r l a t e x s t r i n g =^{″}$
2:: $v a r t r e e = p a r s e r T o t r e e (R a w L a T e X)$
3:: function $g e n e r a t e L a T e X$ ( $f o r m u l a T r e e$ )
4:: for $i : 0 \to L e n g t h (t r e e)$ do
5:: Switch( $t r e e [i] . t y p e$ )
6:: Case( $s t r u c t u r a l C h a r a c t e r$ )
7:: $g e n e r a t e L a T e X (t r e e [i])$
8:: Case( $o r d i n a r y C h a r a c t e r$ )
9:: $l a t e x s t r i n g + = t r e e [i]$
10:: end for
11:: end function

In this paper, YOLOv4 is used as the target detection model. YOLOv4 first divides the original image into N×N grids and introduces the anchors to locate the target higher.

The backbone used in YOLOv4 is csparknet. In order to enhance the feature extraction ability, Spatial Pyramid Pooling (SPP) and path aggregation network (PANet) are added at the neck. The SSP contains convolution kernels with multiple scales of 1 × 1, 5 × 5, 9 × 9, 13 × 13 to expand the receptive field of the network. PANet can fuse the features of different layers to expand input data information and obtain more accurate prediction results. Suppose the input picture size is H×W, the downsampling is s, and the number of output categories is class_ Num. Outputs dimension of the vector D = [H / s, w / s, 3 (5 + class_num)], where 5 means the four coordinate values and one confidence of the regression box. The YOLOV4 model recognition result is shown in the Figure 8.

There are several multi-line formulas in the im2latex-100k dataset, such as matrix with square bracket, matrix with round bracket, round bracket formula, angle bracket formula, curly bracket formula, piecewise function (right bracket), piecewise function with right bracket, and multi-line formula. We can see an example of these multi-line formula from Table 1. The first step of multi-line formula recognition is to identify the formula type of which the main distinguishing feature is the symbols on both sides of the formula. The YOLO model detects whether the picture contains multi-line characters, and multi-line character types separate the formulas of each line and combine them into a complete LaTeX sequence. Special characters (<start> and <end>) shall be added at the beginning and end of each LaTeX sequence to ensure that the encoder can distinguish the start character and the end character.

After adding the YOLO model, the formula recognition first detects whether the picture contains multi-line identifiers. If there are specific types of multi-line formulas, the multi-line formulas are divided into multiple single lines for identification; If there is only one row, we directly call the VIT model to predict the result.

The difference between the segmented recognition and the real value is that each member of the real value is wrapped by “{}”, while the segmented recognition space is divided by ∖quato space. At the same time, multi-line formulas often have fixed format characters such as ∖begin {array} and ∖end {array}. When multi-line formulas have been recognized, multiple lines must be combined into a complete LaTeX sequence. The multi-line formula segmentation and recognition is shown in the Figure 9.

By analyzing the LaTeX string of the multi-line formula, it is found that it contains fixed structure parts. For example, there are usually ∖begin {bMatrix} and ∖end {bMatrix} in the matrix. The result string is spliced with the fixed part through the algorithm to obtain the complete recognition result.

4.2. Settings

In this experiment, the batch-size, initial learning rate, and epoch are set to 45, 0.0001, and 1000, respectively. The hardware environment includes: CPU model i7 12700h, memory capacity 60 GB, graphics card Tesla V100, graphics card memory 60 GB. The storage space includes memory and hard disk, which are 60 GB and 100 GB, respectively. The time to iterate an epoch on Tesla V100 is about 2 min.

4.3. Measurements

The formula recognition task can be regarded as a particular machine translation task. The input language here is the trajectory point sequence of the formula, and the target language is the LaTeX sequence. Therefore, the evaluation standard can use the same indicators of machine translation.

BLEU (Bilingual Evaluation Under) [37] is a text evaluation algorithm often used to evaluate the correspondence between machine translation and professional human translation. BLEU’s guiding design idea is the degree of similarity between machine translation and human translation. The higher the degree of similarity, the easier the score calculated by BLEU can be used as an evaluation index for machine translation quality evaluation. BLEU calculation formula is:

p_{n} = \frac{\sum_{i} \sum_{k} m i n (h_{k} (c_{i}), m a x_{j \in m} h_{k} (s_{i j}))}{\sum_{i} \sum_{k} m i n (h_{k})}

(12)

where

\sum_{i} \sum_{k} m i n (h_{k} (c_{i}), m a x_{j \in m} h_{k} (s_{i j}))

denotes the minimum number of occurrences of n-gram in the prediction sequence and the standard answer.

The matching degree of n-gram may change with the shortening of sentence length, thus leading to a problem: the model may only accurately predict some characters in the LaTeX sequence, but not all, so for its matching degree to avoid the bias of this score, BLEU introduced a length Brevity Penalty in the final score results.

B P = \{\begin{matrix} 1, i f l_{c} > l_{s} \\ e^{1 - \frac{l_{s}}{l_{c}}}, i f l_{c} \leq l_{s} \end{matrix}

(13)

where

l_{c}

denotes the length of the predicted LaTeX sequence,

l_{s}

denotes the effective length of the ground truth, and, when there are multiple ground truths, the length closest to the predicted sequence is selected. When the length of the predicted sequence is greater than the length of the ground truth, the penalty coefficient is 1, which means that there is no penalty. The penalty factor will be calculated only when the length of the machine translation is less than the ground truth.

Since the accuracy of each n-gram statistic decreases exponentially with the increase of the order, to balance the effect of each order statistic, the geometric average form is used to obtain the average value. It is then weighted and multiplied by the length penalty factor. The final evaluation formula is:

B L E U = B P \times e x p (\sum_{n = 1}^{N} W_{N} l o g_{P_{n}})

(14)

where, the upper limit of

W_{N} = \frac{1}{N}

. N is 4, that is, only the accuracy of 4-g is counted at most.

Another important indicator to measure the effect of the formula recognition model is the maximum edit distance [38] (MED), proposed by Russian scientist Vladimir Levenshtein in 1965. MED is usually used to calculate the similarity of two character string sequences. Two character strings

ψ_{1}

and

ψ_{2}

their MED

< ψ_{1}, ψ_{2} >

are defined as the minimum number of single-character editing, which transformed

ψ_{1}

into

ψ_{2}

. Only three types of single-character editing are defined in this article, including insertion, deletion, and substitution. The formula is expressed as:

M E D = l e v_{a, b} (i, j) = \{\begin{matrix} m a x (i, j), i f m i n (i, j) = = 0 \\ m i n \{\begin{matrix} l e v_{a, b} (i - 1, j) + 1 \\ l e v_{a, b} (i, j - 1) + 1, o t h e r w i s e \\ l e v_{a, b} (i - 1, j - 1) + 1 \end{matrix} \end{matrix}

(15)

Here,

l e v_{a, b} (i, j)

denotes the distance between the first i characters in a and the first j characters in b. When the length of a or b is 0, the number of edits to convert an empty string into a non-empty string is the length of the non-empty string. The process of the algorithm is as follows. Where M and N denote the length of input character sequence

S T R_A

and

S T R_B

, respectively;

M A T R I X_E D

is the array storing the final return results. The algorithm process is shown in Algorithm 2.

Algorithm 2 Calculate the maximum edit distance

Require : S T R_A \geq 0 \land S T R_B \geq 0

1:: $M \Leftarrow L e n g t h (S T R_A)$
2:: $N \Leftarrow L e n g t h (S T R_B)$
3:: $M A T R I X_E D \Leftarrow []$
4:: while $0 \leq M$ do
5:: while $0 \leq N$ do
6:: $D I S T_1 = M A T R I X_E D [i - 1, j] + 1$
7:: $D I S T_2 = M A T R I X_E D [i, j - 1] + 1$
8:: $D I S T_3 = M A T R I X_E D [i - 1, j - 1] + 1$
9:: $M A T R I X_E D [i, j] = m i n (D I S T_1, D I S T_2, D I S T_3)$
10:: end while
11:: end while

In addition to BLEU and MED, Exact Match (EM) was used to calculate the degree of match between the predicted LaTeX sequence and the ground truth. The specific method is used to traverse two sequences and compare whether the characters at each position are equal based on the shorter one of the two sequences. For example, the calculation result of a_{b} and a_b} is 0.5. In order to compare with other models, the normalized MED calculation formula is used in this paper.

M E D = 1 - M E D

(16)

The comparison between this model and other models is shown in Table 2. The results show that BLEU is about 2% higher than similar models; MED achieves the same effect as similar models, but EM performance could be better.

In this paper, the model’s generalization is also verified on the dataset of handwritten mathematical formulas. CROHME 2014 is a dataset of 10,846 handwritten formulas, each from a real scene. This model and other models with a handwritten mathematical formula recognition effect are shown in Table 3. We can see that this model still has a good effect in handwritten mathematical formula recognition. Table 4 shows the total efficiency of the model when there is no distinction between single-line and multi-line formulas on img2latex 100k. The experimental results show that on the handwritten mathematical formula dataset CROHME 2014, the single line formula for BLEU, MED, and EM reaches 54.29, 57.80, and 60.20, respectively; the multi-line formula for BLEU, MED, and EM reaches 55.39, 58.20, and 60.22. On the im2latex-100k dataset, the single line formula for BLEU, MED, and EM reaches 90.02, 90.34, and 70.24, respectively; the multi-line formula BLEU, MED, and EM reaches 71.45, 73.55, and 65.27.

In addition, the im2latex-100k dataset reached 0.92 in BLEU, and the Exact Match is 0.62 when there is no distinction between single-line and multi-line formulas.

In order to verify the effect of model parameters on the results, we tested the effect of the model under different parameters. The experimental results show that the model effect’s main parameters are batch size and learning rate. From the following figure one can see the influence of different values of these two parameters on the convergence effect of the model. It was evident that the model achieves the best effect when the batch size is 45 and the learning rate is 0.0001. The parameter change curve during model training is shown in Figure 10.

5. Discussion and Implications

Formula recognition is an exciting research direction. The difficulty is that the formula style is changeable, and the number of formula is infinite. A new formula can be obtained by replacing a variable of the formula. These formula characteristics determine that it can not be realized by pure programming and can not be recognized automatically.

However, it is observed that human understanding of the formula has a good generalization. Learning a particular type of formula such as

\frac{a}{b}

can naturally understand

\frac{a + c}{b - d}

and more expressions of the same form. We speculate that this ability of human beings lies in understanding the structure of the formula, even though there is no structural change for humans to replace its structure with some part of the formula.

In this paper, we abandon the artificial design features of the old method and adopt the idea of deep learning to compress the trajectory points of the formula into high-dimensional semantic vectors to promote the model to learn the tree structure nature of the formula. As a result, the model has good recognition results when processing the same structure but different forms. We believe that the main reason for the good effect of the model is: first, the end-to-end structure alleviates error accumulation. Secondly, introducing an attention mechanism makes the model’s current output aligned with the current concern area.

Our future work will introduce more information to the model to improve its accuracy, such as the above information and spatial location information. We have noticed that some people have been involved in this work, so we intend to improve our model further to increase the accuracy.

6. Conclusions

This paper proposes an end-to-end printed formula recognition method based on the attention mechanism. In order to solve the low accuracy of formula OCR, the model adopts end-to-end training to alleviate the error accumulation. This paper’s main innovation includes: first, the idea of machine translation is introduced into formula recognition, especially, using Transformer as the encoder-decoder framework to improve the generalization and accuracy of the model. Secondly, this paper proposes to identify the type of multi-line formula by target detection for the first time, and then divide the multi-line formula into several single-line formulas. Compared with other models, the model in this paper achieved better performance. The experimental results show that the model has great generalization and is superior to the traditional method when dealing with complex structural formulas.

Author Contributions

Conceptualization, M.Z.; methodology, M.Z.; investigation, M.C.; resources, G.L.; writing—original draft preparation, M.C.; writing—review and editing, M.C.; visualization, G.L.; project administration, M.L.; funding acquisition, M.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key R&D Plan of China (2022YFF0608000).

Data Availability Statement

The data that support the findings of this study are openly available in im2latex-100k at https://www.zenodo.org/record/56198#.Y0k85tpBw2w (accessed on 23 June 2022).

Acknowledgments

Throughout the writing of this dissertation I have received a great deal of support and assistance. I would first like to thank my supervisor, M.Z., whose expertise was invaluable in formulating the research questions and methodology. Your insightful feedback pushed me to sharpen my thinking and brought my work to a higher level. I would particularly like to acknowledge my team members, for their wonderful collaboration and patient support.

Conflicts of Interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

References

Suzuki, M.; Tamari, F.; Fukuda, R.; Uchida, S.; Kanahori, T. Infty: An integrated ocr system for mathematical documents. In Proceedings of the 2003 ACM Symposium on Document Engineering, Grenoble, France, 20–22 November 2003; pp. 95–104. [Google Scholar]
Ion, P.; Miner, R.; Buswell, S.; Devitt, A. Mathematical Markup Language (MathML) 1.0 Specification; World Wide Web Consortium (W3C): Cambridge, MA, USA, 1998. [Google Scholar]
Jaderberg, M.; Simonyan, K.; Vedaldi, A.; Zisserman, A. Deep structured output learning for unconstrained text recognition. arXiv 2014, arXiv:1412.5903. [Google Scholar]
Vinyals, O.; Toshev, A.; Bengio, S.; Erhan, D. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3156–3164. [Google Scholar]
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Zemel, R.; Bengio, Y. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning, Lille, France, 6–11 July 2015; pp. 2048–2057. [Google Scholar]
Cheng, H.; Yu, R.; Tang, Y.; Fang, Y.; Cheng, T. Text Classification Model Enhanced by Unlabeled Data for LaTeX Formula. Appl. Sci. 2021, 11, 10536. [Google Scholar] [CrossRef]
Zhong, W.; Yang, J.H.; Lin, J. Evaluating Token-Level and Passage-Level Dense Retrieval Models for Math Information Retrieval. arXiv 2022, arXiv:2203.11163. [Google Scholar]
Shi, B.; Bai, X.; Yao, C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 2298–2304. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Shi, B.; Yang, M.; Wang, X.; Lyu, P.; Yao, C.; Bai, X. Aster: An attentional scene text recognizer with flexible rectification. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 2035–2048. [Google Scholar] [CrossRef] [PubMed]
Luo, C.; Jin, L.; Sun, Z. Moran: A multi-object rectified attention network for scene text recognition. Pattern Recognit. 2019, 90, 109–118. [Google Scholar] [CrossRef]
Anderson, R.H. Syntax-directed recognition of hand-printed two-dimensional mathematics. In Symposium on Interactive Systems for Experimental Applied Mathematics; Association for Computing Machinery Inc. Symposium: New York, NY, USA, 1967; pp. 436–459. [Google Scholar]
Deng, Y.; Kanervisto, A.; Rush, A.M. What you get is what you see: A visual markup decompiler. arXiv 2016, arXiv:1609.04938. [Google Scholar]
Karpathy, A.; Fei-Fei, L. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 3128–3137. [Google Scholar]
Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473. [Google Scholar]
Luong, M.T.; Pham, H.; Manning, C.D. Effective approaches to attention-based neural machine translation. arXiv 2015, arXiv:1508.04025. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]
Okamoto, M.; Imai, H.; Takagi, K. Performance evaluation of a robust method for mathematical expression recognition. In Proceedings of the Sixth International Conference on Document Analysis and Recognition, Seattle, WA, USA, 10–13 September 2001; pp. 121–128. [Google Scholar]
Berman, B.P.; Fateman, R.J. Optical character recognition for typeset mathematics. In Proceedings of the International Symposium on Symbolic and Algebraic Computation, Oxford, UK, 20–22 July 1994; pp. 348–353. [Google Scholar]
Álvaro, F.; Sánchez, J.A. Comparing several techniques for offline recognition of printed mathematical symbols. In Proceedings of the 2010 20th International Conference on Pattern Recognition, Istanbul, Turkey, 23–26 August 2010; pp. 1953–1956. [Google Scholar]
Zanibbi, R.; Blostein, D.; Cordy, J.R. Recognizing mathematical expressions using tree transformation. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 1455–1467. [Google Scholar] [CrossRef] [Green Version]
Lee, H.J.; Wang, J.S. Design of a mathematical expression recognition system. In Proceedings of the 3rd International Conference on Document analysis and Recognition, Montreal, QC, Canada, 14–16 August 1995; Volume 2, p. 1084. [Google Scholar]
Twaakyondo, H.M.; Okamoto, M. Structure analysis and recognition of mathematical expressions. In Proceedings of the 3rd International Conference on Document Analysis and Recognition, Montreal, QC, Canada, 14–16 August 1995; Volume 1, pp. 430–437. [Google Scholar]
Suzuki, M.; Terada, Y.; Kanahori, T.; Yamaguchi, K. New Tools to Convert PDF Math Contents into Accessible e-Books Efficiently. In Assistive Technology; IOS Press: Washington, DC, USA, 2015; pp. 1060–1064. [Google Scholar]
Gao, L.; Yi, X.; Liao, Y.; Jiang, Z.; Yan, Z.; Tang, Z. A deep learning-based formula detection method for PDF documents. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; Volume 1, pp. 553–558. [Google Scholar]
Wu, J.W.; Yin, F.; Zhang, Y.M.; Zhang, X.Y.; Liu, C.L. Image-to-markup generation via paired adversarial learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases; Springer: Berlin/Heidelberg, Germany, 2018; pp. 18–34. [Google Scholar]
Deng, Y.; Kanervisto, A.; Ling, J.; Rush, A.M. Image-to-markup generation with coarse-to-fine attention. In Proceedings of the International Conference on Machine Learning, Sydney, NSW, Australia, 6–11 August 2017; pp. 980–989. [Google Scholar]
Zhang, J.; Du, J.; Dai, L. A gru-based encoder-decoder approach with attention for online handwritten mathematical expression recognition. In Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan, 9–15 November 2017; Volume 1, pp. 902–907. [Google Scholar]
Zhang, J.; Du, J.; Dai, L. Track, attend, and parse (tap): An end-to-end framework for online handwritten mathematical expression recognition. IEEE Trans. Multimed. 2018, 21, 221–233. [Google Scholar] [CrossRef]
Wang, J.; Sun, Y.; Wang, S. Image to latex with densenet encoder and joint attention. Procedia Comput. Sci. 2019, 147, 374–380. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–27 July 2017; pp. 4700–4708. [Google Scholar]
Zhang, W.; Bai, Z.; Zhu, Y. An improved approach based on CNN-RNNs for mathematical expression recognition. In Proceedings of the 2019 4th International Conference on Multimedia Systems and Signal Processing, Guangzhou, China, 10–12 May 2019; pp. 57–61. [Google Scholar]
Peng, S.; Yuan, K.; Gao, L.; Tang, Z. Mathbert: A pre-trained model for mathematical formula understanding. arXiv 2021, arXiv:2105.00377. [Google Scholar]
Wu, J.W.; Yin, F.; Zhang, Y.M.; Zhang, X.Y.; Liu, C.L. Graph-to-graph: Towards accurate and interpretable online handwritten mathematical expression recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 2925–2933. [Google Scholar]
Wang, Z.; Liu, J.C. Translating math formula images to LaTeX sequences using deep neural networks with sequence-level training. Int. J. Doc. Anal. Recognit. 2021, 24, 63–75. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Rubinstein, R. The cross-entropy method for combinatorial and continuous optimization. Methodol. Comput. Appl. Probab. 1999, 1, 127–190. [Google Scholar] [CrossRef]
Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar]
Chowdhury, S.D.; Bhattacharya, U.; Parui, S.K. Online handwriting recognition using Levenshtein distance metric. In Proceedings of the 2013 12th International Conference on Document Analysis and Recognition, Washington, DC, USA, 25–28 August 2013; pp. 79–83. [Google Scholar]

Figure 1. An example of LaTeX sequence generation from an image.

Figure 2. Overall architecture of this model.

Figure 3. Architecture of the encoder.

Figure 4. Positional embedding; (A) represents the flattened image blocks; (B) represents the position vector.

Figure 5. Scaled dot product attention and multi-head attention.

Figure 6. Decoder architecture diagram.

Figure 7. LaTeX sequence normalized output.

Figure 8. YOLO detection results.

Figure 9. Segmentation and identification results.

Figure 10. Parameter change curving.

Table 1. Multi-line formula table.

Formula Type	Formula Picture	LaTeX Expression
Matrix (square brackets)	$[\begin{matrix} 1 & 1 & 0 \\ 2 & 3 & 3 \\ 3 & 3 & 0 \end{matrix}]$
Matrix (parentheses)	$y = (\begin{matrix} 0 & q_{1} & 0 & 0 \\ 0 & 0 & q_{2} & 0 \\ 0 & 0 & 0 & q_{3} \\ q_{4} & 0 & 0 & 0 \end{matrix})$
Angle bracket formula	$P_{n + 1, m + 1} = 〈\begin{matrix} x^{0}, x^{1}, \dots, x^{m} \\ x^{0}, x^{1}, \dots, x^{n} \end{matrix}〉$
Curly bracket formula	$\{\begin{matrix} N - {\tilde{c}}_{{\tilde{n}}_{1} + 1 - 1}, & 1 \leq {\tilde{n}}_{1} \\ c_{i - {\tilde{n}}_{1}} & i > {\tilde{n}}_{1} \end{matrix}\}$
Piecewise Function	$P_{m i n} (z) = \{\{\begin{matrix} \frac{2 + z}{4} e^{- \frac{z}{2} - \frac{z^{2}}{8}} & i f ν = 0 \\ \frac{z}{4} e^{- \frac{z^{2}}{8}} & i f ν = 1 . \end{matrix}$
Multi-line Expression	$\begin{matrix} a_{1} & = & (0.77, 1.95), a_{2} = (3.25, 1.95); \\ b_{1} & = & (1.32, 1.95), b_{2} = (2.70, 1.95) . \end{matrix}$

Table 2. Comparison between the proposed model and other similar models on img2latex 100k.

Ling	Model	BLEU	MED	Exact Match
Single Line	INFTY	56.9	56.70	56.8
	WYGIWYS	58.71	63.6	61.0
	DoubleAttention	59.5	67.2	63.1
	DenseNet	71.34	59.6	-
	MI2LS	73.53	78.33	63.8
	MathBERT	86.0	81.61	73.77
	Our Model	90.02	90.34	70.24
Multi-Line	INFTY	45.45	50.32	15.70
	WYGIWYS	53.77	57.51	45.31
	DoubleAttention	58.17	54.32	32.12
	DenseNet	61.41	63.34	-
	MI2LS	67.15	65.60	70.57
	MathBERT	69.32	71.37	74.93
	Our Model	71.45	73.55	65.27

Table 3. The effect of our model on the handwritten formula dataset. It can be seen from the table that our model is better than other similar models.

Ling	Model	BLEU	MED	Exact Match
Single Line	INFTY	36.41	37.50	27.25
	WYGIWYS	35.19	40.80	32.84
	DoubleAttention	40.40	43.94	37.51
	DenseNet	39.66	42.51	-
	MI2LS	43.00	46.78	32.09
	MathBERT	50.41	47.94	53.6
	G2G	54.46	52.05	55.28
	Our Model	54.29	57.80	60.20
Multi-Line	INFTY	46.15	32.0	15.27
	WYGIWYS	47.46	42.45	45.46
	DoubleAttention	49.49	51.3	47.68
	DenseNet	52.13	55.72	-
	MI2LS	53.65	52.21	48.3
	MathBERT	54.65	56.71	57.22
	G2G	54.90	57.81	55.28
	Our Model	55.39	58.20	60.22

Table 4. The total efficiency of the model when there is no distinction between single-line and multi-line formulas on img2latex 100k.

Model	BLEU	MED	Exact Match
INFTY	66.65	53.82	15.60
WYGIWYS	87.73	87.60	77.46
DoubleAttention	88.42	88.57	79.81
DenseNet	88.25	91.57	-
MI2LS	90.28	91.90	82.33
MathBERT	90.45	90.11	87.52
Our Model	92.11	90.0	60.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, M.; Cai, M.; Li, G.; Li, M. An End-to-End Formula Recognition Method Integrated Attention Mechanism. Mathematics 2023, 11, 177. https://doi.org/10.3390/math11010177

AMA Style

Zhou M, Cai M, Li G, Li M. An End-to-End Formula Recognition Method Integrated Attention Mechanism. Mathematics. 2023; 11(1):177. https://doi.org/10.3390/math11010177

Chicago/Turabian Style

Zhou, Mingle, Ming Cai, Gang Li, and Min Li. 2023. "An End-to-End Formula Recognition Method Integrated Attention Mechanism" Mathematics 11, no. 1: 177. https://doi.org/10.3390/math11010177

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An End-to-End Formula Recognition Method Integrated Attention Mechanism

Abstract

1. Introduction

2. Related Work

2.1. Traditional Methods

2.2. Neural Methods for Formula Recognition

3. Methods

3.1. Encoder

3.2. Decoder

3.3. Loss Function

4. Experiments

4.1. Preprocessed Data

4.2. Settings

4.3. Measurements

5. Discussion and Implications

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI