*3.3. Gated Recurrent Units (GRUs)*

GRU, which was first presented in 2014, is a mechanism that uses gates in RNNs. Although GRU is modern, it can be presented as a simplified version of the LSTM. GRU uses different mechanisms for generating the update and updating the cell state. The activation function of the candidates *ht* is calculated on the basis of the present input and the last cell state as follows:

$$
\hbar\_t = \sigma(\mathcal{W}\_{xh}\mathbf{x}\_t + \mathcal{W}\_{hh}(r\_t \odot h\_{t-1})),
\tag{8}
$$

where *rt* is the reset gate that has the same functional form as that of an update gate, and all of its weights are the same size. Reset gate is multiplied by the previous hidden state value and controls the usage of last cell state while computing the activation input. It can reset the hidden value. The computation of the reset gate is based on the last activation cell *ht*−<sup>1</sup> and the present candidate activation.

$$\mathbf{r}\_{t} = \sigma \left( \mathbb{W}\_{\text{rr}} \mathbf{x}\_{t} + \mathbb{W}\_{\text{hr}} \mathbf{l}\_{t-1} \right) \tag{9}$$

The current cell state or activation is a linear combination of the previous cell and candidate activations.

$$h\_t = (1 - z\_t) \odot h\_{t-1} + z\_t \odot h\_{t\prime} \tag{10}$$

where *Zt* is the update gate that balances the combined amount of the previous and new candidate hidden values to obtain the new hidden value.

$$Z\_t = \left(\mathbb{W}\_{hz}h\_{t-1} + \mathbb{W}\_{xz}x\_t\right) \tag{11}$$

### *3.4. Proposed Model*

In this section, we discuss our bidirectional model for recognizing NEs. Instead of a feedforward network, Bi-LSTM/GRU network is adapted to obtain bidirectional word arrangement for excellent foresight.

Problem Definition: Given a document D containing a sequence of sentences (*s*1, *s*2, ... , *s*m) for the input sentence X = (x1, x2, ···, xn), Z is identified as the output matrix n × k, where k is the number of labels that represent entities. Zij refers to the score when the ith token (xi) in the sentence tagj. Y = (y1, y2, y3, ···, yn) is the predicted sequence.

First, the word and character embeddings are obtained for the input sentence. Then, we adopt an embedding attention layer that combines the two features to obtain the best word representation. The embedded feature representation is fed into the encoder layer for processing. Finally, the result is obtained from the output layer. Xi refers to the input word, whereas Yi is the predicted tag for the i'th token in the sentence, where 1 ≤ i ≤ t. The embedding layer consists of the word vector mapping from tokens to dense n-dimensional vector representations. The model is explained in detail in the following sections. Figure 1 shows the B-RNN setup with the LSTM/GRU network, and the main components are described below.

**Figure 1.** Main architecture of the network. Word embedding is provided to a Bi-LSTM/GRU. The forward unit represents the word Xt and its left context, whereas the backward unit represents the word Xt and its right context. Concatenating the two vectors yields a representation of the word Xt and its context, and this embedded representation is fed to the classification layer.

#### *A. Embedding Layer*

The work introduced by [21] for word distributed representation has substituted the traditional bag-of-words encoding technique and accomplishes excellent results on many NLP tasks. In distributed embeddings, the model is generalizable because each word gets a map to space. As a result, semantically approximated words can have similar vector representations. However, using word embedding alone as the smallest feature representation unit can result in losing some accurate information. For languages with rich morphology, such as Arabic, we need to capture all morphological and orthographic information. As word embedding encodes semantic and syntactic word relationships, character embedding carries important morphological and shape information. Inspired by this integration as in [22], we acquire the sequence representations from character and word levels.

• Character Embedding Layer

Character sequence representations are helpful for morphologically rich languages and for dealing with the OOV problem for tasks, such as POS tagging and language modeling [23] or dependency parsing [24]. The authors in [25] proposed CharCNN, which is a character-aware neural language model that learns character word representations using CNNs. We follow the same technique for generating the character embedding representation. Details about the implementation can be found in [25].

• Word Embedding

Word embedding refers to the representation of words as vectors in a continuous space, thereby capturing many syntactic and semantic relations among them. We treat the embeddings as fixed constants as this consideration performs better than treating them as learnable model parameters [26]. We adopt pre-trained word embedding, AraVec 2.0 [27], to obtain the fixed word embedding of each token.

## *B. Embedding Attention Layer*

Word embedding treats words as the smallest unit and disregards any morphological resemblances between various words, thereby leading to the OOV problem. By contrast, character embedding can operate over individual characters in each word and can, therefore, be useful for handling OOV. However, research on character embedding is still in the initial phase, and systems that work solely on characters are not superior to those based on words on most tasks [28]. Therefore, character and word embeddings can be integrated to fully utilize their advantages. However, we adopt an embedding attention layer that works as a gate mechanism, can learn similar representations, and allows the model to determine how to consolidate the information for each word. After getting the character feature of each word, we calculate the attention matrix through a two-layer perceptron and combine the two levels of features by a weighted sum as follows:

$$z = \sigma(\mathcal{U}\_a \tanh(V\_a x + \mathcal{W}\_a m)),\tag{12}$$

$$\hat{\mathbf{x}} = \mathbf{z} \cdot \mathbf{x} + (1 - \mathbf{z}) \cdot \mathbf{m} \tag{13}$$

where *Ua*, *Va*, and *Wa* are the weight matrices for computing the attention matrix *z* and *σ*() is the sigmoid logistic function with values between 0 and 1. *x* and *m* are sequence representations of word and character embeddings, respectively. The dimensions of vector *z* are the same as those of *x* or *m* and act as the weight between the two vectors. Accordingly, the model can dynamically determine the amount of information to use from each of the embeddings (character or word).

#### *C. Bidirectional Recurrent Neural Networks (B-RNNs)*

Despite its simplicity, B-RNN is a powerful way to improve the neural network ability to learn, especially for NLP problems. The need for B-RNN helps us consider it in the context of NLP tasks, especially NER in which we try to extract NEs, such as people, location, organization, date, and time.

We consider the three sentences below:

ΓϡϱΩϕΓϱϥϡϱΓϱΏέωΓϙϝϡϡ**سبأ**" Saba is an ancient Yemeni kingdom."

ΓΩϭΩΡϡϝΓϱϭϡϱϙϝΕωϥιϝΓΩΡΕϡϝ**سبأ**" Saba United Chemical Industries Ltd."

ΓϱΏέωϝϝΉΏϕϝ έΏϙϡϱωίϱϥέϩεϝΕϭΥΏϡΕϥΏ**سبأ**" Saba, the daughter of Mbkhout Shahrani leader of the largest Arab tribes."

The word **"- سبأSaba"** in the three sentences will be tagged with something else in each sentence. This word represents a location in the first sentence, a company name in the second sentence, and a proper noun in the third sentence. The problem is that the word **"سبأ "**appears at the beginning of each sentence. This condition indicates that the RNN network will not detect any other word in the sentence before it has the opportunity to make a prediction on the word **"سبأ"**, resulting in a possible incorrect prediction. B-RNN solves this problem by traversing the sequence in both directions. The backward RNN will calculate <sup>←</sup> *hT* in reverse direction, starting from hT and then going backward until h1 for ease of prediction of the right label for the given NE. The idea of going backward helps accurately mark the NEs for the word **"سبأ "**that appears at the beginning of the sentence.

The bidirectional unit contains two LSTM/GRU series: one is propagating in the forward direction, and the other is propagating toward the back direction. We concatenate the outcome from the two series to establish a joint representation of the word and its context.

$$h\_t = [\stackrel{\rightarrow}{h\_{T\_\prime}} \stackrel{\leftarrow}{h\_1}]\_\prime \tag{14}$$

where <sup>→</sup> *hT* is the last word in the forward chain and <sup>←</sup> *h*<sup>1</sup> is the last word in the backward chain.

## *D. Prediction Layer*

The joined vectors from the B-LSTM/GRU network are fed into a classification layer with a feedforward neuron. Furthermore, a SoftMax function is used to normalize the output, and it is given by the following equation:

$$
tau = \text{softmax}\left(h\_t\right). \tag{15}$$

For each tag type j, the probability of similar outputs can be calculated as follows:

$$P(l\_t = j | \mu\_t) = \frac{\exp\left(\mu\_t \mathcal{W}\_j\right)}{\sum\_{k=1}^{K} \exp(\mu\_t \mathcal{W}\_k)},\tag{16}$$

where *lt* and *ut* are the tags or labels and the concatenated vector for each time step *t*. The highest possible tag at each word position is chosen. The entire network is trained via backpropagation. The embedding vectors are updated on the basis of back-propagating errors as well.

#### **4. Experiment**

A series of extensive experimentation was conducted to validate the methodology. The datasets used and the experimental setup were explained thoroughly.

#### *4.1. Datasets*

To train and test our ANER system, we evaluated the system with "ANERCorp," which is a dataset created by Benajiba from several online resources [29]. The ANERCorp dataset is a manually annotated corpus that is freely available for research purposes. Two corpora were used: training and testing. One person annotated the corpus to guarantee the coherence of the annotation, and it had 4901 sentences with 150,286 tokens. Each line contained a single token for easy parsing. Each word in this dataset was tagged with one of the following: person, location, company, and others. ANERcorp was annotated into eight classes: B-PERS, beginning of the person's name; I-PERS, inside of the person's name; B-LOC, beginning of the location's name; B-ORG, beginning of the organization's name; I-ORG, inside of the organization's name; B-MISC, beginning of the miscellaneous word; I-MISC, inside of the miscellaneous word; O, the word that is not an NE but refers to other NEs. The dataset distribution was as follows: 39% for person, 30.4% for location, 20.6% for organization, and the remaining 10% for miscellaneous.

#### *4.2. Baseline*

Many approaches tackle the ANER problem. We selected some of the works that have been carried out previously and compared them with our work by using the same dataset and evaluation metrics. The following works were selected as baselines:


#### *4.3. Setting*

An NVIDIA GeForce GTX1080Ti (12 GB and Intel i7-6800K 3.4GHZx12 Processor with 32 GB RAM) was used to train the model. It was built on Ubuntu and implemented in the Keras environment. For each token, the model was trained to predict either one of the eight appropriate labels described in Section 4.1. The embedding dimension was fixed to 100, and the size of the hidden state was aligned to 200. The combination of forward and backward LSTM provided a dimension of 400 Tanh, which was used as the hidden activation function. Its output was fed into a Softmax output layer to produce probabilities for each of the eight tags. Categorical cross-entropy was used as the objective function, and L2-regularization component was added to the cost function for output tuning. For the over-fitting problem, 50% dropout was used as an additional measure to control the inputs to the LSTM network and the Softmax layer. AdaGrad was used to optimize the network cost. The batch sizes were set to 128. A total of 30 epochs were used to train each network. We set the maximum sequence length to 100 to ensure the same length of all the sequences. Sequence greater than this length would be truncated and sequence less than this length would be padded with zeros to obtain the same length.
