Named Entity Recognition Using Conditional Random Fields

Khan, Wahab; Daud, Ali; Shahzad, Khurram; Amjad, Tehmina; Banjar, Ameen; Fasihuddin, Heba

doi:10.3390/app12136391

Open AccessArticle

Named Entity Recognition Using Conditional Random Fields

by

Wahab Khan

^1,2,*

,

Ali Daud

³,

Khurram Shahzad

⁴

,

Tehmina Amjad

²,

Ameen Banjar

³

and

Heba Fasihuddin

³

¹

Department of Computer Science, University of Science and Technology, Bannu 28100, Pakistan

²

Department of Computer Science and Software Engineering, International Islamic University Islamabad, Islamabad 44000, Pakistan

³

Department of Information Systems and Technology, College of Computer Science and Engineering, University of Jeddah, Jeddah 21959, Saudi Arabia

⁴

Department of Data Science, University of the Punjab, Lahore 54000, Pakistan

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(13), 6391; https://doi.org/10.3390/app12136391

Submission received: 7 May 2022 / Revised: 14 June 2022 / Accepted: 20 June 2022 / Published: 23 June 2022

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figure

Review Reports Versions Notes

Abstract

:

Named entity recognition (NER) is an important task in natural language processing, as it is widely featured as a key information extraction sub-task with numerous application areas. A plethora of attempts was made for NER detection in Western and Asian languages. However, little effort has been made to develop techniques for the Urdu language, which is a prominent South Asian language with hundreds of millions of speakers across the globe. NER in Urdu is considered a hard problem owing to several reasons, including the paucity of large, annotated datasets; an inaccurate tokenizer; and the absence of capitalization in the Urdu language. To this end, this study proposed a conditional-random-field-based technique with both language-dependent and language-independent features, such as part-of-speech tags and context windows of words, respectively. As a second contribution, we developed an Urdu NER dataset (UNER-I) in which a large number of NE types were manually annotated. To evaluate the effectiveness of the proposed approach, as well as the usefulness of the dataset, experiments were performed using the dataset we developed and an existing dataset. The results of the experiments showed that our proposed technique outperformed the baseline technique for both datasets by improving the F1 scores by 1.5% to 3%. Furthermore, the results demonstrated that the enhanced dataset was useful for learning and prediction in a supervised learning approach.

Keywords:

natural language processing; information filtering; information extraction; machine learning; classification algorithms; named entity recognition

1. Introduction

Named entity recognition (NER) is also referred to as entity identification, entity chunking, and entity extraction. NER aims to identify all proper nouns from a given text and classify them into predefined categories, such as persons, locations, organizations, expressions of time, quantities, and monetary values. It was abundantly established that NER plays a pivotal role in several NLP tasks, including information extraction, co-reference resolution, relation extraction, question answering, and machine translation [1,2].

NER approaches have been reported and actively implemented for years [3]. The very first NER challenge was offered during the 6th Message Understanding Conference in 1996 [4]. NER frameworks for English as well as other developed languages were extensively established since that time. However, because to the diversity and structural uniqueness of the Urdu language, Urdu NER (UNER) development is now in its development. In morphologically rich languages, the number of words created from a single root word is often considerable. Furthermore, In comparison to NER for other languages, the research in this field is substantially smaller, and the accessible facilities are minimal [5,6]. UNER researchers primarily used three methodologies. The first method is rule-based, and it is founded on constructed rules of grammar [7]. The second option is attempting to learn, which necessitates tagged samples and various characteristics in order to perform better [8]. The final system is a combination of learning-based and rule-based methods [9]. To get good outcomes, such methods allow language-specific expertise in rule-based methods and significant feature design in learning-based strategies.

NER systems received the attention of researchers during the Message Understanding Conferences (MUCs) [4]. Subsequently, There have been numerous NER methodologies and systems developed [1,10]. A large majority of these systems were developed for Western languages, especially English, and are fairly accurate [11]. Furthermore, several frameworks were developed for other languages, such as Arabic and Persian, as well as for the South Asian languages, such as Hindi and Bengali [12,13,14]. However, NER systems for the Urdu language are still in their infancy [15].

Typically, NER schemes rely on the manipulation of a corpus of extrinsic linguistic resources, e.g., annotated corpora, human-made dictionaries, and gazetteers, to enhance the accuracy of the system [16]. However, the Urdu language is deficient in linguistic resources. Furthermore, certain aspects of the Urdu language complicate NER tasks. For example, capitalization is a prominent feature that is used by NER systems for the Western languages, whereas there is no concept of capitalization in the Urdu language. Therefore, it is desired that novel features are developed that can be used for Urdu NER tasks.

To this end, this study made the following key contributions:

The provision of a named entity annotated dataset consisting of 2161 news sentences with 5283 entities. The news articles were obtained from the BBC Urdu website (https://www.bbc.com/urdu (accessed on 15 February 2022)), which is a valuable resource of Urdu text in digital format. The newly developed Urdu NER dataset was named the UNER-I dataset.
We proposed a condition random field (CRF)-based approach for NER in Urdu. We also proposed the notation of a feature template and a novel set of features and feature functions. It includes dependent and independent features, such as the context-of-words and part-of-speech tags, to address the NER problem in Urdu.
Experimentation using the UNER-I dataset and its counterpart to demonstrate the usefulness of the UNER-I dataset, as well as the effectiveness of the proposed approach.

Roadmap: The structure of this paper is as follows: Section 2 introduces an overview of the existing studies on Urdu NER. The third section focuses on the Urdu language’s key characteristics, as well as its challenges in the context of NER. An overview of our proposed approach is presented in Section 4. Section 5 goes over the specifics of the dataset that we created. Our dataset’s specification is also compared with that of its contemporaries. The details of the experiments performed to evaluate the effectiveness of the proposed approach are presented in Section 6. The results of the experiments are discussed in Section 7. Finally, in Section 8, we present the conclusions drawn from our analysis and study, as well as present some future lines of research.

2. Related Work

The NER research for English has a long history that dates back to the early 1990s [4]. Since then, several studies have been conducted to address the NER problem, which ranges from rule-based techniques [7,17] to purely supervised techniques [18]. Moreover, attempts were made to use hybrid approaches for NER in texts [1,19].

It was observed that NER techniques that were developed for particular domains were typically not equally effective for other domains [20]. Similarly, the techniques developed for NER in one language may not be equally effective for other languages. For instance, a NER system developed for English is not useable for Hindi or Chinese. Most of the NER research has been done for the Western languages, whereas little attention has been paid to Urdu NER. An overview of the three types of NER techniques discussed in the preceding paragraph is as follows.

The rule-based approaches, such as [7,13], use a set of rules or patterns (i.e., grammars) that are designed using linguistics. In this type of approach, different rules are designed linguistically for each class of named entity. These rules are then executed on the given text. Whenever the system finds some text for recognition, it first searches for the named entity and then compares it with the rules that are used. Once the rule is matched, the system fetches the classification and gives the required output. An auspicious aspect of systems that are formulated by applying the linguistic-rule-supported approach is the guaranteed improved accuracy [7,13,21]. However, rule-based methods generally lack potency and manageability [17]. This is due to several reasons: (a) they need to be constantly updated with new guidelines, such as those of the corresponding domain modifications; (b) introducing guidelines for a specific task necessarily requires expertise in the corresponding language, along with experience in and knowledge of rule synthetization; (c) rules established for a language domain typically will not port toward other language subject areas; and (d) rule-based techniques take much longer to develop than other techniques.

Currently, the most dominant techniques for tackling the NER problem are machine learning (ML) approaches [22]. The basic idea behind supervised models is their capability to automatically learn rules from training data. The recent trend of analyzing large datasets with supervised learning approaches is particularly reflective of the general trend toward machine learning methodologies [23,24]. Supervised systems infer rules through pre-labeled input, referred to as training data, and conform to estimation methods, non-parametric, or kernel-based learning algorithms, as well as logic-based algorithms [23,25]. NER frameworks based on machine learning are often more customizable than rule-based techniques [26]. An ML technique could adjust to new contexts with little cost provided that training data is freely available [18]. If large, annotated corpora are not available, then semi-supervised machine learning approaches are more effective. Semi-supervised approaches are frequently built on a learning technique that is bootstrapped [27,28]. In this scheme, an initial seed model is first built by using a modest set of tokens with preset categories. This seed model is then used to classify all the tokens in the text. Classifications with high confidence are fed back into the training data in an iterative manner.

Hybrid methods, as the name suggests, are usually a hybrid of rule-based and machine-learning systems [29], with a trade-off of strengths and weaknesses of both approaches. In the literature, the hybrid approach is followed for NER to overcome the challenges presented by the two earlier approaches once applied on an individual basis, and as a consequence, accomplish finer functioning [24]. To the best of our knowledge, no study has been conducted that employs a purely hybrid approach for the Urdu NER task.

As a baseline model for this study, we used the works of Mukund, Srihari, and Peterson [15]. The authors of this baseline study first used CRF to tackle the Urdu NER task and then experimentally showed that the results could be improved by using hand-crafted rules, with improved F1 scores ranging from 68.89 percent to 74.67 percent for the worst test set and from 69.21 percent to 71.3 percent for the best.

3. Characteristics of Urdu Language

Urdu is a prominent South Asian language with several unique characteristics that play a notable role in the development of Urdu NER systems. For instance, Urdu is based on Arabic script, which flows from right to left, whereas Western languages flow from left to right. The differences between three prominent Western languages and the Urdu language are illustrated in Table 1 using an example sentence.

Another key characteristic of the Urdu language that makes it different from the Western languages is its high contextual sensitivity. For instance, in the Urdu language, a letter can be written in more than one shape, where the shape is determined by two factors: the position of the letter in the text and the surrounding letters. This indicates that a letter can take on a single form if it appears at the beginning of a word and another shape if it appears in the middle of a word. Similarly, a letter can have another shape if it appears at the end of a word. For further understanding, consider the example sentence presented in Table 2.

The table contains an example English sentence, its Urdu translation, and the Urdu letters used in the sentence in their original form. It can be observed from the table that although the sentence is composed of merely three words, the shapes of several letters (letters 2, 8, 9, 10, 12, and 14) are changed due to their position.

Furthermore, Urdu has a lot of inflections because it draws terms from Arabic and Persian, as well as Turkish and English [21,30,31,32,33,34]. As a result of its morphology, Urdu combines several other spoken languages with its own [21,35].

Besides the Urdu language characteristics discussed above, there are multiple other reasons that Urdu language processing (ULP) in general and NER in Urdu in particular require the special attention of the research community. For instance, Urdu is a free word order language, meaning that a single sentence can be structured in more than one way without any change in the meaning of the sentence. Therefore, in contrast to the Western languages, the position of a word in an Urdu sentence may not play a significant role in recognizing named entities [7]. Additionally, Urdu does not have capital letters, which is typically a key indicator of NE in the Western languages. Nevertheless, Urdu is widely acknowledged as a low-resource language with a scarcity of publicly available NER corpora, which has thwarted the advancement of supervised learning techniques for Urdu NER.

To summarize, the main characteristics that make the Urdu NER task more challenging and complex are [7]: the absence of capitalization, its free word order nature, borrowed vocabulary, affixes, nested entities, and a lack of standard linguistic resources. Therefore, the current study aimed to develop techniques and computational resources for named entity recognition in Urdu.

4. Urdu NER Using Condition Random Fields

This study proposed a CRF-based approach for the NER problem in Urdu. The proposed approach’s overall architecture is shown in Figure 1. The proposed approach relies on the notion of novel feature templates, which take into consideration the context of each token for the annotation. In this section, first, the concept of a conditional random field (CRF) is presented, which is a widely used model for segmentation and labeling tasks in NLP. Subsequently, our proposed feature temples and their corresponding feature functions are presented. Finally, the encoding and decoding of the proposed approaches are presented, which form the architecture of Urdu NER.

4.1. Conditional Random Field

A CRF is a probabilistic model that is widely acknowledged as a useful technique for the segmentation and labeling of several natural language processing tasks. As the name indicates, a CRF supports a conditional approach and it structurally follows the characteristics of an undirected graphical model, where the nodes of a CRF graph represent class label sequences, which in our case are named entity types, such as a person, organization, and location. A CRF classifier deduces tag sequences T of tags, where T = t₁, t₂, … t_n, which effectively encapsulates the conceptual nature of an observed pattern x, where x denotes a random observation unit taken from a sentence. The CRF model involves picking the y that gives the maximum dependent likelihood P(y|x). Formally, it is defined in Equation (1).

P (y | x) = \frac{1}{Z (x)} \times e x p (\sum_{t} \sum_{k} λ_{k} . f_{k} (y_{t - 1}, y_{t}, x))

(1)

In this equation,

λ_{k}

symbolizes the weight values attributed to the various features during the learning phase.

f_{k} (y_{t - 1}, y_{t}, x)

is a feature function whose value could be either 0 or 1 in the form of binary-valued features. Z(x) is a normalization element that is defined in Equation (2).

Z (x) = \sum_{y ϵ Y} e x p \sum_{t} \sum_{k} λ_{k} . f_{k} (y_{t - 1}, y_{t}, x))

(2)

When using a CRF for Urdu NER tasks, x denotes the sequence of tokens in a line of text and y denotes the phrase’s associated tag sequence.

4.2. Feature Templates

The effectiveness of machine learning techniques is heavily dependent upon the choice of appropriate features [36], that is, choosing the most informative and discriminating features may improve the performance of a machine technique, whereas a casual choice of features may impede the performance of the technique. Recognizing the importance of the choice of features, a plethora of studies were conducted to ascertain the most suitable feature set for a specific task. However, the set of features that are widely acknowledged as effective features for NER in Western languages is not applicable to Urdu NER due to three reasons. First, the Urdu language vocabulary, as well as its writing script, is entirely different from the Western languages. Therefore, the content-based features that play a prominent role in the identification of NE for the Western languages are not applicable to the Urdu language.

Second, several structural features, such as capitalization, that can be effectively used for discriminating NE from the other text features used in Western languages are not available in the Urdu language. Third, similar to the structural features, other features, such as parts of speech tags, which play a prominent role in the identification of NE in Western languages, are not readily usable in the Urdu language due to the availability of accurate taggers.

To this end, as a first contribution, we proposed the notion of a feature template. A feature template (FT) is a tuple of four elements that includes three compulsory elements and one optional element. The four elements are: prefix, identifier, position, and tag. Formally, it is defined in Equation (3).

F_{T} = 〈P_{R}, I_{D}, P_{S}, T_{G}〉

(3)

where P_R represents the type of prefix, which depends upon the number of words to be used as a feature. There are two types of prefixes: a single-token prefix and a two-token prefix. An element is a single-token prefix if the candidate token is composed of a single-word feature, commonly known as a unigram, whereas this element is a two-token prefix if the candidate token is composed of a feature with two words, commonly known as a bigram. Formally, P_R is defined in Equation (4).

P_{R} = \{U, B\}

(4)

I_D is the identifier of the feature, which is used to distinguish between the types of feature templates. The value of the identifier is simply a serial number. Formally, it is defined in Equation (5).

I_{D} = \{1, 2, 3 \dots n\}

(5)

The third element in the tuple of the feature template is the word token at a position, which is represented by P_S. The values of this feature are given by the lexical word at the x- and y-coordinates, where the values of x and y vary between −1 and 1. Formally, it is defined in Equation (6).

P_{S} = C_{W} ([x, y]), such that x \in \{- 1, 0, 1\} and y \in \{0, 1\}

(6)

The possible set of POS tags included in this study is presented below.

T_{G} = {PPSP, VBI, VBF, CC, JJ, PRP, AUXA, PU, SC, RB, PDM, CD, PRT, APNA, PRR, AUXT, PRD, Q, AUXM, SYM, NEG, AUXP, SCK, OD, FF, PRS, PRF, VALA, SCP, FR, PRE, INJ, QM}

(7)

The possible combinations of x and y are presented in Table 3.

The presence of P_S in the feature tuple is useful to capture the word, as well as its contextual information, including the preceding word and the subsequent word. The fourth element of the TG represents the part of speech (POS) tag of the word feature. The details of the feature templates are presented in Table 4. For the experimentation, eleven feature templates were generated by varying the values of the four elements.

4.3. Feature Functions

Feature functions play a key role in the training phase, as they determine the feature values according to the feature templates presented in Table 4. That is, the feature values used for learning are generated by giving the training data as input to the feature functions.

Two very different real-valued and binary-valued features can be used in CRFs; however, in our experiment all features were binary-valued. The overall volume of finalized features produced by the feature function all across the process of encoding depends significantly upon three aspects: (a) the quantity of the training data, (b) the range of class labels, and (c) the number of independent strings derived from a particular template. As discussed in the preceding subsection, this study used 11 templates and the number of output classes was seven. Thus, in the case of unigram templates, the number of feature functions generated by a template for a single token will be ((1 × 7) × 11 = 77). Similarly, for a record with a total of 12 tokens, the number of features generated will be ((12 × 7) × 12 = 1008)).

For a formal specification, consider a given word token (WT) that is to be used as a feature template. |W_T| represents the length of the word, and four functions—δ, β, γ, and η—can be defined that determine the value of each element in the feature template. The definition of the four functions is presented in Equation (8) to Equation (11).

P_{R} = δ (W_{T}) = \{\begin{matrix} U, |W_{T}| = 1 \\ B, |W_{T}| = 2 \end{matrix}

(8)

I_{D} = β (W_{T}) = N + 1

(9)

P_{S} = γ (W_{T}) = \{\begin{matrix} C_{W} [0, 0], c u r r e n t w o r d \\ C_{W} [1, 0], p r e - c o n t e x t \\ C_{W} [1, 1], p o s t - c o n t e x t \end{matrix}

(10)

T_{G} = η (W_{T}) = \{\begin{matrix} C_{W} [- 1, 1], P r e - c o n t e x t P O S T a g \\ C_{W} [0, 1], c u r r e n t w o r d P O S T a g \\ C_{W} [1, 1], p o s t - c o n t e x t P O S T a g \end{matrix}

(11)

4.4. Encoding

The main function of the encoding process is to generate recursively encoded feature values from the training sample by utilizing various feature templates and storing them in the model file. In the encoding phase, the training of the CRF model is achieved, where the content in the training file is sorted record by record, with each sentence portraying a particular record. An empty line separates rows or entries in the training file. Every entry is broken down into its basic parts, or tokens, as well as its feature vectors. We used a matrix to represent our training data, where the formal specification was used for a single word and its corresponding features; similarly, the formal specification symbolized a feature in the unitary attribute. The example in Table 5 represents the format of the training file. The table shows how different records in our training file were labeled. Here, the training data comprised one record and each token was represented in a three-column architecture. In the given table, the first column represents the token parts and the second one is for the corresponding feature values, which in our case was the POS tag. The last column captures the information about the type and part of a named entity. For the POS tag, the CLE Urdu Digest POS-tagged corpus (https://cle.org.pk/openmart/index.php?route=product/product&product_id=71 (accessed on 20 February 2022)) was used, which is composed of a 100,000-POS-tagged corpus. A POS tag for each word of the NE-tagged dataset was assigned through a maximum matching technique by using the CLE Urdu Digest POS-tagged corpus. The n − 1 columns participated were used in the training phase to generate the encrypted model data, while the last column represents the ideal output label for the testing data. “S” means that the entity consisted of a single term, ”B” represents the beginning term of an entity, ”M” represents the middle term of an entity, ”E” represents the ending term of an entity, and ”NOR” represents a normal term and not an entity.

Algorithm 1 describes the steps involved in the training process/encoding phase of the proposed Urdu NER system. This process generated the encrypted model file.

Algorithm 1: Encoding of the CRF model.

1. CRF_NER_Encode (string [Feature File, Trining Data, Mode = Train] Argument)
2. For (L = 0; L < Argument. Length; L++)
3. End For
4. If mode = train
5. Load Training Data
6. While (Training_data. End == false)
7. Fetch line by line record of Train Data
8. For (J = 0; J < Featuer File; J++)
9. Read feature file from {0}
10. Extract feature set
11. Remove feature with frequency below threshold {0}
12. Store feature in Model file to be used in Decoding phase
13. End for
14. End while

4.5. Decoding

Algorithm 2 shows the steps involved in the testing process/decoding phase. During the decoding stage, the encrypted model file created during the encoding stage was decrypted using template features by employing the steps mentioned in Algorithm 2 to produce NE labels in the test dataset. The data format for both the training and test data was the same. The most significant difference between the test dataset and the training file was the absence of the class label column.

Algorithm 2: Decoding of the CRF model.

1. CRF_NER_Decode (string [Testing Data, Enocde_Model_file, Mode = Test, Results] Arguments)
2. For (J = 0;J < args. Length; J++)
3. Read Arguments
4. End for
5. If mode = Test
6. Read model data file
7. While (test data file.End == false)
8. Fetch line by line record of Test Data
9. While (model data file.End == false)
10. Make a decoder tagging object
11. Initialize the outcome
12. To forecast the tags of a given string, use the CRFSharp wrapper.
13. Output raw result with probability
14. Save the result in the result file
15. End while
16. End while Set all of the input parameters to their default values.

5. Urdu UNER Datasets

Urdu is widely acknowledged as a low-resource language, as several computational resources are either not available or not usable [37]. To this end, we developed an Urdu NER (UNER-I) (https://github.com/My-Khan/UNER-Dataset (accessed on 12 June 2022)) corpus that was substantially larger than the baseline dataset. The baseline dataset was originally presented at the International Joint Conference on Natural Language Processing (IJCNLP), which has become a part of the flagship NLP forum of the Association for Computational Linguistics (ACL). The IJCNLP took the initiative to build an automated NER system, and it defined a shared task for the five most widely spoken languages from South Asia. This includes Urdu, Bengali, Hindi, Telugu, and Oriya. The NER dataset for the Urdu language was thus referred to as the IJCNLP-Urdu dataset. The key reason for the choice of this dataset was that it is among the most scarcely available resources for Urdu. The second reason was that the number of NE types in the IJCNLP-Urdu dataset was more than its counterparts, i.e., the other datasets have a large number of tagged named entities but they have annotated merely a few NE types, i.e., person and location. Consequently, the NE system developed using fewer NE types will not be adequate to handle several information extraction tasks or questions. Furthermore, IJCNLP-Urdu is used in a majority of the existing studies on Urdu NER. Below, an overview of the IJCNLP-Urdu dataset is presented, which is followed by the details of our developed UNER-I dataset.

5.1. The Baseline Dataset

IJCNLP-Urdu is considered the pioneer Urdu NER dataset in which named entities are manually annotated. A key feature of the dataset is that six diverse NE types are annotated, namely, person, location, organization, designation, date, and number. The presence of these multiple NE types makes it the most diverse Urdu NER dataset; this makes it adequate to answer a broad range of questions, in contrast with other datasets. The IJCNLP-Urdu dataset is composed of 40,408 manually annotated tokens [38] and it is freely available for academic and research purposes. The summary of the specification of the IJCNLP-Urdu dataset is presented in Table 6. It can be observed from the table that the dataset is composed of 1097 sentences containing 1115 named entities.

The detailed specification of the dataset is presented in Table 7. The first column in this table contains the NE types and the second column contains the count of named entities of a specific type. It can be observed from this table that the number of NEs in the dataset was not comparable, e.g., the number of NEs of type location was at least 10-fold more than the NEs of type organization. In the IJCNLP event, the participants were required to use the training data released by the organizers for learning and prediction.

5.2. The UNER-I Dataset

This section discusses our second contribution, i.e., the development of a large Urdu NER dataset, namely, UNER-I. The developed dataset is substantially superior to its counterparts across three key dimensions: NE count, token count, and diversity of text. In terms of the NE count, the number of named entities in IJCNLP-Urdu is merely 1115, whereas the newly developed UNER-I dataset included 5283 NEs, which was nearly a five-fold increase. For the second dimension, i.e., token count, the IJCNLP-Urdu dataset is composed of 48,673 tokens, whereas the newly developed UNER-I dataset was composed of 58,633 tokens, which was a substantial increase. Finally, in terms of the diversity of the text, the IJCNLP-Urdu dataset includes text from a single genre, whereas UNER-I included Urdu text from four genres: national news, international news, sports news, and art news. In particular, it included 50 international news, 60 national news, 40 sports news, and 50 art news articles. Additionally, to ensure the quality of the annotation, the random samples from the four domains were reviewed by two independent Urdu language experts.

Three annotators investigated the UNER-I dataset. The first two annotators separately labeled the named entity tags in each sentence, while the third annotator resolved any conflicts. Only the sentences in which the first two annotators disagreed were shown to the third annotator. The interface showed the tags that the first two annotators allocated and allowed the final tag decision to be made by the third annotator.

Three annotators physically tagged the UNER-I dataset. All the experts were fluent Urdu speakers with strong speaking ability and expertise with the NER task. Annotators 1 and 2 tagged a sample of 100 phrases in the first stage, the inter-annotator agreement was estimated, and their annotations were examined in detail. The remainder of the dataset was annotated by annotators 1 and 2, and the inter-annotator agreement was calculated for the complete dataset in the second stage. The third annotator annotated sentences where the first two annotators differed. The cumulative kappa [39] result was 0.80 and the inter-annotator agreement was 89.1 percent, where kappa is used as a standard measure of interrater agreement that assesses the extent of agreement between two variables.

During the development process, all the entities were tagged from right to left and the text was stored as sentences. Typically, a “۔” sign was used to mark the sentence boundary, whereas in some cases, a “?” sign was also used to mark the sentence boundary. The entity boundary was marked with start and end tags. The tagged data was stored in a notepad using UTF-8 encoding. Furthermore, the text in the files was organized sentence-wise because most of the machine learning models take input in a sentence-wise manner, making it a sequence-labeling problem.

A notable characteristic of UNER-I is that it is freely available for research purposes. Therefore, we contend that it will be a valuable resource for promoting NER research in Urdu, which is a low-resource language. The summary of the characteristics of our UNER-I dataset is presented in Table 8, whereas Table 9 contains the detailed statistics of the UNER-I dataset. In particular, Table 9 contains the distribution of NE and their respective genres.

Table 8 shows that our UNER-I dataset was substantially larger than the IJCNLP-Urdu dataset in terms of the number of words, number of sentences, and number of named entities. Furthermore, it can be observed from a comparison of Table 7 and Table 9 that the number of entities for each entity type in UNER-I was many times higher than in the IJCNLP-Urdu dataset. Hence, the dataset that we contributed is many times greater than its existing counterparts.

6. Evaluation

This section discusses the details of the experimentation performed in this study to evaluate the effectiveness of our proposed CRF-based approach. Furthermore, we demonstrate the usefulness of our developed UNER-I dataset relative to the baseline IJCNLP-Urdu dataset. First, the datasets used for the experiments are presented. Second, the evaluation measures used for the evaluation are presented. Finally, the details of the experimental setup are presented.

6.1. Datasets

Experiments were performed using two datasets: the IJCNLP-Urdu dataset, which was used as the baseline dataset, and the UNER-I dataset we developed. The details of both datasets are presented in the preceding section. The UNER-I dataset was chosen for experimentation to demonstrate the usefulness of our developed resources, whereas the key reason for choosing IJCNLP-Urdu as a baseline was the number of NE types in the IJCNLP-Urdu dataset is similar to the UNER-I dataset. In contrast, the NE types tagged in the remaining datasets were significantly different in the UNER-I dataset; therefore, the other datasets could not be used to fairly contextualize the usefulness of our developed dataset. Finally, since both the datasets are publicly available, they were chosen for the experiments.

6.2. Evaluation Measures

The effectiveness of our proposed CRF-based approach was assessed using three widely used evaluation measures: precision, recall, and F1 score. Precision represents the ratio of the number of NEs that were correctly recognized to the total number of NEs identified by a technique. If the total number of NEs identified by a technique is N(T, A), then T and N(T, C) represent the number of NEs that were correctly identified by the technique, while N(T, F) represents the number of NEs that were incorrectly identified by the technique. Precision is formalized in Equation (12).

P = \frac{N (T, C)}{N (T, C) + N (T, F)}

(12)

Recall represents the ratio between the total number of positive NEs and is given as (T, A), while N(T, C) represents the number of NEs that were correctly identified by the technique.

R = \frac{N (T, C)}{N (T, A)}

(13)

The harmonic mean of precision and recall is used to calculate the F1 score.

F 1 score = 2 \times \frac{P \times R}{P + R}

(14)

6.3. Experimental Setup

As discussed earlier, experiments were performed using our developed UNER-I dataset and the IJCNLP-Urdu dataset, where the latter was used as the baseline dataset. Similarly, this study used an existing CRF-based approach for Urdu NER detection [15] as a baseline approach. Given that we intended to evaluate the effectiveness of the feature template that was proposed in this study, we contended that the approach proposed by [15] was the most appropriate choice for the baseline technique. Note that the key difference between our proposed and the baseline model was that the proposed approach relies on using a novel set of feature templates that can be effectively used for NER in Urdu.

For the experimentation, an open-source CRFSharp Package 1 was used due to the flexibility it offers compared with its counterparts. This is because it uses CRFSharp in its model parameters encoding capability via L-BFGS. Furthermore, it provides several substantial advantages over CRF++, such as completely synchronal encrypting and optimizing computer memory utilization. During the training phase of the large-sized training data set with numerous tags, CRFSharp benefits from the streamlined usage of poly-core processors with efficient usage of memory in contrast with CRF++. Hence, in a given environment with identical configurations, CRFSharp can practically encrypt more analyzable models with inferior cost. Another reason for the choice of the package is that it has been widely used for handling several sequence-labeling tasks in English, as well as other languages.

The experiments were performed separately using the two datasets, i.e., experiments were performed using the IJCNLP-Urdu dataset and a separate set of experiments were performed using the UNER-I dataset. For the experiments, 10-fold cross-validation was performed; that is, each dataset was divided into 10 parts, where 9 parts were iteratively used for the experiments and the remaining 10th part was used for testing. The results presented in this study are the micro average of the 10 parts.

7. Results and Discussion

Table 10 presents the summary results of all the experiments using two datasets, the baseline technique, and the proposed technique. From the comparison of the results of the two datasets, it can be observed that the precision, recall, and F1 scores achieved by the baseline approach using the UNER-I dataset were higher than that of the IJCNLP-Urdu dataset. The results of our proposed approach were similar. This higher performance of both techniques was due to the number of NEs annotated in the UNER-I dataset being significantly higher than that of the IJCNLP-Urdu dataset, which offered better learning and prediction opportunities for the machine learning techniques.

From the comparison of the results achieved by the two types of techniques, it can be observed that the precision, recall, and F1 score achieved by our proposed approach were higher than that of the baseline technique. Given that the difference between the two approaches lay in the features used for learning and prediction, this shows that our proposed feature templates were useful resources for NER in Urdu.

For a deeper analysis of the results achieved by the two techniques, a synthesis of the results was performed. It included generating precision, recall, and F1 scores for each genre of news and the values of each evaluation measure for each NE type. The following subsections discuss the results of the techniques across the news genres and NE types.

7.1. Results of News Genres

As discussed earlier, a key feature of the UNER-I dataset in which it surpasses the IJCNLP-Urdu dataset is that our dataset includes news articles from four different genres and the genre information of each news article is also provided. In contrast, the IJCNLP-Urdu dataset includes news articles from merely a single genre. Therefore, the discussion of the results presented in this section is limited to the UNER-I dataset, i.e., the genre-wise results of the IJCNLP-Urdu datasets are not presented in this paper.

Table 11 presents the results of both the techniques for all genres of news for the UNER-I dataset. From the comparison of the results of the four news genres, it can be observed that the results achieved by the proposed approach varied significantly. For instance, the value of the F1 score for national news was 84.68, whereas the F1 score for art-related news was merely 68.09. A similar variation could be observed for the baseline approach. One possible reason for the significant difference in the performance scores stemmed from the differences in the number of NEs in the news articles of each genre; that is, there were 1749 NEs in the national news articles, whereas there were merely 662 NEs in the art news articles, which impeded the learning ability of the CRF-based approach, as well as the baseline approach.

7.2. Results of NE Types

Further analysis of the performance of the CRF-based approach was performed to investigate the impact of the frequency of each NE type on the performance of the proposed approach. In contrast to the preceding section, the details of the NE types for both datasets, namely, IJCNLP-Urdu and UNER-I, are available. Therefore, the results of both datasets are discussed separately.

Table 12 presents the results of each NE type for the IJCNLP-Urdu dataset. It can be observed from the table that our proposed approach showed promising results for all the NE types with higher frequencies than the NE types with low frequencies. For instance, the two NE types with the highest frequencies were location and person, and these were also the two NE types that achieved the highest F1 scores. Similarly, the most scarce NE type was organization, with merely 48 instances and the proposed approach achieved the lowest F1 score for this NE type. Table 13 presents the results of each NE type for the UNER-I dataset. Similar to the results of the IJCNLP-Urdu dataset, the results of the UNER-I dataset indicated that the CRF-based approach surpassed the baseline approach by achieving a higher precision, recall, and F1 score.

As discussed earlier, the news in the IJCNLP-Urdu dataset was collected from a single genre, whereas the UNER-I dataset included news from four genres. Therefore, the results of the UNER-I dataset were synthesized across genres and NE types. The synthesized results of the UNER-I dataset are presented in Table 14. It can be observed from this table that for the national news, our proposed approach surpassed the baseline model for all NE types. In contrast, for the news from the other three genres, our proposed approach surpassed the baseline model for a majority of the NE types, whereas it achieved a comparable score for the remaining types. In particular, for sports news, the proposed approach outperformed the baseline with higher margins for two NE types: organization and date. For international news, the F1 score achieved by our approach was substantially higher than the baseline for three NE types: designation, number, and time. Similarly, for the art news, only the date and number classes superseded the baseline model with a high margin.

7.3. Error Analysis

In order to identify the causes of incorrect identification of NEs in Urdu text, an error analysis of the incorrectly recognized NEs was performed. From the examination of the incorrectly identified NEs, it was observed that in the majority of cases, person NEs conflicted with location and organization NEs. Furthermore, organization NEs conflicted with person and location NEs. A further examination of these conflicts revealed the underlying cause, i.e., Urdu is a highly complex language and often the meanings of the words are determined by the surrounding words, meaning that a single word can acquire different meanings in different contexts. To illustrate this, consider the word جان (Jan), which can be used variously as a common noun, a proper noun, or an adjective. This polymorphemic behavior of most Urdu words makes the NER task very challenging in Urdu.

For a further illustration, consider an excerpt text presented in Table 15. The table contains an actual excerpt from the IJCNLP-Urdu datasets where NEs were annotated. Furthermore, the table contains the NEs predicted by our proposed CRF-based approach. It can be observed from the example that our proposed approach incorrectly marked the NE types by confusing the date and location NE types. That is, in the original text, the tokensاکتوبر (October) and فروری (February) were marked as examples of the date NE type but our proposed approach marked both tokens as locations. Similarly, the actual tag for the term ڈیرہغازیخان (Dera Ghazi Khan) is a location, but after testing, the same term was predicted as a number.

8. Conclusions

Machine learning algorithms are now widely used today for the creation of NER tools in practically all languages, which include Urdu. The main reason for its widespread use is owing to four attributes: (a) the ability to learn on its own; (b) the high precision, (c) the speed of processing, and (d) the comprehensive design. The availability of pre-NE-labeled datasets is a basic requirement for ML techniques for testing and training. Urdu is classified as a language with limited resources. As a result, we attempted to contribute to the Urdu language resources using a newly constructed homogeneous NE-labeled dataset in this study for an NER task in the Urdu language with a special focus on ML techniques. In comparison to the previous NE dataset with text from multiple news sites, a lot of effort was put into creating a new massive NE-tagged dataset. The scale of this UNER-I news dataset and its extremely rich NE contents were both fascinating. These two factors combined to form a UNER-I news dataset that was more feasible for ML techniques. A decent feature set construction applies in almost all machine-oriented approaches and it contains a high level of intelligence. A good set of features has high importance compared with the learning model. As a result, producing a good feature set is critical. As a result, in this study, we offered a well-balanced feature set that included both language-dependent and language-independent features. By comparing our suggested feature set with a baseline, we were able to demonstrate the influence of our proposed feature set on CRF performance. The newly developed dataset was used to build and evaluate a conditional-random-field-based Urdu NER system. This study showed that by taking advantage of the CRF model by using our proposed various sparse and dense features (context words window and POS information), compared with the baseline features, we could obtain better NER tagging results for Urdu. The experiments showed that the proposed CRF-based model performed better compared with the baseline approach on both datasets. In the national news domain of our UNER-I dataset, our proposed model superseded the baseline model by 2.36%; in the sports news domain, the F-measure value was 1.69% higher; in the international news domain, the F-measure was 1.91% higher; while in the art news domain, the F-measure value was 0.52% higher. Similarly, for the IJCNLP-08 dataset, the yielded values of the precision, recall, and F-measure of our proposed model were 49.93%, 67.76%, and 54.02%, respectively, while those of the baseline model were 50.46%, 67.21%, and 53.49%, respectively. Additionally, in this work, we also provided an analysis of the impact of the high variance of the cardinality of named entities. In the future, we aim to utilize the NER task to perform numerous NLP tasks.

Author Contributions

Data curation, T.A.; Formal analysis, K.S.; Software, A.B.; Supervision, A.D.; Visualization, H.F.; Writing—original draft, W.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Conflicts of Interest

We declare that we have no known competing financial interest or personal relationship that could have influenced the work reported in this paper.

References

Nadeau, D.; Sekine, S. A survey of named entity recognition and classification. Lingvisticae Investig. 2007, 30, 3–26. [Google Scholar] [CrossRef]
Espla-Gomis, M.; Sánchez-Martínez, F.; Forcada, M.L. Using machine translation to provide target-language edit hints in computer aided translation based on translation memories. J. Artif. Intell. Res. 2015, 53, 169–222. [Google Scholar] [CrossRef] [Green Version]
Yadav, V.; Bethard, S. A survey on recent advances in named entity recognition from deep learning models. arXiv 2019, arXiv:1910.11470. [Google Scholar]
Sundheim, B.M. Overview of Results of the MUC-6 Evaluation. In Proceedings of the Sixth Message Understanding Conference, Vienna, VA, USA, 6–8 May 1996; pp. 423–442. [Google Scholar]
Khattak, A.; Asghar, M.Z.; Saeed, A.; Hameed, I.A.; Hassan, S.A.; Ahmad, S. A survey on sentiment analysis in Urdu: A resource-poor language. Egypt. Inform. J. 2021, 22, 53–74. [Google Scholar] [CrossRef]
Khan, I.U.; Khan, A.; Khan, W.; Su’ud, M.M.; Alam, M.M.; Subhan, F.; Asghar, M.Z. A review of Urdu sentiment analysis with multilingual perspective: A case of Urdu and roman Urdu language. Computers 2022, 11, 3. [Google Scholar] [CrossRef]
Riaz, K. Rule-Based Named Entity Recognition in Urdu. In Proceedings of the 2010 Named Entities Workshop, Uppsala, Sweden, 16 July 2010; Association for Computational Linguistics: Minneapolis, MN, USA, 2010; pp. 126–135. [Google Scholar]
Malik, M.K.; Sarwar, S.M. urdu named entity recognition and classification system using conditional random field. Sci. Int. 2015, 5, 4473–4477. [Google Scholar]
Saha, S.K.; Chatterji, S.; Dandapat, S.; Sarkar, S.; Mitra, P. A hybrid named entity recognition system for south and south east asian languages. In Proceedings of the IJCNLP-08 Workshop on Named Entity Recognition for South and South East Asian Languages; Asian Federation of Natural Language Processing: Taipei, Taiwan, 2008; pp. 83–88. [Google Scholar]
Roberts, A.; Gaizauskas, R.J.; Hepple, M.; Guo, Y. Combining Terminology Resources and Statistical Methods for Entity Recognition: An Evaluation. In Proceedings of the the Conference on Language Resources and Evaluation (LRE’08), Marrakech, Morocco, 26 May–1 July 2008; pp. 2974–2980. [Google Scholar]
Tjong Kim Sang, E.F.; De Meulder, F. Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, Stroudsburg, PA, USA, 31 May 2003; Volume 4, pp. 142–147. [Google Scholar]
Shaalan, K.; Raza, H. NERA: Named entity recognition for Arabic. J. Am. Soc. Inf. Sci. Technol. 2009, 60, 1652–1663. [Google Scholar] [CrossRef]
Singh, U.; Goyal, V.; Lehal, G.S. Named Entity Recognition System for Urdu. In Proceedings of the COLING, Mumbai, India, 8–15 December 2012; pp. 2507–2518. [Google Scholar]
Ekbal, A.; Haque, R.; Bandyopadhyay, S. Named Entity Recognition in Bengali: A Conditional Random Field Approach. In Proceedings of the the International Joint Conference on Natural Language Processing (IJCNLP), Taipei, Taiwan, 27 November–1 December 2008; pp. 589–594. [Google Scholar]
Mukund, S.; Srihari, R.; Peterson, E. An Information-Extraction System for Urdu—A Resource-Poor Language. ACM Trans. Asian Lang. Inf. Processing (TALIP) 2010, 9, 1–43. [Google Scholar] [CrossRef] [Green Version]
Kazama, J.I.; Torisawa, K. Exploiting Wikipedia as External Knowledge for Named Entity Recognition. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Prague, Czech Republic, 28–30 June 2007; pp. 698–707. [Google Scholar]
Chiong, R.; Wei, W. Named Entity Recognition Using Hybrid Machine Learning Approach. In Proceedings of the 5th IEEE International Conference on Cognitive Informatics, Beijing, China, 17–19 June 2006; pp. 578–583. [Google Scholar]
Shaalan, K. A survey of arabic named entity recognition and classification. Comput. Linguist. 2014, 40, 469–510. [Google Scholar] [CrossRef]
Collins, M.; Singer, Y. Unsupervised Models for Named Entity Classification. In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, College Park, MD, USA, 21–22 June 1999; pp. 100–110. [Google Scholar]
Capstick, J.; Diagne, A.K.; Erbach, G.; Uszkoreit, H.; Leisenberg, A.; Leisenberg, M. A system for supporting cross-lingual information retrieval. Inf. Processing Manag. 2000, 36, 275–289. [Google Scholar] [CrossRef]
Daud, A.; Khan, W.; Che, D. Urdu language processing: A survey. Artif. Intell. Rev. 2016, 47, 279–331. [Google Scholar] [CrossRef]
Villa, S.; Stella, F. Learning Continuous Time Bayesian Networks in Non-stationary Domains. J. Artif. Intell. Res.(JAIR) 2016, 57, 1–37. [Google Scholar] [CrossRef]
Khan, W.; Daud, A.; Nasir, J.A.; Amjad, T. A survey on the state-of-the-art machine learning models in the context of NLP. Kuwait J. Sci. 2016, 43, 66–84. [Google Scholar]
Oudah, M.; Shaalan, K. NERA 2.0: Improving coverage and performance of rule-based named entity recognition for Arabic. Nat. Lang. Eng. 2016, 23, 441–472. [Google Scholar] [CrossRef]
Silver, D.; Huang, A.; Maddison, C.J.; Guez, A.; Sifre, L.; Van Den Driessche, G.; Schrittwieser, J.; Antonoglou, I.; Panneershelvam, V.; Lanctot, M. Mastering the game of Go with deep neural networks and tree search. Nature 2016, 529, 484–489. [Google Scholar] [CrossRef]
Haq, R.; Zhang, X.; Khan, W.; Feng, Z. Urdu Named Entity Recognition System Using Deep Learning Approaches. Comput. J. 2022. [Google Scholar] [CrossRef]
Thenmalar, S.; Balaji, J.; Geetha, T. Semi-supervised Bootstrapping approach for Named Entity Recognition. arXiv 2015, arXiv:1511.06833. [Google Scholar]
Dubba, K.S.; Cohn, A.G.; Hogg, D.C.; Bhatt, M.; Dylla, F. Learning relational event models from video. J. Artif. Intell. Res. 2015, 53, 41–90. [Google Scholar] [CrossRef]
Oudah, M.; Shaalan, K.F. A Pipeline Arabic Named Entity Recognition Using a Hybrid Approach. In Proceedings of the COLING, Mumbai, India, December 2012; pp. 2159–2176. [Google Scholar]
Hardie, A. Developing a Tagset for Automated Part-of-Speech Tagging in Urdu. In Corpus Linguistics; UCREL Technical Papers; Department of Linguistics, Lancaster University: Lancaster, UK, 2003. [Google Scholar]
Anwar, W.; Wang, X.; Wang, X.-l. A Survey of Automatic Urdu Language Processing. In Proceedings of the International Conference on Machine Learning and Cybernetics, Dalian, China, 13–16 August 2006; pp. 4489–4494. [Google Scholar]
Akram, Q.-u.-A.; Naseer, A.; Hussain, S. Assas-Band, an Affix-Exception-List Based Urdu Stemmer. In Proceedings of the 7th Workshop on Asian Language Resources, Suntec, Singapore, 6–7 August 2009; pp. 40–46. [Google Scholar]
Ahmed, T.; Hautli, A. A first approach towards an Urdu WordNet. Linguist. Lit. Rev. 2011, 1, 1–14. [Google Scholar] [CrossRef]
Adeeba, F.; Hussain, S. Experiences in Building the Urdu WordNet. In Proceedings of the 9th Workshop on Asian Language Resources Collocated with IJCNLP, Chiang Mai, Thailand, 12–13 November 2011; pp. 31–35. [Google Scholar]
Anwar, W.; Wang, X.; Li, L.; Wang, X.-L. A Statistical Based Part of Speech Tagger for Urdu Language. In Proceedings of the International Conference on Machine Learning and Cybernetics, Hong Kong, China, 19–22 August 2007; pp. 3418–3424. [Google Scholar]
Khan, W.; Daud, A.; Alotaibi, F.; Aljohani, N.; Arafat, S. Deep recurrent neural networks with word embeddings for Urdu named entity recognition. ETRI J. 2020, 42, 90–100. [Google Scholar] [CrossRef] [Green Version]
Rasheed, I.; Banka, H.; Khan, H.M.; Daud, A. Building a text collection for Urdu information retrieval. ETRI J. 2021, 43, 856–868. [Google Scholar] [CrossRef]
Hussain, S. Resources for Urdu Language Processing. In Proceedings of the IJCNLP, Hyderabad, India, 7–12 January 2008; pp. 99–100. [Google Scholar]
Jakobsson, U.; Westergren, A. Statistical methods for assessing agreement for ordinal data. Scand. J. Caring Sci. 2005, 19, 427–431. [Google Scholar] [CrossRef] [PubMed] [Green Version]

Figure 1. Overall architecture of the proposed approach.

Table 1. Example of Urdu text.

Language	Example Sentence	Flow
English	Today’s newspaper	Left to right (→)
French	Le journal d’aujourd’hui	Left to right (→)
German	Die heutige Zeitung	Left to right (→)
Urdu	آج کا اخبار	Right to left (←)

Table 2. Examples of changes in shape.

Example: Riyadh Is the Capital

Table 3. Possible combinations of x and y for the third element.

[x, y]	Description
[0, 0]	Position of the current word
[−1, 0]	Position of the preceding word
[1, 0]	Represents the position of the next lexical word
[−1, 1]	Represents the position of the POS tag of the preceding word
[0, 1]	Represents the position of the POS tag of the current word
[1, 1]	Represents the position of the POS tag of the next lexical word

Table 4. Details of the Feature Templates.

Feature Templates	Values of Elements				Description
Feature Templates	PR	ID	PS	TG	Description
<U1:%x[−1, 0]>	U	1	[−1, 0]	-	Prior vocabulary term
<U2:%x[0, 0]>	U	2	[0, 0]	-	Present vocabulary term
<U3:%x[1, 0]>	U	3	[1, 0]	-	Succeeding vocabulary term
<U4:%x[0, 1]>	U	4	[0, 1]	-	Syntactic tag of the current word
<U5:%x[−1, 1]>	U	5	[−1, 1]	-	Syntactic tag of the preceding lexical word
<U6:%x[1, 1]>	U	6	[1, 1]	-	Syntactic tag of the succeeding lexical word
<U7:%x[0, 0]% [0, 1]%>	U	7	[0, 0]	[0, 1]	Current word/current syntactic tag
<U8:%x[0, 0]%x[1, 1]>	U	8	[0, 0]	[1, 1]	Current word/next syntactic tag
<U9:%x[0, 0]%x[−1, 1]>	U	9	[0, 0]	[−1, 1]	Current word/previous syntactic tag
<B10:%x[0, 1]>	B	10	-	[0, 1]	Bigram of the current syntactic tag
<B11:%x[0, 0]>	B	11	-	[0, 0]	Bigram of the current word

Table 5. Training file format.

Token	POS Tag	Class Label
ریحام	PNN	S_PERSON
پی	PNN	B_ORGANIZATION
ٹی	PNN	M_ORGANIZATION
آئی	PNN	E_ORGANIZATION
کے	PSP	NOR
ٹکٹ	NN	NOR
پر	PSP	NOR
انتحابات	NN	NOR
بھی	PRT	NOR
نہیں	NEG	NOR
لڑیں	OOV	NOR
گی	AUXT	NOR

Table 6. Summary of the Specification of the IJCNLP-Urdu Dataset.

Total No. of Words	40,408
Total No. of Sentences	1097
Total No. of Named Entities	1115

Table 7. Entity-Wise Statistics of the IJCNLP-Urdu Dataset.

NE Type	Count
Person	277
Location	490
Organization	48
Designation	69
Date	123
Number	123
Total	1115

Table 8. Summary of the UNER-I Dataset.

Cumulative Amount of Words	58,633
Overall Number of Sentences	2161
Total Count of Identified Entities	5283

Table 9. Detailed Statistics of the UNER-I Dataset.

Domain/Entity	Person	Location	Organization	Date	Time	Number	Designation	Total
Sports	605	455	53	48	10	589	42	1802
National	401	390	400	81	40	270	167	1749
International	201	360	210	74	23	132	70	1070
Arts	355	120	41	42	33	56	15	662
Total	1562	1325	704	245	106	1047	294	5283

Table 10. Summary results of the two datasets.

Approach	IJCNLP-Urdu Dataset			UNER-I Dataset
Approach	Precision	Recall	F-Measure	Precision	Recall	F-Measure
Baseline	50.46	67.21	53.49	76.5875	74.2225	73.1925
Proposed	49.93	67.76	54.02	78.205	75.7875	74.8125

Table 11. Results of the UNER-I dataset across genres.

Genres	Baseline Approach			Proposed Approach
Genres	Precision	Recall	F-Measure	Precision	Recall	F-Measure
National	87.14	81.01	82.32	88.21	84.05	84.68
Sports	75.48	73.39	73.23	77.44	75.02	74.92
International	73.45	71.66	69.65	75.84	72.62	71.56
Art	70.28	70.83	67.57	71.33	71.46	68.09

Table 12. Entity-wise results of the IJCNLP-Urdu dataset.

NE Type	Baseline Approach			Proposed Approach
NE Type	Precision	Recall	F1 Score	Precision	Recall	F1 Score
Person	65.7	76.64	69.79	64.46	80.24	70.09
Location	75.23	95.41	83.41	75.16	95.31	83.47
Organization	40.47	51.81	38.32	38.89	51.18	38.78
Date	39.45	58.26	42.77	40.67	59.12	45.05
Designation	45.54	62.19	47.42	50.3	67.01	53.24
Number	36.38	58.93	39.22	35.24	59.77	39.01

Table 13. Entity-wise results of the UNER-I dataset.

NE Type	Baseline Approach			Proposed Approach
NE Type	Precision	Recall	F1 Score	Precision	Recall	F1 Score
Person	81.94	94.82	87.16	82.84	95.14	87.94
Location	84.76	86.40	85.17	84.49	86.32	84.84
Organization	73.17	70.15	70.31	75.68	70.60	71.54
Date	79.66	63.71	68.38	82.00	66.77	70.77
Designation	83.18	80.28	80.403	86.163	80.12	81.50
Number	72.99	77.46	72.66	76.16	81.10	76.33
Time	55.64	46.59	46.42	55.35	50.08	48.72

Table 14. Results of the UNER-I dataset grouped by genre and NE type.

Genre	NE Type	Baseline Approach			Proposed Approach
Genre	NE Type	Precision	Recall	F1 Score	Precision	Recall	F1 Score
National	Person	83.38	91.96	87.08	83.64	92.79	87.66
	Location	85.86	88.36	86.86	89.44	89.87	89.4
	Organization	89.4	89	88.9	91.83	89.52	90.47
	Date	92.09	74.52	80.02	93.21	78.72	82.23
	Designation	78.98	73	75.29	78.78	73.74	75.61
	Number	85.19	83.41	83.56	87.34	87.12	86.73
	Time	95.04	66.79	74.53	93.25	76.57	80.67
Sports	Person	94.22	98.9	96.37	94.04	98.82	96.23
	Location	95.46	97.05	96.17	95.7	96.7	96.1
	Organization	44.48	36.13	38.76	58.09	48.68	51.52
	Date	68.47	60.15	60.55	76.51	64.46	66.09
	Designation	96.94	95.74	95.64	96.67	92.34	93.29
	Number	88.76	96.2	92.21	87.53	94.57	90.77
	Time	40	29.54	32.92	33.57	29.54	30.42
International	Person	69.09	90.28	76.67	71.38	90.81	78.65
	Location	82.23	86.3	83.84	80.22	86.54	82.75
	Organization	81.63	84.61	82.12	80.08	81.55	79.99
	Date	82.26	63.88	70.4	76.95	63.66	68.31
	Designation	73.62	72.12	70.28	83.04	74.29	75.62
	Number	76.19	75.33	73.43	82.15	78.7	78.38
	Time	49.15	29.11	30.83	57.06	32.82	37.23
Art	Person	81.09	98.17	88.52	82.33	98.17	89.25
	Location	75.51	73.89	73.81	72.61	72.19	71.11
	Organization	77.17	70.86	71.49	72.74	62.66	64.19
	Date	75.83	56.32	62.56	81.34	60.27	66.47
	Number	41.82	54.9	41.44	47.64	64.02	49.44
	Time	38.38	60.93	47.40	37.55	61.39	46.59

Table 15. Excerpt of the IJCNLP-Urdu dataset with actual and predicted NE tags.

Sentence 1			Sentence 2
Token	Actual Tag	Predicted Tag	Token	Actual Tag	Predicted Tag
یہ	NOR	NOR	ڈیرہغازیخان	<LOCATION>	<NUMBER>
ہولناک	NOR	NOR	کمسن	NOR	NOR
کھیل	NOR	NOR	سواروں	NOR	NOR
ہر	NOR	NOR	کی	NOR	NOR
سال	NOR	NOR	فراہمی	NOR	NOR
وسط	NOR	NOR	کے	NOR	NOR
اکتوبر	<DATE>	<LOCATION>	لئے	NOR	NOR
اور	NOR	NOR	بڑی	NOR	NOR
فروری	<DATE>	<LOCATION>	کی	NOR	NOR
میں	NOR	NOR	شکل	NOR	NOR
جنوبیپنجاب	<LOCATION>	<LOCATION>	اختیار	NOR	NOR
منعقد	NOR	NOR	کر	NOR	NOR
ہوتا	NOR	NOR	گیا	NOR	NOR
ہے	NOR	NOR	ہے	NOR	NOR

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Khan, W.; Daud, A.; Shahzad, K.; Amjad, T.; Banjar, A.; Fasihuddin, H. Named Entity Recognition Using Conditional Random Fields. Appl. Sci. 2022, 12, 6391. https://doi.org/10.3390/app12136391

AMA Style

Khan W, Daud A, Shahzad K, Amjad T, Banjar A, Fasihuddin H. Named Entity Recognition Using Conditional Random Fields. Applied Sciences. 2022; 12(13):6391. https://doi.org/10.3390/app12136391

Chicago/Turabian Style

Khan, Wahab, Ali Daud, Khurram Shahzad, Tehmina Amjad, Ameen Banjar, and Heba Fasihuddin. 2022. "Named Entity Recognition Using Conditional Random Fields" Applied Sciences 12, no. 13: 6391. https://doi.org/10.3390/app12136391

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Named Entity Recognition Using Conditional Random Fields

Abstract

1. Introduction

2. Related Work

3. Characteristics of Urdu Language

4. Urdu NER Using Condition Random Fields

4.1. Conditional Random Field

4.2. Feature Templates

4.3. Feature Functions

4.4. Encoding

4.5. Decoding

5. Urdu UNER Datasets

5.1. The Baseline Dataset

5.2. The UNER-I Dataset

6. Evaluation

6.1. Datasets

6.2. Evaluation Measures

6.3. Experimental Setup

7. Results and Discussion

7.1. Results of News Genres

7.2. Results of NE Types

7.3. Error Analysis

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI