An Arabic Dataset for Disease Named Entity Recognition with Multi-Annotation Schemes
Abstract
:1. Summary
2. Data Description
2.1. Annotation Labels
2.2. Part-of-Speech Tags
2.3. Stopwords
2.4. Gazetteers
2.5. Lexical Marker Lists
2.6. Definiteness (Existence of ‘AL’)
3. Methods
3.1. Data Collection
3.2. Data Preprocessing
3.3. Data Annotation
Listing 1: Code snippet of the IOBES scheme script. |
def generate_IOBES(dataset): # make a new copy of the new_dataset = dataset.copy() # loop over every record in the new dataset for i, row in enumerate(new_dataset): # Check if the current token is a single entity if ( row[LABEL] == "I" and dataset.iloc[i - 1][LABEL] == "O" and dataset.iloc[i + 1][LABEL] == "O" ): new_dataset.at[i, LABEL] = "S" # check if the current token is the beginning of a multi-token entity if row[LABEL] == "I" and dataset.iloc[i - 1][LABEL] == "O": new_dataset.at[i, LABEL] = "B" # check if the current token is the beginning of a multi-token entity if row[LABEL] == "I" and dataset.iloc[i + 1][LABEL] == "O": new_dataset.at[i, LABEL] = "E" # return the newly generated dataset return new_dataset |
3.4. Feature Engineering
Supplementary Materials
Author Contributions
Funding
Conflicts of Interest
References
- McLoughlin, L. Colloquial Arabic (Levantine); Routledge: London, UK, 2009. [Google Scholar]
- Alanazi, S. A Named Entity Recognition System Applied to Arabic Text in the Medical Domain. Ph.D. Thesis, Staffordshire University, Stoke-on-Trent, UK, 2017. [Google Scholar]
- Shaalan, K.; Raza, H. Arabic named entity recognition from diverse text types. In International Conference on Natural Language Processing; Springer: New York, NY, USA, 2008; pp. 440–451. [Google Scholar]
- Konkol, M.; Konopík, M. Segment representations in named entity recognition. In International Conference on Text, Speech, and Dialogue; Springer: New York, NY, USA, 2015; pp. 61–70. [Google Scholar]
- Demiros, I.; Boutsis, S.; Giouli, V.; Liakata, M.; Papageorgiou, H.; Piperidis, S. Named Entity Recognition in Greek Texts. Ph.D. Thesis, Aristotle University of Thessaloniki, Thessaloniki, Greece, 2019. [Google Scholar]
- Mozharova, V.A.; Loukachevitch, N.V. Combining knowledge and CRF-based approach to named entity recognition in Russian. In International Conference on Analysis of Images, Social Networks and Texts; Springer: Cham, Switzerland, 2016; pp. 185–195. [Google Scholar]
- Ahmad, M.T.; Malik, M.K.; Shahzad, K.; Aslam, F.; Iqbal, A.; Nawaz, Z.; Bukhari, F. Named Entity Recognition and Classification for Punjabi Shahmukhi. ACM Trans. Asian Low-Resour. Lang. Inform. Process. (TALLIP) 2020, 19, 1–13. [Google Scholar] [CrossRef]
- Algahtani, S.M. Arabic Named Entity Recognition: A Corpus-Based Study. Ph.D. Thesis, University of Manchester, Manchester, UK, 2012. [Google Scholar]
- Elsebai, A.; Meziane, F.; Belkredim, F.Z. A rule based persons names Arabic extraction system. Commun. IBIMA 2009, 11, 53–59. [Google Scholar]
- Torisawa, K. Exploiting Wikipedia as external knowledge for named entity recognition. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague, Czech Republic, 28–30 June 2007; pp. 698–707. [Google Scholar]
- Alruily, M. Using Text Mining to Identify Crime Patterns from Arabic Crime News Report Corpus; De Montfort University: Leicester, UK, 2012. [Google Scholar]
- Shaalan, K. A survey of arabic named entity recognition and classification. Comput. Linguist. 2014, 40, 469–510. [Google Scholar] [CrossRef]
- King Abdullah Bin Abdulaziz Arabic Health Encyclopedia. Available online: https://kaahe.org/ (accessed on 30 April 2020).
- Diab, M. Second generation AMIRA tools for Arabic processing: Fast and robust tokenization, POS tagging, and base phrase chunking. In Proceedings of the 2nd International Conference on Arabic Language Resources and Tools (MEDAR), Cairo, Egypt, 22–23 April 2009; p. 198. [Google Scholar]
- Hovy, E.; Lavid, J. Towards a ‘science’ of corpus annotation: A new methodological challenge for corpus linguistics. Int. J. Transl. 2010, 22, 13–36. [Google Scholar]
- Cohen, J. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 1960, 20, 37–46. [Google Scholar] [CrossRef]
- McHugh, M.L. Interrater reliability: The kappa statistic. Biochem. Med. Biochem. Med. 2012, 22, 276–282. [Google Scholar] [CrossRef]
- Pasha, A.; Al-Badrashiny, M.; Diab, M.T.; El Kholy, A.; Eskander, R.; Habash, N.; Pooleery, M.; Rambow, O.; Roth, R. Madamira: A fast, comprehensive tool for morphological analysis and disambiguation of arabic. LREC 2014, 14, 1094–1101. [Google Scholar]
- Habash, N.; Rambow, O.; Roth, R. MADA+ TOKAN: A toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. In Proceedings of the 2nd International Conference on Arabic Language Resources and Tools (MEDAR), Cairo, Egypt, 22–23 April 2009; p. 62. [Google Scholar]
Column Values | Description |
---|---|
d | Definite: The word is definite and the definite article ال is present. |
i | Indefinite: The definite article ال is not present. |
c | Construct/poss/idafa: The word is a genitive construct. |
ns | Not applicable: Words that cannot be categorized into any of the above cases, such as: Verbs, propositions, punctionations, etc. |
Annotation Scheme | Number of Labels | Description |
---|---|---|
IO | 2 | I (Inside): Marks the word as a part of an entity. O (Outside): Marks the word as a non-entity. |
IOE | 3 | E (End): Marks the word as the end of an entity. I (Inside): Marks the word as a part of an entity. O (Outside): Marks the word as a non-entity. |
IOB | 3 | B (Beginning): Marks the word as the beginning of an entity. I (Inside): Marks the word as a part of an entity. O (Outside): Marks the word as a non-entity. |
BIES | 8 | B (Beginning): Marks the word as the beginning of an entity. I (Inside): Marks the word as a part of an entity. E (End): Marks the word as the end of an entity. S (Single): Marks the word as a single entity. BO (Beginning-Outside): Marks the word as the beginning of a non-entity sequence. IO (Inside–Outside): Marks the word as a part of a non-entity sequence. EO (End-Outside): Marks the word as the end of a non-entity sequence. SO (Single-Outside): Marks the word as a single non-entity word. |
IOBES | 5 | B (Beginning): Marks the word as the beginning of an entity. I (Inside): Marks the word as a part of an entity. E (End): Marks the word as the end of an entity. S (Single): Marks the word as a single entity. O (Outside): Marks the word as a non-entity. |
IE | 4 | I (Inside): Marks the word as a part of an entity. E (End): Marks the word as the end of an entity. IO (Inside–Outside): Marks the word as a part of a non-entity sequence. EO (End-Outside): Marks the word as the end of a non-entity sequence. |
BI | 4 | B (Beginning): Marks the word as the beginning of an entity. I (Inside): Marks the word as a part of an entity. BO (Beginning-Outside): Marks the word as the beginning of a non-entity sequence. IO (Inside–Outside): Marks the word as a part of a non-entity sequence. |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Alshammari, N.; Alanazi, S. An Arabic Dataset for Disease Named Entity Recognition with Multi-Annotation Schemes. Data 2020, 5, 60. https://doi.org/10.3390/data5030060
Alshammari N, Alanazi S. An Arabic Dataset for Disease Named Entity Recognition with Multi-Annotation Schemes. Data. 2020; 5(3):60. https://doi.org/10.3390/data5030060
Chicago/Turabian StyleAlshammari, Nasser, and Saad Alanazi. 2020. "An Arabic Dataset for Disease Named Entity Recognition with Multi-Annotation Schemes" Data 5, no. 3: 60. https://doi.org/10.3390/data5030060
APA StyleAlshammari, N., & Alanazi, S. (2020). An Arabic Dataset for Disease Named Entity Recognition with Multi-Annotation Schemes. Data, 5(3), 60. https://doi.org/10.3390/data5030060