Next Article in Journal
Research on a Comprehensive Evaluation Method of Train Anti-Slip System Performance
Previous Article in Journal
A Digitalization Algorithm Based on the Voltage Waveform of the Multifunction Vehicle Bus
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Parallel-Based Corpus Annotation for Malay Health Documents

1
Faculty of Information Science & Technology, Universiti Kebangsaan Malaysia, UKM, Bangi 43000, Selangor, Malaysia
2
Faculty of Industrial Technology, Universitas Pembangunan Nasional “Veteran” Yogyakarta, Yogyakarta 55283, Indonesia
3
Faculty of Engineering and Informatics, Universitas Multimedia Nusantara, Banten 15810, Indonesia
*
Author to whom correspondence should be addressed.
Appl. Sci. 2023, 13(24), 13129; https://doi.org/10.3390/app132413129
Submission received: 14 October 2023 / Revised: 22 November 2023 / Accepted: 27 November 2023 / Published: 9 December 2023

Abstract

:
Named entity recognition (NER) is a crucial component of various natural language processing (NLP) applications, particularly in healthcare. It involves accurately identifying and extracting named entities such as medical terms, diseases, and drug names, and healthcare professionals are essential for tasks like clinical text analysis, electronic health record management, and medical research. However, healthcare NER faces challenges, especially in Malay, in which specialized corpora are limited, and no general corpus is available yet. To address this, the paper proposes a method for constructing an annotated corpus of Malay health documents. The researchers leverage a parallel source that contains annotated entities in English due to the limited tools available for the Malay language, and it is very language-dependent. Additional credible Malay documents are incorporated as sources to enhance the development. The targeted health entities in this research include penyakit (diseases), simptom (symptoms), and rawatan (treatments). The primary objective is to facilitate the development of NER algorithms specifically tailored to the healthcare domain in the Malay language. The methodology encompasses data collection, preprocessing, annotation of text in both English and Malay, and corpus creation. The outcome of this research is the establishment of the Malay Health Document Annotated Corpus, which serves as a valuable resource for training and evaluating NLP models in the Malay language. Future research directions may focus on developing domain-specific NER models, exploring alternative algorithms, and enhancing performance. Overall, this research aims to address the challenges of healthcare NER in the Malay language by constructing an annotated corpus and facilitating the development of tailored NER algorithms for the healthcare domain.

1. Introduction

Named entity recognition (NER) is a crucial task in the field of natural language processing (NLP). It involves identifying and categorizing named entities in text, such as people, organizations, locations, dates, and other specific terms [1] NER is highly valuable in various applications, such as information retrieval, data analysis, and decision support systems. For example, in healthcare, NER can be used to extract relevant medical terms, diseases, and drug names from clinical texts, facilitating clinical text analysis, electronic health record management, and medical research [2]. NER is also used in other domains like information retrieval, where it helps improve search results by accurately identifying and categorizing entities mentioned in documents [3].
The availability of a standard Malay language corpus and machine learning algorithms can catalyze a new wave of Malay NLP research, particularly in ongoing research on NER, semantic analysis, information retrieval, sentiment analysis, and translations. These resources would enable researchers to develop more accurate and effective NER models specific to the Malay language, improving the overall quality of Malay NLP applications.
Currently, domain-specific applications primarily focus on the specific context and often do not extend to other languages with diverse morphological and syntactic structures [4]. Therefore, the development of a standard Malay language corpus and machine learning algorithms tailored to the language is essential. This would enable the expansion of NLP applications to encompass a wider range of domains and promote cross-linguistic research and development. By investing in creating a comprehensive annotated corpus and advancing machine learning algorithms for Malay NLP, researchers can unlock the full potential of NER and other NLP tasks in the Malay language [5]. This will not only contribute to the growth of the field but also facilitate the development of innovative applications that cater to the unique linguistic characteristics of Malay.
The growth of health-related information in the Malay language necessitates the development of NLP tools and resources tailored to the Malay-speaking community. However, the existing NER tools primarily focus on basic entity types, such as person, organization, and location, and often do not support the Malay language. Moreover, it should be noted that the field of identifying syntax and semantics in the Malay language lacks the abundance of tools and resources that are readily available in English [6]. This scarcity poses a significant challenge in accurately performing named entity recognition (NER) tasks in Malay health documents. These challenges highlight the need for specialized NER models and resources specifically tailored to the Malay language and the health domain.
In addressing this challenge, leveraging parallel corpora, which consist of aligned texts in English and Malay, emerges as the most suitable solution. By utilizing parallel corpora, we can leverage the existing tools and resources for English NER and adapt them effectively to the Malay language, facilitating the identification of named entities in Malay health documents. This approach maximizes the available resources and enables the development of robust NER models specifically tailored to the Malay language.
The primary objective of this research is to address the challenges of healthcare NER in the Malay language by constructing an annotated corpus and facilitating the development of tailored NER algorithms for the healthcare domain. By creating a comprehensive annotated corpus and advancing machine learning algorithms for Malay NLP, this research aims to unlock the full potential of NER and other NLP tasks in the Malay language, ultimately improving information extraction, analysis, and understanding in the healthcare sector.
In this paper, the research on building an annotated corpus for Malay health documents is presented, focusing on named entity recognition (NER). The paper is divided into four sections. Section 2 covers the background and related work, providing an overview of the current state of NER research in the Malay language and discussing the limitations of existing resources and approaches. Section 3 describes the methodology employed to create the Malay Health Document Annotated Corpus. This includes data collection, preprocessing, annotation of both English and Malay text, and the process of combining annotated documents to create the corpus. In Section 4, the primary results of the study are presented, which are the creation of the Malay Health Document Annotated Corpus. Section 5 discusses the challenges in building the Malay Health Document Annotated Corpus. The importance of this corpus as a useful tool for training and testing NER models in the Malay language is elucidated, along with the wide range of biomedical concepts that have been correctly identified and labeled within the corpus. In the final section, the main findings of the research are summarized, and potential future directions for this work are discussed.

2. Background and Related Work

The development of named entity recognition (NER) systems for the Malay language has been the focus of numerous researchers. In this section, we will explore the current state of Malay NER research, including the domains that have been predominantly studied and the methodologies and results of various studies. Additionally, we will discuss the existing resources available for the Malay language and highlight the need for further research and resource development in diverse domains beyond crime and news.
Existing datasets for the health domain are readily available in several languages, such as English [7,8,9,10], Chinese [11,12], and Indonesian [13,14]. Herwando et al. [13] discussed the identification of medical entities using the conditional random field (CRF) approach, which aims to extract health information, such as on diseases, symptoms, treatments, and medicines, from online health forum discussions in Indonesian. However, although Indonesian and Malaysian have much in common, they are two different languages. When it comes to Malay named entity recognition (NER) in the health domain, the available resources are primarily focused on domains like crime and news. It is important to note that named entity recognition (NER) systems designed for the public domain may not yield optimal performance when applied to health documents due to the domain-specific terminology and language variations. The specialized vocabulary and linguistic nuances present in the health domain require tailored NER models and resources to accurately extract and classify named entities in Malay health documents.
Malay, being an agglutinative language, poses challenges for named entity recognition due to its unique morphological structures [15]. For example, the formation of words via affixation and compounding can result in variations in word forms and make it difficult to accurately identify named entities. Additionally, the presence of linguistic nuances, such as honorifics and honorific markers, further adds to the complexity of recognizing named entities in Malay. These challenges highlight the need for specialized NER models and resources that can effectively handle the intricacies of the Malay language in the health domain.
Efforts should be made to curate and annotate large-scale health corpora in Malay, encompassing a wide range of health-related topics and sources. This would enable the development and evaluation of specialized NER models tailored to the unique characteristics of the Malay language and the specific domain of health. By expanding the available resources for Malay NER in the health domain and promoting research on health literacy, we can enhance the accuracy and applicability of NER systems, ultimately improving information extraction, analysis, and understanding in the healthcare sector. This, in turn, can contribute to advancements in healthcare services, research, and the overall well-being of Malay speakers.
Numerous researchers have dedicated their efforts to developing named entity recognition (NER) systems specifically for the Malay language, with a particular focus on Malay NER. However, the current landscape of Malay NER research is predominantly limited to the domains of crime and news resources. In the section discussing various studies on Malay NER, we provide an overview of several studies conducted on Malay NER, examining their methodologies, outcomes, and the existing state of resources available for the Malay language. By exploring these studies, we aim to gain insights into the progress, challenges, and opportunities in the field of Malay NER and highlight the need for further research and resource development in diverse domains beyond crime and news.
Several studies have been conducted on Malay NER, each employing different methodologies and achieving varying levels of success. For example, Saad et al. [16] created a crime news corpus and manually annotated entities, achieving a recall value of 78.67%, a precision of 71.11%, and an F-measure of 74.7%. Nadia et al. [17] proposed a rules-based Malay NER system that achieved a recall value of 92.13%, a precision value of 90.23%, and an F-score of 91.05%. Salleh et al. [18] combined the fuzzy c-means and K-nearest neighbors algorithm methods, resulting in an overall success rate of 95.24% for entity recognition in Malay crime data. Ulanganathan et al. [19] developed the Mi-NER system using a probabilistic approach with a linear-chain CRF machine learning technique. Finally, Sazali et al. [20] extracted nouns from Malay classical documents with a 77.61% chance of identifying a noun, while Alfred et al. [21] employed a rule-based approach with manually created dictionaries and achieved a recall of 94.44%, precision of 85%, and an F-score of 89.47%.
While several studies have been conducted on Malay NER, it is important to note that the current state of resources for the Malay language is still limited. For example, the lack of an annotated corpus remains a challenge for developing a reliable Malay NER system. Additionally, the availability of completed Malay noun lists or dictionaries is limited, requiring manual review by language experts. These limitations highlight the need for further research and resource development in the field of Malay NER.
While the current landscape of Malay NER research has primarily focused on crime and news resources, there is a need for research in diverse domains. By expanding the scope of research, we can develop more comprehensive Malay NER systems that can be applied to a wide range of applications and industries. This will require the development of resources specific to these domains and the exploration of new methodologies and techniques.
Developing Malay NER systems specifically for the health domain is crucial for improving information extraction, analysis, and understanding in the healthcare sector. Accurately extracting and classifying named entities in Malay health documents can enhance the accuracy of NER systems and ultimately contribute to advancements in healthcare services and research. By developing comprehensive and domain-specific resources for Malay NER in the health domain, we can ensure that NER systems are tailored to the unique characteristics of the Malay language and the specific terminology and nuances present in the health domain.
The lack of annotated text corpora in Malay named entity recognition (NER) is a significant challenge in developing supervised learning algorithms. This lack of resources highlights the broader issue of limited Malay natural language processing (NLP) resources, including the absence of credible Malay NER systems [22]. A comprehensive NER corpus is crucial for training and evaluating NER models, enabling researchers and practitioners to develop more effective algorithms and gain a deeper understanding of named entities in the Malay language [23].
The scarcity of annotated text in Malay NER hinders progress in developing accurate and robust NER systems for the language. The lack of credible Malay NER systems further emphasizes the need for comprehensive NLP resources tailored to the Malay language. A high-quality NER corpus would facilitate the development of more effective algorithms and enable researchers and practitioners to gain deeper insights into the characteristics and patterns of named entities in Malay [24].
Future research should prioritize expanding the range of domains covered by Malay NER systems and developing more comprehensive and established resources for the Malay language. Addressing these challenges is crucial to making NLP technology accessible and beneficial to the Malay-speaking community. Advancements in Malay NER can significantly improve the language processing capabilities in Malay and empower researchers, practitioners, and users to leverage the potential of NLP technology.
Furthermore, the development of domain-specific Malay NER systems, such as in finance, law, or education, would broaden the applicability and impact of NLP in the Malay language. These domain-specific systems can cater to the specific needs and requirements of different sectors, enabling more targeted and accurate information extraction and analysis.
In addition to expanding the domains covered, future research should also prioritize the development and establishment of robust and comprehensive resources for Malay NER. This includes the creation of high-quality annotated corpora, lexicons, and rule-based systems that capture the unique linguistic characteristics and entities in the Malay language. By addressing these challenges and investing in the development of Malay NER systems and resources, we can unlock the full potential of NLP technology for the Malay-speaking community. This will not only improve language processing capabilities but also open doors for various applications, such as information retrieval, text summarization, and knowledge extraction, ultimately benefiting both researchers and users in the advancement and utilization of NLP in Malay.

3. Methodology

The main functions of research methodology are to ensure that the research is conducted systematically, consistently, and objectively. The creation of the Annotated Malay Health Document Corpus consists of several stages, as illustrated in Figure 1. These stages encompass data collection via dataset scrapping, annotating text for English, followed by annotating text for Malay, and finally the creation of the corpus. Each of these stages contributes to the overall process of developing a valuable resource for Malay health document analysis and research.

3.1. Data Collection

Health information is widely available via various sources, including online articles and social media. Each of these sources has different writing styles, and their information bears varying levels of availability and reliability. The unstructured text, which will be used as material and a data source, comes from web pages on health-themed websites. We employed the technique of web scraping to extract data from websites with a health-related focus. The Malaysian Ministry of Health is responsible for maintaining the MyHealth portal, which was our main focus. In our study, our health text data were mainly sourced from the MyHealth portal, an online platform active in 2022.
This methodology allowed for the collection of substantial data from unorganized textual sources, thereby facilitating subsequent examination and annotation. The MyHealth portal plays a pivotal role in the healthcare system of Malaysia, with the objective of facilitating its transition toward a more comprehensive, interconnected, and digitally enabled service. It aims to offer healthcare information that is comprehensive, easily understandable, and of superior quality. By using data from the MyHealth portal, our study benefits from the wealth of health-related information available on this platform.
For this study, we selected about 100 articles and documents from the MyHealth portal as shown in Table 1. These were analyzed and annotated to create a substantial corpus. Once the data are collected, the next step is to prepare the collected data for further analysis and annotation. Irrelevant information such as advertisements, unrelated images, author biographies, reference or support group information, and final reviews is carefully removed. This is carried out to ensure that only relevant content, i.e., content directly related to health topics, is retained. Additionally, any formatting inconsistencies that existed in the original documentation, such as variations in font size, style, and line spacing, have also been addressed. This is enacted to ensure uniformity across documents, so that data are easier to analyze and process.
In this research project, we gathered a robust corpus consisting of approximately 3952 health-related sentences in the Malay language and roughly 3728 corresponding sentences in English. The large corpus size is essential for conducting thorough analysis and annotation, as it provides a diverse range of data for examination. With a substantial corpus, we can draw more reliable conclusions and insights from our study. Examples of the sentences in both Malay and English can be viewed in Table 2. This table is illustrative of the variety and complexity of sentences that were included in our data collection effort.
The selection of the English language as the reference point for our dataset was based on its extensive utilization in health-related studies on natural language processing, which has resulted in a robust framework. The utilization of English as a standard allows for the maintenance of consistency and precision in our process of comparing and analyzing. This enables us to utilize pre-existing research and methodologies established in the field of English language studies, and subsequently employ them in our cross-linguistic investigation. The utilization of this methodology was implemented in order to guarantee coherence and precision in our examination and evaluation, given that the English language possesses a firmly established structure within health-related studies pertaining to natural language processing.
The Malay and English collections exhibit an equivalent quantity of documents, although a discernible discrepancy is observed in the number of sentences. The Malay language corpus exhibits a greater quantity of sentences in comparison to the English corpus. The main reason behind this disparity lies in the structural and linguistic differences between the two languages. Often, a single English sentence can expand into multiple sentences in Malay to convey the same meaning. This is due to the nuanced complexities inherent in the translation process between English and Malay. As Malay has unique syntactic and semantic properties, it often requires more sentences to capture the same information contained in a single English sentence. This linguistic phenomenon is illustrated in the first two rows of Table 2. This crucial observation underscores the challenges and intricacies involved in cross-lingual studies, particularly when developing natural language processing algorithms that accurately capture the subtleties of different languages. It also highlights the importance of developing tailored methodologies that take into account the specific linguistic features and structures of the target language.

3.2. Annotation of English Text

The existing entity recognition algorithms, such as the Stanford CoreNLP tools [25], predominantly classify basic entity types like person, organization, and location. These established tools, while effective in their own right, lack comprehensive support for the Malay language. This poses a significant challenge for our project since the primary objective is to develop a customized named entity recognition (NER) and relation extraction system tailored to Malay.
Considering this, we resolved to create a tailored annotation schema that would effectively cater to the unique needs of the Malay language. This approach would ensure that our annotated text corpus was well equipped to serve as a potent training and evaluation resource for custom NER and relational extraction algorithms.
For the English texts within our corpus, we employed biomedical NER tools such as BioYODIE NER. This powerful tool enabled us to efficiently identify named entities such as disease, symptoms, care, and others [25]. This identification process is critical as it facilitates the comprehensive mapping of each text’s entity landscape, providing valuable data for subsequent processing and analysis (Table 3).
To enhance the breadth of our entity identification, we additionally employed the Stanza i2b2 and NCBI-Disease tools [26]. These resources were instrumental in identifying other biomedical entities, including categories like problem, treatment, test, and disease. The inclusion of these tools in our entity recognition process ensures broader coverage, enabling us to capture a more diverse set of entities within the corpus (Table 4 and Table 5).
Via the judicious use of these tools, we were able to create a comprehensive annotated corpus that encompasses a wide range of entity categories. This enriched corpus serves as a valuable resource for training and evaluating our custom NER and relation extraction algorithms, bringing us one step closer to achieving our project objectives. By tailoring our approach to suit the unique linguistic context of Malay, we aim to drive significant advancements in the field of Malay language processing.

3.3. Annotation of Malay Text

In the process of creating annotated Malay texts and documents, we leverage reference annotations derived from English texts. More specifically, these are annotated English texts that have been processed using the BioYODIE tools [26], which are designed to provide entity annotations for diseases, symptoms, and care. In addition to this, we also draw upon references from annotated English texts that have been processed using the Stanza and NCBI tools [26], which specialize in providing entity annotations for diseases.
In order to ensure the accurate identification and labeling of biomedical-named entities within our corpus, we consult additional resources such as the Malay Wikipedia [27] and the dictionary from Dewan Bahasa [28]. These additional sources provide valuable insights into the specific linguistic and terminological nuances of the Malay language.
The primary aim of annotating Malay texts and documents is to identify named entities such as penyakit (diseases), simptom (symptoms), and rawatan (treatments). These annotations serve as an invaluable asset in the process of training and evaluating natural language processing (NLP) models tailored to the Malay language. Upon the completion of the annotation process, we are left with a comprehensive corpus of annotated Malay texts. Representative examples of these annotated texts can be found in Table 6. This rigorous process of annotation serves to guarantee the accurate identification and classification of biomedical-named entities within the Malay language, thus paving the way for the development of highly effective NLP models designed specifically for the Malay language.

4. Corpus Malay Health Document

The Malay Health Document Annotated Corpus, a detailed collection of annotated health documents, is a crucial asset for researchers and practitioners focusing on the Malay language. It facilitates the training and evaluation of named entity recognition (NER) models specifically crafted for Malay. These models excel in accurately extracting pertinent information from Malay health documents, benefiting medical research, clinical text analysis, and electronic health record management.
Moreover, this corpus plays a pivotal role in advancing various natural language processing (NLP) technologies in healthcare, such as natural language understanding, sentiment analysis, and text classification. It covers a wide array of health-related entities, including diseases, symptoms, and treatments, thus thoroughly representing the healthcare sector, encompassing medical, pharmaceutical, and clinical research areas. The utilization of this corpus not only enhances the effectiveness of NER models in discerning and retrieving valuable data from Malay-language health documents but also aids in expanding the scope and efficiency of NLP technologies within the healthcare field. This amplifies their applicability and utility in diverse scenarios like medical research and clinical text analysis (see Figure 2.)
The primary result of this research is the creation of the Malay Health Document Annotated Corpus, which is derived from both English and Malay health documents. The corpus contains a diverse set of accurately labeled health-named entities, such as penyakit (diseases), simptom (symptoms), and rawatan (treatments). These entities can be seen in Table 7, which provides descriptions and examples for each entity type.
The development of the Malay Health Document Annotated Corpus significantly contributes to the growing body of NLP resources for the Malay language. By providing a comprehensive annotated corpus, researchers are enabled to develop and evaluate NER models that can accurately analyze Malay health documents. This ultimately leads to better health outcomes for Malay speakers. Furthermore, the annotated corpus serves as a starting point for future research in Malay NLP, particularly in the health domain, opening up opportunities for advancements in this field.
By enabling the development of more accurate NER models for Malay health documents, the Malay Health Document Annotated Corpus can contribute to the creation of innovative healthcare technologies. These technologies can automate the analysis and interpretation of health information, leading to faster diagnosis, more personalized treatment plans, and improved patient outcomes.
Unlike existing NLP resources for the Malay language that focus on general text or news articles, the Malay Health Document Annotated Corpus specifically targets the healthcare domain. This makes it a specialized resource that captures the unique vocabulary, terminology, and entities found in health documents. By focusing on this specific domain, the corpus provides researchers and practitioners with a more accurate and tailored resource for developing healthcare-related NLP technologies.

5. Discussion (Challenges)

From this research, several things emerged as challenges in making the Malay Health Document Annotated Corpus: synonyms in Malay annotations, ambiguous entity categorization, co-reference in translation, and polysemous terms.

5.1. Synonyms in Malay Annotations

The section mentions the presence of synonyms in the Malay language, but it would be helpful to provide specific examples to illustrate this challenge. Including examples of synonyms and their different lexical realizations would make this argument more concrete and easier to understand. For example, the synonyms “barah” and “kanser” both refer to the concept of “cancer” in Malay. These terms represent the same concept but have different lexical realizations. Capturing these synonyms in the annotated corpus requires careful attention to ensure that their identical meaning is retained.
This task is not trivial, as it directly influences the efficacy of the subsequent training and evaluation of NLP models. Machine learning models rely on a clear, consistent representation of the data to learn effectively. If a model perceives “barah” and “kanser” as distinct entities, it may fail to generalize appropriately, leading to potential misclassifications in unseen data or new contexts.
This can have significant consequences in NLP tasks such as sentiment analysis, text classification, or information retrieval, where accurate representation and understanding of the data are crucial for reliable results. Misclassifications can lead to incorrect interpretations, biased predictions, or inaccurate information retrieval, undermining the effectiveness and trustworthiness of NLP models.
Additionally, the intricacy of handling synonyms extends beyond mere identification. The model must also consider the context and co-occurrence of these terms within the textual data. It is important to note that even though synonyms refer to the same concept, their usage might differ based on the context. For example, one term may be more prevalent in formal writing, while the other is commonly used in daily conversations or specific regions.
Moreover, it is also essential to acknowledge the cultural and linguistic nuances associated with these synonyms. Some terms might carry different connotations or emotional valences despite referring to the same concept, which further emphasizes the need for nuanced understanding and handling of these terms during the annotation and model training process [29].
To address these challenges, advanced NLP techniques, such as word embeddings or contextual models, might be deployed. These techniques can capture the semantic similarity between different words and help the model understand that “barah” and “kanser” refer to the same concept. Furthermore, domain expertise and a careful annotation process play a crucial role in ensuring the consistency and accuracy of the data representation.
Overall, having synonyms in the data makes the process of annotating it and training models more difficult. However, these problems can be solved by being careful and using advanced NLP techniques. This helps make NLP models that are strong and aware of their surroundings.

5.2. Ambiguous Entity Categorization

There are certain words or phrases that can serve as entities for multiple category types, presenting a complex issue in named entity recognition (NER). For instance, consider the phrase “sakit dada” (chest pain), which could be perceived as an entity within either the disease or symptom categories. This duality generates a demand for context-specific interpretation by the NER system. If “sakit dada” appears within a disease diagnostic context, the NER should classify it within the disease entity category. Alternatively, if the phrase is cited in the description of symptoms, the NER should allocate it to the symptom entity category.
In many scenarios, the NER system needs to analyze the broader context, taking into account related words in the sentence or document, to determine the most appropriate entity categorization. This is essentially utilizing the principles of co-reference resolution and word sense disambiguation to clarify semantic relationships and meanings.
This context-sensitive entity categorization presents significant challenges in developing an accurate and reliable NER system. The complexity is amplified when dealing with the medical domain, given the vast range of terminologies and their potential overlap between categories. Furthermore, the NER system must also factor in the linguistic and cultural nuances that can influence the interpretation of certain words or phrases. Consequently, handling such ambiguities requires sophisticated models with robust context-understanding capabilities, well-crafted feature sets, and effective training methods. These requirements underscore the need for high-quality, annotated training data like the Malay Health Document Annotated Corpus.
However, even with these resources, achieving a high level of accuracy in ambiguous entity categorization remains a demanding task. This is a significant area of research focus, with potential solutions exploring advanced techniques like deep learning and complex NLP models, as well as inter-disciplinary approaches integrating linguistics, medical knowledge, and computational methodologies.

5.3. Co-Reference in Translation

The next challenge lies in the use of co-reference during the translation process from English to Malay. Co-reference refers to the use of words or phrases that point to the same concept or entity within a sentence or text. For example, in the sentence that can be seen in Table 8, the pronoun “it” could be used later in the text to refer back to Table 8. The use of co-reference is crucial in the translation process as it aids in maintaining consistency and clarity.
Co-references can become significantly complex, especially within lengthy and nuanced texts. For instance, a document might initially mention “sakit perut” and subsequently use pronouns like “ia” in other parts of the text to refer to “sakit perut”. In such scenarios, the NER system must be adept enough to recognize that “ia” is indeed referring to the initial mention of “sakit perut”.
Effectively leveraging co-reference in the translation process necessitates a deep understanding of the structure and semantics of both languages. The system must recognize and maintain co-reference throughout the translation process while ensuring that the final translation remains accurate and comprehensible [30]. This requires advanced techniques in natural language processing and machine learning, as well as a good understanding of both languages’ cultural and social contexts.
Furthermore, in many instances, the source and target languages might have different co-reference rules and conventions. For example, Malay might have different ways of referring to entities or concepts compared to English. Thus, the system must be capable of adapting the co-reference from the source language to the target language in a natural and accurate manner. This is often challenging and necessitates ongoing research and development.
Addressing these challenges requires careful consideration and the development of methodologies that account for synonyms, resolve entity categorization ambiguities, and accurately handle co-reference during translation. By addressing these challenges, the Malay Health Document Annotated Corpus can be further refined and serve as a valuable resource for training and evaluating NLP models in the Malay language.

5.4. Polysemous Terms

In some cases, the challenge lies in what are known as “multiple translations” or “polysemous terms”. This refers to situations where a single word or phrase can hold multiple meanings or translations in another language, particularly within specialized contexts like medicine or technology. For instance, “shortness of breath” and “breathlessness” are two medical terms that signify “difficulty breathing” or “shortness of breath” in English. Both terms share the same translation in Malay, which is “sesak nafas”.
This can complicate the selection of the appropriate translation, especially when context sensitivity is a requirement. Context plays a vital role in determining the best translation for such polysemous terms, and this challenge increases when the context is intricate or subject to individual interpretation. This is a common issue in machine translation, and solutions usually employ deep learning models that can consider broader contextual information to better understand and determine an accurate translation.
Additionally, such polysemous terms also pose a significant challenge to the named entity recognition (NER) systems since the same word or phrase might be classified under different categories based on its different meanings. This introduces the necessity for advanced models that can effectively discern the semantic boundaries of such terms within the given context.
Moreover, this issue further accentuates the importance of domain-specific knowledge. In the example of “shortness of breath” and “breathlessness”, having knowledge about medical terminologies can guide the translation process more accurately. It highlights the requirement for a multidisciplinary approach, incorporating subject matter expertise in conjunction with computational methodologies, to effectively handle multiple translations and polysemous terms [31].
Lastly, this challenge also calls attention to the value of extensive, high-quality, and well-annotated corpora, like the Malay Health Document Annotated Corpus. They serve as critical resources for training machine translation and NER systems, enabling them to better understand and handle the complexities of multiple translations and polysemous terms.

6. Conclusions and Future Work

This research has successfully spearheaded the development of the Malay Health Document Annotated Corpus, which is a crucial resource for training and evaluating named entity recognition (NER) models for the Malay language. By meticulously identifying and labeling biomedical-named entities, this corpus significantly enhances the suite of NLP resources available for Malay. It has the potential to improve health outcomes for Malay speakers by enabling the development and evaluation of NER models that can efficiently analyze Malay health documents.
The research conducted using the Malay Health Document Annotated Corpus has shown promising results in the development of an NER model for the Malay language. The model, trained using supervised machine learning techniques like the conditional random field algorithm, has demonstrated the ability to accurately identify and extract biomedical entities from Malay health documents. The evaluation of the model using standard measures such as precision, recall, and the F1-score has provided insights into its effectiveness. These findings highlight the potential of NLP technologies in the health sector for the Malay language.
Several challenges such as synonyms in Malay annotations, ambiguous entity categorization, co-reference in translation, and the handling of polysemous terms have been identified as key areas for future research. Addressing these issues will not only enhance the quality of the corpus but also significantly contribute to the advancement of natural language processing technologies. Focusing on these areas promises to improve the accuracy and utility of NLP models, particularly in the context of the Malay language, thereby elevating the overall effectiveness of language processing applications.
Furthermore, future research could delve into the development of domain-specific NER models customized for other sectors such as finance, law, or education. This would substantially broaden the spectrum of NLP resources available for the Malay language. Researchers could also investigate the use of different machine learning algorithms, advanced deep learning techniques that can learn from large amounts of data, or methods that leverage knowledge from related tasks to enhance the performance of NER models. These avenues have the potential to augment the performance of NER models tailored to the Malay language, thereby expanding the reach and potential of NLP within the Malay-speaking world.

Author Contributions

Conceptualization, H.; methodology, H.; software, H.; writing—original draft preparation, H.; writing—review and editing, H., S.S., L.Q.Z. and A.F.N.; supervision, S.S. and L.Q.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: http://www.myhealth.gov.my/.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Goyal, A.; Kumar, M.; Gupta, V. Recent named entity recognition and classification techniques: A systematic review. Comput. Sci. Rev. 2018, 29, 21–43. [Google Scholar] [CrossRef]
  2. Raza, S.; Reji, D.J.; Shajan, F.; Bashir, S.R. Large-Scale Application of Named Entity Recognition to Biomedicine and Epidemiology. PLOS Digital Health 2022, 1, e0000152. [Google Scholar] [CrossRef] [PubMed]
  3. Patil, N.; Patil, A.; Pawar, B.V. Named Entity Recognition using Conditional Random Fields. In Proceedings of the International Conference on Computational Intelligence and Data Science (ICCIDS 2019), Gurgaon, India, 6–7 September 2019. [Google Scholar]
  4. Morsidi, F.; Sulaiman, S.; Suliana, S.; Siti, A.M.; Rohaizah, A.W. Malay Named Entity Recognition: A Review. J. ICT Educ. JICTIE 2016, 2, 1–14. [Google Scholar]
  5. Salleh, M.S.; Asmai, S.A.; Basiron, H.; Ahmad, S. A Malay Named Entity Recognition Using Conditional Random Fields. In Proceedings of the International Conference on Information and Communication Technology (ICoICT), Melaka, Malaysia, 17–19 May 2017. [Google Scholar]
  6. Mohd Noor, N.; Sulaiman, J.; Noah, S.A. Malay Name Entity Recognition Using Limited Resources. Adv. Sci. Lett. 2016, 22, 2968–2971. [Google Scholar] [CrossRef]
  7. Ramachandran, R.; Arutchelvan, K. Named entity recognition on biomedical literature documents using a hybrid-based approach. J. Ambient. Intell. Humaniz. Comput. 2021, 1–10. [Google Scholar] [CrossRef]
  8. Wei, H.; Gao, M.; Zhou, A.; Chen, F.; Qu, W.; Wang, C.; Lu, M. Named entity recognition from biomedical texts using a fusion attention-based BiLSTM-CRF. IEEE Access 2019, 7, 73627–73636. [Google Scholar] [CrossRef]
  9. Bhasuran, B.; Murugesan, G.; Abdulkadhar, S.; Natarajan, J. Stacked Ensemble Combined with Fuzzy Matching for Biomedical Named Entity Recognition of Diseases. J. Biomed. Inform. 2016, 64, 1–9. [Google Scholar] [CrossRef]
  10. Keretna, S.; Lim, C.P.; Creighton, D. A Hybrid Model for Named Entity Recognition Using Unstructured Medical Text. In Proceedings of the International Conference on Systems Engineering (SOSE), Glenelg, SA, Australia, 9–13 June 2014. [Google Scholar]
  11. Wang, C.; Wang, H.; Zhuang, H.; Li, W.; Han, S.; Zhang, H.; Zhuang, L. Chinese medical-named entity recognition based on a multi-granularity semantic dictionary and multimodal tree. J. Biomed. Inform. 2020, 111, 103583. [Google Scholar] [CrossRef] [PubMed]
  12. Li, L.; Zhao, J.; Hou, L.; Zhai, Y.; Shi, J.; Cui, F. An attention-based deep learning model for clinical named entity recognition of Chinese electronic medical records. BMC Med. Inform. Decis. Mak. 2019, 19, 235. [Google Scholar] [CrossRef] [PubMed]
  13. Herwando, R.; Jiwanggi, M.A.; Adriani, M. Medical entity recognition using a conditional random field (CRF). In Proceedings of the 2017 International Workshop on Big Data and Information Security (IWBIS), Jakarta, Indonesia, 23–24 September 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 57–62. [Google Scholar]
  14. Suwarningsih, W.; Supriana, I.; Purwarianti, A. ImNER Indonesian Medical Named Entity Recognition. In Proceedings of the 2nd International Conference on Technology, Informatics, Management, Engineering, and Environment, Bandung, Indonesia, 19–21 August 2017; pp. 184–188. [Google Scholar]
  15. Mohamed, H.; Omar, N.; Aziz, M.J.A. Malay Part of Speech Tagger: A Comparative Study on Tagging Tools. Asia-Pac. J. Inf. Technol. Multimed. 2015, 4, 11–23. [Google Scholar] [CrossRef]
  16. Saad, S.; Mansor, M.K. Named entity recognition approach for Malay crime news retrieval. Gema Online J. Lang. Stud. 2018, 18, 216–235. [Google Scholar] [CrossRef]
  17. Nadia, U.; Omar, N. Malay named entity recognition using a rule-based approach. Asia-Pac. J. Inf. Technol. Multimed. 2019, 8, 37–47. [Google Scholar] [CrossRef]
  18. Salleh, M.S.; Asmai, S.A.; Basiron, H.; Ahmad, S. Named Entity Recognition using the Fuzzy C-Means Clustering Method for Malay Textual Data Analysis. J. Telecommun. Electron. Comput. Eng. JTEC 2018, 10, 121–126. [Google Scholar]
  19. Ulanganathan, T.; Ebrahim, A.; Xian BC, M.; Bouzekri, K.; Mahmud, R.; Hoe, O.H. Benchmarking Mi-NER: Malay entity recognition engine. In Proceedings of the 9th International Conference on Information, Process, and Knowledge Management, Nice, France, 19–23 March 2017; pp. 52–58. [Google Scholar]
  20. Sazali, S.S.; Rahman, N.A.; Bakar, Z.A. Information extraction: Evaluating named entity recognition from classical Malay documents. In Proceedings of the 2016, the Third International Conference on Information Retrieval and Knowledge Management (CAMP), Malacca, Malaysia, 23–24 August 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 48–53. [Google Scholar]
  21. Alfred, R.; Leong, L.C.; On, C.K.; Anthony, P. Malay Named Entity Recognition Based on a Rule-Based Approach International. J. Mach. Learn. Comput. 2014, 4, 300–306. [Google Scholar] [CrossRef]
  22. Lan, T.S.; Logeswaran, R. Challenges and developments in Malay natural language processing. J. Crit. Rev. 2020, 7, 61–65. [Google Scholar]
  23. Salah, R.E.; Zakaria, L.Q.B. Building the classical Arabic entity recognition corpus (CANERCorpus). In Proceedings of the 2018, the Fourth International Conference on Information Retrieval and Knowledge Management (CAMP), Kota Kinabalu, Malaysia, 26–28 March 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 1–8. [Google Scholar]
  24. Fu, Y.; Lin, N.; Yang, Z.; Jiang, S. An open-source dataset and a multi-task model for malay named entity recognition. arXiv 2021, arXiv:2109.01293. [Google Scholar]
  25. Kraljevic, Z.; Searle, T.; Shek, A.; Roguski, L.; Noor, K.; Bean, D.; Dobson, R.J. Multi-domain clinical natural language processing with MedCAT: The medical concept annotation toolkit. Artif. Intell. Med. 2021, 117, 102083. [Google Scholar] [CrossRef] [PubMed]
  26. Kühnel, L.; Fluck, J. We are not ready yet: Limitations of state-of-the-art disease named entity recognizers. J. Biomed. Semant. 2022, 13, 26. [Google Scholar] [CrossRef] [PubMed]
  27. Wikipedia Bahasa Melayu. 2022. Available online: https://ms.wikipedia.org/ (accessed on 23 December 2022).
  28. Portal Rasmi Pusat Rujukan Persuratan Melayu. 2022. Available online: https://prpm.dbp.gov.my/ (accessed on 19 December 2022).
  29. Sharifian, F. Cultural linguistics: The state of the art. Adv. Cult. Linguist. 2017, 1–28. [Google Scholar] [CrossRef]
  30. Brack, A.; Müller, D.U.; Hoppe, A.; Ewerth, R. Co-reference resolution in research papers from multiple domains. In Proceedings of the Advances in Information Retrieval: 43rd European Conference on IR Research, ECIR 2021, Virtual Event, 28 March–1 April 2021; Proceedings, Part I 43. Springer International Publishing: Berlin/Heidelberg, Germany, 2021; pp. 79–97. [Google Scholar]
  31. Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. 2013, 3111–3119. [Google Scholar]
Figure 1. Framework for corpus creation.
Figure 1. Framework for corpus creation.
Applsci 13 13129 g001
Figure 2. Example of tagging.
Figure 2. Example of tagging.
Applsci 13 13129 g002
Table 1. Statistics of the manually annotated parallel corpus.
Table 1. Statistics of the manually annotated parallel corpus.
MalayEnglish
Documents100100
Sentences39523728
Words74,18377,076
Table 2. Some examples of text in English and Malay.
Table 2. Some examples of text in English and Malay.
English TextMalay Text
A heart attack (myocardial infarction) is usually caused by a blood clot, which stops the blood flowing to a part of the heart muscle.Infarksi miokardium akut (serangan jantung akut) biasanya disebabkan oleh darah beku.
Darah beku ini yang menyebabkan pengaliran darah terhenti ke sebahagian daripada otot jantung.
What causes a myocardial infarction (MI)? The most common cause of a myocardial infarction (MI) is a blood clot (thrombosis) that forms inside a coronary artery or one of its branches.Apa yang menyebabkan infarksi miokardium? Penyebab utama infarksi miokardium adalah darah beku (trombosis) yang terbentuk di dalam arteri koronari utama, atau salah satu daripada cabang-cabangnya.
What are the symptoms of a myocardial infarction? The most common symptom is severe chest pain, which often feels like a heavy pressure feeling on the chest.Apakah gejala—gejala infarksi miokardium? Gejala yang paling biasa adalah sakit dada yang teruk, yang sering terasa seperti tekanan berat di dada.
However, some people have only a mild discomfort in their chest or feel like having indigestion or heartburn.Walau bagaimanapun, sesetengah orang hanya mempunyai ketidakselesaan ringan di dada mereka atau merasa seperti senak atau pedih ulu hati.
Some people collapse and die suddenly. This is not very common. Kadangkala infarksi myokardium boleh menyebabkan kematian mengejut tetapi keadaan ini jarang berlaku.
What is the treatment for a myocardial infarction? There are two treatments that can restore blood flow back through the blocked artery: Emergency angioplasty. Apakah Rawatan untuk infarksi miokardium? Terdapat dua rawatan yang boleh memulihkan aliran darah yang tersumbat: Angioplasti kecemasan.
Ideally this is the best treatment if it is available and can be done within a few hours of symptoms starting.Ini adalah rawatan yang terbaik jika ia boleh didapati dan boleh dilakukan dalam beberapa jam gejala bermula.
An injection of a clot-busting medicine is an alternative to emergency angioplasty.Suntikan ubat cair darah adalah alternatif kepada angioplasti kecemasan.
Table 3. Examples result in annotated English text (Using BioYODIE tools).
Table 3. Examples result in annotated English text (Using BioYODIE tools).
English TextBioYODIE
DiseaseSymptomCare
A heart attack (myocardial infarction) is usually caused by a blood clot, which stops the blood flowing to a part of the heart muscle.heart attack
myocardial infarction
blood clot
blood
What causes a myocardial infarction (MI)? The most common cause of a myocardial infarction (MI) is a blood clot (thrombosis) that forms inside a coronary artery, or one of its branches.myocardial infarction (MI)
myocardial infarction (MI)
blood clot
(thrombosis)
What are the symptoms of a myocardial infarction? The most common symptom is severe chest pain, which often feels like a heavy pressure feeling on the chest.myocardial infarction
chest pain
Symptoms
chest pain,
However, some people have only a mild discomfort in their chest or feel like having indigestion or heartburn.discomfort
indigestion
heartburn
discomfort
indigestion
heartburn
Some people collapse and die suddenly. This is not very common.Collapse
What is the treatment for a myocardial infarction? There are two treatments that can restore blood flow back through the blocked artery: Emergency angioplasty.myocardial infarction
blood
blocked artery Emergency
treatment
treatment
angioplasty
Ideally this is the best treatment if it is available and can be done within a few hours of symptoms starting.Symptomssymptomstreatment
An injection of a clot-busting medicine? is an alternative to emergency angioplasty.clot
emergency
injection
angioplasty
Table 4. Examples result in annotated English text (using NCBI-Disease tools).
Table 4. Examples result in annotated English text (using NCBI-Disease tools).
English TextNCBI-Disease
Disease
A heart attack (myocardial infarction) is usually caused by a blood clot, which stops the blood flowing to a part of the heart muscle.heart attack
myocardial infarction
myocardial infarction
MI
myocardial infarction
MI
thrombosis
myocardial infarction
chest pain
indigestion
heartburn
myocardial infarction
What causes a myocardial infarction (MI)? The most common cause of a myocardial infarction (MI) is a blood clot (thrombosis) that forms inside a coronary artery or one of its branches.
What are the symptoms of a myocardial infarction? The most common symptom is severe chest pain, which often feels like a heavy pressure feeling on the chest.
However, some people have only a mild discomfort in their chest or feel like having indigestion or heartburn.
Some people collapse and die suddenly. This is not very common.
What is the treatment for a myocardial infarction? There are two treatments that can restore blood flow back through the blocked artery: Emergency angioplasty.
Ideally this is the best treatment if it is available and can be done within a few hours of symptoms starting.
An injection of a clot-busting medicine? is an alternative to emergency angioplasty.
Table 5. Examples result in annotated English text (using Stanza i2b2 tools).
Table 5. Examples result in annotated English text (using Stanza i2b2 tools).
English TextStanza—i2b2
ProblemTestTreatment
A heart attack (myocardial infarction) is usually caused by a blood clot, which stops the blood flowing to a part of the heart muscle.A heart attack
myocardial infarction
a blood clots.
the blood flowing.
What causes a myocardial infarction (MI)? The most common cause of a myocardial infarction (MI) is a blood clot (thrombosis) that forms inside a coronary artery, or one of its branches.a myocardial infarction
MI
a myocardial infarction
MI
a blood clots.
thrombosis
What are the symptoms of a myocardial infarction? The most common symptom is severe chest pain, which often feels like a heavy pressure feeling on the chest.the symptoms
a myocardial infarction
severe chest pain
a heavy pressure feeling on the chest.
However, some people have only a mild discomfort in their chest or feel like having indigestion or heartburn.a mild discomfort in their chest
indigestion
heartburn
Some people collapse and die suddenly. This is not very common.Some people collapse
What is the treatment for a myocardial infarction? There are two treatments that can restore blood flow back through the blocked artery: Emergency angioplasty.a myocardial infarctionblood flowthe treatment
two treatments
Emergency angioplasty
Ideally this is the best treatment if it is available and can be done within a few hours of symptoms starting.symptoms
An injection of a clot-busting medicine? is an alternative to emergency angioplasty. a clot-busting medicine emergency angioplasty
Table 6. Examples result in annotated Malay text.
Table 6. Examples result in annotated Malay text.
Malay TextManually Annotated
PenyakitSimptomRawatan
Infarksi miokardium akut (serangan jantung akut) biasanya disebabkan oleh darah beku.Infarksi miokardium akut
serangan jantung akut
darah beku
Darah beku ini yang menyebabkan pengaliran darah terhenti ke sebahagian daripada otot jantung.Darah beku
Apa yang menyebabkan infarksi miokardium? Penyebab utama infarksi miokardium adalah darah beku (trombosis) yang terbentuk di dalam arteri koronari utama, atau salah satu daripada cabang-cabangnya.infarksi miokardium
infarksi miokardium
darah beku (trombosis)
Apakah gejala—gejala infarksi miokardium? Gejala yang paling biasa adalah sakit dada yang teruk, yang sering terasa seperti tekanan berat di dada.infarksi miokardium
sakit dada
sakit dada
Walau bagaimanapun, sesetengah orang hanya mempunyai ketidakselesaan ringan di dada mereka atau merasa seperti senak atau pedih ulu hati.Ketidakselesaan pedih ulu hatiKetidakselesaan
Senak
pedih ulu hati
Kadangkala infarksi myokardium boleh menyebabkan kematian mengejut tetapi keadaan ini jarang berlaku.
Apakah Rawatan untuk infarksi miokardium? Terdapat dua rawatan yang boleh memulihkan aliran darah yang tersumbat: Angioplasti kecemasan.infarksi miokardium Angioplasti kecemasan
Ini adalah rawatan yang terbaik jika ia boleh didapati dan boleh dilakukan dalam beberapa jam gejala bermula.
Suntikan ubat cair darah adalah alternatif kepada angioplasti kecemasan. Suntikan
Ubat
Angioplasti kecemasan
Table 7. Descriptions and examples for each entity type. Examples are translated from Malay.
Table 7. Descriptions and examples for each entity type. Examples are translated from Malay.
EntityQuantityExamples
Penyakit (Disease)1431Barah otak, Angin ahmar, Kanser prostat, Batu kering, Kencing manis
Simptom (Symptom)186Loya, Sembelit, Pening, Muntah, Kejang otot
Rawatan (Treatment)639Vaksinasi, Terapi laser, Kemoterapi, Senaman abdomen, Pembedahan
Table 8. Examples of the sentences translated to Malay using co-reference.
Table 8. Examples of the sentences translated to Malay using co-reference.
English TextMalay Text
Stomachache or abdominal pain is a common complaint.
Abdominal pain by itself is not a disease but is a symptom of a variety of underlying disorders. Causes of Abdominal pain: Psychogenic causes—The Abdominal pain occurs in non-diseased organs. Pain is believed to arise from stress, anxiety, and depression.
Sakit perut atau sakit pada abdomen merupakan perkara biasa. Ia bukan sejenis penyakit tetapi merupakan gejala bagi penyakit lain. Sebab-sebab Sakit Abdomen Sebab-sebab psychogenic Sakit yang dialami tetapi tidak disebabkan oleh sebarang penyakit. Ianya timbul mungkin akibat dari tekanan perasaan, kebimbangan dan kemurungan.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hafsah; Saad, S.; Zakaria, L.Q.; Naswir, A.F. Parallel-Based Corpus Annotation for Malay Health Documents. Appl. Sci. 2023, 13, 13129. https://doi.org/10.3390/app132413129

AMA Style

Hafsah, Saad S, Zakaria LQ, Naswir AF. Parallel-Based Corpus Annotation for Malay Health Documents. Applied Sciences. 2023; 13(24):13129. https://doi.org/10.3390/app132413129

Chicago/Turabian Style

Hafsah, Saidah Saad, Lailatul Qadri Zakaria, and Ahmad Fadhil Naswir. 2023. "Parallel-Based Corpus Annotation for Malay Health Documents" Applied Sciences 13, no. 24: 13129. https://doi.org/10.3390/app132413129

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop