Enhanced Precision in Chinese Medical Text Mining Using the ALBERT+Bi-LSTM+CRF Model

Fang, Tianshu; Yang, Yuanyuan; Zhou, Lixin

doi:10.3390/app14177999

Open AccessArticle

Enhanced Precision in Chinese Medical Text Mining Using the ALBERT+Bi-LSTM+CRF Model

by

Tianshu Fang

^1,2

,

Yuanyuan Yang

^1,* and

Lixin Zhou

^1,2

¹

Laboratory for Medical Imaging Informatics, Shanghai Institute of Technical Physics, Chinese Academy of Science, Shanghai 200083, China

²

University of Chinese Academy of Sciences, Beijing 100049, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2024, 14(17), 7999; https://doi.org/10.3390/app14177999 (registering DOI)

Submission received: 14 July 2024 / Revised: 31 August 2024 / Accepted: 3 September 2024 / Published: 7 September 2024

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Medical texts are rich in specialized knowledge and medical information. As the medical and healthcare sectors are becoming more digitized, many medical texts must be effectively harnessed to derive insights and patterns. Thus, great attention is directed to this emerging research area. Generally, natural language processing (NLP) algorithms are employed to extract comprehensive information from unstructured medical texts, aiming to construct a graphical database for medical knowledge. One of the needs is to optimize model sizes while maintaining the precision of the BART algorithm. A novel carefully designed algorithm, called ALBERT+Bi-LSTM+CRF, is introduced. In this way, both enhanced efficiency and scalability are attained. When entities are extracted, the constructed algorithm achieves 91.8%, 92.5%, and 94.3% for the F-score, precision, and recall, respectively. The proposed algorithm also achieves remarkable outcomes in extracting relations, with 88.3%, 88.1%, and 88.4% for the F-score, precision, and recall, respectively. This further underscores its practicality in the graphical construction of medical knowledge.

Keywords:

text mining; knowledge graph; Chinese medical text; natural language processing

1. Introduction

Both internet technology and digitization have altered multiple areas of human life, and the medical field and health sector are no exceptions. Smart healthcare is no longer just a term but aims to optimize the management and operational processes of hospitals and health institutions. Hence, the quality level of medical services for the public is being substantially enhanced [1]. The medical field deals with vast amounts of text data generated as texts written in natural languages, such as medical records, test reports, textbooks, encyclopedias, academic journals, and clinical guidelines. These contain rich, comprehensive, and interrelated professional knowledge and medical practices [2]. Correspondingly, these texts’ more effective utilization provides invaluable auxiliary guidance for medical doctors when diagnoses and treatments are administered [3]. Hence, developing methods to efficiently process medical texts and derive key insights and patterns has become the primary research focus [4].

For example, Named Entity Recognition (NER), one of the key approaches for processing medical texts, aims to accurately derive medical entities, such as disease names, symptom descriptions, and the administration of treatments. Generally, a more effective process results in extracting crucial subsequent relationships and analyzing information in a better way. The NER framework mainly includes three types of methods, namely, rule-based and dictionary-based, statistical, and the combined form of these two [5]. Although the rule-based and dictionary-based approaches have been employed for a long time, their application scope is relatively constrained due to their dependence on knowledge bases. Statistical methods are employed to standardize medical text databases by analyzing and learning manually annotated corpora that have become the mainstream direction of current research.

The concept of the Semantic Web and the use of ontology models was advocated to formally express the implicit semantics of knowledge graphs of data that prove themselves in several applications [6]. Subsequently, formal models of the Semantic Web emerged, such as the Resource Description Framework (RDF) and Web Ontology Language (WOL), laying a solid foundation for the development of a knowledge graph [7]. As the research has deepened, the knowledge graph method has been progressed. However, it not only depends on the inheritance and development of Semantic Web technology and benchmarks but also involves a substantial breakthrough in information representation and comprehension. Google Inc. (2012) officially launched the Knowledge Graph [8], which enables computers to comprehend human communication patterns. Hence, more intelligent and personalized services are provided [9].

The wave of digitalization adopted in the medical field is transforming medical care unprecedentedly. This new paradigm also brings a wide range of optimized procedures pertinent to hospital management and operations provided by technological means. Thus, the quality of medical services is significantly enhanced [10]. Even though a vast number of textual data records are rich in medical knowledge and practice, they cannot be used directly to their full potential; in order to achieve that, their contents need to be submitted to further crunching processes, to extract useful information. Specifically, one of the reasons why they cannot be used directly is their lack of a unified standard structure. Instead, they have diversified and complex structures [11]. Another requirement is to optimize model sizes while maintaining the precision of the BART algorithm since it is a well-designed algorithm that provides better outcomes. However, some technical issues accompany these efforts.

Beyond these, the literature provides alternatives when the research direction turns to deep learning methods. For example, Lan et al. [12] proposed an effective fusion method based on graph neural networks (GNNs) in Chinese medical text classification. Dong and Zheng [13] suggested an algorithm that combines texts and graphs to repurpose drug usage. Guo et al. [14] proposed a machine knowledge graph based on an enhanced transformer. Lin et al. [15] introduced a multi-model approach to detect disease–disease relations. However, text-based approaches are used broadly to determine associations between several aspects of medical texts. For example, even though some technical issues need resolving, BERT is widely used in medical texts to derive insights and patterns. Some up-to-date references can be found in [16,17,18].

In this paper, an algorithm is proposed to construct a graph database of medical texts, by employing ALBERT+Bi-LSTM+CRF to resolve the issues outlined. An in-depth analysis and training on medical texts were conducted to gain key insights and extract patterns, using advanced statistical learning methods [19]. The proposed algorithm improves the efficiency of medical text processing and provides medical doctors with more accurate auxiliary diagnostic and treatment suggestions, thus contributing to advancing the overall quality of medical services.

Figure 1 depicts the steps taken. Firstly, medical texts with annotations are obtained from public datasets and manually labeled. Then, those are preprocessed to remove some special symbols and check the accuracy of the annotation format. Next, the whole dataset is split into three, for training, validation, and testing, respectively. The ALBERT+Bi-LSTM+CRF algorithm is used for training. A large number of medical texts or medical guidelines can be embedded into the proposed algorithm as inputs to generate predictions after the algorithm is evaluated and adjusted. Afterward, a structured entity set and relationship set are obtained. These data are organized and fused to form a ternary group after duplicates are removed. Finally, the database of a knowledge graph is constructed.

This study applied the knowledge graph concept in the medical field by employing the formal expression of the Semantic Web. In this way, the implicit semantics in the data are more effectively captured and understood, which provides strong support for information retrieval and personalized medical services. This study also employed statistical methods to conduct comprehensive training on various medical texts, aiming to derive effective insights and patterns from the texts and construct a structured and logically rigorous database for knowledge graphs. According to the findings, significant academic value and application prospects are demonstrated.

The structure of this article is as follows: Section 2 presents the core concepts, construction stage, and architecture of the knowledge graph. Section 3 addresses model training and presents the strengths and limitations of the ALBERT algorithm, while the Bi-LSTM+CRF algorithm is proposed to enhance the precision and help extract key texts effectively. Section 4 presents the experimentation and optimization phases. Critical steps such as real medical text annotation, data preprocessing, label definition, and extraction of the entity database are outlined, and the results and analyses are presented. Section 5 summarizes the work conducted. The proposed algorithm, ALBERT+Bi-LSTM+CRF, and its further potential advancement opportunities are discussed regarding the scale of the dataset, the introduction of medical image data, and the label structure of the knowledge graph.

2. Construction of a Knowledge Graph

2.1. The Core Concepts of a Knowledge Graph

As technology has advanced, knowledge graphs have been widely applied in various fields, such as natural language processing, recommendation systems, and intelligent question and answer (Q&A) systems [20]. A knowledge graph typically consists of nodes (entities) and edges (relationships), forming a directed graph. Each node represents an entity, and the edges represent the relationships between entities. Building a knowledge graph is a complex process that involves data collection, data cleansing, integration, and standardization. Automated methods for constructing knowledge graphs include information extraction, entity recognition, and relationship extraction. As the scale of knowledge graphs expands, it has become a challenge to effectively store, retrieve, and update them. In addition, the quality and accuracy of knowledge graphs present concerns for researchers [21].

The knowledge graph, an advanced semantic network, aims to formally delineate complex entities and their relations in real settings and has become a tool of substantial value for representing large-scale knowledge bases. Its basic unit is a triplet (G = (E, R, S)) that practically presents structured knowledge (S) formed by entities (E) joined through relationships (R) [22]. Entities are the cornerstone of the knowledge graph, and each entity is represented by a globally unique identifier (ID), connected to other entities through multiple associations.

2.2. Architectural Design of a Knowledge Graph

The architectural design of the knowledge graph reflects a rigorously logical and flexible system.

(1) Logical Structure: The knowledge graph is logically split into two layers, namely, the data and schema. The data layer carries rich factual data. Each fact precisely stores a basic unit of knowledge. The schema layer, functioning as the skeleton of the knowledge graph, standardizes and structures facts in the data layer through an ontology library, thus ensuring the precision and consistency of knowledge [23].

(2) System Architecture: The construction of a knowledge graph is a continuously repetitive and evolving process. The construction approaches are mainly split into two approaches, namely, top-down and bottom-up. The top-down approach emphasizes the definition of the ontology and data schema of the knowledge graph and gradually plugging in entities. This approach is suitable for construction based on the available structured knowledge bases. On the other hand, the bottom-up approach focuses more on deriving entities from open-linked data, picking high-quality entities based on confidence levels, and adding them to the knowledge base to form a full ontology model. In the current research setting and practice, the bottom-up construction approaches are popularly applied due to their flexibility and adaptability [24]. This manuscript will also follow such an approach, deriving invaluable insights and patterns through the comprehensive processing of open-access data to construct a structured and rich knowledge graph.

Figure 2 depicts the whole construction procedure and fully demonstrates the transformation of the knowledge graph from raw data to the final knowledge base.

3. Model Training

When a knowledge graph is constructed, a systematic approach is adopted to process unstructured medical texts and extract entities and related data. Firstly, a deep analysis of texts using NLP is conducted, and then key entities such as diseases, body parts, and examination items are identified and extracted, which are the fundamental elements for constructing a medical knowledge graph. To recognize the correspondence between entities, label categories are assigned to each entity in the training process. Thus, the accuracy and reliability of knowledge graphs are ensured. Finally, these accurately labeled entities and relationships are integrated into a database, thus constructing structured and information-rich knowledge graphs. Therefore, the proposed algorithm not only advances the reliability of the database but also lays a solid foundation for subsequent data analysis and applications.

The training model consists of three parts, ALBERT, Bi-LSTM, and CRF. Figure 3 depicts the model structure. The text is first embedded into ALBERT to extract features, providing a hidden-state representation of a sequence after multiple embeddings are run. Then, the Bi-LSTM layer captures the long-distance dependencies in the sequence according to the output of ALBERT. Finally, the CRF layer performs sequence labeling and outputs the labeled results in the BIO format. To categorize named entities, each entity is assigned to a label, namely, B-X, I-X, or O. B-X signifies that a given word is categorized under class X and is positioned at the start of the entity. I-X denotes that the word belongs partially to the category X, but it appears in the middle of the entity’s span. O signifies that a word is not associated with any of the categories.

3.1. Choice of the ALBERT Algorithm

When a knowledge graph is structured, both entities and related data are derived from unstructured texts. Then, accurate label categories are assigned, which is a pivotal task in order to formulate a more reliable database and knowledge graphs. In this study, we chose ALBERT as the primary text-processing algorithm. Before delving into ALBERT, BERT is reviewed.

Bidirectional Encoder Representations from Transformers (BERT), launched by Google in 2018, is a large-scale pre-trained language algorithm based on a bidirectional transformer [25]. Since it has the capability of powerfully extracting text information, substantial performance enhancements in multiple NLP tasks are attained with pioneering outcomes. BERT’s pre-training strategy permits it to master large-scale data and then convey the learned knowledge to various NLP tasks, attaining a strong performance through fine-tuning.

Nevertheless, though an excellent performance is attained, it requires relatively large computational resources. BERT only masks and estimates about 15% of the words in a text, resulting in a relatively slow convergence speed at the training stage. Additionally, BERT may face memory insufficiency, thus limiting its implementation in resource-constrained environments when running on GPUs with limited memory.

To overcome the constraints of BERT and fully unleash its potential, this manuscript introduces A Lite BERT (ALBERT), which is a streamlined version of BERT by Google. Communication overheads in distributed training are effectively reduced by decreasing the parameter numbers. ALBERT implements two algorithms, namely, factorized embedding parametrization and cross-layer parameter sharing [26], to decrease the number of parameters. In addition, ALBERT introduces a new self-supervised learning objective, called Sentence Order Prediction (SOP), to substitute the Next Sentence Prediction (NSP) task in BERT, capturing sentence coherence more accurately. These innovations enable ALBERT to maintain a higher performance level while decreasing computational resource requirements.

3.2. The Integration of Bi-LSTM with CRF

To improve the accuracy and precision of the proposed algorithm, the power of ALBERT is integrated with the Bi-LSTM+CRF module. Thus, the effectiveness of key text extraction is attained. Experimental results indicate that ALBERT+Bi-LSTM+CRF significantly outperforms the case where only ALBERT is used at the training stage. Thus, not only is model accuracy achieved but also an overall efficacy of information extraction from unstructured texts.

The Bidirectional Long Short-Term Memory (Bi-LSTM) network excels in capturing complex sequential attributes, but may not completely employ the constraint relations between annotations when labeling tasks are handled. Conditional Random Field (CRF), on the other hand, is adept at transition probabilities of modeling between labels but depends on serialized attributes as inputs. When combining Bi-LSTM with CRF, Bi-LSTM is utilized to derive features and used as an input into the CRF layer to model the transition probabilities between labels. This combination permits the proposed algorithm to master complex sequential attributes while considering the constraint relationships between labels. Hence, the accuracy of sequence labeling is enhanced.

Bi-LSTM+CRF involves three main consecutive steps: Firstly, the Bi-LSTM layer processes the input sequence, producing high-dimensional attribute representations for each time point. Secondly, the output of Bi-LSTM is transformed into the required format for the CRF layer. Finally, the CRF layer employs the attributes provided by Bi-LSTM and the mastered transition probability model to label the sequence. This combined strategy has succeeded in achieving substantial gains in outcomes when multiple NLP tasks are run, particularly benchmark outcomes when processing sequences with long-distance dependencies.

4. Experimental Process

4.1. Dataset

A single public dataset is insufficient to meet the needs of comprehensive experiments when the complexity of entity and relation extractions is under consideration. Thus, multiple datasets and real hospital text data were incorporated in the validation stage.

4.1.1. The Dataset for Entity Extraction

The CCKS2020 dataset was picked for entity extraction. This wa developed as part of the 6th China Conference on Knowledge Graph and Semantic Computing, with 1450 cases containing 6 kinds of entity label categories, containing a total of approximately 22,000 entities [27]. The dataset was split into 13,346, 4562, and 4538 data samples for the training, validation, and test sets, respectively.

4.1.2. The Dataset for Relation Extraction

The Chinese Medical Information Extraction dataset (CMeIE-V2) was picked for relation extraction, which contains nearly 75,000 triplet data, covering 28,000 disease statements and 53 triplet categories [28]. The data were split into 14,339, 3585, and 4482 data samples for the training, test, and validation sets, respectively. The whole dataset contains pediatric training corpora for 518 pediatric diseases and 109 common diseases.

4.1.3. Annotation of Real Medical Texts

To validate the efficacy of the proposed algorithm in real scenarios, real data from a hospital and 200 manually annotated pathological diagnoses and imaging report results were obtained. However, due to the relatively limited observations, this part of the research was mainly employed to validate the experimental process. The outcomes have not yet reached an optimized level and are therefore not included in this article.

4.2. Data Preprocessing and Label Definition

4.2.1. The Definition of Entity Labels

For entity extraction tasks, medical entity annotation rules were referred to and combined with actual needs to delineate 6 major categories of entity labels, namely, location, surgery, disease, examination, medication, and project indicators. The original text was preprocessed by splitting sentences, by using punctuation marks and the BIO annotation rules, which were utilized to label each character in the text in order to produce texts that met the training requirements. The specific data are shown in Table 1.

4.2.2. The Definition of Relation Labels

For relation extraction tasks, some categories in the original dataset had limited data, preventing an effective analysis. To address this, the original 53 categories were summarized and integrated, with us ultimately combining them into the final 19 categories. Key information, for example, treatment methods, was focused on when data were integrated. The medical history and environmental factors were appropriately compressed. The specific data are shown in Table 2.

4.3. Results and Analysis

4.3.1. Results

Experiments were conducted in two distinct forms. However, the primary focus was on extracting entities. Table 3 presents a detailed analysis of the experimental results.

To demonstrate the effectiveness of the proposed algorithm, well-known algorithms were used in comparison based on the CCKS2020 dataset [29,30]. We observed that the ALBERT+Bi-LSTM+CRF algorithm achieved satisfactory results in the F1-score, accuracy, and recall metrics. Notably, when compared with the BERT+Bi-LSTM+CRF algorithm, it significantly reduced the number of model parameters and the algorithm’s complexity, yet still maintained a high performance level.

We also analyzed the impact of incorporating the Bi-LSTM+CRF module on the model’s performance, and we found that the ALBERT algorithm could swiftly converge and attain stable outcomes in a short training cycle. However, the addition of more training cycles did not bring improved outcomes.

Since strict requirements for accurate data apply in the medical field, BiLSTM and CRF were introduced to construct a more refined algorithm. As shown in Figure 4, although the proposed algorithm performed relatively poorly in the early stages of training, it succeeded in producing a better outcome within an acceptable time frame if the training cycle was appropriately extended. Specifically, the F1-score, precision, and recall of the ALBERT+BILSTM+CRF algorithm reached 91.8%, 92.5%, and 94.3%, respectively. When ALBERT was compared with the proposed algorithm, the F1-score, precision, and recall increased by 1.3%, 5.3%, and 4.7%, respectively.

This outcome indicates that key entities can be effectively extracted from medical texts. Thus, robust support for the construction of medical knowledge graphs is provided. Also, clinical implementations can be combined with advanced deep learning architectures and sequence labeling techniques.

Table 4 presents the experimentation’s results in relation extraction.

Once the algorithm was constructed, the triplets from the text were attained by using the constructed entity extraction and relationship extraction algorithms. Table 5 summarizes the test results, which were mainly derived from public datasets, real hospital texts, and consultation guides. After entities and relations are successfully extracted, a detailed screening and reorganization of the data is conducted. First, duplicate nodes are removed to ensure the data’s consistency. Then, synonyms, which are labeled entity pairs, are detected and merged to decrease redundancy. Finally, 46,599 unique nodes and 149,641 relationships were obtained.

They were combined and plugged into the Neo4j database, which provides powerful support through its efficient data retrieval and visualization capabilities as a representative of graph databases. The graphical medical knowledge represented by a node-edge diagram has storage and display functions. Figure 5 depicts the eventual graph database of medical knowledge that we produced, which presents comprehensive information resources and powerful analysis tools for research and applications in the medical field. In this specific knowledge graph, all entity and relationship descriptions are conducted in Chinese.

Figure 6 depicts graphical medical knowledge stored and displayed through the Neo4j database. The nodes in the graph denote entities such as diseases, symptoms, and medications, while the edges designate the association types between these entities, such as causal, similar, and so on. Thus, a clearer comprehension of connections and patterns within the medical knowledge database can be grasped, and powerful support for clinical diagnosis and treatment through representation is attained.

4.3.2. Error Analysis of Cases

To conduct a detailed analysis of the errors, we randomly selected 100 test entry datasets, which included 978 entity instances and 489 relationship instances, respectively. We identified approximately 196 error cases, with 87 entity and 109 relationship errors, accounting for errors at about 44% and 56%, respectively.

We categorized the entity and relationship data into several types based on different error causes. Figure 7 illustrates the specific distributions of these errors.

As this shows, classification errors can generally be summarized as follows (for entity extraction, there are several types of errors):

(1) The original entity is not extracted or just partially extracted: This is the most common type of error in entity extraction and is often encountered with more complex entities that have complete sentence structures, including punctuation marks or conjunctions, for example, “and.” The model tends to break down these entities into smaller units. Due to the low absolute number of such complex entities and the difficulty in finding similarities among them, the learning performance of features in the model is suboptimal.

(2) Misclassification: The entity is correctly extracted but is assigned to the wrong category. This error often occurs between two-entity categories that have some similarities, such as “Symptom” and “Pathological Staging,” which both describe parts of a disease, leading to misclassification when processing the data. However, clarifying the definition of each category during annotation should help reduce these occurrences.

(3) Extraction of superfluous entity information: Technically, an error has occurred. Yet, when the errors are reviewed, some of these are acceptable. For example, the model extracts “liver function” as “liver function test,” which does not change the information contained in the entity and does not affect the extraction process of the subsequent relationship. Therefore, it may be possible to incorporate them into the knowledge graph through some semantic similarity calculations, since such similarities are difficult to define under a category.

For relationship extraction, there are several types of errors:

(1) Relationship errors due to entity errors: This is the most common type of error in relationship extraction. Since relationship extraction is performed after entity extraction, the accuracy of entity extraction directly affects the outcome of relationship extraction. If there are errors or omissions in entity extraction, it will directly affect the process of relationship extraction. In other words, relationship extraction errors are caused by the relationship categories, for instance, “Synonyms” and “Correlation.” To illustrate, after two entities A and B are connected with the relationship “Synonyms” or “Correlation,” the model tends to consider these two entities as equivalent and attribute the other relationship information contained in entity A equally to entity B. This approach is generally acceptable when it comes to the “Synonyms” category, but it can cause significant problems in the “Correlation” category. Unlike “Synonyms,” the two entities associated with the “Correlation” category cannot replace each other; instead, they may just be two different types that are more likely to appear together. Therefore, we believe that the definition of the “Correlation” category needs further refinement.

(2) Misclassification: The entities are correctly extracted but are assigned to the wrong relationship category. This error often occurs between two relationship categories that have some similarities, such as “Epidemiological” and “Sociological,” which both describe parts of the causes of disease outbreaks, leading to misclassification when processing the data.

(3) Relationships not extracted: The entities are correctly extracted, but the corresponding relationships are not identified based on the entities. This type of error is relatively rare and may require more case analysis to be conducted.

5. Conclusions and Future Prospects

5.1. Discussion

Huge and varied datasets in the medical field bring several challenges and opportunities. The information, data, and links in medical texts can be effectively integrated by constructing a base of medical semantic knowledge. Then, a transformation process can be conducted to generate structured knowledge that is easy to derive insights, patterns, and relations from, so a tool that helps support better assessments and comprehension can be attained. Not only are unstructured data in medical texts then organized but also smart medical technology can be better harnessed [31].

In this study, we implemented the proposed ALBERT+Bi-LSTM+CRF algorithm to conduct comprehensive learning tasks on medical texts, successfully attaining high-precision methods for entity and relation extraction. A graphical database for medical text knowledge was constructed based on the triplet representation of information. Thus, structured data support for auxiliary guidance in the decision-making around medical treatments was produced. The implementation of ALBERT ensures precision while substantially decreasing the complexity and memory requirements during training, enabling flexible training and updating of models within hospitals. The addition of BiLSTM+CRF further advances precision, with the utmost significance in the medical field, which pursues rigorous and precise outcomes.

After entities and relations are successfully extracted, a detailed screening and reorganization of the data are conducted. First, duplicate nodes are removed to ensure data consistency. Then, synonyms, which are labeled entity pairs, are detected and merged to decrease redundancy.

When entities are extracted, the constructed algorithm achieves 91.8%, 92.5%, and 94.3% for the F-score, precision, and recall, respectively. The proposed algorithm also achieves remarkable outcomes in extracting relations, with 88.3%, 88.1%, and 88.4% for the F-score, precision, and recall, respectively.

5.2. Conclusions

Despite the outstanding achievements, there is still room for further progress. First, public datasets are generally relied on when training. However, the proposed algorithm needs to include medical clinical texts derived from real cases, so the proposed algorithm can be closely aligned with clinical practices and so the hospital’s diagnosis and treatment processes are more accurately reflected. The algorithm has been verified with real, manually annotated, small-scale medical texts [32]. The scale of the implemented data needs to be further improved to produce a more feasible algorithm that can be expanded to achieve practicality and accuracy. Therefore, future development work will be focused in this direction.

Second, current knowledge graphs primarily focus on text data, but the medical field also possesses abundant medical image data [33]. Thus, a multi-modal knowledge graph should be constructed. The image data often reflect a patient’s physical condition more intuitively and in more detail than text reports. Thus, incorporating medical image data into the construction process of knowledge graphs and deeply mining the effective information within them will greatly enrich the content of the knowledge graphs and progress their reasoning capabilities.

Finally, the labeling structure of the knowledge graph needs to be further refined [34]. This research merely focuses on the extraction of the categories of basic entities, for instance, diseases. However, a more comprehensive form of extraction is needed in medical applications such as the extraction of symptoms and medications for patient conditions, which is a far more complex setting. For example, to describe a patient’s disease course more comprehensively, the development of symptoms from a temporal dimension needs to be tracked, and multi-dimensional datasets should be implemented, which correspond to the patient’s demographic and medical attributes. Thus, a more complete and graphical knowledge database with three dimensions should be constructed.

The developed algorithm will be applied to real-world scenarios. However, the efficiency of the text annotation process in natural languages needs to be enhanced, and model training with a large number of real medical texts needs to be conducted. Hence, we have attempted to incorporate the methods of large language models (LLMs) into the construction process of the knowledge graph. By leveraging the powerful semantic analysis and generation capabilities of large language models (LLMs), we can efficiently extract entity and relationship triplets from texts generated by natural languages and supplemented by manual reviews to ensure data accuracy. Also, we are actively advancing collaborations with multiple hospitals to find suitable directions for the practical application of knowledge graphs. In addition, one ongoing research study aims to utilize the knowledge graph to predict the likelihood of complications in patients when undergoing percutaneous lung ablation. By collecting patients’ basic information, laboratory and examination reports, and ablation surgery information, a knowledge graph is being constructed for lung ablation and its structural features are being analyzed to predict patients’ probabilities of developing surgical complications.

Future research will focus on the following directions:

(1) Expand the data scale to generate a more feasible algorithm that has practicality and accuracy in real cases.

(2) Construct a multi-modal knowledge graph by integrating medical image data to enrich the content of the knowledge graph and improve its reasoning capabilities.

(3) Improve the markup structure of knowledge graphs to achieve more comprehensive extracted forms, such as symptoms and medications related to patients’ conditions.

(4) Investigate the integration of NLP with medical image recognition by constructing a cross-modal medical knowledge graph to provide more comprehensive and accurate support for medical decision-making.

(5) Combine the knowledge graph with AI algorithms to develop intelligent systems that automatically recommend medical diagnoses and treatment protocols, further improving the efficiency and quality of medical services.

Also, future research is planned to explore the integration of NLP with medical image recognition. Medical data from different sources can also be effectively integrated and analyzed by constructing a cross-modal medical knowledge graph. In doing so, more comprehensive and accurate support for medical decision-making can be provided. Also, knowledge graphs can be combined with artificial intelligence algorithms to develop intelligent systems that automatically recommend medical diagnosis and treatment schemes. In this way, the efficacy and quality level of medical services can be further improved.

Author Contributions

Conceptualization, T.F., Y.Y. and L.Z.; methodology, T.F., Y.Y. and L.Z.; software, T.F. and Y.Y.; validation, T.F.; formal analysis, T.F.; investigation, T.F., Y.Y. and L.Z.; resources, T.F. and Y.Y.; data curation, T.F.; writing—original draft preparation, T.F.; writing—review and editing, T.F. and Y.Y.; visualization, T.F. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Medical data used to support the findings of this study were supplied by the China Conference on Knowledge Graph and Semantic Computing (CCKS) and the Chinese Biomedical Language Understanding Evaluation (CBLUE): https://tianchi.aliyun.com/cblue/ (accessed on 2 July 2024).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cai, Z.X.; Cai, Y.F. Clinical application and technology of smart medical. J. Med. Inform. 2021, 42, 48–53. (In Chinese) [Google Scholar]
Li, C.L.; Zhao, C.; Si, Q.; Yan, M.L.; Li, Y.T.; Zhang, X. Development status and the future of smart medical treatment. Life Sci. Instrum. 2021, 19, 4–13. (In Chinese) [Google Scholar]
Zhang, H.; Zong, Y.; Chang, B.B.; Sui, Z.F.; Zan, H.Y.; Zhang, K.L. Medical entity annotation specification for medical text processing. In Proceedings of the Chinese National Conference on Computational Linguistics, Haikou, China, 30 October–1 November 2020. (In Chinese). [Google Scholar]
Wang, H.C.; Zhao, T.J. Research and development of biomedical text mining. J. Chin. Inf. Process. 2008, 22, 89–98. (In Chinese) [Google Scholar]
Sun, Z.; Wang, H.L. Overview of the advance of the research on named entity recognition. Data Anal. Knowl. Discov. 2010, 6, 42–47. (In Chinese) [Google Scholar]
Berners-Lee, T.; Hendler, J.; Lassila, O. The Semantic Web. Scientific American Magazine. Available online: https://www.scientificamerican.com/article/the-semantic-web/ (accessed on 1 May 2001).
Sheth, A.; Thirunarayan, K. Semantics Empowered Web 3.0: Managing Enterprise, Social, Sensor, and Cloud-Based Data and Service for Advanced Applications; Morgan and Claypool: San Rafael, CA, USA, 2013. [Google Scholar]
Amit, S. Introducing the Knowledge Graph. Official Blog of Google. Available online: http://googleblog.blogspot.pt/2012/05/introducing-knowledge-graph-things-not.html (accessed on 2 January 2015).
Zhao, Y.H.; Liu, L.; Wang, H.L.; Han, H.Y.; Pei, D.M. Survey of knowledge graph recommendation system research. J. Front. Comput. Sci. Technol. 2023, 17, 771–791. (In Chinese) [Google Scholar]
Mi, Z.H.; Qian, A.B. Research status and trend of smart healthcare: A literature review. Chin. Gen. Pract. 2019, 22, 366–370. (In Chinese) [Google Scholar]
Wang, Z.C.; Feng, J.Y. Application of a digital health system based on the Internet of Things in China. Chin. Med. Devices 2022, 37, 174–179. (In Chinese) [Google Scholar]
Lan, G.; Hu, M.T.; Li, Y.; Zhang, Y.Z. Contrastive knowledge integrated graph neural networks for Chinese medical text classification. Eng. Appl. Artif. Intell. 2023, 122, 106057. [Google Scholar] [CrossRef]
Dong, X.L.; Zheng, W.F. Emerging technologies for drug repurposing: Harnessing the potential of text and graph embedding approaches. Artif. Intell. Chem. 2024, 2, 100060. [Google Scholar] [CrossRef]
Guo, L.; Li, X.L.; Yan, F.; Lu, Y.Q.; Shen, W.P. A method for constructing a machining knowledge graph using an improved transformer. Expert Syst. Appl. 2024, 237, 121448. [Google Scholar] [CrossRef]
Lin, Y.C.; Lu, K.M.; Yu, S.; Cai, T.X.; Zitnik, M. Multimodal learning on graphs for disease relation extraction. J. Biomed. Inform. 2023, 143, 104415. [Google Scholar] [CrossRef]
Sung, Y.W.; Park, D.S.; Kim, C.G. A study of BERT-based classification performance of text-based health counseling data. Comput. Model. Eng. Sci. 2022, 135, 795–808. [Google Scholar]
Wang, Y.M.; Zhang, J.D.; Yang, Z.Y.; Wang, B.; Jin, J.Y.; Liu, Y.T. Improving extractive summarization with semantic enhancement through topic-injection based BERT model. Inf. Process. Manag. 2024, 61, 103677. [Google Scholar] [CrossRef]
Shiney, J.; Raghuveera, T. COVID-based question criticality prediction with domain adaptive BERT embeddings. Eng. Appl. Artif. Intell. 2024, 132, 107913. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. Available online: http://arxiv.org/abs/1810.04805 (accessed on 11 October 2018).
Wu, Z.M.; Liang, J.; Zhang, Z.A.; Lei, J.B. Exploration of text matching methods in Chinese disease Q&A systems: A method using ensemble based on BERT and boosted tree models. J. Biomed. Inform. 2021, 115, 103683. [Google Scholar]
Yang, P.R.; Wang, H.J.; Huang, Y.Z.; Yang, S.; Zhang, Y.; Huang, L.; Zhang, Y.S.; Wang, G.X.; Yang, S.Z.; He, L.; et al. LMKG: A large-scale and multi-source medical knowledge graph for intelligent medicine applications. Knowl. Based Syst. 2024, 284, 111323. [Google Scholar] [CrossRef]
Xu, Z.L.; Sheng, Y.P.; He, L.R.; Wang, Y.F. Review on knowledge graph techniques. J. Univ. Electron. Sci. Technol. China 2016, 45, 589–606. (In Chinese) [Google Scholar]
Liu, Q.; Li, Y.; Duan, H.; Liu, Y.; Qin, Z.G. Knowledge graph construction techniques. J. Comput. Res. Dev. 2016, 53, 582–600. (In Chinese) [Google Scholar]
Huang, M.; Xu, G.J.; Li, H.L. Construction of personalized learning service system based on deep learning and knowledge graph. Appl. Math. Nonlinear Sci. 2024, 9. [Google Scholar] [CrossRef]
Lan, Z.Z.; Chen, M.D.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. ALBERT: A lite BERT for self-supervised learning of language representations. In Proceedings of the International Conference on Learning Representations, Virtual, 26 April–1 May 2020. [Google Scholar]
Tu, Y.; Chi, M. E-Business. Digital Empowerment for an Intelligent Future; Springer: Cham, Switzerland, 2023. [Google Scholar]
China Medical Knowledge Graph Research Association. CCKS2020 dataset. In Proceedings of the 8th Chinese Conference on Natural Language Processing and Chinese Computing, Dunhuang, China, 9–14 October 2019; pp. 100–110. [Google Scholar]
Guan, P.; Zan, H.; Zhou, X.; Xu, H.; Zhang, K. CMeIE: Construction and evaluation of Chinese medical information extraction dataset. In Natural Language Processing and Chinese Computing, 9th CCF International Conference, Zhengzhou, China, 14–18 October 2020; Springer International Publishing: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
Li, Z.M.; Yun, H.Y.; Wang, Y.Z. Medical named entity recognition based on BERT with multi-feature fusion. J. Qingdao Univ. (Nat. Sci. Ed.) 2021, 34, 23–29. (In Chinese) [Google Scholar]
Gao, W.C.; Zheng, X.H.; Zhao, S.S. Named entity recognition method of Chinese EMR based on BERT-BiLSTM-CRF. J. Phys. Conf. Ser. 2021, 1848, 012083. [Google Scholar] [CrossRef]
Hou, M.W.; Wei, R.; Lu, L.; Lan, X.; Cai, H.W. A survey of knowledge graph research and its application in the medical field. Comput. Res. Dev. 2018, 55, 2587–2599. (In Chinese) [Google Scholar]
Li, M.D.; Zhang, P.; Li, G.L.; Jiang, W.; Li, K.; Cai, P.Q. Study on Chinese medical named entity recognition algorithm. J. Med. Inform. 2022, 43, 45–51. (In Chinese) [Google Scholar]
Tan, L.; E, H.H.; Kuang, Z.M.; Song, M.N.; Liu, Y.; Chen, Z.Y.; Xie, X.X.; Li, J.D.; Fan, J.W.; Wang, Q.C.; et al. Construction technologies and research development of medical knowledge graph. Appl. Res. Comput. 2021, 7, 80–104. (In Chinese) [Google Scholar]
Huang, H.X.; Wang, X.Y.; Gu, Z.W.; Liu, J.; Zang, Y.N.; Sun, X. Research on construction technology and development status of the medical knowledge graph. Comput. Eng. Appl. 2023, 59, 33–48. (In Chinese) [Google Scholar]

Figure 1. The extraction of key information from medical texts.

Figure 2. The architecture of the knowledge graph.

Figure 3. The training process of ALBERT+Bi-LSTM+CRF.

Figure 4. Performance comparison of ALBERT and ALBERT+Bi-LSTM+CRF.

Figure 5. A display of a graphical database of medical knowledge.

Figure 6. A localized scheme of the medical knowledge graph.

Figure 7. Error types for entity and relationship.

Table 1. The distribution of the entity extraction in the datasets.

Entity Type	Train Set		Validation Set		Test Set
Entity Type	Entries	Percentage/%	Entries	Percentage/%	Entries	Percentage/%
Body	6061	45.4	1965	43.1	1990	43.8
Operation	606	4.5	216	4.7	217	4.8
Disease	3032	22.7	1069	23.4	932	20.5
Examination	860	6.4	282	6.2	252	5.6
Drug	2006	15.0	790	17.3	863	19.0
Item	776	5.8	214	4.7	313	6.9

Table 2. The distribution of the relation extractions in the datasets.

Relation Type	Train Set		Validation Set		Test Set
Relation Type	Entries	Percentage/%	Entries	Percentage/%	Entries	Percentage/%
Epidemiological	3895	4.5	1002	4.6	1043	4.8
Prognosis	388	0.1	110	0.5	107	0.5
Sociological	6649	7.6	1720	7.9	1756	8.1
Synonyms	4848	5.6	1144	5.2	1084	5.0
Prevention	2479	2.9	540	2.5	567	2.6
Post-treatment	147	0.1	37	0.1	55	0.3
Imaging	2238	2.6	505	2.3	521	2.4
Medication	6690	7.7	1547	7.1	1488	6.9
Symptom	22,373	25.7	5678	26.0	5621	25.9
Pathological staging	4304	4.9	1182	5.4	1145	5.3
Stage	1216	1.4	318	1.5	309	1.4
Complications	5187	6.0	1324	6.1	1297	6.0
Other inspections	1576	1.8	394	1.8	408	1.9
Operation	1681	1.9	404	1.8	391	1.8
Diagnosis	2900	3.3	754	3.4	776	3.6
Body part	4376	5.0	1130	5.2	1106	5.3
Other treatments	2848	3.3	715	3.3	703	3.2
Laboratory	3210	3.7	851	3.9	844	3.9
Correlation	9971	11.5	2529	11.6	2487	11.5

Table 3. Experimental results for entity extraction.

Model	F1-Score/%	Precision/%	Recall/%
Word2vec-BiLSTM-CRF	75.1	73.2	72.8
Bi-LSTM+CRF	88.6	88.2	88.4
BERT+Bi-LSTM+CRF	90.1	94.9	92.3
ALBERT	90.5	87.2	89.6
ALBERT+Bi-LSTM+CRF	91.8	92.5	94.3

Table 4. The results of relation extraction.

Model	F/%	Precision/%	Recall/%
ALBERT+Bi-LSTM+CRF	88.3	88.1	88.4

Table 5. Data overview of the knowledge graph.

Data Source	Entries	Number of Triplets
Public datasets	14,335	64,757
Real hospital data	13,580	172,354
Diagnosis and treatment guidelines	357	1010

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fang, T.; Yang, Y.; Zhou, L. Enhanced Precision in Chinese Medical Text Mining Using the ALBERT+Bi-LSTM+CRF Model. Appl. Sci. 2024, 14, 7999. https://doi.org/10.3390/app14177999

AMA Style

Fang T, Yang Y, Zhou L. Enhanced Precision in Chinese Medical Text Mining Using the ALBERT+Bi-LSTM+CRF Model. Applied Sciences. 2024; 14(17):7999. https://doi.org/10.3390/app14177999

Chicago/Turabian Style

Fang, Tianshu, Yuanyuan Yang, and Lixin Zhou. 2024. "Enhanced Precision in Chinese Medical Text Mining Using the ALBERT+Bi-LSTM+CRF Model" Applied Sciences 14, no. 17: 7999. https://doi.org/10.3390/app14177999

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Enhanced Precision in Chinese Medical Text Mining Using the ALBERT+Bi-LSTM+CRF Model

Abstract

1. Introduction

2. Construction of a Knowledge Graph

2.1. The Core Concepts of a Knowledge Graph

2.2. Architectural Design of a Knowledge Graph

3. Model Training

3.1. Choice of the ALBERT Algorithm

3.2. The Integration of Bi-LSTM with CRF

4. Experimental Process

4.1. Dataset

4.1.1. The Dataset for Entity Extraction

4.1.2. The Dataset for Relation Extraction

4.1.3. Annotation of Real Medical Texts

4.2. Data Preprocessing and Label Definition

4.2.1. The Definition of Entity Labels

4.2.2. The Definition of Relation Labels

4.3. Results and Analysis

4.3.1. Results

4.3.2. Error Analysis of Cases

5. Conclusions and Future Prospects

5.1. Discussion

5.2. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI