3.1. Date Source Classification
Since different categories of medical data contain different medical knowledge, the value of medical big data can be realized only by fusing multiple medical data. For example, Wu et al. [
10] proposed the method of fusing multiple data resources to construct knowledge graphs to improve the practical application value of knowledge graphs. Therefore, in this paper, three medical data resources, i.e., medical dictionaries, electronic medical records, and online medical communities, were selected to improve the practical application value of medical knowledge graphs. To be able to effectively enhance the richness of the medical knowledge graph, this paper counted characteristics of data sources selected as medical data sources and formulated a data source selection strategy for the construction of multi-data source medical knowledge graphs. The statistical results and specific characteristics are described as follows.
- (a)
Medical Dictionary
The medical dictionary mainly includes existing medical dictionary resources, such as the International Classification of Diseases Manual ICD-11, etc. These resources are highly professional and are among the most important data sources of the medical knowledge graph.
The International Classification of Disease (ICD) is a manual for classifying diseases published by the World Health Organization (WHO). ICD-11 (
http://www.medsci.cn/sci/icd-10.asp, accessed on 10 December 2022) is the 11th revision of ICD, and China has participated in the revision, research, and development process of ICD-11. As shown in
Figure 2, some examples of disease codes from the ICD-11 disease classification manual are shown. The first column is the disease name, and the second column is the disease code, which is used to uniquely identify a disease entity. For the first time, the ICD-11 has an all-electronic version, which is easy to use and reduces error rates. A total of 55,000 codes were included in this compilation, far more than the 14,400 codes of the ICD-10. At present, the Chinese translation version provided by the State Hospital Administration is the ICD-11 Concise Code List, with a total of 32,198 entries.
- (b)
Chinese electronic medical record
Chinese electronic medical record is a record of clinicians’ diagnosis and treatment process of patients, mainly including discharge summary and various treatment records, which contains many professional terms in the medical field and a high density of entity vocabulary and is an important data source for constructing medical knowledge graph.
With the support of local hospitals, we obtained 2000 Chinese electronic medical records, and the statistical results are shown in
Table 1. The main entity types in the outpatient records and consultation records are body parts, symptoms, and examinations; the main entity types in the medical history records and discharge records are body parts, symptoms, and examinations.
- (c)
Online medical community
The advantages of online medical communities as an emerging source of medical data are shown below:
- (a)
There are plentiful medical data resources available for mining;
- (b)
Medical data are originated from the real situation of users;
- (c)
These data have better timeliness and can update faster.
In addition, online medical data do not contain private data, are publicly available, and there is a large amount of data. Although the data quality is low compared with the first two types of medical data, it is improving day by day and serve as an important complementary source of data for the medical knowledge graph.
3.2. Data Strategy
3.2.1. Data Source Selection Strategy
The data source selection strategy developed in this paper first took the quality of data sources as a precondition to ensure the accuracy of medical data. Then, according to the statistical results of corresponding data sources as the applicable standard, the data characteristics of different data sources were utilized to the maximum use value. Specific implementation details are described below:
- (a)
Firstly, considering that the publicly available medical dictionaries possess expertise, they were used as the base data to provide standards for medical knowledge graphs, for example, standards for disease entity names and coding rules;
- (b)
Secondly, clinical experience data have higher authority and validity, so using electronic medical records as the core data of a medical database facilitates improving the accuracy of intelligent diagnosis, for example, detailed symptom narratives, clinical manifestations, drug effects, and treatment;
- (c)
Finally, medical community data sources can not only greatly enrich the diversity of medical databases, but also guarantee the timeliness of medical knowledge using the Web. However, considering the quality of the data, they are used as supplementary data to the medical database, for example, dietary issues, usage, dosage, precautions of drugs, medical science, etc.
Considering the openness and availability of data, we chose publicly available medical dictionaries, online medical communities, and a certain number of electronic medical records obtained from local hospitals as the medical database in this paper. Among them, electronic medical records were desensitized data after manual processing, contain no private data of users, and were only used for academic research.
3.2.2. Two-Step Data Fusion Strategy
Data fusion is a key step in constructing knowledge graphs from multi-data sources. Wu Yunbing et al. first constructed domain ontology libraries and used rules such as similarity detection and conflict resolution to fuse ontology libraries of multiple domains to form a global ontology library. They targeted the construction of multi-data source knowledge graphs for generic domains and gave a standardized process from data acquisition, data processing, data fusion, and knowledge graph construction, but it could not be directly used in the medical domain [
10]. By researching the literature in the medical field, we found that disease is the most critical entity, doctors need to diagnose diseases through symptoms, treatments need to be selected for different diseases, tests need to be performed to accurately determine diseases, etc. Other medical entities need to cross each other through disease entities. We concluded that the disease entity has the characteristic of a “transportation hub” among many predefined medical entities. Based on this characteristic, this paper proposed a two-step data fusion strategy for medical data.
In addition, the target of data fusion is the integration of data and knowledge from different data sources [
11]. The goal of data fusion methods used in multi-data source knowledge graphs is only the data entities themselves, ignoring the fusion after forming a triad of data in the form of “entity1-relationship type-entity2”. In medical data, there are many different types of disease entities, and it is easy to fuse two disease expressions that are similar but have different meanings into one entity for processing. To avoid causing such errors, this paper prioritized the different data sources according to their characteristics in the data source selection strategy. Moreover, in entity alignment, the medical dictionary is used as the entity alignment standard to avoid this wrong operation of fusing different entities into the same concept.
The commonly used data fusion methods are to process medical entities from different data sources acquired by different methods through the same entity alignment method, which fails to effectively resolve entity conflicts. The two-step data fusion strategy proposed in this paper added the type of data sources (basic data, core data, and supplementary data) as a consideration to ensure the correctness of entity alignment operation when entity conflicts are encountered. As a result of this process, the triplet data “entity1-relationship type-entity2” was processed twice. Finally, according to the entity characteristics of “ hub” in medical data, medical entities from different data sources were fused according to medical relationship type and data source classification (basic data, core data, supplemental data) based on disease entity names to obtain the final triplet data.
The two-step data fusion strategy first adopted the most authoritative medical dictionary data source as the alignment standard in the entity alignment process, avoiding the wrong operation of fusing similar disease entities into the same entity. Then, according to the entity characteristics of a “hub” in the medical field, the medical data from three data sources were fused according to their respective functions, and the medical entities from different data sources were linked into a “treatment plan” in the form of knowledge through disease entities. With the abovementioned two characteristics, we not only improved the effectiveness of data fusion in terms of entity fusion but also improved the effectiveness of knowledge fusion in terms of triplet fusion.