1. Introduction
There are approximately 19.3 million new cancer cases diagnosed worldwide every year. Gastric cancer is the cancer with the fifth highest incidence and the fourth highest mortality, and about two-thirds of all cases are found in East Asia and Southeast Asia [
1,
2]. Achieving precision medicine for gastric cancer will hinge on the viability of big data analytics and AI models rooted in gastric cancer clinical data. While AI is widely used in biomedical science [
3,
4], medical data analysis models based on other diseases or algorithms adaptive for other conditions are ineffective in the field of tumors, which hinders the practical benefits of tumor clinical analytics. This is due to the fact that the granularity of clinical records is different in various diseases [
5]. Tumor histopathological data as well as molecular and genetic data represent pivotal features with high information density and clinical value [
6,
7]. However, these two types of data are generally not involved in other diseases. Histopathological information mainly includes clinical descriptions, including imaging results, tumor site, tumor stage, differentiation, cellular composition, pathological type, and final diagnosis. Molecular data comprise marker expressions, genetic mutations, genomic features, and molecular classifications. Through pathological examination and molecular detection, in combination with other clinical data, tumor and tumor microenvironment characteristics can be comprehensively described and an exact diagnosis can be made, all of which underpin clinical decision-making. Moreover, precision medicine claims to be able to provide personalized therapy for every patient, fueled by clinical genetic testing because of the advances in cancer molecular genetics and genomics [
8].
Even now, cancer continues to pose a great threat to human health. Inter-patient heterogeneity represents a great obstacle to cancer therapy. Conceivably, as cases of cancer are an enormous group, there are always some patients who are similar, and such historically similar patients may shed light on treatments for future patients. However, determining how to define and evaluate patient similarity remains controversial [
9,
10]. Patient similarity calculation, which assesses the similarity between patients by mathematically calculating data on the multi-modal heterogeneity metrics of patients, seems to be a solution. In general, the first step in patient similarity calculation is determining a multi-modal data processing and integration strategy. The second step is to define a similarity metric to calculate the distance or similarity score among patients in a systematic and consistent manner. The third step is to establish a patient similarity network (PSN) and carry out cluster analysis and clinical feature analysis in the PSN system. Finally, for patients to be evaluated, they would be embedded in the PSN and the group of patients most similar to the patient of interest would be defined based on the patient’s similarity score [
11].
There have been some explorations into patient similarity calculations in human diseases [
9,
12]. They have generally used patient demographic information, diagnosis, treatment, prescription drugs, laboratory test data, and physiological monitoring data extracted from electronic medical records (EMRs). At present, some patient similarity calculations only use numerical variables as parameters to calculate Euclidean distance. This strategy presumes all variables are continuous, which is not perfectly suitable for categorical variables [
13,
14,
15]. Some use the International Classification of Diseases (ICD) hierarchical coding to calculate the distance between the parent node and each child node for disease diagnoses and then evaluate the similarity [
16,
17], while some orchestrate medical record information into a medical knowledge graph and convert the medical entity relations into vector space, which can be used to calculate the Euclidean distance, Mahalanobis distance, or cosine similarity [
18,
19]. The method of encoding/embedding conversion has obvious defects as the information needs to be converted into other systems such as ICD coding and knowledge graphs, which are indirect calculations and bring various additional influencing factors, eventually affecting the accuracy of the results. The major drawback of patient similarity research is that it has struggled to incorporate diverse clinical data types into a unified model.
AI and deep learning have demonstrated usefulness in patient similarity analysis [
20,
21]. For example, disease characteristics are often mathematically represented as vectors or matrices, and neural networks are subsequently employed to learn similarities and cluster patients. However, models derived from neural networks are usually highly specialized [
22]. Most patient similarity models based on supervised or semi-supervised algorithms are dependent on pre-labeled training data and require the extraction of parameters and corresponding exact weights. Although such a model performs well on an experimental dataset, the generalizability is weak. Even for the same disease, transferring an algorithmic model is difficult when the data metrics are different. This is a drawback of the supervised method [
23]. Additionally, when defining and measuring patient similarity, data labeling is laborious and susceptible to subjective factors. In the era of biomedical big data [
24], knowledge and decisions are obtained based on population data but not on the clinician’s experience, so the labeling is disgusted. These two drawbacks have led to the diminished clinical application value of deep learning represented by supervised neural networks in patient similarity assessment. Maybe self-organizing map is a promising neural network.
Regarding research on patient similarity among gastric cancer patients, the topic is relatively understudied. One study developed a GC subtype classification model that integrates multi-omics fusion data and patient similarity networks via a residual graph convolutional network. However, their method was limited to handling numerical variables [
25].
Our review suggests that identifying similar cases from a large pool of historical cancer patients, a process known as patient similarity analysis, holds great promise in clinical big data analysis. Nevertheless, the methods for performing patient similarity analysis are still in their infancy. The challenges are twofold: clinical data typically consist of both numerical and categorical variables, accompanied by a significant number of missing values, which demands an efficient data processing approach. Furthermore, the similarity of tumor patients remains a rough estimate, and high-quality labeled data are scarce. Fortunately, unsupervised learning techniques such as K-means and hierarchical clustering are well-suited to handle unlabeled data. To address these challenges, we developed a pipeline that leverages one-hot encoding and K-means clustering to construct a cancer-specific PSN. Then, the PSN was validated using survival endpoints or other indicators to ensure clinical validity. Our ultimate goal is to utilize the derived cPSN to facilitate patient stratification, uncover clinical characteristics, provide personalized treatment recommendations, and inform healthcare management.
2. Methods
2.1. Data Collection and Preprocessing
Multiple types of clinical data from one thousand patients with surgical GCs were collected from the department of gastrointestinal surgery, Shanghai Changhai Hospital. Clinical information was extracted from EMRs and medical examination reports, and the data were then preprocessed to ensure consistency in formatting. Clinical descriptions were summarized into keywords, such as classifying surgical procedures into laparotomy or laparoscopy. In terms of histopathological data, our dataset contained mesenteric vein/portal vein involvement, qualitative description of surgical margin status, tumor stage, tumor differentiation, etc. (
Table 1). In terms of molecular genetic data, the dataset contained gene mutations derived from clinical genetic testing, gene expression, and immunohistochemical data. Emphatically, each gene mutation and each tumor marker/gene expression level can be considered as an independent variable. Missing data are marked with NA.
2.2. Encoding
In the encoding process, categorical variables were directly coded, numerical variables and clinical qualitative descriptions were converted into categorical variables, and each categorical state of each variable was recorded as a one-hot feature. Suppose that there are M observation indices (variables) in a set of samples, denoted as , and each observation index has different classification states, denoted as ; altogether, we would obtain one-hot features. Continuous values were transformed into discrete values by equivalent partitioning. Preferably, for numerical variables, the values in a set of samples were divided into 4 parts according to the quartile method so that 4 categorical variables were formed. For clinical qualitative descriptions, N states were formed into N categorical variables.
A missing value was regarded as an independent one-hot coding type in the observation index of clinical data, and there was no need to fill in null values.
The one-hot encoding method was engaged to integrate multi-modal medical data. Subsequently, the heterogeneous data of patients were transformed into a feature embedding matrix.
Through distance calculation (Euclidean distance in this study), the feature embedding matrix was organized into a PSN. Preferably, the t-distributed stochastic neighbor embedding (t-SNE) method can be used to visualize the high-dimensional network in a two-dimensional or three-dimensional display.
2.3. Subgrouping
K-means clustering, an unsupervised learning algorithm, was conducted for the patient similarity analysis to divide all patients into K clusters. K is a hyperparameter that is set between 2 and 10. The elbow method or gap statistic method was used to evaluate the effect of clustering for each selected K. Data encoding and clustering analyses were conducted using scikit-learn packages in Python 3.10.
2.4. Survival Analysis
The Kaplan–Meier method was used for clinical endpoint correlation analysis. The log-rank test was used to assess the statistical differences in OS between different groups of patients following clustering. PSNs with or without clinical implications were obtained based on the statistical significance of p-values. If the p-value is less than 0.05, we would consider the constructed PSN to be correlated with a clinically meaningful endpoint, namely a cPSN. Survival analysis was conducted in R 4.0.3 using the survival and survminer packages.
2.5. Statistical Analysis
Statistical analyses were conducted using the chi-square test (χ2) in SPSS Statistics 20. Briefly, we created a contingency table to display the frequency of each classification (clustering group, patient age, cancer differentiation, or tumor stage). We examined the distribution of multiple categorical variables (dMMR, EGFR-IHC, ERBB2-IHC, p53-IHC) simultaneously and used a statistical test to determine if there is a significant association between the variables and subgroup, patient age, cancer differentiation, and tumor stage. A p-value of less than 0.01 indicates statistical significance.
3. Results
We collected multiple types of clinical data from 1000 patients with surgical GCs. In this study, the heterogeneous medical data we dealt with included demographic data, histopathological data, molecular and genetic data, laboratory tests, and the surgical paradigm narrative. The types of data contained numerical variables, binary variables, categorical variables, and clinical qualitative descriptions (
Table 1).
Categorical data representation has advantages in capturing data from clinical records [
26]. Numerical data is continuous values that are accurate, but this information does not necessarily have to be presented this way. Given that continuous values within a certain range could be considered to have similar clinical significance, and to improve the generalizability of the model, we transformed the continuous values into discrete values using equivalent partitioning. In this case, categorical variables were directly coded in the encoding process, and numerical variables and clinical qualitative descriptions were first converted into categorical variables. In order to integrate multi-modal medical data, we encoded the feature parameters of each patient using the one-hot encoding method. A total of 143 one-hot encoding values were identified from 37 variables, as a result of each categorical state of each variable being recorded as a one-hot feature. Subsequently, the heterogeneous data of patients were transformed into a feature embedding matrix.
Through feature coding, patient embedding, and distance calculation, all patient data were orchestrated to form a PSN, which is an
M-dimensional network, where
M is the sum of observation parameters. The PSN reflects the similarity distance between patients (
Figure 1). Each point in the high-dimensional PSN represents a patient. We then conducted cluster analysis. The 1000 surgical GC patients were divided into 2 to 11 clusters via the
K-means algorithm. Using the elbow method, five clusters were found to provide the best clustering performance. Each cluster represents a similar group sharing some clinical characteristics (
Figure 2), which is the immanent foundation for treatment recommendations for a given patient who is clustered into a specific group.
We performed a correlation analysis of a clinically meaningful endpoint to evaluate the clinical validity of the clustering. OS, which serves as the gold standard of oncological clinical endpoints [
27], was investigated to assess the validity and clinical relevance of the constructed PSN. When the patients in our cohort were divided into five clusters, the OS differences between clusters were statistically significant (log-rank test,
p < 0.0001,
Figure 3A). In addition to distinguishing this clinical endpoint, our clustering could also suggest specific gene mutations and genomic features in various subgroups (
Table 2). Our strategy achieved an excellent performance that was superior to that using traditional classifications such as patient age, cancer differentiation, and tumor stage (
Figure 3B–D,
Table 2). Notably, ERBB2-IHC is related to differentiation, with the proportion of ERBB2-IHC positivity in patients with high, moderate-high, moderate, moderate-low, and low differentiation being 0.462, 0.556, 0.450, 0.322, and 0.219, respectively. This is the sole demonstration of an association between conventional classification approaches and genomic molecular features (
Table 2).
Cluster_2 has the longest survival rates. Most patients in Cluster_2 are negative for nerve invasion, negative tumor thrombus, negative cancerous node, and regional nodal involvement. The patients are mainly in pathological stage I and stage II, with some scattered across other stages. All of these clinical indicators support better prognosis. Interestingly, we found that lower distal GCs are more common than upper proximal GCs, shedding light on the debates [
28,
29] (
Figure 2). Cluster_5 has the worst prognosis among all subgroups. The patients are mainly in pathological stage III, with the majority having upper tumor locations. Cluster_1 contains 71.6% of patients of Mx, meaning that distant metastasis cannot be determined. TP53 mutations are predominantly found in Cluster_3 and Cluster_4, in accordance with their dMMR characteristics. The proportion of EGFR and ERBB2 expression is significantly lower in Cluster_3 and Cluster_4 (
Figure 2).
4. Discussion
The present research provides a similarity calculation method for tumor patients based on one-hot encoding and unsupervised clustering. According to their clinical features, a cohort of tumor patients was embedded in a high-dimensional space and then clustered into several groups based on commonalities. We then assessed whether these different groups of patients were clinically distinct. While death is the primary event of interest in cancer patients, based on the OS of cancer patients, a correlation analysis of this clinical endpoint was carried out on the clustered patients. The log-rank test assessed statistical significance to examine whether the distribution of OS was distinguishable, which ensures the clinical significance and practical value of the established PSN. For example, cancer stage is conventionally used to stratify patients [
30]. However, patients with different stages were often clustered into the same subset in our model. Furthermore, patients in the same subset tend to have similar survival prognoses, as well as potentially similar clinical characteristics and responses to treatment.
Clinical data resources include electronic medical records, imaging examinations, laboratory tests, and genetic and cellular analyses. Determining how to integrate highly heterogeneous patient data is vital in patient similarity analysis. We adopted the “early integration strategy”, which constructs a unified model for all types of data, in contrast to the “late integration strategy”, which calculates distances for each data type and requires searching for a corresponding appropriate model [
31]. Note that the “early integration strategy” ignores the correlation between parameters. The Mahalanobis distance calculation, which weights multivariate parameters using a covariance matrix, may compensate for this shortcoming.
Data encoding was performed following the integration. We used the one-hot encoding strategy as it is concise and robust in clinical data management. It can efficiently code any clinical data, and the data processing ability is superior. As the parameters with clinical meaning and data accessibility are limited, dimensionality does not need to be considered. Although undesirable, missing data frequently occur in real-world healthcare scenarios when the values of variables are not measured or are unavailable for a patient. The usual practice is to fill in missing values with estimated values, which may underrepresent the real state, thus rendering them unsuitable for further analysis. The present study regarded missing values as independent one-hot coding types without filling in null values, which may reduce value bias and avoid the classification error caused by filling methods.
Besides multi-modal medical data integration and encoding, data labeling in the field of tumor patient similarity lacks standards. Doctors’ annotations often rely on limited information and are usually based on heuristic judgements. These labeling processes are subjective or otherwise uncertain. However, AI algorithms and machine learning usually require a large amount of labeled data. This creates a great gap between the available manual labeling and the accurate labeling required for training algorithms. Unsupervised clustering, independent of any labeling data, efficiently classifies patients into subgroups. Thus, machine learning can then be used to uncover clinical characteristics or data features underlying the subgroups. Essentially, the constructed cPSN should authentically restore the similarity of patients in the real world, linking to prognostic assessment, personal treatment, and health management.
While a consensus on which machine learning algorithm performs better with specific data types in the context of precision medicine is still lacking [
32], the present study performed
K-means unsupervised clustering and evaluated
K using a statistical algorithm to obtain the optimal
K, and the whole process was unsupervised without human intervention. The present study uniformly adopted one-hot encoding for multi-modal, highly heterogeneous clinical data, which is flexibly compatible with clinical data evolvement and changes in observation status caused by different medical institutions, doctors, and medical development stages. The data processing method provides an extensibility mechanism for adding more parameters. In the future, when synthetic data are expected to replace real data in medical big data analytics [
33], machine learning algorithms can be used to accelerate clinical trials, which is a subsequent mission of patient similarity analysis.
Our cPSN holds great clinical value in the context of cancer care management. This big data analytics approach can elucidate subtle clinical conundrums, including the disparities in prognosis between distal and proximal gastric cancer patients and the prognostic differences associated with ERBB2 expression in gastric cancer, as exemplified by our research findings. To evaluate target a patient, the group of patients most similar to the target index patient is identified in the cPSN using the K-nearest neighbor algorithm based on distance calculation, and the range and fineness of similar patients are selected by adjusting the K value. Then, therapeutic insights can be acquired from similar patients to help prognostic evaluation. Thus, population-based clinical information obtained by searching similar patient cases can be used to propose treatment and management strategies, which would promote the development of big-data-based precision medicine.
Altogether, we developed an easy-to-perform, clinically interpretable, generalizable, and universal method to conduct cancer patient similarity analysis. Our cPSN could create paths from clinical data to insight, and from information to decision. With an emphasis on clinical utility and usability, clinical investigators can use the cPSN to find insights and conduct clinical research. Clinicians can use the cPSN to inform patient stratification, recommend treatments, deliver personalized patient care, and improve population health management.
This study has several limitations. Firstly, although one-hot encoding is a robust method that can dispose of missing values, too many null values in the dataset affect the accuracy of the results. It is better to apply as complete records as possible in future studies. Secondly, the data we used were baseline data that depicted the patients’ features before surgery, without considering treatment information. That may make sense because a causal connection exists between baseline data and treatment programs. However, if treatment data are available, we may be able to discover whether patient outcomes are more influenced by baseline characteristics or treatment regimen/treatment sequence through patient similarity analysis. Thirdly, our model ignored the correlations among the parameters of selected features, leading to potential redundancy. We recognize that parameter redundancy and non-weight matrices are limitations of the cPSN. These require solutions, especially under the framework of unsupervised learning.