*2.1. Study Design and Study Database*

This study design of this research adopted the secondary analysis of longitudinal data from NHIRD. The study database used here was retrieved from the NHIRD in Taiwan. Since the National Health Insurance (NHI) program was launched on 1 March 1995, the NHI program provided healthcare service coverage to more than 99% of the population by 2017 [20]. The NHIRD includes medical reimbursement records for outpatient and inpatient healthcare services, hospital or clinic visits, dental service visits and traditional Chinese medicine service visits. All of the reimbursement records for diagnostic and medicalrelated procedures for diseases are based on the international classification of diseases (ICD)—ninth and tenth revisions (after 1 January 2016 [21]) of the clinical modification (CM, or ICD-9-CM and ICD-10-CM, respectively)—and on a procedure coding system for all medical service claims.

## *2.2. Ethics Statement*

The ethical review of this study was approved by the Institutional Review Board of the School of Nursing, National Taipei University of Nursing and Health Sciences (approval number: IRB# CN-IRB-2011-063). The date of approval was 23 October 2011. The encryption and protection of the personal information from the NHIRD were performed by the National Health Insurance Administration in Taiwan by using a complex double encryption procedure. In addition, because the present study was a secondary data analysis, written informed consent forms were not required from the recruited or selected patients. This study was also registered at Open Science Framework (OSF, reference osf.io/fkhm8 (accessed on 15 March 2021)).

#### *2.3. Study Population and Possible Risk Factors Selection*

The ICD-9-CM codes that were used to define patients with depression were 296.2X– 296.3X, 300.4 and 311.X and the ICD-9-CM codes used to define patients with anxiety were 300.XX, 291.89 and 292.89. In Taiwan, if cancer patients are suspected of having depression or anxiety, they are refereed by the oncologists to psychiatrists, which is recorded as the first National Health Insurance (NHI) outpatient visit. After the referral, the cancer patients receive some psychological tests by clinical psychologists and the cancer patients are diagnosed by psychiatrists again to determine if they need anti-depressant or anti-anxiety medications; this is recorded as the second NHI psychiatric visit. After a period of time, the cancer patients need to be confirmed again by psychiatrists; therefore, to confirm that a cancer patient has depression or anxiety usually needs at least three outpatient visits and the prescription of anti-depressant or anti-anxiety drugs. In this study, young lung cancer patients that were aged 20–39 years and who were newly diagnosed with lung cancer (ICD-9-CM code = 162.XX) between 1 January 2001, and 31 December 2007, were retrieved from the NHIRD. Young lung cancer patients who died or withdrew from the NHI program during the study period were excluded. Young patients with lung cancer who had been diagnosed with baseline psychiatric diseases, such as depressive disorder (ICD-9-CM codes: 296.2X–296.3X, 300.4 and 311.X), anxiety states (ICD-9-CM codes: 300.XX, 291.89 and 292.89), bipolar disorders (ICD-9-CM codes: 296.0, 296.1, 296.4, 296.5, 296.6, 296.7, 296.8, 296.80 and 296.89), or alcohol-induced mental disorders (ICD-9-CM codes: V113, 9800, 2650, 2651, 3575, 4255, 3050, 291, 303 and 571.0–571.3) between 1 January and 31 December in 2001 were also excluded. In order to avoid selecting false-positive patients with depression and anxiety, young lung cancer patients with at least three consecutive corresponding diagnoses were eligible to be coded as having depression and anxiety.

The possible risk factors associated with depression and anxiety among lung cancer patient were determined based on Park et al. [19], who investigated if hypertension, diabetes mellitus, history of tuberculosis, liver disease (liver cancer and liver cirrhosis), end-stage renal disease, coronary artery disease (including heart failure), stroke (ischemic stroke and hemorrhage stroke) and Chronic obstructive pulmonary disease (COPD) are risk factors associated with anxiety and depression after surgical treatment for lung cancer; and Clarke and Currie [20], who took into account heart disease, stroke, cancer, diabetes mellitus, rheumatoid arthritis and asthma as the possible risk factors associated with depression and anxiety in cancer patients. Therefore, in this study, we took into account diabetes mellitus (DM), hypertension, asthma, liver cirrhosis, COPD, autoimmune diseases (including rheumatoid arthritis, systemic lupus erythematosus and aplastic anemia), cerebral diseases (including ischemic stroke, hemorrhage stroke and transient ischemic attack (TIA)), heart failure, hepatitis B virus (HBV), renal diseases and osteoporosis.

#### *2.4. Combining Multiple Correspondence Analysis and the K-Means Clustering Algorithm with v-Fold Cross-Validation (MCA–k-Means Clustering Algorithm)*

The raw data matrix was first transformed into a matrix with solely index variables (i.e., encoded as 0 or 1) through multiple correspondence analysis (MCA) [21,22], which was the data preprocessing procedure for the raw data matrix. The index variables indicate the levels of all of the categorical variables in this study. The MCA then converted all index variables into multi-dimensional Euclidean coordinates. The multi-dimensional Euclidean coordinate matrix derived from the MCA could be considered a high-dimensional dataset that could be carried into the further optimal clustering algorithm. In order to determine the optimal clustering in the high-dimensional dataset obtained from the MCA, the kmeans clustering algorithm with *v*-fold cross-validation was applied to obtain the optimal clustering. The algorithm is described in detailed in the following:

2.4.1. Step 1. Multiple Correspondence Analysis

Let **M**I×<sup>K</sup> be the raw data matrix with I subjects and k categorical variables.

	- If a categorical variable is binary, then place it in the Burt matrix as an original variable matrix.
	- If a categorical variable has more than two levels (i.e., Jk > 2 levels), then convert this variable into an index variable (containing only 0 and 1); this forms an indicator matrix I × Jk where each column contains index variables coded with 0 or 1.
	- Place all index variable columns together to form the indicator matrix **X**I×J.
	- Calculate the Burt matrix as (**X**I×J) *- ·***X**I×J.
	- The total orders of MI×<sup>K</sup> (N) are observed and the probability matrix is defined as P = N − 1X.
	- Define r as the vector of the row totals of P (i.e., r = P1, where 1 is a unit vector of ones) and define c as the vector of the column totals of P. Then, Dc = diag{c} and Dr = diag{r}.
	- Calculate the Euclidean coordinates by using a singular value decomposition method as follows:

$$D\_r^{-\frac{1}{2}} \left( Z - r c^T \right) D\_c^{-\frac{1}{2}} = P \Delta Q^T$$

where Δ and Λ = Δ<sup>2</sup> are the diagonal matrix of singular values and the matrix containing eigenvalues, respectively. Therefore, the row and column coordinate matrices (**F** and **G**, respectively) are calculated as follows:

$$\mathbf{F} = D\_r^{-\frac{1}{2}} P \boldsymbol{\Delta}$$

$$\mathbf{G} = D\_c^{-\frac{1}{2}} Q \boldsymbol{\Delta}$$

	- The inertia value is calculated based on a Pearson chi-squared (*χ*2) value from the rows and columns to identify their coordinate centers as follows:

$$d\_r = \operatorname{diag} \left\{ FF^T \right\} \text{ and } d\_c = \operatorname{diag} \left\{ GG^T \right\}.$$

• If a subset of **F** or **G** is selected, then the inertia values for the row and column coordinates are calculated as:

$$Inertia\_r = \frac{diag\left\{\mathbf{F} \mathbf{F}^T\right\}}{N} \text{ and } Inertia\_c = \frac{diag\left\{\mathbf{G} \mathbf{G}^{'T}\right\}}{N}.$$

where **F** and **G**are subsets of **F** and **G**.

2.4.2. Step 2. K-Means Clustering with v-Fold Cross-Validation

The *k*-means clustering algorithm with *v*-fold cross-validation was applied to analyze the **F** and **G** that were obtained from the MCA [23,24]. The algorithm is as follows:

	- (a) Divide **F** or **G** into *v* folds (denoted Fi or Gi, I = 1, ... , *v*), in this study, we set *v* = 5;
	- (b) For i = 1 to *v,* take Fi or Gi as the testing set and {**F**}\Fi or {**G**}\Gi as the training sets;
	- (c) Compute the mean Euclidean distances, which are called the clustering costs in this study, within each cluster of training sets, set these as the new cluster centers and replace the cluster centers of the previous step;
	- (d) Compute the mean Euclidean distances of each index variable (or the level of all of the categorical variables) of the testing set from the new cluster centers derived from the training sets;

The MY Structured Query Language (MySQL) was used for selection, linkage, processing and cleaning of the dataset from the NHIRD. The algorithm we proposed in this study was implemented with STATISTICA Data Miner ver. 10.0 (StatSoft, Inc., Tulsa, OK, USA).

## **3. Results**

In the present study, 1022 young lung cancer patients aged 20–39 years were studied and their demographic information is shown in Table 1. The study sample comprised 520 male (50.9%) and 502 female patients (49.1%); 154 of the patients were aged 20–29 years old (15.1%) and 868 patients were aged 30–39 years old (84.9%).


**Table 1.** Demographic information of the study sample (*n* = 1022).
