**Table 1.** *Cont.*


As a result of the k-means clustering of **F** and **G,** which were Euclidean coordinate matrixes derived from the multiple correspondence analysis (MCA) and by using *v*-fold cross-validation, the clustering costs of different numbers used for the k-means clustering algorithm are shown in Figure 1. According to the results shown in Figure 1, on the basis of the clustering cost, there was no statistically significant difference between using five clusters or six clusters. Based on the principal of parsimony of clustering, the optimum number of clusters was determined to be five. Table 2 presents the clustering results that comprise these five clusters. Table 2 indicates that anxiety was clustered with osteoporosis and depression was clustered with the lack of diabetes mellitus (DM), Charlson comorbidity index (CCI) = 0 and female sex.

**Figure 1.** Cluster costs of different numbers of clusters resulting from k-means clustering combined with *v*-fold cross-validation.



Note: DM = Diabetes mellitus; CCI = Charlson comorbidity index; COPD = Chronic obstructive pulmonary disease; HBV = Hepatitis B virus.

In addition, in the present study, a control arm statistical analysis was also performed using a multiple logistic regression model, which is the most widely used method for investigating risk factors associated with diseases. Table 3a shows the results of score tests of both dependent variables—depression and anxiety—for each independent variable. In Table 3a, no statistical significance was observed for any of the independent variables with these two dependent variables, indicating that using a stepwise variable selection strategy (forward or backward variable selection) cannot be used to find any statistically significant predictors. Furthermore, Table 3b shows the results of the multiple logistic regression model (without variable selection procedures), which also indicated that there were no statistically significant predictors (except for constant terms for both dependent variables).

**Table 3.** (**a**) Score test results for each variable of the logistic regression model. (**b**) Results of multiple logistic regression models for depression and anxiety.



Note: S.E. = Standard Error; DM = Diabetes mellitus; CCI = Charlson comorbidity index; COPD = Chronic obstructive pulmonary disease; HBV = Hepatitis B virus.

#### **4. Discussion**

The objective of this study aimed to develop a novel algorithm for identifying risk factors for anxiety and depression in young lung cancer patients aged 20–39 years by using the population-based database (National Health Insurance Research Database (NHIRD) in Taiwan), which are regarded rare events and very limited number of methods were proposed to solve this problem. A novel algorithm was proposed in this study which

integrated *v*-fold cross-validation into MCA–k-means clustering for solving the problem of determining risk factors associated with rare events.

Compared with the results of a univariate analysis using traditional multiple logistic regression analysis, which is a widely used method for determining risk factors associated with diseases (see Table 3), the results showed that none of the risk factors were statistically significantly associated with anxiety and depression, respectively, in young patients with lung cancer. Moreover, some parameter estimates were very unreliable because of their large standard errors (even bigger than the parameter estimates). In Table 3a, for the depression outcome variable, CCI = 1 vs. CCI = 0, DM, asthma, liver cirrhosis, autoimmune diseases, cerebral diseases, heart failure, renal diseases and osteoporosis indicated that parameter estimates were unreliable and exhibited extremely low odds ratios (ORs); for the anxiety outcome variable, CCI ≥ 2 vs. CCI = 0, DM, hypertension, asthma, liver cirrhosis, chronic obstructive pulmonary disease (COPD), autoimmune diseases, cerebral diseases, heart failure, hepatitis B (HBV) and renal diseases also indicated that the parameter estimates were unreliable and exhibited extremely low odds ratios (ORs), or an extremely high OR for asthma. Previous studies have indicated that parameter estimation methods such as maximum likelihood estimation provide biased or inestimable estimates for rare events [26,27]. According to King and Zeng (2001) [28], logistic regression would sharply underestimate the probability of rare events. For resolving the problems, some methods have been proposed, but there is still a lack of optimal methods and agreements on how to better estimate the coefficient of logistic regression for rare event data. In this study, not only were the dependent variables (depression and anxiety) rare events, but so were the independent variables, which may have resulted in many zeros in the database and the estimation of the standard error may have been biased. The novel algorithm proposed in this study can be considered to be a good approach for resolving rare event problems. In addition, compared with the results using self-reported questionnaire or inventory, such as Yan et al. [17], which used binary logistic regression analysis and the results showed that the risk factors of both anxiety and depression were lack of surgery and age; however, binary logistic regression did not successfully identify statistically significant risk factors in this study and the difference can be resulted from different operational definitions of depression and anxiety. Both kinds of studies using self-reported questionnaires or ICD-9-CM codes by psychiatrist-confirmed diagnoses provide different contributions to the clinical practices. Studies using self-reported questionnaires or inventory to measure depression and anxiety are more likely to look for factors associated with the self-perceived depression symptom and anxiety symptoms, which may be easier to express by patients themselves and some behavior interventions may be suggested, such as exercise, focus group consultant or health promotion life adjustment. However, the results of the current study using ICD-9-CM codes of depression and anxiety which are confirmed by psychiatrists, what young patients with lung cancer need are not only behavior interventions, but also the prescriptions of antidepressant drugs or anti-anxiety drugs, or the psychiatric hospitalization.

The advantages of the MCA–k-means clustering algorithm proposed in this study are: (1) the adoption of the clustering-based method to determine risk factors associated with rare events, which may avoid the parameter estimation problems encountered when using conventional logistic regression models; (2) the algorithm can take more than one dependent variable (≥2) into account simultaneously, especially for easily confused diseases, for example, anxiety and depression in this study. In comparison with a logistic regression model, it deals with only one dependent variable at a time. (3) The algorithm determines the optimum number of clusters by using the *v*-fold cross-validation algorithm; through the repeated random sub-sampling scheme, all observations were used for both the training and validation sets and each observation was used for validation exactly once, which can help determine the optimum number of clusters with less influence from rare event data, such as the dataset used in this study.

Regarding the final clustering results of this study (see Table 2), the results indicated that anxiety was clustered with osteoporosis and depression was clustered with the lack of DM, CCI = 0 and female sex in young patients with lung cancer. These factors were optimally clustered with anxiety and depression. The results obtained in this study are validated by other studies that have indicated that patients with anxiety and osteoporosis easily encounter more complications than those with several other disease groups [29–31]. The results of this study indicate that young patients with lung cancer and osteoporosis are also at a high risk for the onset of anxiety. In addition, young female lung cancer patients were also at a higher risk of the onset of depression. Previously published studies have shown that female cancer patients are at significantly higher risk of depression than males [29,32,33]. In this study, the clustering results also supported that young female lung cancer patients were at a higher risk of the onset of depression.

This study still had some limitations. First, although the National Health Insurance (NHI) program in Taiwan covers more than 98% of the Taiwanese population [34–36], the NHIRD does not provide information about some potential confounding factors, such as smoking, alcohol consumption, exercise habits, diet and lifestyle, which may also influence the association with the risk of anxiety and depression. Second, some young lung cancer patients who experience anxiety and depression may not consult psychiatrists; they usually express their concerns about their cancer diseases to their oncologists and the oncologists may easily neglect or ignore their patients' anxiety and depression symptoms. Thus, cancer patients may search for religious help or may isolate themselves from people or medical professionals; therefore, the number of patients with anxiety and depression may be underestimated. Third, because the young patients with lung cancer enrolled in this study were primarily of the Chinese or Han ethnicities, the results derived from the novel algorithm proposed here require further examination and validation for generalization to other ethnicities. Furthermore, according to Lu et al. (2019) [37], in recent decades, the overall incidence of lung cancer initially increased and then gradually decreased. The surgical rate and radiotherapy rate for lung cancer showed a general downward trend, while the chemotherapy rate experienced a significantly increasing trend [30]. Although the five-year relative survival rate has increased over the years, it has remained very low for the last 20 years [31]. Therefore, this study, which used a nationwide database from 2001 to 2007, can still provide useful findings for clinicians.

### **5. Conclusions**

The novel MCA–k-means clustering algorithm in this study successfully identified risk factors associated with anxiety and depression, which are considered rare events in young patients with lung cancer. The clinical implications of this study suggest that psychiatrists need to be involved at the early stage of initial diagnose with lung cancer for young patients and provide adequate prescriptions of antipsychotic medications for young patients with lung cancer.

**Author Contributions:** Drafting of the article: Y.-W.F.; critical revision of the article for important intellectual content: Y.-W.F. and C.-Y.L.; final approval of the article: C.-Y.L.; statistical expertise: C.-Y.L. All authors have read and agreed to the published version of the manuscript.

**Funding:** This study was supported by an industry–academia collaboration grant whose grant number is DSLPA-PC-107-003.

**Institutional Review Board Statement:** The study was conducted according to the guidelines of the Declaration of Helsinki and it was approved by the Institutional Review Board of the School of Nursing, National Taipei University of Nursing and Health Sciences (approval number: IRB# CN-IRB-2011-063).

**Informed Consent Statement:** Patient consent was waived because the encryption and protection of the personal information from the NHIRD were performed by the National Health Insurance Administration in Taiwan by using a complex double-encryption procedure. As this present study was a secondary data analysis, written informed consent forms were not required from the recruited or selected patients.

**Data Availability Statement:** The study dataset (NHIRD) was not publicly archived; to access it, an application from the Bureau of National Health Insurance in Taiwan is needed. The application website is: https://www.nhi.gov.tw (the access date was 10 December 2012).

**Acknowledgments:** The authors of this study are very grateful to the National Health Insurance Administration for providing the National Health Insurance claim database and to the Health Data Value-Added Center of the Ministry of Health and Welfare of Taiwan for maintaining the National Health Insurance Research Database (NHIRD).

**Conflicts of Interest:** The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

#### **References**

