Identifying the Predictors of Patient-Centered Communication by Machine Learning Methods

Wu, Shuo; Zhang, Xiaomei; Chen, Pianzhou; Lai, Heng; Wu, Yingchun; Shia, Ben-Chang; Chen, Ming-Chih; Ye, Linglong; Qin, Lei

doi:10.3390/pr10122484

Open AccessArticle

Identifying the Predictors of Patient-Centered Communication by Machine Learning Methods

by

Shuo Wu

¹,

Xiaomei Zhang

²,

Pianzhou Chen

³,

Heng Lai

²,

Yingchun Wu

²,

Ben-Chang Shia

^4,5

,

Ming-Chih Chen

^4,5

,

Linglong Ye

^6,* and

Lei Qin

^2,7,*

¹

China National Tobacco Corporation, Beijing 100045, China

²

School of Statistics, University of International Business and Economics, Beijing 100029, China

³

School of New Media, Peking University, Beijing 100091, China

⁴

Graduate Institute of Business Administration, College of Management, Fu Jen Catholic University, New Taipei City 24205, Taiwan

⁵

Artificial Intelligence Development Center, Fu Jen Catholic University, New Taipei City 24205, Taiwan

⁶

School of Public Affairs, Xiamen University, Xiamen 361005, China

⁷

Dong Fureng Institute of Economic and Social Development, Wuhan University, Wuhan 430072, China

^*

Authors to whom correspondence should be addressed.

Processes 2022, 10(12), 2484; https://doi.org/10.3390/pr10122484

Submission received: 2 November 2022 / Revised: 20 November 2022 / Accepted: 21 November 2022 / Published: 23 November 2022

(This article belongs to the Section Advanced Digital and Other Processes)

Download

Browse Figure

Review Reports Versions Notes

Abstract

:

Patient-centered communication (PCC) quality is critical to increasing the quality of patient-centered care. Based on the nationally representative data of the Health Information National Trends Survey (HINTS) 2019–2020 (N = 4593), this study combined four machine learning methods, namely, Generalized Linear Models (GLM), Random Forests (Random Forests), Deep Neural Networks (Deep Learning), and Gradient Boosting Machines (GBM), to identify important PCC predictors through variable importance metrics. Fifteen variables were identified as important predictors, involving multiple dimensions, such as individual sociodemographic characteristics, health-related factors, and individual living habits. Among them, four novel potential associated variables are included, an individual’s level of verbal expression, exercise habits, etc., which significantly impacted respondents’ perceived PCC quality. This study revealed the value of combining feature selection with machine learning approaches to identify broad variables that could enhance PCC prediction and clinical decision-making, influence future PCC prediction research, and improve patient-centered care. In the future, other easy-to-interpret models can be combined to conduct further research on the impact direction and mechanism of important predictors on PCC.

Keywords:

patient-centered communication; machine learning; HINTS; predictors

1. Introduction

Patient-Centered Communication (PCC) has been one of the most widely debated subjects in healthcare over the past few decades and has important implications in the promotion of a harmonious doctor-patient relationship and the improvement of health care. Quality PCC was originally defined by the Institute of Medicine as a model that aims to obtain the necessary diagnostic and treatment information relevant to medical care in addition to the wishes, needs, and preferences of the patient. The reason for this is to make clinical decisions consistent with the patient’s values and to enhance the understanding and consensus between doctors and patients. PCC is not only a quality of an individual practitioner but also of the entire health system [1].

According to research, PCC has a clear positive impact on healthcare, reducing disease symptoms and improving clinical outcomes in cancer treatment [2,3,4]. PCC is also essential for patient care, medical education, clinician licensure, and quality assessment [5]. Evidence suggests that patient-centered care improves disease outcomes and quality of life, alleviates medical conflict, and is critical for addressing racial, ethnic, and socioeconomic disparities in health care and health outcomes [6,7]. Therefore, the identification of important PCC predictors is crucial.

Existing research has linked PCC to multi-dimensional variables, such as individual sociodemographic characteristics, health status, and attention to health problems. For example, patients who prefer PCC tend to be younger and more educated [8]. Racial or ethnic minorities are less likely to participate in PCC with providers due to a lack of emotional communication, which may influence whether providers use a patient-centered approach, putting patients at risk for persistent health conditions [9]. In addition, studies showed that those with strong self-efficacy in caring for their health as well as those with good overall health reported better PCC from providers [10,11]. However, few variables have been investigated in the existing literature, and the selection of variables is prone to some degree of subjectivity, so it is difficult to investigate the variables of interest comprehensively and objectively.

Machine learning methods are widely used in the medical and health field for drug discovery, disease prediction, and diagnosis [12]; however, few studies include large-scale variable machine learning research in the direction of PCC. Existing studies on PCC typically use traditional statistical methods, such as interaction analysis and linear regression, to model a limited set of easily measurable variables to predict PCC [13,14,15]. These simple, cost-effective types of models are often preferred in many settings, including population-wide screening or diagnosis in resource-limited settings [16]. However, as big data has evolved in the medical field, the cost of data collection has decreased, and the scale of data has increased [17]. Although machine learning methods are more complex than traditional statistical methods, their performance on large-scale data has certain advantages [18]. Furthermore, traditional statistical analysis methods and machine learning methods should be complementary [19]. In the literature relevant to the research question in this paper, machine learning methods are gradually used to identify important predictors of research variables [20,21,22,23]. After variable screening or feature extraction, high-precision classification or prediction tasks with small errors can be achieved, which can improve the timely diagnosis of the prodromal stage of related diseases and important problems and provide a reference for early intervention and prevention [24]. Therefore, machine learning methods with large-scale data were hereby combined for the identification of important PCC predictors.

The present analysis aimed to gain a clearer picture of sociodemographic, healthcare access, and health status variables, their impact on the quality of patient-healthcare provider communication, and to identify significant predictors, based on a national sample. Therefore, based on the extensive data variable set of the Health Information National Trends Survey (HINTS), four machine learning methods were employed to select the important factors for predicting PCC from a set of characteristic variables based on variable importance measures. This study combined the strengths of feature selection, machine learning, and extensive datasets to provide support for more comprehensive predictor identification and prediction of PCC.

2. Materials and Methods

2.1. Data Source

Data from the National Cancer Institute’s 2019–2020 Health Information National Trends Survey (HINTS) were collected. HINTS regularly collects nationally representative data on the American public’s knowledge, attitudes, and use of cancer and health-related information. This study analyzed pooled data from cycles 3 and 4 of HINTS 5. This survey provides an opportunity to examine the perceived PCC levels on a population-level basis. The dataset contains metric variables for PCC and extensive demographic survey data to help identify sociodemographic, lifestyle, and health-related factors associated with respondents’ perceived PCC. The administration of HINTS is approved by the Westat Inc. Institutional Review Board and exempted by the Office of Human Research at the National Institutes of Health. HINTS also offered additional useful information about the survey design and allowed estimates for individual countries.

2.2. Statistical Analysis

The combined data of HINTS 5 Cycle 3 and 4 were used to perform descriptive statistical analysis on relevant variables. The original variable set of HINTS, which includes a very large number of variables, was screened using the t-test for binary and continuous independent variables, and the F-test for multi-category variables. Finally, eighteen candidate variables were obtained for modeling. This paper deals with some of the filtered variables (see Supplementary Material Table S1 for details). The threshold of significance level was set to 0.1, and the variables with a p-value less than 0.1 were included in the candidate set. Four machine learning methods were employed to identify significant predictors of PCC, including Generalized Linear Models (GLM), Random Forests (Random Forests), Deep Neural Networks (Deep Learning), and Gradient Boosting Machines (GBM). Variable importance measures were used to identify important predictors, and the metrics corresponding to each algorithm are described in the method introduction. Furthermore, the model performance and prediction performance of each algorithm were evaluated using the Mean Absolute Percentage Error (MAPE), Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and Root Mean Squared Logarithmic Error (RMSLE) values under five-fold cross-validation. See Figure 1 for the simple computational framework of this study. Feature selection consisted of two steps: preliminary screening and machine learning important predictor identification. All statistical analyses were performed on R Software version 4.1.2. R is a commonly used programming language created by statisticians Ross Ihaka and Robert Gentleman. The official R software environment is an open-source free software environment in the GNU package, provided under the GNU General Public License.

2.3. Measures

2.3.1. Patient-Centered Communication

The focus variable in this paper is PCC, which is described by seven items in the HINTS. Participants were asked: “The following questions are about your communication with all doctors, nurses, or other health professionals you saw during the past 12 months. How often did they do each of the following: (a) Give you the chance to ask all the health-related questions you had; (b) Give the attention you needed to your feelings and emotions; (c) Involve you in decisions about your health care as much as you wanted; (d) Make sure you understood the things you needed to do to take care of your health; (e) Explain things in a way you could understand; (f) Spend enough time with you; (g) Help you deal with feelings of uncertainty about your health or health care.” Response options included: (1) Always, (2) Usually, (3) Sometimes, and (4) Never. These questions were addressed only to participants who had seen a doctor, nurse, or other health professional in the past 12 months. HINTS created a composite PCC scale based on this question, with values ranging from 0 to 100, with higher scores indicating more increased positive communication with healthcare providers.

2.3.2. Demographic Variables and Other Related Variables

Based on the combined data of HINTS 5 Cycle 3 and 4, a total of 143 initial variables were obtained. Some of the obtained variables had the same meaning. In general, this initial set of variables involved sociodemographic characteristics, such as Age, Gender, Education, Race, and variables related to personal health statuses, such as GeneralHealth, EverHadCancer, Deaf, and OwnAbilityTakeCareHealth, and also personal living habits variables, such as UseInternet, DrinkDaysPerWeek, and WeeklyMinutesModerateExercise. The variables were sorted according to the questionnaire section. Due to a large number of variables, each variable is not listed here. Except for the PCC, the remaining involved variables were questions answered by all the participants, for example, a questionnaire designed for women only, “How long ago did you have your most recent Pap test to check for cervical cancer”, which were not considered in this paper.

2.4. Methods

The four machine learning algorithms GLM, Random Forests, Deep Learning, and GBM were applied to analyze the factors affecting PCC. All four machine learning methods can provide an important measure of the introduced variable, which can be used to evaluate the importance of the variable. For specific variable importance indicators, see the introduction of each method. The implementation from the R package “h2o” was used for all models. Using grid search and five-fold cross-validation, the optimal parameters of the machine learning model were selected based on the objective with the smallest RMSE.

2.4.1. GLM

The GLM model constructed in this paper adopts the regularization method to solve the problem of overfitting that may occur. Regularization can reduce the variance of prediction errors and deal with correlated predictors by introducing penalty items

L_{1}

and

L_{2}

in the form of

λ (α | | β | |_{1} + \frac{1 - α}{2} | | β | |_{2}^{2})

during model building. The combination of

L_{1}

and

L_{2}

penalties in this algorithm can be parameterized by

α

and

λ

. Here,

α

is calculated by a grid search in the (0, 1) interval, which controls the elastic net penalty distribution between the norms of

L_{1}

and

L_{2}

, and the penalty strength is controlled by the parameter

λ

. Therefore, the best regularized model was constructed by performing an automatic

λ

search on every value of

α

set in the (0, 1) interval using the grid search. The final model had a regularization parameter

α

of 0.65. Variable importance is measured according to the “absolute value of normalization coefficient”.

2.4.2. Random Forests

Random forest is an ensemble learning method, which integrates many decision trees into a forest and uses it to predict the final result by the majority voting of all trees to ensure the stability of the model. Breiman (2000) combined classification trees into random forests, which improved the prediction accuracy without significantly increasing the amount of computation [25]. Random forest is insensitive to multi-collinearity, robust to missing and unbalanced data, and is a powerful classification and regression tool. In this paper, the three hyperparameters “ntrees”, “max_depth”, and “mtries” of the random forest in the “h2o” package were calculated. Hyperparameter “ntrees” represents the number of trees, set in the range of 10 to 50, “max_depth” represents the specified maximum tree depth, set in the range of 2 to 12, and “mtries” represents the range of the number of variables selected at the node of each tree, set in the range of 5 to 30. Variable importance was measured by the “mean decrease gini” indicator.

2.4.3. Deep Learning

The back-propagation algorithm used in the deep learning model is based on gradient descent and is a popular supervised learning algorithm for training feed-forward neural networks. The neural network consists of an input layer, a hidden layer, and an output layer. The input vector to each neuron in the first layer of the network is provided to obtain the activation level through weighted summation, and then an activation function is applied to the activation level to obtain the result. These results are fed to the next layer of neurons. This procedure is continued until the last layer (i.e., the output layer) calculates the result, which is the output vector of the neural network.

The number of hidden layers and the number of nodes per hidden layer are hyperparameters in the “h2o” package. The number of hidden layers in this paper was set to two or three, the number of nodes in the first two layers was set to 100~200 and 50~100, respectively, and five nodes were used if there was a third layer. The most suitable rectifier with dropout (dropout ratio is 0.5 by default) was selected as the activation function in the deep learning model. The variable importance from the first two layers of the network was calculated using the weight-based Gedeon method, and the top ten variables according to importance were selected among them. In this paper, the final deep learning model consisted of two hidden layers, the first with 200 nodes and the second with 100 nodes.

2.4.4. GBM

GBM is a type of boosting algorithm, which is a machine learning technique for regression and classification problems. This method generates predictive models in the form of an ensemble of weak predictive models, usually decision trees. The basic idea is that multiple weak learners are generated serially, and the goal of each weak learner is to fit the negative gradient of the loss function of the previous accumulation model so that the accumulated model loss after adding the weak learner is reduced to the direction of the negative gradient. Different weights are used to linearly combine the base learners, to ensure that the excellent learners can be reused. In this paper, each regression tree was built sequentially and in parallel with all features of the dataset based on the GBM model in the “h2o” package in a fully distributed manner, so the three hyperparameters of “ntrees”, “max_depth”, and “rate” were set separately. Hyperparameter “ntrees” indicates the number of trees and was set from 10 to 50, “max_depth” indicates the specified maximum tree depth and was set from 2 to 12, and “rate” represents the specified learning rate, which was set between 0.01 and 0.10. The optimal choice of final parameters was set with the number of trees at 50, the maximum depth at 5, and the learning rate at 0.10. Its relevance was assessed throughout the variable selection process depending on whether the variable was selected to split and how much the squared error increased or decreased.

2.4.5. Evaluation Indicators

The model evaluation indicators used in this paper were MAE, RMSE, RMSLE, and MAPE, which are commonly used for prediction problems. Equations (1)–(4) are the calculation formulas for each indicator, where

n

represents the sample size, and

y_{i}

and

{\hat{y}}_{i}

represent the actual value and predicted value, respectively. MAE and RMSE measure the absolute error and absolute squared error between the predicted and true values, respectively, and are appropriate for cases where the error is relatively obvious. RMSLE is a variant of RMSE that calculates the ratio of predicted to actual values, primarily used when outliers in a dataset are particularly large. MAPE is the average of the absolute percentage error of each entry in a dataset. The smaller the value of each index, the better the fitting effect.

M A E = \frac{1}{n} \sum_{i = 1}^{n} | {\hat{y}}_{i} - y_{i} |

(1)

R M S E = \sqrt{\frac{\sum_{i = 1}^{n} {({\hat{y}}_{i} - y_{i})}^{2}}{n}}

(2)

R M S L E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(\log ({\hat{y}}_{i} + 1) - \log (y_{i} + 1))}^{2}}

(3)

M A P E = \frac{1}{n} \sum_{i = 1}^{n} | \frac{{\hat{y}}_{i} - y_{i}}{y_{i}} |

(4)

3. Results

After removing missing data, the combined dataset from HINTS Cycles 3 and 4 yielded a sample of 4593 respondents. The mean value of the dependent variable PCC was 80.59, and the standard deviation was 20.9988. Overall, the level of communication quality between patients and medical staff was good.

In the combined dataset, an additional 143 variables were present, in addition to PCC. Because the dependent variable PCC is a continuous variable, the t-test was performed on the continuous and binary variables among the 143 variables in the analysis of variable significance, and the F-test was performed on the multi-categorical variables. Eighteen variables were selected for regression analysis based on significance (p < 0.1), including twelve categorical variables and six continuous variables.

Table 1 provides the descriptive statistics of the remaining eighteen variables after feature selection for the sociodemographic and health-related characteristics of all participants, as well as the results of the significance tests for all independent variables. Overall, 94.27% of individuals are confident that they can access advice or information about cancer when needed, and 72.81% trust information about cancer provided by government health agencies, but more than half do not trust information from charitable organizations. The majority of people in the sample (45.55%) consulted a doctor or healthcare provider first when they needed cancer information. Furthermore, among people who own electronic devices (97.24%), 80.84% of them do not suffer from diabetes or hyperglycemia, and there is little difference in whether individuals have psychological distress (50.99% vs. 49.01%). In addition, 68.28% believed that everything could cause cancer, and 74.51% believed that the quality of medical services they received in the past 12 months was low. In terms of numerical variables, the average weight of the individuals was 181.2077 pounds, they did about 6.9375 h of sitting per day, and the average age was 54.7037. On average, individuals did at least 173.5256 min of moderate-intensity exercise per week, drank alcohol 3.5785 times per week, and had a mean BMI of 28.5081, which is outside the normal range and is considered overweight.

Following missing value removal and classification of some categorical variables into two categories, the eighteen independent variables were re-tested for significance. Among them, six continuous variables (Weight, AverageTimeSitting, Age, BMI, WeeklyMinutesModerateExercise, and AvgDrinksPerWeek) remained highly significant (p < 0.0001), while some categorical variables changed their significance levels after binary classification. There were five variables (CancerConfidentGetHealthInf, CancerTrustGov, StrongNeedCancerInfo, HaveDevice_Cat, and EverythingCauseCancer) with test p-values greater than 0.1 from the dependent variable. To investigate the maximum number of variables within the relative scope, the above eighteen variables were introduced into the model as candidate variables to further identify important variables.

Table 2 shows the top ten important predictors of PCC in the regression analysis performed by the four algorithms. A total of fifteen important predictors were screened out by the four algorithms, including individual sociodemographic characteristics, health-related factors, and living habits. Among them, the variables QualityCare and Weight were identified as important predictors in the four algorithms, and the variable QualityCare had the highest variable importance index value in each algorithm. The variables CancerTrustCharities, EverythingCauseCancer, HealthIns_Other, AverageTimeSitting, WeeklyMinutesModerateExercise, StrongNeedCancerInfo, and AvgDrinksPerWeek also showed high importance and were identified as important predictors by the three algorithms.

Table 3 evaluates the performance of the four machine learning methods using commonly used regression model performance evaluation metrics (MAE, RMSE, RMSLE, and MAPE). The values of each index are the five-fold cross-validation results obtained by constructing a model based on the top ten important predictors identified by each algorithm. Overall, the prediction effect of the Random Forest model was the best, with the metric values of MAE, RMSE, RMSLE, and MAPE being 14.8905, 18.4192, 0.3701, and 0.2537, respectively.

4. Discussion

This paper comprehensively examined the impact of individual sociodemographic characteristics, living habits, health status, and variables that reflect the attention of individuals to health-related content in terms of PCC. Four machine learning methods were used to screen for significant predictors by variable importance measures. The study of the current status of PCC and exploration of the influencing factors that affect its degree is critical for patient-centered care. It is beneficial to mobilize the enthusiasm of patients to participate in nursing and treatment to better meet the psychological expectations and feelings of the patients [26,27].

A total of fifteen significant predictors related to patient communication were derived by four machine learning approaches, based on the extensive variable set of the HINTS database. The results showed that the significant predictors identified relate to various aspects of the individual. Socio-demographic characteristics reflect the basic characteristics of individuals, which often affect the living habits, behavior patterns, and communication attitudes of an individual. Although the conclusions on the relationship between various sociodemographic characteristics and PCC are inconsistent [28,29,30], sociodemographic characteristics are an important aspect affecting PCC, which is consistent with the findings of this study.

From the perspective of sociodemographic characteristics, the important predictors of PCC screened in this paper include age, weight, and BMI among other indicators. Age is accompanied by personal growth and life experience, which can change personal characteristics and affect personal communication ability. The perception of medical interaction varies with age, so age can have an impact on PCC [30,31]. Some studies suggest that, on average, obese patients may also receive less patient-centered care than non-obese patients, and that both weight and BMI have an impact on the quality of patient communication [32,33,34]. However, unlike most studies, common indicators such as gender, income, and marital status are missing from the sociodemographic characteristics screened in the present paper. A potential reason is that, unlike the subjective selection of variables in previous studies, the variables selected in this paper are extensive and objectively selected based on statistical methods. Furthermore, the relationship between sociodemographic characteristics and PCC may be complex. Thus, more research is needed in this area.

Consistent with previous literature, health-related predictors associated with PCC include mental health status (QualityCare), cancer-related perceptions (the belief that anything can cause cancer), the way health information is queried (StrongNeedCancerInfo), and whether people believe the cancer information published by the relevant agencies (CancerTrustCharities) [35,36,37,38]. The variables with the highest relative importance value identified by the four machine learning methods in this paper are all QualityCare. An important relationship is present between the quality of care an individual receives and PCC, as patients who receive better care can be motivated to communicate with their healthcare providers [39].

As novel findings, the study discovered that highly isolated language affects PCC, possibly because poor English speaking might impede communication between patients and healthcare personnel. Exercise time, drinking frequency, and sitting time also affect PCC. This may be because, on one hand, personal living habits can affect attitude to life, which in turn affects personality, that is, communication habits. On the other hand, personal living habits are closely related to physical health. Bad living habits can lead to chronic diseases, which in turn will have an impact on PCC. These findings suggest that not only common factors, such as individual sociodemographic characteristics and health status, but also individual living habits should be further considered when conducting PCC prediction. Patient-centered care should also fully understand the health, hygiene, exercise, and other multi-dimensional conditions of the patient.

Briefly, our contributions can be summarized in the following points. First, after reviewing existing research on the PCC problem, to the best of our knowledge, this is the first study introducing machine learning methods to PCC predictor identification. Second, the variable selection method is relatively objective. Previous studies on PCC issues have subjectively introduced predictors, and the research conclusions have a certain degree of subjectivity. In this paper, four machine learning methods are used to select variables objectively according to variable importance indicators. Third, there is a wide range of variable choices. Comprehensive experiments on the HINTS database demonstrate that our adopted methods can effectively discover important predictors in large-scale variables regarding patients without the intervention of human experts. Large-scale variable sets can not only guarantee the objectivity of variable selection to a certain extent but also facilitate the discovery of some novel variable relationships and impact patterns.

A known limitation of non-linear and ensemble machine learning algorithms is their poor interpretability. While the predictors of PCC were hereby identified, it was difficult to explain the direction and magnitude of the influence of each variable. In addition, the machine learning algorithm used in this paper lacks a good prediction performance and needs to be further adjusted. Therefore, the predictors identified in this paper should be evaluated in conjunction with other well-interpretable models or clinical evidence. Further efforts need to combine the important predictors obtained in this paper with other easy-to-interpret models to analyze the direction and mechanism of influence of the newly obtained predictors on PCC. On this basis, a decision support system can be constructed to provide support for doctors or clinical staff. Understanding the sociodemographic and health-related factors that predict PCC can help to achieve more accurate predictions of PCC problems and enhance the quality of communication between doctors and patients to support medical staff in providing higher-quality medical services to patients.

5. Conclusions

Based on the National Cancer Institute’s 2019–2020 Health Information National Trends Survey (HINTS) database, four machine learning methods were hereby used to identify important predictors of PCC from a wide range of data sets based on variable importance measures. A total of fifteen significant predictors were obtained, involving multiple dimensions, such as personal sociodemographic characteristics, living habits, and concerns about health problems. Notably, this paper identified four novel potentially relevant variables, an individual’s level of verbal expression, exercise habits, etc., which significantly impacted respondents’ perceived PCC quality. Understanding the sociodemographic and health-related factors that predict PCC can help researchers make high-precision predictions of PCC problems and also provide references for improving the level of communication between medical staff and patients to provide optimal care.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/pr10122484/s1, Table S1: Variables extracted from the HINTS database for use in this study.

Author Contributions

Methodology, S.W. and L.Q.; software, H.L., Y.W.; formal analysis, X.Z. and P.C.; writing—original draft preparation, X.Z., P.C. and H.L.; writing—review and editing, B.-C.S., M.-C.C. and L.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Youth Project of National Social Science Fund of China (grant 21CTJ008).

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and approved by the Westat Institutional Review Board of the U.S. National Institutes of Health Office of Human Subjects Research Protections.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

Publicly available datasets were analyzed in this study. This data can be found here: [https://hints.cancer.gov/data/download-data.aspx] (accessed on 20 November 2022).

Conflicts of Interest

The authors declare no conflict of interest.

Nomenclature

PCC	Patient-Centered Communication
HINTS	Health Information National Trends Survey
GLM	Generalized Linear Models
GBM	Gradient Boosting Machine
MAE	Mean Absolute Error
RMSE	Root Mean Square Error
RMSLE	Root Mean Squared Logarithmic Error
MAPE	Mean Absolute Percentage Error

References

Richardson, W.C.; Berwick, D.M.; Bisgard, J.C.; Bristow, L.R.; Buck, C.R.; Cassel, C.K. Crossing the Quality Chasm: A New Health System for the 21st Century; The National Academies Press: Washington, DC, USA, 2001. [Google Scholar]
Bredart, A.; Bouleuc, C.; Dolbeault, S. Doctor-patient communication and satisfaction with care in oncology. Curr. Opin. Oncol. 2005, 17, 351–354. [Google Scholar] [CrossRef] [PubMed]
Pozzar, R.A.; Xiong, N.; Hong, F.; Wright, A.A.; Goff, B.A.; Underhill-Blazey, M.L.; Tulsky, J.A.; Hammer, M.J.; Berry, D.L. Perceived patient-centered communication, quality of life, and symptom burden in individuals with ovarian cancer. Gynecol. Oncol. 2021, 163, 408–418. [Google Scholar] [CrossRef] [PubMed]
Rossi, A.A.; Marconi, M.; Taccini, F.; Verusio, C.; Mannarini, S. From fear to hopelessness: The buffering effect of patient-centered communication in a sample of oncological patients during COVID-19. Behav. Sci. 2021, 11, 87. [Google Scholar] [CrossRef]
Epstein, R.M.; Franks, P.; Fiscella, K.; Shields, C.G.; Meldrum, S.C.; Kravitz, R.L.; Duberstein, P.R. Measuring patient-centered communication in patient–physician consultations: Theoretical and practical issues. Soc. Sci. Med. 2005, 61, 1516–1528. [Google Scholar] [CrossRef] [PubMed]
Epstein, R.M.; Fiscella, K.; Lesser, C.S.; Stange, K.C. Why the nation needs a policy push on patient-centered health care. Health Aff. 2010, 29, 1489–1495. [Google Scholar] [CrossRef] [Green Version]
Chen, L.; Tang, H.; Guo, Y. Effect of Patient-Centered Communication on Physician-Patient Conflicts from the Physicians’ Perspective: A Moderated Mediation Model. J. Health Commun. 2022, 27, 164–172. [Google Scholar] [CrossRef]
Swenson, S.L.; Buell, S.; Zettler, P.; White, M.; Ruston, D.C.; Lo, B. Patient-centered communication. J. Gen. Intern. Med. 2004, 19, 1069–1079. [Google Scholar] [CrossRef] [Green Version]
Johnson, R.L.; Roter, D.; Powe, N.R.; Cooper, L.A. Patient race/ethnicity and quality of patient–physician communication during medical visits. Am. J. Public Health 2004, 94, 2084–2090. [Google Scholar] [CrossRef]
Finney Rutten, L.J.; Agunwamba, A.A.; Beckjord, E.; Hesse, B.W.; Moser, R.P.; Arora, N.K. The relation between having a usual source of care and ratings of care quality: Does patient-centered communication play a role? J. Health Commun. 2015, 20, 759–765. [Google Scholar] [CrossRef]
Zandbelt, L.C.; Smets, E.M.; Oort, F.J.; Godfried, M.H.; de Haes, H.C. Medical specialists’ patient-centered communication and patient-reported outcomes. Med. Care 2007, 45, 330–339. [Google Scholar] [CrossRef]
Chowdhary, C.L.; Khare, N.; Patel, H.; Koppu, S.; Kaluri, R.; Rajput, D.S. Past, present and future of gene feature selection for breast cancer classification–a survey. Int. J. Eng. Syst. Model. Simul. 2022, 13, 140–153. [Google Scholar] [CrossRef]
Boon, H.; Stewart, M. Patient-physician communication assessment instruments: 1986 to 1996 in review. Patient Educ. Couns. 1998, 35, 161–176. [Google Scholar] [CrossRef] [PubMed]
Mead, N.; Bower, P. Patient-centredness: A conceptual framework and review of the empirical literature. Soc. Sci. Med. 2000, 51, 1087–1110. [Google Scholar] [CrossRef] [PubMed]
Trivedi, N.; Moser, R.P.; Breslau, E.S.; Chou, W.Y.S. Predictors of Patient-Centered Communication among US Adults: Analysis of the 2017-2018 Health Information National Trends Survey (HINTS). J. Health Commun. 2021, 26, 57–64. [Google Scholar] [CrossRef] [PubMed]
De Silva, K.; Jönsson, D.; Demmer, R.T. A combined strategy of feature selection and machine learning to identify predictors of prediabetes. J. Am. Med. Inform. Assoc. 2020, 27, 396–406. [Google Scholar] [CrossRef] [PubMed]
Murdoch, T.B.; Detsky, A.S. The inevitable application of big data to health care. JAMA 2013, 309, 1351–1352. [Google Scholar] [CrossRef]
Kolisetty, V.V.; Rajput, D.S. A review on the significance of machine learning for data analysis in big data. Jordanian J. Comput. Inf. Technol. (JJCIT) 2020, 6, 155–171. [Google Scholar] [CrossRef]
Casanova, R.; Saldana, S.; Simpson, S.L.; Lacy, M.E.; Subauste, A.R.; Blackshear, C.; Wagenknecht, L.; Bertoni, A.G. Prediction of incident diabetes in the Jackson Heart Study using high-dimensional machine learning. PLoS ONE 2016, 11, e0163942. [Google Scholar] [CrossRef] [Green Version]
Shameer, K.; Johnson, K.W.; Yahi, A.; Miotto, R.; Li, L.I.; Ricks, D.; Jebakaran, J.; Kovatch, P.; Sengupta, P.P.; Gelijns, S.; et al. Predictive modeling of hospital readmission rates using electronic medical record-wide machine learning: A case-study using Mount Sinai heart failure cohort. In Proceedings of the Pacific Symposium on Biocomputing 2017, Kohala Coast, HI, USA, 4–8 January 2017; pp. 276–287. [Google Scholar]
Awan, S.E.; Bennamoun, M.; Sohel, F.; Sanfilippo, F.M.; Chow, B.J.; Dwivedi, G. Feature selection and transformation by machine learning reduce variable numbers and improve prediction for readmission or death. PLoS ONE 2019, 14, e0218760. [Google Scholar] [CrossRef] [Green Version]
Chen, M.; Hao, Y.; Hwang, K.; Wang, L.; Wang, L. Disease prediction by machine learning over big data from healthcare communities. IEEE Access 2017, 5, 8869–8879. [Google Scholar] [CrossRef]
Chen, Y.; Zhang, X.; Lu, L.; Wang, Y.; Liu, J.; Qin, L.; Ye, L.; Zhu, J.; Shia, B.C.; Chen, M.C. Machine learning methods to identify predictors of psychological distress. Processes 2022, 10, 1030. [Google Scholar] [CrossRef]
Mengarelli, A.; Tigrini, A.; Fioretti, S.; Verdini, F. Identification of Neurodegenerative Diseases From Gait Rhythm Through Time Domain and Time-Dependent Spectral Descriptors. IEEE J. Biomed. Health Inform. 2022, 1–9. [Google Scholar] [CrossRef] [PubMed]
Breiman, L. Some Infinity Theory for Predictor Ensembles; Technical Report 579; Statistics Dept. UCB: Berkeley, CA, USA, 2000. [Google Scholar]
Arora, N.K.; Street, R.L., Jr.; Epstein, R.M.; Butow, P.N. Facilitating patient-centered cancer communication: A road map. Patient Educ. Couns. 2009, 37, 319–321. [Google Scholar] [CrossRef] [PubMed]
Sabee, C.M.; Koenig, C.J.; Wingard, L.; Foster, J.; Chivers, N.; Olsher, D.; Vandergriff, I. The process of interactional sensitivity coding in health care: Conceptually and operationally defining patient-centered communication. J. Health Commun. 2015, 20, 773–782. [Google Scholar] [CrossRef] [PubMed]
Finney Rutten, L.J.; Augustson, E.; Wanke, K. Factors associated with patients’ perceptions of health care providers’ communication behavior. J. Health Commun. 2006, 11 (Suppl. S1), 135–146. [Google Scholar] [CrossRef] [PubMed]
Siminoff, L.A.; Graham, G.C.; Gordon, N.H. Cancer communication patterns and the influence of patient characteristics: Disparities in information-giving and affective behaviors. Patient Educ. Couns. 2006, 62, 355–360. [Google Scholar] [CrossRef]
Jaipaul, C.K.; Rosenthal, G.E. Are older patients more satisfied with hospital care than younger patients? J. Gen. Intern. Med. 2003, 18, 23–30. [Google Scholar] [CrossRef] [Green Version]
DeVoe, J.E.; Wallace, L.S.; Fryer, G.E., Jr. Patient age influences perceptions about health care communication. Fam. Med. 2009, 41, 126. [Google Scholar]
Phelan, S.M.; Puhl, R.M.; Burgess, D.J.; Natt, N.; Mundi, M.; Miller, N.E.; Saha, S.; Fischer, K.; van Ryn, M. The role of weight bias and role-modeling in medical students’ patient-centered communication with higher weight standardized patients. Patient Educ. Couns. 2021, 104, 1962–1969. [Google Scholar] [CrossRef]
Gudzune, K.A.; Beach, M.C.; Roter, D.L.; Cooper, L.A. Physicians build less rapport with obese patients. Obesity 2013, 21, 2146–2152. [Google Scholar] [CrossRef]
Wong, M.S.; Gudzune, K.A.; Bleich, S.N. Provider communication quality: Influence of patients’ weight and race. Patient Educ. Couns. 2015, 98, 492–498. [Google Scholar] [CrossRef] [PubMed]
Finney Rutten, L.J.; Hesse, B.W.; St Sauver, J.L.; Wilson, P.; Chawla, N.; Hartigan, D.B.; Moser, R.P.; Taplin, S.; Glasgow, R.; Arora, N.K. Health self-efficacy among populations with multiple chronic conditions: The value of patient-centered communication. Adv. Ther. 2016, 33, 1440–1451. [Google Scholar] [CrossRef] [Green Version]
Oates, J.; Weston, W.W.; Jordan, J. The impact of patient-centered care on outcomes. Fam Pract 2000, 49, 796–804. [Google Scholar]
Epstein, R.M.; Street, R.L., Jr. Patient-Centered Communication in Cancer Care: Promoting Healing and Reducing Suffering; US Department of Health and Human Services: Washington, DC, USA, 2007.
Arora, N.K. Interacting with cancer patients: The significance of physicians’ communication behavior. Soc. Sci. Med. 2003, 57, 791–806. [Google Scholar] [CrossRef] [PubMed]
De Haes, H.; Bensing, J. Endpoints in medical communication research, proposing a framework of functions and outcomes. Patient Educ. Couns. 2009, 74, 287–294. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Machine learning framework.

Table 1. Distribution of the variable characteristics in the HINTS database.

Variable	N	%	p-Value
Categorical Variables	N	%	p-Value
HIGHSPANLI			0.0957
Yes	368	8.01
No	4225	91.99
HISP_HH			0.0858
Yes	727	15.83
No	3866	84.17
CancerConfidentGetHealthInf			0.8343
High_level	4330	94.27
Low_level	263	5.73
CancerTrustGov			0.1925
High_level	3344	72.81
Low_level	1249	27.19
CancerTrustCharities			0.0055
High_level	2048	44.59
Low_level	2545	55.41
StrongNeedCancerInfo			0.9378
Doctor or health care provider	2092	45.55
Internet	1817	39.56
Other-Specify	684	14.89
HaveDevice_Cat			0.8391
Yes	4466	97.24
No	127	2.77
HealthIns_Other			0.0326
Yes	422	9.19
No	4171	90.81
MedConditions_Diabetes			0.0706
Yes	880	19.16
No	3713	80.84
Psychological_Distress			0.0907
Yes	2342	50.99
No	2251	49.01
EverythingCauseCancer			0.1261
Agree	3136	68.28
Disagree	1457	31.72
QualityCare			<0.0001
High_level	1171	25.50
Low_level	3422	74.51
Numeric Variables	Mean	SD	p-value
Weight	181.2077	45.2559	<0.0001
AverageTimeSitting	6.9375	3.7328	<0.0001
Age	54.7037	16.5867	<0.0001
BMI	28.5081	6.4164	<0.0001
WeeklyMinutesModerateExercise	173.5256	311.4200	<0.0001
AvgDrinksPerWeek	3.5785	7.4883	<0.0001
PCCScale	80.5900	20.9988	——

Table 2. Predictors of patient-centered communication built using GLM, Random Forests, Deep Learning, and GBM.

GLM		Random Forests		Deep Learning		GBM
Predictor	Importance	Predictor	Importance	Predictor	Importance	Predictor	Importance
QualityCare	100.00	QualityCare	100.00	QualityCare	100.00	QualityCare	100.00
CancerTrustCharities	7.95	BMI	3.26	AvgDrinksPerWeek	44.19	BMI	11.87
EverythingCauseCancer	4.00	Age	2.19	HaveDevice_Cat	43.43	Weight	7.40
Psychological_Distress	3.70	WeeklyMinutesModerateExercise	1.94	Psychological_Distress	43.36	Age	6.28
HealthIns_Other	3.43	Weight	1.86	Weight	42.92	WeeklyMinutesModerateExercise	5.85
HISP_HH	3.20	AverageTimeSitting	1.54	HealthIns_Other	42.75	AverageTimeSitting	5.43
AverageTimeSitting	2.84	AvgDrinksPerWeek	0.93	EverythingCauseCancer	42.67	AvgDrinksPerWeek	2.36
WeeklyMinutesModerateExercise	1.81	CancerTrustCharities	0.77	StrongNeedCancerInfo	41.86	CancerTrustCharities	2.26
Weight	1.70	HIGHSPANLI	0.54	HISP_HH	40.71	StrongNeedCancerInfo	1.94
StrongNeedCancerInfo	1.47	HealthIns_Other	0.33	BMI	40.60	EverythingCauseCancer	1.74

Table 3. Regression metrics for each machine learning method.

Criterion	GLM	Random Forests	Deep Learning	GBM
MAE	14.9083	14.8905	14.9805	14.9805
RMSE	18.3934	18.4192	18.4556	18.5557
RMSLE	0.3718	0.3701	0.3725	0.3710
MAPE	0.2664	0.2537	0.2716	0.2408

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, S.; Zhang, X.; Chen, P.; Lai, H.; Wu, Y.; Shia, B.-C.; Chen, M.-C.; Ye, L.; Qin, L. Identifying the Predictors of Patient-Centered Communication by Machine Learning Methods. Processes 2022, 10, 2484. https://doi.org/10.3390/pr10122484

AMA Style

Wu S, Zhang X, Chen P, Lai H, Wu Y, Shia B-C, Chen M-C, Ye L, Qin L. Identifying the Predictors of Patient-Centered Communication by Machine Learning Methods. Processes. 2022; 10(12):2484. https://doi.org/10.3390/pr10122484

Chicago/Turabian Style

Wu, Shuo, Xiaomei Zhang, Pianzhou Chen, Heng Lai, Yingchun Wu, Ben-Chang Shia, Ming-Chih Chen, Linglong Ye, and Lei Qin. 2022. "Identifying the Predictors of Patient-Centered Communication by Machine Learning Methods" Processes 10, no. 12: 2484. https://doi.org/10.3390/pr10122484

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Identifying the Predictors of Patient-Centered Communication by Machine Learning Methods

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Source

2.2. Statistical Analysis

2.3. Measures

2.3.1. Patient-Centered Communication

2.3.2. Demographic Variables and Other Related Variables

2.4. Methods

2.4.1. GLM

2.4.2. Random Forests

2.4.3. Deep Learning

2.4.4. GBM

2.4.5. Evaluation Indicators

3. Results

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Nomenclature

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI