**Evaluation of Transfer Learning with Deep Convolutional Neural Networks for Screening Osteoporosis in Dental Panoramic Radiographs**

### **Ki-Sun Lee 1,2,3,\*, Seok-Ki Jung 4, Jae-Jun Ryu 5, Sang-Wan Shin 6,7 and Jinwook Choi 1,8,\***


Received: 25 December 2019; Accepted: 30 January 2020; Published: 1 February 2020

**Abstract:** Dental panoramic radiographs (DPRs) provide information required to potentially evaluate bone density changes through a textural and morphological feature analysis on a mandible. This study aims to evaluate the discriminating performance of deep convolutional neural networks (CNNs), employed with various transfer learning strategies, on the classification of specific features of osteoporosis in DPRs. For objective labeling, we collected a dataset containing 680 images from different patients who underwent both skeletal bone mineral density and digital panoramic radiographic examinations at the Korea University Ansan Hospital between 2009 and 2018. Four study groups were used to evaluate the impact of various transfer learning strategies on deep CNN models as follows: a basic CNN model with three convolutional layers (CNN3), visual geometry group deep CNN model (VGG-16), transfer learning model from VGG-16 (VGG-16\_TF), and fine-tuning with the transfer learning model (VGG-16\_TF\_FT). The best performing model achieved an overall area under the receiver operating characteristic of 0.858. In this study, transfer learning and fine-tuning improved the performance of a deep CNN for screening osteoporosis in DPR images. In addition, using the gradient-weighted class activation mapping technique, a visual interpretation of the best performing deep CNN model indicated that the model relied on image features in the lower left and right border of the mandibular. This result suggests that deep learning-based assessment of DPR images could be useful and reliable in the automated screening of osteoporosis patients.

**Keywords:** osteoporosis screening; artificial intelligence; convolutional neural networks; dental panoramic radiographs

#### **1. Introduction**

Osteoporosis is a systemic disease characterized by low bone mineral density (BMD) and micro-architectural deterioration of bone structure, thereby leading to compromised bone strength and, consequently, an increased risk of fracture [1]. Hip, spine, and wrist fractures caused by osteoporosis often lead to disorders that reduce the quality of life of the patient and, in severe cases, increase the risk of mortality [2,3]. With fast population aging and an increase in life expectancy, osteoporosis is increasingly becoming a global public health issue; it has been estimated that more than 200 million

people are suffering from osteoporosis [4]. According to recent statistics from the International Osteoporosis Foundation, approximately one in three women over the age of 50 will experience osteoporotic fractures, as will one in five men over the age of 50 [4–7]. Moreover, it is expected that more people will be affected by osteoporosis in the future and, consequently, the rate of osteoporotic fractures will increase [8]. This is because the disease initially develops without any symptoms, remains undiagnosed due to scarce symptomatology, and its first manifestation is often a low-energy fracture of long bones or vertebrae [9].

Generally, osteoporosis is diagnosed by evaluating bone mineral density (BMD) measurements (expressed as a T-score) using dual-energy X-ray absorptiometry (DXA), which is considered as the reference-standard examination for BMD assessment [10,11]. However, this technique is complex, expensive, and the availability is limited for overall population diagnosis [12]. Recently, digital images of dental panoramic radiographs (DPRs) have been evaluated as cost-effective and important image data for osteoporosis screening. This is because the widespread use of panoramic radiation in dental care for elderly patients with increased life expectancy and a number of studies have demonstrated the feasibility of BMD estimation and osteoporosis screening using panoramic radiography [13–23].

However, previous approaches primarily relied on manually categorized feature indexes [13–23], such as the Gonion index, mandibular cortical index, mental index, and panoramic mandibular index, and traditional classifier called machine learning (ML) algorithms, such as support vector machine (SVM) [22] and fuzzy classifiers [23], for screening osteoporosis. Although the previously handcrafted feature indexes provided sufficient evidence for assisting osteoporosis screening using panoramic radiographs, these methods for discriminating features are of a low order and do not fully characterize the heterogeneous pattern in radiographic images. In addition, most previous studies require tedious and manual operations, such as extensive preprocessing, image normalization, and region of interest (ROI) segmentation, which can significantly affect the repeatability of the classification method.

In the last few years, deep learning algorithms, particularly deep convolutional neural networks (CNNs) architecture, have been widely recognized as a reliable approach to learn the classification of the characteristics of features directly from original medical images [24,25]. As opposed to ML approaches that rely on explicitly classified features, deep CNNs are a class of deep neural networks that can learn high dimensional features to maximize the networks ability to discriminate abnormalities among images [26]. There are many different CNN architectures that have been designed to perform image classifications and recognitions. Each of these architectures differ in specific aspects, including the number and size of layers, the connections between these layers, and the overall network depth. Because different network architectures are best suited for different problems, and it is difficult to know in advance which architecture is the right choice for a given task, empirical examination is often recognized as the best way to make these decisions [27].

Although deep CNNs have been recognized as efficient tools for image classification, they require a large amount of training data, which can be difficult to apply to medical radiographic image data. When the target dataset is significantly smaller than the base dataset, transfer learning is considered a powerful technique for training deep CNNs without overfitting [28,29]. The general process of transfer learning is performed through the use of pretrained models in a two-step method as follows: First, copying the first n layers of pretrained base network on a general large dataset to the first n layers of a target network and secondly, the remaining layers of the target network are then randomly initialized and trained on a small local dataset toward the target task [28]. On the basis of the transfer learning techniques, several state-of-the-art results showed outperformance in both general image classification [30–32] and medical image classification [33–36]. However, a few studies have been done to develop and evaluate transfer learning-based deep CNN models for predicting osteoporosis in DPRs.

The aim of this study is to develop and evaluate the deep learning approaches for screening osteoporosis with DPR images. Using the classified panoramic radiograph images based on the BMD value (T-score), this study evaluated several different CNN models based on osteoporosis discriminating accuracy. In addition, we quantitatively evaluated the effect of transfer learning and fine-tuning of a deep CNN model on classifying performance.

#### **2. Patients and Methods**

#### *2.1. Patients*

The study was done on a total of 680 panoramic radiograph images from 680 different patients who visited the Korea University Ansan Hospital. The patients simultaneously underwent skeletal BMD examinations and digital panoramic radiography evaluations within four months, between 2009 and 2018. The subjects were classified into a non-osteoporosis group (T-score ≥ −2.5) or osteoporosis group (T-score < −2.5), according to the World Health Organization criteria [37], into which 380 and 300 subjects were assigned, respectively. The dataset was divided into training and test sets as follows: The radiographs were selected randomly, and 136 radiographs (20% of the total), 68 each from the osteoporosis and non-osteoporosis groups, were set aside as a test set. This ensured that the testing data set only contained images of novel radiographs that had not been encountered by the model during training. The remaining 544 radiographs were used for the training and validation set. This study protocol was approved by the institutional review board of the Korea University Ansan Hospital (no. 2019AS0126).

#### *2.2. Data Preprocessing*

The dimensions of the collected dental X-ray images varied from 1348 to 2820 pixels in width and 685 to 1348 pixels in height. For consistency of image preprocessing, the images were downsampled to a uniform size of 1200 × 630 pixels, using bilinear interpolation. The final ROI was restricted to the lower part of the mandible, below the teeth-containing alveolar bone, for an image size of 700 × 140 pixels (Figure 1). This included the most ROI areas of previous studies [13–23] that applied various classification techniques by detailed and specifically indexing the image feature characteristics of the limited small region of mandible. By setting the ROI to include most of the mandible instead of the specific area of it, this study evaluated the area that plays the most distinctive role in osteoporosis classification through explainable deep learning techniques.

**Figure 1.** Image preprocessing for this study. The original DPRs were downsampled, and the ROI is restricted to the mandibular region below the teeth (region inside the bounding box). DPR, dental panoramic radiograph; ROI, region of interest.

#### *2.3. Convolutional Neural Networks*

This study employed four study groups of CNN as follows: a basic CNN model with three convolutional layers (CNN3), a visual geometry group deep CNN model with no pre-trained weights (VGG16), a transfer learning model from VGG16 with pre-trained weights (VGG16-TF), and a transfer learning and fine-tuning model from VGG16 with pre-trained weights (VGG16-TF-FT). The preceding architectures, along with the four variant CNN models (CNN3, VGG16, VGG16-TR, and VGG16-TR-FT) used in this study, are depicted in the block diagram in Figure 2.

**Figure 2.** Schematic diagrams of the four convolutional neural networks (CNN) architectures evaluated in this study.

The reason for choosing VGG16 [31] architecture was that it had been widely adopted and recognized as state-of-the-art in both general and medical image classification tasks [24]. Additionally, it has been trained on large-scale datasets, so that a transfer learning approach could be adopted for large-scale image recognition [38]. For the VGG16 architecture under consideration, the following three different experimental groups were evaluated: the native group (VGG16), transfer learning group (VGG16-TR), and transfer learning with fine-tuning group (VGG16-TR-TF). In the native version, model weights were randomly initialized, and training was conducted using only the DPR data described in this study. In the transfer learning version, model weights were fixed, based on pre-training with a general image dataset, except for the final, fully connected layers, which were randomly initialized. In the transfer learning with fine-tuning version, model weights were initialized based on pre-training on a general image dataset, the same as previous versions, except that some of the last blocks were unfrozen so that their weights were updated in each training step. In this study, the last two transfer learning version models (VGG16-TR and VGG16-TR-FT) employed pre-trained weights using the ImageNet database [38]. ImageNet is an image dataset containing thousands of different objects used to train and evaluate image classification models.

#### *2.4. Model Training*

The 544 images selected as the training dataset were randomly divided into five folds. This was done to perform 5-fold cross validation to evaluate the model training, while avoiding overfitting or bias [39]. Within each fold, the dataset was partitioned into independent training and validation sets, using an 80 to 20 percentage split. The selected validation set was a completely independent fold from the other training folds and it was used to evaluate the training status during the training. After one model training step was completed, the other independent fold was used as a validation set and the previous validation set was reused, as part of the training set, to evaluate the model training. An overview of the 5-fold cross validation performed in this study is presented in Figure 3.

**Figure 3.** The overview of the performed 5-fold cross validation in this study.

This process was repeated for each architecture (CNN3, VGG16, VGG16-TR, and VGG16-TR-FT). All models were trained and evaluated on a 64-bit Windows 10 operating system, with 64 GB memory and an NVIDIA Quadro P4000 GPU. Building, training, validation, and prediction of deep learning models were performed using the Keras [40] library and TensorFlow [41] backend engine.

#### *2.5. Performance Evaluation*

The evaluation of the screening performance of the CNN models was performed with the independent test dataset in each cross-validation fold. To comprehensively evaluate the screening performance on the test dataset, the accuracy, sensitivity, specificity, receiver operating characteristic (ROC) curve, and precision recall (PR) curve were calculated. The accuracy, sensitivity, and specificity score can be calculated as follows:

$$\text{accuracy} = \frac{\text{TP} + \text{TN}}{\text{TP} + \text{TN} + \text{FN} + \text{FP}}$$

$$\text{sensitivity} = \frac{\text{TP}}{\text{TP} + \text{FN}}$$

$$\text{specificity} = \frac{\text{TN}}{\text{TN} + \text{FP}}$$

TP and FP are the number of correctly and incorrectly predicted images, respectively. Similarly, TN and FN represent the number of correctly and incorrectly predicted images, respectively. The area under the ROC curve (AUC) was also calculated.

#### *2.6. Visualizing Model Decisions*

Deep learning models have often been referred to as non-interpretable black boxes because it is difficult to know the process by which they make predictions. To know the decision-making process of the model, and which features are most important for the model to screen osteoporosis in DPR images, this study employed the gradient-weighted class activation mapping technique (Grad-CAM) [42] and the most significant regions for screening osteoporosis in DPR images were highlighted.

#### **3. Results**

#### *3.1. Baseline Clinical and Demographic Characteristics of the Subjects*

The patients were 565 female and 115 male, with an age range from 27 to 90 years (mean age of 63.0 years). There were 380 patients (mean age 58.5) without osteoporosis (T-score ≥ −2.5) and 300 patients (mean age 68.6) with osteoporosis (T-score < −2.5). The clinical characteristics of the DPR dataset used in this study are summarized in Table 1.

**Table 1.** Clinical and demographic characteristics of the dental panorama radiographs (DPRs) dataset in this study.


#### *3.2. Prediction Performance*

The CNN models of this study were trained using a cross-entropy loss function on the selected training image dataset. The screening performances of the four CNN models tested in this study are displayed in Table 2. It was observed that the transfer learning and fine tuning VGG16 model with pre-trained weights (VGG16-TR-FT) achieved the top performance, with the highest AUC of 0.858 (95% CI 0.865 to 0.850), sensitivity of 0.900 (95% CI 0.919 to 0.881), specificity of 0.815 (95% CI 0.847 to 0.783), and accuracy of 0.840 (95% CI 0.857 to 0.822). The screening performances of the other models that applied transfer learning techniques, but did not apply fine tuning, one with pre-trained weights (VGG-TR) and the other without pre-trained weights (VGG16), were slightly degraded. The arbitrarily established model with three convolutional layers (CNN3) achieved the lowest performance, with an AUC of 0.667 (95% CI 0.708 to 0.626), sensitivity of 0.684 (95% CI 0.889 to 0.480), specificity of 0.649 (95% CI 0.813 to 0.486), and accuracy of 0.660 (95% CI 0.725 to 0.594).

**Table 2.** Osteoporosis screening accuracy of convolutional neural network models in this research.


Figure 4 shows the ROC curves of all tested models. The VGG16-TR-FT models achieved the highest AUC of 0.86, while the CNN3 model achieved the lowest AUC of 0.61. Figure 5 shows the PR curves of the tested CNN models. It was also observed that the VGG16-TR-FT models achieved the highest PR of 0.86, while the CNN3 model achieved the lowest PR of 0.61.

**Figure 4.** Mean ROC curves of each CNN models for screening osteoporosis on DPR images in this study.

**Figure 5.** Original and Grad-CAM sample images of correctly predicted by the best-performing deep CNN model (VGG16-TR-TF) for DPR image-based osteoporosis screening are illustrated. Below each original sample images, a Grad-CAM image is superimposed over the original image. The bright red in each Grad-CAM image indicate the region that has the greatest impact on screening osteoporosis patients.

#### *3.3. Visualizing Model Decisions*

Figures 5 and 6 illustrate the case examples of predictions using the best predictive VGG16-TR-FT model as compared with ground truth. Each case example employed a Grad-CAM technique to perform a visual interpretation to determine which areas affected the deep CNN's class classification. In the case of screening correctly for osteoporosis (Figure 5A), the region showing the weak lower border of the mandibular cortical bone and the less dense, spongy bone texture at its periphery was extracted as the main image feature of the classification. In correctly screened cases of no osteoporosis (Figure 5B), the region showing the strong lower boundary of the mandible cortical bone and the dense texture around its periphery was extracted as the main image feature of the classification. However, in the case of incorrectly screened cases, i.e., the non-osteoporosis case predicted as osteoporosis (Figure 6A) or the osteoporosis case predicted as non-osteoporosis (Figure 6B), the central region of the mandible or the ghost images of the hyoid bone was extracted as the main image feature.

**Figure 6.** Original and Grad-CAM sample images of incorrectly predicted by the best-performing deep CNN model (VGG16-TR-TF) for DPR image-based osteoporosis screening are illustrated. Below each original sample images, a Grad-CAM image is superimposed over the original image. The bright red in each Grad-CAM image indicate the region that has the greatest impact on screening osteoporosis patients.

#### **4. Discussion**

Although DPRs are commonly performed for the evaluation of dentition and adjacent structures of the jaw, some clinical assistant diagnosis (CAD) systems based on DPRs have been suggested for screening systemic diseases, such as osteoporosis and carotid artery calcification [13–23,43]. However, the approaches of most previous studies are only valid when image features are accurately extracted, using sophisticated and manual image preprocessing algorithms or techniques. If a DPR image is imported from an unfamiliar environment or unexpected noise is added to the image, the prediction can easily be distorted. The neural network algorithm can resolve this problem. All the knowledge necessary for diagnosis is established only with the given training image data, without complicated or sophisticated image preprocessing. In recent years, a cutting-edge neural network technology, called deep learning, has been applied to medical imaging analysis and has shown a level of performance that is equal to or better than a clinician. As mentioned above, most previous CAD system studies, which used manual or sophisticated image preprocessing and machine learning algorithms for the screening of osteoporosis based on DPRs, presented variable diagnostic performances, in terms of sensitivity and specificity [13–23]. Recently, a deep learning-based osteoporosis prescreening study, which resulted in a very high AUC score (0.9763 to 0.9991) and accuracy (92.5% to 98.5%), was published [44]. However, in that study, osteoporosis labeling was subjectively performed by dental specialists, rather than BMD score (T-score) which is the gold standard for diagnosing osteoporosis. In addition, the study did not visually interpret the decision of the trained CNN model, and using five arbitrarily established convolutional layers, there is a limitation to the reproducibility of the deep CNN model.

The first major findings of the present study showed that applying appropriate transfer learning and fine-tuning techniques on pre-trained deep CNN architectures had an equivalent DPR-based osteoporosis screening level of previous studies, even with small image datasets, without complex image preprocessing and image ROI settings. According to Table 2 and Figure 4, the CNN3 group, having only arbitrary established three convolutional layers, showed the lowest true-positive screening performance and accuracy among the experimental groups. On the basis of these results, it can be estimated that a CNN model with a small number of convolutional layers can have limitation in learning the true data distribution from a small number of datasets.

Comparing models that used pre-trained weights (VGG16-TR and VGG16-TR-FT) to those that did not (VGG16), also revealed that deep CNNs initialized with large-scale pre-trained weights outperformed those directly learnt from small-scale data, with AUC improvements between 7% to 11%. Thus, in the case of having a small-scale image dataset, this study also suggests that the use of transfer learning on deep CNN models with pre-trained weights can be an efficient solution for the classification of medical images, instead of learning a deep neural network from scratch.

Moreover, as shown in Table 2 and Figure 7, the results of this study also indicated an improvement in screening performance when using fine-tuning on some convolutional blocks in deep CNN layers. In general, the deep CNN model learned from pre-trained deep neural networks on a large natural image dataset could be used to classify common images but cannot be well utilized for specific classifying tasks of medical images (Figure 8A). However, according to a previous study that described the effects and mechanisms of fine tuning on deep CNNs [45], when certain convolutional blocks of a deep CNN model were fine-tuned, the deep CNN model could be further specialized for specific classifying tasks (Figure 8B). More specifically, earlier layers of a deep CNN contain generic features that should be useful to many classification tasks, but later layers progressively contain more specialized features to the details of the classes contained in the original dataset (i.e., the large natural image dataset on which the deep CNN was originally trained). Using this property, when the parameters of the early layers are preserved and the parameters in later layers are updated during training new datasets, the deep CNN model can be effectively used in new classification tasks. In conclusion, fine-tuning uses the parameters learned from a previous training of the network on a large dataset and, then, adjusts the parameters in later layers from the new dataset, improving the performance and accuracy in the new classification task. As with the previous study, the fine-tuning technique, which freezes

the weight parameters of some initial convolutional blocks in the deep CNN model called VGG16, and, then, updates the weight parameters of the later convolutional blocks (Figure 8B), show higher performance than other experimental groups. The conceptual diagram of the fine-tuning technique mentioned above can be seen in Figure 8.

**Figure 7.** Comparison of grad-CAM images from other groups against some original images showing true positive and true negative in the best performing VGG16-TR-TF group.

**Figure 8.** The conceptual diagram of the fine-tuning technique in the transfer learning of a deep CNN.

The second major result of this study was to identify areas where image feature differences occurred when screening osteoporosis in DPR images using the Grad-CAM technique. To understand and visualize the decision of deep CNN models, some samples of the correctly and incorrectly screened examples were reviewed (Figures 5 and 6). For additional insight to model decisions, a Grad-CAM technique was performed in this study. This technique identified the areas of input images that had the greatest impact on model classification. According to this additional review, the model does seem to identify the feature characteristics of osteoporosis in DPR images (e.g., cortical bone thinning). According to the Grad-CAM evaluation of this study, DPR-based screening performances of osteoporosis were high when the image features were specified in the middle region of the left and right side of the mandibular lower border. This region is also consistent with the regions used to discriminate osteoporosis using DPR images, in most previous studies [13–23], although the measurement algorithm was different. This indicates that most osteoporosis patients have image feature characteristics, on DPR images, at the lower border of the cortical bone in the mandible. However, image quality issues, such as blurring, low contrast, and ghost images of adjacent objects can cause incorrect predictions. When the image features were specified in the center region of the mandible, or when the ghost images of the hyoid bone were in the ROI region, the accuracy was reduced. Therefore, to improve the deep CNN-based screening performance of osteoporosis in DPR images, it is suggested that the ROI setting be limited to the area around the middle of the left and right side of the lower border of the mandible.

#### **5. Conclusions**

This study presents the usefulness of transfer learning and fine tuning with a deep CNN for the screening of osteoporosis in DPR images, in cases with a limited training dataset. We have applied various transfer learning techniques on pre-trained networks VGG16 for the discrimination of osteoporosis using a DPR image dataset, labeled based on T-score. The experimental results showed that transfer learning with pre-trained weights and fine-tuning techniques achieved the highest overall accuracy of 84%. The presented results suggest that the combination of the appropriate deep CNN architectures and transfer learning techniques has effectively resolved the issue of a small training set of images and that DPR images have the potential for osteoporosis prescreening. In addition, using the Grad-CAM technique, this study performed a deep learning-based visual explanation for the area where the image feature difference occurred. Therefore, this study confirmed the previous osteoporosis screening studies using DPR images that set the ROI at the middle of the left and right side of the lower border of the mandible. Given the increasing burden of osteoporosis on the global healthcare system, as our population ages, and the proliferation of dental panoramic image devices, the results presented in this study suggest that deep learning-based image analysis of DPRs could serve an important role in cost-effective prescreening for patients unaware of osteoporosis. To further improve screening performance, future research is needed, using different deep CNN architectures and deep learning techniques, more validated and qualified labeled image dataset, the appropriate number of datasets, and automated configuration techniques for more limited range of ROI.

**Author Contributions:** Conceptualization, K.-S.L., J.-J.R. and S.-W.S.; Data curation, K.-S.L. and S.-K.J.; Formal analysis, K.-S.L.; Funding acquisition, K.-S.L.; Investigation, K.-S.L.; Methodology, K.-S.L.; Project administration, K.-S.L. and J.C.; Software, K.-S.L.; Supervision, J.C.; Validation, S.-K.J. and J.C.; Visualization, K.-S.L.; Writing—original draft, K.-S.L.; Writing—review & editing, K.-S.L. and J.C. All authors have read and agreed to the published version of the manuscript.

**Funding:** This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (NRF-2019R1I1A1A01062961).

**Conflicts of Interest:** The authors declare no conflict of interest.

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **E**ff**ect of an Electronic Medical Record-Based Screening System on a Rapid Response System: 8-Years' Experience of a Single Center Cohort**

#### **Se Hee Lee, Chae-Man Lim, Younsuck Koh, Sang-Bum Hong and Jin Won Huh \***

Department of Pulmonary and Critical Care Medicine, Asan Medical Center, University of Ulsan College of Medicine, 88, Olympic-ro 43-gil, Songpa-gu, Seoul 05505, Korea; celestia7@gmail.com (S.H.L.); cmlim@amc.seoul.kr (C.-M.L.); yskoh@amc.seoul.kr (Y.K.); sbhong@amc.seoul.kr (S.-B.H.) **\*** Correspondence: jwhuh@amc.seoul.kr; Tel.: +82-2-3010-3985

Received: 30 December 2019; Accepted: 21 January 2020; Published: 1 February 2020

**Abstract:** An electronic medical record (EMR)-based screening system has been developed as a trigger system for a rapid response team (RRT) that traditionally used direct calling. We compared event characteristics, intensive care unit (ICU) admission, and 28-day mortality following RRT activation of the two trigger systems. A total of 10,026 events were classified into four groups according to the activation time (i.e., daytime or on-call time) and the triggering type (i.e., calling or screening). Among surgical patients, the ICU admission was lowest for the on-call screening group (26.2%). Compared to the on-call screening group, the on-call calling group and daytime calling group showed higher ICU admission with an odds ratio (OR) of 2.07 (95% CI 1.50–2.84, *p* < 0.001) and OR of 2.68 (95% CI 1.91–3.77, *p* < 0.001), respectively. The 28-day mortality was lowest for the on-call screening group (8.7%). Compared to the on-call screening group, on-call calling (OR 1.88, 95% CI 1.20–2.95, *p* = 0.006) and daytime calling (OR 1.89, 95% CI 1.17–3.05, *p* < 0.001) showed higher 28-day mortality. The EMR-based screening system might be useful in detecting at-risk surgical patients, particularly during on-call time. The clinical usefulness of an EMR-based screening system can vary depending on patients' characteristics.

**Keywords:** clinical deterioration; early medical intervention; electronic health records; hospital rapid response team; intensive care units; medical records system; computerized

#### **1. Introduction**

Rapid response teams (RRT) were widely deployed in the early 2000s to promptly detect deteriorating patients outside critical care and to provide appropriate advanced critical care early on [1]. RRTs are activated by calls by medical staff based on the calling criteria and the clinical concern. Increasing the RRT dose could improve patient outcomes [2,3]. However, previous research indicates that only 30% of at-risk patients who satisfied the calling criteria received critical care from RRTs [1]. In addition, diurnal variation affects RRT activation and clinical outcomes. RRT calls frequently occur during the day. Diurnal variation in RRT utilization influences hospital mortality dependent upon the time of the call [4,5]. Delayed RRT activation occurs more frequently between midnight and 8:00 am and is associated with increased hospital mortality [6]. Infrequent activation during early morning hours is followed by a spike in mortality at 7:00 am [7]. These findings suggest a delay in recognition of at-risk patients and suboptimal RRT utilization by caregivers at night results in poor patient outcomes.

Abundant clinical data and conclusions derived from electronic medical records (EMR) can be utilized to improve not only health care quality but also point-of-care management by detecting clinical deterioration early. The 24-h accessibility of EMR is beneficial in that automatic EMR monitoring by RRTs tends to reinforce the screening of at-risk patients. Currently, vital signs and certain laboratory data in EMR are used as criteria parameters for detecting deteriorating patients working as an additional limb of RRT [8,9]. However, results of EMR-based RRT systems have been mixed [9–11]. In our hospital, an RRT with dual-triggering afferent limbs, which utilizes both direct calling from bedside doctors or nurses and 24-h based EMR screening criteria, was introduced in 2008. We adopted a single-parameter EMR screening system. Previous research that assessed clinical outcomes in the first two-year period after dual triggering system deployment reported that EMR screening resulted in lower intensive care unit (ICU) admission rates but only surgically ill patients had reduced 28-day mortality rates [12]. This study aimed to analyze the event characteristics and clinical outcomes of RRT activations according to the trigger-type and activation time using the 8-year-period RRT cohort.

#### **2. Methods**

#### *2.1. Study Populations*

The study protocol was approved by the Institutional Review Board (2016-0857) of Asan Medical Center. Due to the retrospective nature, informed consent was not required, and patients' data were used anonymously. This study was conducted at an academic tertiary care hospital with approximately 2400 adult beds. All adult patients in general wards who received treatment from the RRT were eligible. RRT operated for 24 h a day, 7 days a week during the study period. As the purpose of the study was to compare clinical characteristics and effectiveness of two triggering systems in early detection and management of at-risk patients, RRT events which were categorized as cardiopulmonary cerebral resuscitation (CPCR), post-CPCR care, educational purpose, procedure assistance, and counseling for end-of-life were excluded. Patients who requested to be listed as "do not resuscitate" (DNR) were also excluded. If the patient had more than one event in the same admission period, only the first event was included for analysis to eliminate redundancy.

Based on the duty hours of training residents, daytime for weekdays was defined as 7:00 am to 5:59 pm while on-call time was defined as 6:00 pm to 6:59 am on the following day. Daytime for weekends or holidays was defined as from 7:00 am to 11:59 am while on-call time was defined as from 12:00 pm to 6:59 am on the following day. Nursing staffs work in three shifts, day shift (6:30 am to 2:30 pm), evening shift (2:30 pm to 10:30 pm), and night shift (10:30 pm to 6:30 am on the following day), regardless of weekend or weekdays. We divided the type of activation events into four groups: calling vs. screening, based on trigger type, and daytime vs. on-call time, based on activation time.

At this hospital, the RRT not only provides advanced critical care but is also actively involved in monitoring and assessing at-risk patients throughout the day. Direct calling from ward physicians and nurses activates the RRT. Additionally, the EMR-based screening system is utilized, which automatically activates the RRT when the pre-defined criteria based on the vital signs and laboratory measurements of the patients' medical records are met. Details of the criteria are shown in Table S1.

#### *2.2. Study Variables and Outcomes*

Data were routinely collected for patients' demographics, illness-type (medical or surgical), RRT activation time and date (weekdays or weekend), trigger parameter for activation, therapeutic intervention during the event (e.g., intubation, ventilator, high flow nasal cannular (HFNC), bilevel positive airway pressure (BiPAP), advanced cardiovascular life support (ACLS), etc.), the outcome of the RRT intervention (e.g., ICU transfer vs. ward stay), and the 28-day mortality following the event. If the alarm was triggered by both calling and screening, the first trigger was recorded. Patients' vital signs (e.g., systolic/diastolic blood pressure (BP), pulse rate (PR), respiratory rate (RR), body temperature (BT), and mental status) at the time of the event were also collected to risk-stratify patients. We calculated a modified early warning score (MEWS), which was validated for both medical in-patients and surgical in-patients [13,14] and used this score for adjustment. MEWS is the sum of scores for five parameters: systolic BP, PR, RR, and mental status. Each parameter and detailed pre-assigned score are described in Table S2.

The number of RRT events per year and the number of RRT activations per clock hour were analyzed to identify the RRT activation pattern. RRT events per each clock hour were classified as per screening and calling which were further divided into doctor-calling, and nurse-calling. All events were categorized into four different groups: daytime calling, daytime screening, on-call calling, and on-call screening. The primary outcome considered was ICU admission after RRT activation. Twenty-eight day mortality following RRT activation was also assessed for the four groups.

#### *2.3. Statistical Methods*

Annual RRT activations and the number of RRT events for each clock hour are presented graphically. Differences between two groups (i.e., calling vs. screening in daytime, and calling vs. screening in on-call time) were tested using a Chi-square test for categorical variables, and the independent t-test for continuous variables. ICU admission and 28-day mortality were compared between the on-call screening group and the other three individual groups and presented as an odds ratio (OR) with a 95% confidence interval using a multivariate logistic regression model adjusting for age, sex, MEWS, weekend, and activation coding after univariate analysis. The risk factors for ICU admission and 28-day mortality were identified following univariate logistic analysis and statistically significant variables were further applied for the multivariate logistic regression. All statistical analyses were performed using SPSS software (version 24.0; IBM Corp., Armonk, NY).

#### **3. Results**

From 1 January 2009 to 31 December 2016, 15,641 RRT events were identified (Figure 1); of these, 10,026 events were included for analysis. All events were classified according to activation time and trigger type. A total of 6293 events occurred during on-call time and 54.3% (*n* = 3419) were activated by screening rather than calling. Figure 2 represents the number of RRT activation events per year over the 8-year period. The number of RRT triggers by calling did not vary significantly between each year but activation by screening increased steadily from 2009. This increase was more prominent during on-call time.

**Figure 1.** Schematic flow chart of the study Given that our rapid response team (RRT) performs a multifunctional role, only events related to early detection and management of at-risk patients were included for the study. Among 15,641 eligible events, 10,026 events were analyzed to describe the pattern of RRT activations. For clinical outcome analysis, 9736 events were included after excluding 290 events due to unavailable MEWS. Cardiopulmonary cerebral resuscitation (CPCR); Do not resuscitate (DNR); Rapid response team (RRT); Modified early weaning score (MEWS).

**Figure 2.** The number of RRT events per year since 2009. Overall RRT events increased from 1055 in 2009 to 1627 in 2016. The total number of RRT activations by screening in 2016 was 2.63-fold higher than that in 2009. Data are presented as number of events.

The number of events for each clock hour are illustrated in Figure 3. In total, 4771 events (47.6%) were call-triggered (doctor calling and nurse calling). The number of activations by nurse calling was relatively stable in each clock hour compared with the number of doctor calling. RRT contacts were most frequent from midnight to 00:59 am (*n* = 952, 9.5%). The proportions of activation by screening were higher during on-call time, particularly from midnight to 0:59 at which time the vital check is conducted by night duty nurses.

**Figure 3.** The RRT frequency according to each clock hour. Among 10,026 events, 4771 (47.6%) were triggered by calling and 5255 (52.4%) were triggered by screening. RRT contacts are most frequent at midnight to 00:59 am (*n* = 952, 9.5%). The total frequency was higher in order of 18:00 pm, 21:00 pm, and 8:00 am. Data are presented as number of events.

Patients' baseline characteristics, illness type, and MEWS are presented in Table 1. Approximately half of the patients in the screening group had solid malignancy (daytime 50.3% vs. 35.3%, *p* < 0.001; on-call time 50.2% vs. 41.5%, *p* < 0.001). Moreover, a greater number of patients had hematologic malignancy in the screening group than in the calling group (daytime 19.3% vs. 13.4%, *p* < 0.001; on-call time 16.7% vs. 11.6%, *p* < 0.001). In contrast, the proportion of patients with chronic lung disease, cardiovascular disease, or neurologic disease was higher in the calling group. Among surgical patients, a higher number of at-risk patients were activated by calling rather than screening (daytime 18.9% vs. 10.7%, *p* < 0.001: on-call time 19.6% vs. 12.9%, *p* < 0.001). MEWS was available in 9,736 events. MEWS was significantly higher in the calling group than in the screening group (daytime 4.54 vs. 4.30, *p* = 0.031; on-call 4.57 vs. 4.37, *p* = 0.0017). Activation coding and type of intervention are presented in Table S3.


**Table 1.** Baseline characteristics of included events.

Among continuous variables, age is presented as median (interquartile range) and MEWS are presented as mean <sup>±</sup> SD. Categorical variables are presented as No. (%). MEWS was available in 9736 patients. \* *<sup>p</sup>*-value <sup>&</sup>lt; 0.05, † *<sup>p</sup>*-value <sup>&</sup>lt; 0.01, ‡ *<sup>p</sup>*-value <sup>&</sup>lt; 0.001. Chi-square test was done for the comparison between daytime calling and daytime screening. The same analytic technique was used for the comparison between on-call calling and on-call screening. MEWS = modified early weaning score.

The overall ICU admission was 28.9% and 28-day mortality was 30% among 9736 patients. As more patients in the screening group had a malignancy (Table 1), a subgroup analysis was conducted to compare the clinical outcomes between patients with cancer and those without cancer (Table 2). Among patients with an underlying malignancy, the overall ICU admission rate was 21.5% and the on-call screening group displayed the lowest ICU admission (14.9%). The overall 28-day mortality was 40.6% and mortality was lower for the on-call calling group compared to the on-call screening group (OR 0.84, 95% CI 0.72–0.98). Among patients without cancer, overall ICU admission and 28-day mortality were 36.1% and 21.4%, respectively. ICU admission was lowest for the on-call screening group (22.3%) as was 28-day mortality (18.6%). Similar to patients with cancer, the daytime screening group had higher mortality compared to on-call screening group (OR 1.48, 95% CI 1.14–1.93).



Data are presented as odds ratio (OR) with 95% confidence interval (CI). ICU admission and 28-day mortality were analyzed using a multivariate logistic regression model adjusting for age, sex, MEWS, weekend, and activation code. MEWS, weekend, and activation code were variables finally selected for the regression model for ICU admission in cancer patients. For 28-day mortality in cancer patients, sex, MEWS, weekend, and activation code were adopted variables for the regression model. Among patients without cancer, age, MEWS, and activation code were variables adopted for ICU admission and 28-day mortality. For patients with cancer: overall (*n* = 4980); on-call screening (*n* = 1971); daytime screening (*n* = 1131); on-call calling (*n* = 1188); daytime calling (*n* = 690). For patients without cancer: overall (*n* = 3256); On-call screening (*n* = 961); Daytime screening (*n* = 472); On-call calling (*n* = 1063); Daytime calling (*n* = 760).

Among patients with surgical illnesses (Table 3) overall ICU admission and 28-day mortality were 37.6% and 13.4%, respectively. The on-call screening group was significantly associated with lower ICU admission (26.2%); daytime screening had an ICU admission of 28.4% (OR 1.06, 95% CI 0.70–1.59, *p* = 0.794), on-call calling 42.4% (OR 2.07, 95% CI 1.50–2.84, *p* < 0.001), and daytime calling 46.1% (OR 2.68, 95% CI 1.91–3.77, *p* < 0.001). The on-call screening group was also associated with lower 28-day mortality (8.7%); the 28-day mortality for daytime screening was 12.9% (OR 1.44, 95% CI 0.82–2.51, *p* = 0.203), on-call calling 15.5% (OR 1.88, 95% CI 1.20–2.95, *p* = 0.006), and daytime calling 15.9% (OR 1.89, 95% CI 1.17–3.05, *p* = 0.0009).


**Table 3.** Clinical outcomes among patients with surgical illness.

Data are presented as odds ratio with 95% confidence interval (CI). ICU admission and 28-day mortality were analyzed using a multivariate logistic regression model adjusting for age, sex, MEWS, weekend, and activation code. Sex, MEWS, weekend, and activation code were variables finally selected for the regression model for ICU admission. For 28-day mortality, sex, MEWS, and activation code were adopted variables for the regression model. Overall (*n* = 1500); on-call screening (*n* = 412); daytime screening (*n* = 201); on-call calling (*n* = 528); daytime calling (*n* = 359).

The risk factors associated with ICU transfer and 28-day mortality following RRT activation are shown in Table S4. On-call time patients were less likely to be transferred to the ICU (OR 0.76, 95% CI: 0.68–0.84, *p* < 0.001) and had a lower mortality rate than daytime patients (OR 0.85, 95% CI: 0.78–0.94, *p* < 0.001). Compared to calling, the RRT activation by screening was associated with a lower ICU transfer rate, (OR 0.51, 95% CI: 0.46–0.57, *p* < 0.001) but a higher 28-day mortality rate (OR 1.19, 95% CI: 1.08–1.33).

#### **4. Discussion**

We evaluated the effects of an EMR-based screening system using the RRT cohort over an 8-year period following the employment of a dual triggering system. In addition to the calling system, the EMR-based screening system was implemented to aid in increasing the detection sensitivity of at-risk patients. Between 1 January 2009 and 31 December 2016, a total of 10,026 events were included. The number of RRT activations by screening system continually increased over the course of data collection. The higher proportion of screening group (especially on-call time) compared to the other study groups can be explained by two factors: (1) the experiences gained by the RRT over the study years might have led to an increase in the sensitivity of the screening system in detecting at-risk patients; (2) a small number of doctors on duty compared to daytime might result in a decrease in on-call calling, thus eventually increasing the burden of RRT work during on-call time.

Among the patients without cancer and surgical patients, the on-call screening group had lower ICU admission and lower 28-day mortality than other groups, possibly due to early detection using EMR screening during on-call time. However, this positive effect of the screening system was not observed among at-risk patients with cancer. Although more patients in the on-call calling group transferred to the ICU and had higher MEWS than the on-call screening group, 28-day mortality was significantly lower in the on-call calling group. The higher ICU admission rate in the calling group can be explained by the fact that activation by calling is (1) more likely to be associated with acute medical events and (2) more likely to reflect greater motivation on behalf of attending physicians to treat the patient. Alternatively, the daytime screening group had lower MEWS and lower ICU admission, but higher 28-day mortality. This result suggests that various factors affect mortality in medically ill at-risk patients. DNR agreement following RRT activation or consideration for end-of-life care might be closely related to 28-day mortality.

In many reports, the dose-response effect of the RRT was well described [15–17]. Increasing RRT dose was associated with dose-related reduction of cardiac arrest and cardiac arrests were most common overnight when RRT dose was the lowest. Because the trigger threshold by traditional calling criteria may vary depending on the experience or concern of ward physicians and nurses, the achievement of optimal RRT dose is important. Real-time monitoring by experienced RRT could increase the sensitivity of detecting clinical deterioration and improve clinical outcome. As our results indicate, triggering frequency itself is largely dependent on the interval of vital sign measurements. As vital sign check-ups at the ward are typically recorded by nurses at intervals of 8-h or 4-h, the activation frequency was higher during the regular vital sign check-up hours. Therefore, unless clinicians order frequent vital sign check-ups or laboratory tests for possible at-risk patients, the possibility of missing indicators of deteriorating patients will persist. Employment of automatic continuous monitoring could overcome this limitation [18].

We used single parameters including laboratory data, such as lactate levels and arterial blood gas analysis, as the triggers for the EMR-based system. Previous research indicates various degrees of sensitivity and the accuracy of multiple aggregate weighted scoring systems (AWSS) for predicting ICU transfer, cardiac arrest, and mortality [19,20]. However, the current AWSS depends mainly on vital signs, without assessing other characteristics of at-risk patients. Therefore, the development of a modified scoring system which includes vital signs, laboratory data, and characteristics of the patient population is necessary to increase the sensitivity and specificity of the screening system.

There are several limitations in this study. First, this is a single-center study and data were analyzed retrospectively. As our center is an academic tertiary hospital, the patients' disease severity is generally higher than in non-tertiary hospitals; thus, our results may not generalize well to non-teaching hospitals or hospitals with different patient populations. Nevertheless, the use of a screening system during on-call time seems to be associated with an improved 28-day mortality rate, particularly among surgically ill patients and patients without malignancy. As our hospital adopted a dual-triggering system at the time of the launch of RRT, for ethical and patient safety concerns, a prospective randomized trial comparing calling system and screening system based on the duty hours was not possible. A prospective multicenter study is required to evaluate the efficacy of the screening system more accurately. Second, the proportion of patients who agreed on a plan for end-of-life care following RRT activation was not considered. There are other factors affecting primary outcomes aside from the management of RRT, such as end-of-life care, spontaneous decisions by attending physicians, will of family caregivers, and the availability of ICU beds at the time of the RRT visit. Therefore, these results should be interpreted with careful consideration of multiple clinical factors, not by the RRT intervention alone.

#### **5. Conclusions**

Deployment of an EMR-based screening system offers additional improvement in detecting and managing at-risk patients, particularly during on-call time. However, the clinical effectiveness of this system can vary depending on patients' characteristics. The deployment of a modified screening system reflecting the physiologic parameters, laboratory measurements, and underlying diseases of the patient population at each hospital would maximize the beneficial role of the RRT in point of care management.

**Supplementary Materials:** The following are available online at http://www.mdpi.com/2077-0383/9/2/383/s1, Table S1: Criteria for medical alert team activation, Table S2: Modified early weaning score (MEWS) Table S3: Baseline patients and event characteristics, Table S4: The multivariate analysis for the risk factors associated with ICU admission and 28-day mortality after RRT activation.

**Author Contributions:** S.H.L. contributed to analyzing data and writing of the manuscript. C.-M.L., Y.K., and S.-B.H. contributed in planning the study, clinical work, and approval of the submitted article. J.W.H. conceived the study; contributed in clinical work, integrity, and accuracy of data; preparation and approval of the submitted article. All authors have read and agreed to the published version of the manuscript.

**Funding:** This study was supported by a grant of the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (grant number: HI15C1106).

**Acknowledgments:** The authors thank the nurses of rapid response team for their contribution to this work.

**Conflicts of Interest:** The authors have no potential conflicts of interest to disclose.

#### **Abbreviations**

ACLS: Advanced cardiovascular life support; AWSS: Aggregate weighted scoring systems; BiPAP: Bilevel positive airway pressure; CI: Confidence interval; CPCR: Cardiopulmonary cerebral resuscitation; DNR: Do not resuscitate; EMR: electronic medical record; HFNC: High flow nasal cannular; ICU: Intensive care unit; MEWS: Modified early warning score; RRT: Rapid response team

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Ensemble Learning to Improve the Prediction of Fetal Macrosomia and Large-for-Gestational Age**

### **Shangyuan Ye 1, Hui Zhang 2, Fuyan Shi 3, Jing Guo 4, Suzhen Wang 3,\* and Bo Zhang 5,\***


Received: 28 December 2019; Accepted: 28 January 2020; Published: 31 January 2020

**Abstract:** Background: The objective of this study was to investigate the use of ensemble methods to improve the prediction of fetal macrosomia and large for gestational age from prenatal ultrasound imaging measurements. Methods: We evaluated and compared the prediction accuracies of nonlinear and quadratic mixed-effects models coupled with 26 different empirical formulas for estimating fetal weights in predicting large fetuses at birth. The data for the investigation were taken from the Successive Small-for-Gestational-Age-Births study. Ensemble methods, a class of machine learning techniques, were used to improve the prediction accuracies by combining the individual models and empirical formulas. Results: The prediction accuracy of individual statistical models and empirical formulas varied considerably in predicting macrosomia but varied less in predicting large for gestational age. Two ensemble methods, voting and stacking, with model selection, can combine the strengths of individual models and formulas and can improve the prediction accuracy. Conclusions: Ensemble learning can improve the prediction of fetal macrosomia and large for gestational age and have the potential to assist obstetricians in clinical decisions.

**Keywords:** macrosomia; large for gestational age; machine learning; ensemble methods; prediction; sensitivity; specificity

#### **1. Introduction**

Excessive fetal growth poses risks to maternal and infant well-being [1]. The term "macrosomia" is used to describe the condition of a fetus with a birth weight of more than 4000 g, regardless of gestational age [1,2]. Macrosomia is sometimes confused with "large for gestational age" (LGA), which describes an infant with a 90th percentile or higher birthweight for gestational age [1]. It has been shown that the infants with either macrosomia or LGA pose a large risk of perinatal morbidity and mortality to their mothers [2,3]. Therefore, accurately predicting and diagnosing these conditions has been a major goal of obstetric investigators, for purposes of conducting early intervention or targeted perinatal medical care to reduce the risks.

Measurements taken from prenatal ultrasound imaging are the primary quantitative resources for predicting birth weights and diagnosing macrosomia or LGA. Zhang et al. [4] took the empirical formula given by Hadlock et al. [5] for estimating fetal weights and implemented a joint mixed-effects model to predict macrosomia and LGA. This procedure of predicting macrosomia or LGA from prenatal ultrasound measurements was a two-step supervised learning process. In the first step, an empirical formula was chosen to derive the estimated fetal weights (EFWs) from sonographic ultrasound measurements at each of the gestational time points when the ultrasound measurements were taken and recorded. In the second step, a joint mixed-effects model with latent subject-specific random effects was fitted. One component of the joint model was a quadratic mixed-effects model to derive the predicted birth weights (PBWs). Another component was a probit mixed-effects model, from which the classification of macrosomia or LGA was determined from the PBWs by comparing the PBWs with a pre-specified threshold [6]. However, previous literature has noted that there were 26 candidate empirical formulas for estimating fetal weights [7–9]. In addition, some literature also argued that nonlinear mixed-effects models should be more appropriate in modeling growth curves than linear or quadratic mixed-effects models [10]. Selecting a particular EFW empirical formula and a quadratic mixed-effects model, as in Zhang et al. [4], increases the uncertainties in predicting macrosomia or LGA, because these empirical formulas and the statistical models function diversely.

In this article, we investigate the use of ensemble methods to aggregate prediction results given by different EFW empirical formulas and the statistical models. The goal of this practice is to improve the prediction in macrosomia or LGA. With ensemble methods, it is not required to select any specific statistical model or empirical formulas. Instead, the prediction capability of each combination of the empirical formulas and statistical models is combined and aggregated to generate a learning procedure that gives the best prediction performance.

#### **2. Methods**

#### *2.1. Data*

The Successive Small-for-Gestational-Age Births study (SGA study) was funded by the National Institute of Child Health and Human Development (NICHD), in the National Institutes of Health of the USA in 1983. The study was conducted concurrently by the University of Bergen in Norway, the University of Uppsala in Sweden, and the University of Alabama, USA, from 1984 to 1985 [11]. The initial goal of the SGA study was to characterize the different types of intra-uterine growth restriction and to assess the associated risk factors.

To demonstrate the strengths of ensemble methods in predicting fetal macrosomia or LGA, we took the Scandinavian data in the SGA study that were collected in Norway and Sweden. The Scandinavia SGA data were collected from January 1st, 1986, through March 31st, 1988 from nulliparous (parity 1) and primiparous (parity 2) Caucasian pregnant women prior to the 20th gestational week who had a singleton pregnancy and spoke one of the Scandinavian languages. A total of 6354 women were recruited to the study, and 632 of them were excluded from the study, according to the exclusion criteria (n = 432) or due to absence of first prenatal visit (n = 200). The remaining 5722 patients were split into three subgroups. A random sample of 561 patients was first selected. Then, a "high-risk" group of 1384 patients was identified out of the random sample of 561 patients on the basis of five small-for-gestational-age (SGA) risk factors: giving birth to an infant with birthweight below 2750 g in the past, maternal cigarette smoking at conception, pre-pregnancy weight lower than 50 kg, a previous perinatal death, and the presence of chronic maternal disease (chronic renal disease, hypertension, or heart disease). The remaining 3777 patients were considered as the "low-risk" group. The patients from the random sample and "high-risk" group were eligible to participate in a detailed follow-up study, during which the women were examined at approximately 17th, 25th, 33rd, and 37th weeks of gestation. For each visit, ultrasound examination was performed, and demographic and medical information were collected. At the end of the Scandinavia SGA study, only 1945 were able to complete the follow-up study.

In our study, we restricted the study data to the 1115 women in the Scandinavia SGA study who had all four ultrasound examinations and complete covariate information (maternal age, pre-pregnancy body weight and height, previous diseases history, and smoking history) in the Scandinavia SGA study. In the ultrasound examination records in the Scandinavia SGA study, there were three fetal measurements: biparietal diameter (BPD), middle abdominal diameter (MAD), and femur length (FL).

#### *2.2. An Ensemble Learning Procedure of Predicting Macrosomia and LGA*

Here, we developed a four-step ensemble learning procedure to predict macrosomia or LGA from prenatal ultrasound measurements. The output of the learning procedure is the binary classification of either macrosomia or LGA. The prediction of macrosomia and LGA is conducted based upon a set of input features. The primary input features are the sonographic measurements BPDs, MADs, and FLs collected from 17th, 25th, 33rd, and 37th weeks of gestation, as well as gestational age at delivery, in the Scandinavia SGA study. Other features also include maternal age, pre-pregnancy body mass index, parity, smoking status, existing diabetes, and gestational diabetes. Figure 1 shows a diagram that delineates the four steps in the ensemble learning procedure, to predict macrosomia or LGA. Step 1 is to take the sonographic measurements for each of the gestational weeks and each of the empirical formulas in Table A2, to obtain EFWs at each gestational time point. Step 2 is to fit either a nonlinear mixed-effects model or a quadratic mixed-effects model to predict the corresponding PBWs at birth from the EFWs obtained in Step 1. Step 3 is to use the PBWs to derive the classification of macrosomia or LGA by using a specified threshold. Step 4 is the ensemble learning step, in which prediction output from various empirical formulas and the nonlinear and quadratic mixed-effects models are combined to generate the ensemble learning prediction results.

#### *2.3. Estimated Fetal Weights with 26 Empirical Formulas*

In the ultrasound examination records in the Scandinavia SGA study, there were three fetal measurements BPD, MAD, and FL. Melamed et al. [7] summarized 26 different empirical sonographic formulas (see Appendix A Table A1) that can be taken to estimate fetal weights from sonographic ultrasound measures. In our ensemble learning procedure, we considered all the 26 models in our analysis. Abdominal circumference (AC) in the empirical sonographic formulas can be calculated by 3.1416 × MAD, and head circumference (HC) can be derived by the formula introduced in [12].

#### *2.4. Mixed-E*ff*ects Models for Predicting PBWs and Deriving the Classification of Macrosomia or LGA*

We built a nonlinear three-parameter mixed-effects logistic model with latent random effects to predict PBWs and derive the classification of macrosomia or LGA [10,13]. The three-parameter mixed-effects logistic model for the *i*th fetus (*i* = 1, ··· , *n* indexing the study subjects) at gestational time *tij* (*j* = 1, 2, 3, 4 indexing the time when the gestational ultrasound measurements were recorded, *j* = 5 indexing the time of birth) is as follows:

$$|y\_{ij}| = \frac{\phi\_{1i}}{1 + \exp\left[-\left(t\_{ij} - \phi\_{2i}\right)/\phi\_{3i}\right]} + \epsilon\_{ij\prime}$$

where *yij* is the EFW obtained from one of the 26 formulas in Table A1 at gestational time *tij*, *j* = 1, 2, 3, and 4, and *yij* is the birth weight when *j* = 5, φ*<sup>i</sup>* = (φ1*i*, φ2*i*, φ3*i*) are model parameters in which φ1*<sup>i</sup>* indicates amplitude, φ2*<sup>i</sup>* indicates the smoothness, φ3*<sup>i</sup>* indicates stretch, and *ij* are within-subject random errors. For the nonlinear mixed-effects model, each parameter φ*ki* (*k* = 1, 2, 3 indexing the parameter), can be further modeled by a linear representation:

$$
\phi\_{ki} = X\_{ij}' \beta\_k + b\_{ki'}
$$

where *Xij* is a vector of fixed-effects the covariates of maternal age, pre-pregnancy body mass index, previous disease history, including diabetes, cardiac disease, high blood pressure, renal disorders, and other diseases, and smoking during pregnancy; and β*<sup>k</sup>* denotes the corresponding regression parameters. The first element in *Xij* equals 1 for the interception. In our learning procedure, the covariates were

included in the linear term φ1*<sup>i</sup>* = *X ij*β<sup>1</sup> + *b*1*i*, and φ2*<sup>i</sup>* and φ3*<sup>i</sup>* were specified as φ2*<sup>i</sup>* = β<sup>2</sup> + *b*2*<sup>i</sup>* and φ3*<sup>i</sup>* = β<sup>3</sup> + *b*3*i*. We also considered the nonlinear three-parameter mixed-effects logistic model without any covariates in φ1*<sup>i</sup>* such that φ1*<sup>i</sup>* = β<sup>1</sup> + *b*1*i*. The random effects *bi* = (*b*1*i*, *b*2*i*, *b*3*i*) represent the latent individual variations that are not explained by the covariates. We assumed the random effects, *bi*, independently follow a multivariate normal distribution, with a vector of mean 0 and a variance-covariance matrix Σ, where Σ was assumed to be positive, definite, and unstructured. Further, we assumed the following heteroscedastic model [10] for the within-subject random errors *ij*:

$$\epsilon\_{i\bar{j}} \sim \mathcal{N}\left(0, \left\lfloor \frac{\phi\_{1i}}{1 + \exp\left[-\left(t\_{i\bar{j}} - \phi\_{2i}\right) / \phi\_{3i}\right]} \right\rceil^{2\delta} \right)$$

.

**Figure 1.** An ensemble learning procedure to predict macrosomia or LGA from prenatal ultrasound measurements.

We compared the above nonlinear mixed-effects models with the following quadratic mixed-effects model implemented in Zhang et al. [4].

$$y\_{ij} = X\nu\_{ij}\beta + \theta\_1 t\_{ij} + \theta\_2 t\_{ij}^2 + b\_{i0} + b\_{i1} t\_{ij} + b\_{i2} t\_{ij}^2 + \epsilon\_{ij}.$$

The configuration of *Xij*, β, *bi*, and *ij* is identical to those in the nonlinear mixed-effects model, and θ = (θ1, θ2) are the parameters of time fixed-effects *tij* and *t* 2 *ij*. For the quadratic mixed-effects

model, we also considered the one without any covariates for predicting the birth weights and deriving the classification of macrosomia or LGA.

#### *2.5. Ensemble Learning Methods*

Here, we propose to apply ensemble methods [14], to combine the prediction results generated from 26 individual EFW empirical formulas and from nonlinear and linear mixed-effects models. One appealing property for ensemble methods is that they combine the classification strengths of individual models but do not overfit the data. Two types of ensemble algorithms, majority voting and stacking, are considered. Majority voting is one of the most fundamental ensemble methods for classification [14]. For a binary classification problem, the final classification of majority voting is the class that receives more than half of the votes from the individual learning models. Because the individual learning models can be correlated, it is necessary to select among individual learning models that are combined in majority voting [15–17]. Here, we implemented least absolute shrinkage and selection operator (LASSO) [18], smoothed clipped absolute deviation (SCAD) [19], and minimax penalized likelihood (MCP) [20] to select individual learning models.

The stacking method [21,22] is one of the most known meta-learning methods. It combines the prediction results from several different individual learning models, called "first-level learners", by another learning model, named "second-level learner" or "meta-learner". Van der Laan et al. [23] proposed to train the meta-learner by a unified cross-validation algorithm or super learner and proved its oracle properties. The unified cross-validation algorithm is summarized as follows. For a *K*-fold cross-validation procedure (*K* = 10 is set here), the training dataset was split into *K* equal-sized groups, stratified by the response variable. Then, let the *k*-th group be the validation data, take the remaining data to train the first-level learners, and collect the prediction values from the validation data as the covariates of the meta-leaner. By repeating the above procedure on every fold of data, along with the original response variable, a complete dataset, called the "leave-one data", is generated for training the meta-learner. Several reports [21,22,24] proposed using the logistic regression with positive constraint on the regression coefficients as the meta-learner for classification, whereas Van der Laan et al. [23] used linear regression as the meta-learner for regression models. Debray et al. [24] suggested performing model selection on the meta-learner, so we applied LASSO, SCAD, and MCP to conduct model selection on the meta-learner.

#### *2.6. Evaluation of Prediction Performance*

In our investigation, the original dataset was randomly divided into a training dataset (70%, n = 781) and a testing dataset (30%, n = 334), stratified by the presence of macrosomia or LGA. For an individual mixed-effects model with a particular EFW empirical formula, the entire training dataset, along with the ultrasound measures and demographic information in the testing dataset, was used to fit the model. The birth time, birthweight, and the true macrosomia or LGA status in the testing dataset were used to evaluate the prediction performance. For ensemble methods, the ensemble learner was trained by the training dataset and tested by the testing dataset.

The prediction accuracy of each individual mixed-effects model with a particular EFW empirical formula, as well as the ensemble learner, was assessed by the areas under the receiver operating characteristic curve (AUC), sensitivity, specificity, positive predictive value (+PV), negative predictive value (−PV), positive likelihood ratio (+LR), negative likelihood ratio (−LR), and Youden's index (sensitivity + specificity − 1) for predicting both macrosomia and LGA.

#### **3. Results**

Table 1 shows the baseline characteristics of our study subjects (n = 1115). The mean maternal age of the study subjects was 28 years (standard deviation or SD: 4 years), and the average height and weight before pregnancy were 166 centimeters (SD: 6 centimeters) and 59 kg (SD: 10 kg), respectively. Twenty-one percent of women had a history of SGA births, and about half of them smoked at enrollment. Only a few had high blood pressure, cardiac disease, diabetes, or renal disorder, but 15% had other types of diseases. The mean gestational age at birth was 280 days (SD: 8 days), and the mean birthweight was 3562 g (SD: 478 g). Seventeen percent of infants had macrosomia, 11% had LGA at birth, and, among the LGA infants, 9 of them (7%) were not macrosomia infants. The current study sample was used as the reference of LGA [11].


**Table 1.** Baseline characteristics of study subjects.

#### *3.1. Prediction Performance of Individual Models and Empirical Formulas in Macrosomia*

We predicted macrosomia or LGA with the nonlinear and quadratic mixed-effects models described above, and also considered those models with or without the covariates. Thus, four mixed-effects models combined with 26 empirical formulas for EFWs, totally 104 learning models, were fitted. The prediction performance of individual learning models in predicting macrosomia is reported in Appendix A Tables A2–A5. For the three-parameter mixed-effects logistic models, their prediction performance varied with different EFW empirical formulas, and adding covariates into the models did not improve their prediction performance (see Appendix A Tables A2 and A3). The three-parameter mixed-effects logistic models, either with or without covariates, predicted all birth weights to be under 4000 g when combined with empirical formula 3 in Table A1. The two nonlinear mixed-effects models combined with empirical formula 11 in Table A1 gave the highest value of Youden's index of 0.670, and the lowest value of Youden's index of 0.050 was obtained by empirical formula 18 in Table A1 without covariates. The best sensitivities of 0.857 and 0.839 were obtained from the two models with or without covariates, respectively, when coupled with empirical formula 7 in Table A1. The AUCs for these models ranged from 0.871 to 0.910. The prediction performance of the quadratic mixed-effects models also varied among different empirical formulas, and adding covariates did not improve their prediction performance either (see Appendix A Tables A4 and A5). The quadratic mixed-effects models combined with empirical formula 12 in Table A1 gave the highest value of Youden's index of 0.663, whereas the lowest value of Youden's index of 0.310 was obtained by empirical formula 3, in Table A1, without covariates. The best sensitivity of 0.982 was obtained from the two models with or without covariates when coupled with empirical formulas 7 and 11 in Table A1. The AUCs for these models ranged from 0.857 to 0.905.

#### *3.2. Prediction Performance of Individual Models and Empirical Formulas in LGA*

LGA is defined as a newborn with a birthweight greater than the 90th percentile for gestational age. Here, given the input of sonographic ultrasound measures, both the nonlinear and quadratic mixed-effects models can generate PBWs for each of fetuses at any time point, regardless of its actual birth time. A newborn was classified as LGA if his or her PBW was above the 90th percentile of PBWs of all other fetuses, when the fetuses were assumed to be born at the identical birth time as that newborn. The prediction performance of the nonlinear and quadratic mixed-effects models is reported in Appendix A Tables A6–A9. The prediction performance showed small variation among different EFWs empirical formulas and among these models. The Youden's indexes ranged from 0.354 to 0.526, the sensitivities ranged from 0.424 to 0.576, the specificities ranged from 0.930 to 0.950, and the AUCs ranged from 0.863 to 0.894.

#### *3.3. Prediction Performance of Ensemble Methods*

Two ensemble methods, voting and stacking methods, were applied to combine the 104 learning models for the prediction of macrosomia. The voting method was implemented by using two approaches: voting from all learning models without selection and voting from the selected learning models. To select among the learning models, a penalized logistic regression was run with either a LASSO, SCAD, or MCP penalty. When the stacking method was implemented, we fit a linear regression model as our meta-learner, either without variable selection or with three variable selection methods, LASSO, SCAD, and MCP. The prediction performance of the ensemble methods for the prediction of macrosomia is summarized in Table 2. The voting and stacking methods with the SCAD selection yielded the best Youden's indexes of 0.681 and 0.688, respectively, which were higher than the Youden's indexes generated by any other individual learning models in Tables A2–A5. The voting method with the SCAD and MCP selection and the stacking method with the LASSO, SCAD, and MCP selection outperformed most of the individual learning models listed in Tables A2–A5. The voting method with the SCAD or MCP model selection generated an AUC of 0.932 and 0.924, respectively, higher than the AUCs generated by any other individual learning models in Tables A2–A5.

We also applied the voting and stacking methods for the prediction of LGA, to combine the 104 learning models. The voting method was implemented as described in the prediction of macrosomia. When the stacking method was implemented, we fit a logistic regression with positive constraints [22], without further selection on first-level learners, and a logistic regression with the LASSO, SCAD, and MCP selection on first-level learners as our meta-learner. The results are summarized in Table 3. These results showed that the best prediction results were supplied by the voting method with the MCP selection with a Youden's index of 0.537 and a sensitivity of 0.636, both higher than any of the individual learning models in Tables A6–A9.



 negative predictive values (−PV). **Table 3.**Prediction performance of large for gestational age by the ensemble methods.


Areas under the receiver operating characteristic curve (AUC), confident interval (CI), positive likelihood ratio (+LR), negative likely ratios (−LR), positive predictive values (+PV), negative predictive values (−PV).

#### **4. Discussion and Conclusions**

We proposed using ensemble methods to combine the strengths from nonlinear and quadratic mixed-effects models and 26 empirical formulas of EFWs to predict macrosomia and LGA of newborns from sonographic ultrasound measurements. The prediction performance of the ensemble methods was studied with the data from the Scandinavia SGA study. We showed that the prediction performance varied among the empirical formulas and mixed-effects models. The three-parameter mixed-effects logistic model combined with empirical formula 11 in Table A1 gave the best prediction results in predicting macrosomia. In predicting LGA, the prediction performance also varied. The best prediction results were obtained from the quadratic mixed-effects model combined with empirical formula 6 in Table A1. These results showed that it was difficult to select any individual statistical learning model combined with only one empirical formula of EFWs, to predict macrosomia or LGA from sonographic ultrasound measurements.

We subsequently proposed applying ensemble methods to aggregate the prediction results from mixed-effects models and empirical formulas of EFWs, to achieve better prediction performance. Our investigation showed that, with the aid of either the SCAD or MCP for model selection, both stacking and voting methods improved the prediction accuracies in predicting macrosomia, as opposed to those from individual statistical models and empirical formulas. The voting method with the MCP for model selection predicted LGA more accurately than the individual statistical models and empirical formulas. In this study, the ensemble prediction algorithms were created from all the prenatal ultrasound measures and birth weights. However, the algorithms can be used to make prediction on macrosomia or LGA with the ultrasound measures only from the first or second trimester, although the ultrasound measures collected in the third trimester or before birth can substantially improve the prediction accuracy. Our current study is a feasibility study that demonstrates the ensemble methods can be integrated with ultrasound examinations to assist obstetricians in clinical diagnosis on whether a pregnant woman will give birth to a large infant, and to further guide clinical interventions for the condition. However, measurable clinical benefits are unclear until they are demonstrated in prospective clinical studies.

Our current study has several limitations. The proportion of macrosomia or LGA infants in the Scandinavia SGA dataset is small (15% for macrosomia and 11% LGA of n = 1115 infants). This may largely influence the prediction accuracy given by the individual models and empirical formulas. However, using machine learning methods specifically ensemble methods, can accommodate such imbalanced data, and improve prediction accuracy [23]. In addition, the initial objective of the SGA study was to characterize the intra-uterine growth restriction and assess the associated risk factors of SGA, but the study was not designed for studying macrosomia or LGA. As a consequence, the risk factors associated with macrosomia or LGA were not thoroughly collected. This may be the reason why, in our study, we did not receive benefits from adding covariates into the mixed-effects models. Lastly, The SGA study was conducted in 1980s, with out-of-date sonographic ultrasound examination technologies, so the prediction models developed in this study may not be directly applied to predict macrosomia or LGA with the ultrasound measures from most recent state-of-the-art ultrasound technologies. However, the ensemble methods can still be applied to aggregate any available ultrasound prediction models. Also, the subjects in this study were the Caucasians from Europe that were mostly not obese (11.6% overweight rate and 2.1% obese rate in the study). New studies and data on a diverse population should be able to substantially improve the prediction of macrosomia and LGA among both the whites and minorities, as well as the overweight and obese population.

**Author Contributions:** S.Y., H.Z., F.S., and B.Z. conceived of the presented methodology. S.Y., F.S., and B.Z. verified the analytical methods and conducted data analysis. S.Y., F.S., and B.Z. took the lead in writing the manuscript. H.Z., S.W., and J.G. provided critical feedback, helped shape the research and analysis, and helped in manuscript writing. All authors proved the manuscript contents. All authors have read and agreed to the published version of the manuscript.

**Funding:** Dr. Suzhen Wang's research was partially supported by the National Natural Science Foundation of China (No. 81872719), the National Bureau of Statistics Foundation Project (No. 2018LY79), the Natural Science Foundation of Shandong Province (No. ZR201807090257), and the Poverty Alleviation Fund Project of Weifang Medical University (No. FP1801001). Dr. Fuyan Shi's research was partially supported by the National Natural Science Foundation of China (No. 81803337), the Shandong Provincial Government Fund for Overseas Study (No. 27, 2019, Lu-Jiao), the Shandong Science and Technology Development Plan Project (No. 2015 WS0067), and the Weifang Medical University Doctoral Foundation Project (No. 2017BSQD51). Dr. Hui Zhang receives support from the Robert H. Lurie Comprehensive Cancer Center of Northwestern University in Chicago, IL, which is supported in part by a NCI Cancer Center Support Grant #P30CA060553, and Mesulam Center for Cognitive Neurology and Alzheimer's Disease of Northwestern University, which is supported in part by a NIA Center Support Grant #P30AG13854.

**Conflicts of Interest:** The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

#### **Appendix A**

**Table A1.** Empirical formulas used for estimating fetal weight from sonographic ultrasound measurements (Melamed et al. [5]).


Abdominal circumference (AC), femur length (FL), biparietal diameter (BPD), and head circumference (HC) are expressed in centimeters, and EFW is expressed in grams, unless stated otherwise. This is an identical table to the Table 1 in Melamed et al. [5].


See Table A1 for specifications of empirical formulas. Areas under the receiver operating characteristic curve (AUC), confident interval (CI), positive likelihood ratio (+LR), negative likelihoodratios(−LR),positivepredictivevalues(+PV),negativepredictivevalues(−PV).



See Table A1 for specifications of empirical formulas. Areas under the receiver operating characteristic curve (AUC), confident interval (CI), positive likelihood ratio (+LR), negative likelihood ratios (−LR), positive predictive values (+PV), negative predictive values (−PV).



See Table A1 for specifications of empirical formulas. Areas under the receiver operating characteristic curve (AUC), confident interval (CI), positive likelihood ratio (+LR), negative likelihood ratios (−LR), positive predictive values (+PV), negative predictive values (−PV).


**Table A5.** Prediction performance of the quadratic mixed-e ffects model with covariance when combined with the 26 estimated fetal weight empirical formulas\*

68

**\*** See Table A1 for specifications likelihood ratios (−LR), positive predictive values (+PV), negative predictive values (−PV).

 of empirical formulas. Areas under the receiver operating characteristic

 curve (AUC), confident interval (CI), positive likelihood ratio (+LR), negative

 in



**\***See Table A1 for specifications of empirical formulas. Areas under the receiver operating characteristic curve (AUC), confident interval (CI), positive likelihood ratio (+LR), negative likelihoodratios(−LR),positivepredictivevalues(+PV),negativepredictivevalues(−PV).

0.946 (0.914, 0.969) 0.940 (0.907, 0.964) 0.944 (0.911, 0.967)

0.940 (0.907, 0.964) 0.943 (0.910, 0.966) 0.943 (0.911, 0.967) 0.947 (0.915, 0.969)

0.944 (0.911, 0.967) 0.950 (0.919, 0.972) 0.943 (0.911, 0.967) 0.943 (0.911, 0.967)

0.940 (0.907, 0.964) 0.946 (0.914, 0.969) 0.946 (0.914, 0.969)

 0.452

 0.391

 0.428

 0.388

 0.415

 0.425

 0.465

 0.428

 0.489

 0.425

 0.422

 0.388

 0.452

 0.452

 0.489

 0.361

 0.425

 0.425

 0.391

 0.391

 0.455

 0.890 (0.850, 0.931)

 0.884 (0.840, 0.929)

 0.882 (0.836, 0.929)

 0.887 (0.846, 0.929)

 0.889 (0.847, 0.931)

 0.885 (0.840, 0.930)

21

22

23

24

25

26

 0.424 (0.255, 0.608)

 0.485 (0.308, 0.665)

 0.485 (0.308, 0.665)

 0.455 (0.281, 0.636)

 0.455 (0.281, 0.636)

 0.515 (0.335, 0.692)

 0.937 (0.903, 0.962)

 0.940 (0.907, 0.964)

 0.940 (0.907, 0.964)

 0.937 (0.903, 0.962)

 0.937 (0.903, 0.962)

 0.940 (0.907, 0.964)

 6.721 (3.728, 12.117)

 8.108 (4.587, 14.330)

 8.108 (4.587, 14.330)

 7.201 (4.057, 12.780)

 7.201 (4.057, 12.780)

 8.614 (4.936, 15.035)

 0.615 (0.458, 0.825)

 0.548 (0.393, 0.764)

 0.548 (0.393, 0.764)

 0.582 (0.426, 0.796)

 0.582 (0.426, 0.796)

 0.516 (0.362, 0.734)

 0.424 (0.255, 0.608)

 0.471 (0.298, 0.649)

 0.471 (0.298, 0.649)

 0.441 (0.272, 0.621)

 0.441 (0.272, 0.621)

 0.456 (0.243, 0.656)

 0.950 (0.919, 0.972)

 0.937 (0.903, 0.962)

 0.943 (0.911, 0.967)

 0.943 (0.911, 0.967)

 0.940 (0.907, 0.964)

 0.940 (0.907, 0.964)

 0.946 (0.915, 0.969)



See Table A1 for specifications of empirical formulas. Areas under the receiver operating characteristic curve (AUC), confident interval (CI), positive likelihood ratio (+LR), negative likelihood ratios (−LR), positive predictive values (+PV), negative predictive values (−PV).



See Table A1 for specifications of empirical formulas. Areas under the receiver operating characteristic curve (AUC), confident interval (CI), positive likelihood ratio (+LR), negative likelihood ratios (−LR), positive predictive values (+PV), negative predictive values (−PV).


**Table A9.** Prediction performance of the quadratic mixed-e ffects model with covariance when combined with the 26 estimated fetal weight empirical formulas\*

26

**\*** See Table A1 for specifications likelihood ratios (−LR), positive.

 0.885 (0.840, 0.930)

 0.485 (0.308, 0.665)

 of empirical formulas. Areas under the receiver operating characteristic

 0.940 (0.907, 0.964)

 8.108 (4.587, 14.330)

 0.548 (0.393, 0.764)

 curve (AUC), confident interval (CI), positive likelihood ratio (+LR), negative

 0.471 (0.298, 0.649)

 0.943 (0.911, 0.967)

 0.425

 in

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

### *Article* **Seasonal Variation in Physical Activity among Preoperative Patients with Lung Cancer Determined Using a Wearable Device**

**Sunga Kong 1,2,**†**, Hye Yun Park 3,**†**, Danbee Kang 1,4, Jae Kyung Lee 2, Genehee Lee 2, O Jung Kwon 3, Young Mog Shim 5, Jae Ill Zo 5,\* and Juhee Cho 1,4,6,\***


Received: 22 December 2019; Accepted: 22 January 2020; Published: 27 January 2020

**Abstract:** We aim to examine how season and temperature levels affect physical activity using a wearable device among patients scheduled to undergo surgical resection of lung cancer. Physical activity (PA) data from the wearable device were analyzed by seasons for 555 preoperative lung cancer patients from the CATCH-LUNG cohort study. The seasons were divided into spring, summer, autumn, and winter using the study enrollment date before surgery. The overall mean (SD) age was 61.1 (8.9) years, and the mean (SD) daily steps at each season were 11,438 (5922), 11,147 (5065), 10,404 (4403), and 8548 (4293), respectively. In the fully-adjusted models, patients in the winter season had 27.04% fewer daily steps (95% CI = −36.68%, −15.93%) and 35.22% less time spent performing moderate to vigorous physical activity (MVPA) compared to patients in the spring. The proportion of participants with over 8000 steps and duration of MVPA were significantly lower in the winter than the spring. In particular, daily steps had a negative linear association with wind chill temperature in patients who lived in Seoul. In conclusion, PA was significantly lower in the winter and it was more robust in patients who had a low cardiorespiratory function.

**Keywords:** lung cancer; physical activity; season; preoperative; wearable

#### **1. Introduction**

Lung cancer is the leading cause of cancer-related death worldwide, contributing to 1.6 million deaths annually [1]. Surgical resection remains the best curative treatment option in patients with early-stage non-small cell lung cancer (NSCLC), and a patient's preoperative status is important to assess the feasibility of undergoing surgical lung resection under general anesthesia. In particular, patients with poor pulmonary function and cardiorespiratory fitness are considered inoperable due

to increased morbidity and mortality after surgical resection [2]. Both cardiopulmonary fitness and functional capacity are widely recognized as strong predictors of postoperative complications, specifically mortality and long-term survival, in NSCLC [3].

Cardiorespiratory fitness (CRF) and functional capacity are affected by physical activity (PA) [4]. Numerous studies have conducted PA or exercise programs to improve physical fitness and functions before thoracic surgery [2,5]. Adopting and sustaining a more physically active lifestyle has been shown to reduce the risk of complications and mortality and improve health-related quality of life in patients who underwent thoracic surgery [6]. However, promoting long-term PA has been challenging due to various factors such as lack of motivation, access to facilities for physical activities, and inclement weather [7]. Studies reported that seasonal weather conditions could promote or deter PA [8]. Furthermore, most studies were conducted with a small number of participants (<50) and only included limited populations such as children [9] or the elderly [10]. In addition, few quantitative assessments have been focused on seasonal variation in PA among preoperative lung cancer patients. Studies have attempted to measure daily PA with quantitative assessments using simple and non-expensive devices during the perioperative periods of lung cancer surgery [11–13]. They found that that daily walk distance predicted maximum oxygen consumption per minute in patients undergoing lung resection [11], and the time and the quality of the daily ambulatory activity of the patients decreased during the first postoperative month [12]. Thus, our study aims to use a wearable device (Fitbit) to examine how the season and temperature level affect PA among patients who are scheduled to undergo surgical resection for lung cancer.

#### **2. Methods**

#### *2.1. Subjects and Data Sources*

Patients with lung cancer in this study were selected from the Coordinate Approach to Cancer patients' Health for Lung Cancer (CATCH-LUNG) cohort of preoperative lung cancer patients between March 2016 and October 2018 at the Samsung Medical Center in Seoul, Korea. Inclusion criteria for CATCH-LUNG cohort were (1) patients who were expected to undergo curative lung cancer surgery for suspected or histologically confirmed NSCLC, (2) patients who were able to walk and keep a normal life with the Eastern Cooperative Oncology Group Performance Status (ECOG PS ≤1), and (3) patients understood the purpose of this study and agreed to participate in the study. Exclusion criteria were (1) patients who had undergone neoadjuvant treatment before surgery, (2) patients free of NSCLC after pathological exams, (3) patients whose surgery was canceled, or (4) patients who withdrew consent before baseline data collection. Among patients who met these criteria, we furthermore excluded patients who had either pathologically confirmed stage IV cancer after surgery (*n* = 2) and 63 patients who were excluded due to lack of Fitbit data (28 patients wore their Fitbit less than one day and 35 patients had either hardware or software failures of Fitbit). The final study sample included 555 patients. The study protocol was approved by the Institutional Review Board of Samsung Medical Center (no. 2015-11-025). Written informed consent was obtained from all participants.

#### *2.2. Grouping and Weather Data Collection*

The main exposure variable was the season, which was divided into spring (March to May), summer (June to August), autumn (September to November), and winter (December to February) using the study enrollment date before surgery. For preoperative patients living in the metropolitan Seoul area, the weather-related factor of wind chill temperature was obtained from the Korea Meteorological Administration (https://data.kma.go.kr).

#### *2.3. Physical Activity and other Variable*

The main outcome was PA, which was assessed using a reliable wearable activity tracker [14]. The Flex tracker (Fitbit, San Francisco, CA, USA) was used to quantify the PA intensity, time, activity

type, and steps per day. We asked participants to wear a Fitbit activity tracker 24 h per day for 7 consecutive days. We determined that patients did not wear the Fitbit if there was no movement (0 steps) for more than 4 consecutive hours during daytime (9:00 a.m. to 4:00 p.m.). Physical activity time, frequency, and intensity data were automatically measured and saved by an internal sensor. We then calculated activity level by combining the Fitbit data with age, gender, height, and weight data recorded at registration. Activities were classified into four categories: (1) sleeping or sedentary activity (1 MET), (2) light physical activity (1~2.9 METs), (3) moderate physical activity (3~5.9 METs), and (4) vigorous physical activity (more than 6 METs). The calculated values were averaged to define daily activity level.

The tracker was worn on the wrist. To reduce bias, the device had no screen so that patients could not see their recorded level of PA. In addition, there was no specific education or guidelines for patients regarding PA and patients were recommended to maintain PA prior to surgery, as usual.

CRF was measured using the 6-min walk test (6MWT), which was performed according to ATS guidelines [15]. Each participant was asked to walk (not run) back and forth along the corridor as far as possible for 6 min and was given standardized verbal encouragement every minute. The test has been widely used for preoperative and postoperative evaluations of CRF. In some clinical situations, the 6MWT provides a better index of the patient's ability to perform daily activities compared to peak oxygen uptake [16]. In addition, the 6MWT is a sub-maximal test of CRF, as opposed to cardiopulmonary exercise testing (CPET). While it is well known that the incremental shuttle walk test (ISWT) has a higher correlation with CPET than 6MWT, we could not use the ISWT because it was not available in Korea. In fact, 6MWT has been widely used in real-world clinics for cardiopulmonary evaluation in patients with chronic obstructive pulmonary disease (COPD) [17]. To obtain quality data, trained researchers provided study participants detailed instructions about 6MWT, and asked patients to do a pilot walk (for 15~20 s) before the actual test.

Spirometry and DLco measurements were performed using a Vmax 22 respiratory analyzer (SensorMedics, OH, USA) according to the American Thoracic Society/European Respiratory Society criteria [18,19]. Absolute values of forced expiratory volume in 1 s (FEV1), forced vital capacity (FVC), and DLco were obtained, and the percentage of predicted values (% predicted) for FEV1, FVC, and DLco were calculated using a representative Korean sample [20,21] as a reference. Sociodemographic and behavioral information, including age, smoking status, and comorbidities, were recorded before surgery using a questionnaire. Treatment information regarding pathological stage and pulmonary function were collected after surgery.

#### *2.4. Statistical Analysis*

Continuous and categorical variables were compared among the seasons using analysis of variance (ANOVA) and the χ*2* test, respectively.

For the main analyses, we used linear regression to compare the daily number of steps and the moderate-to-vigorous physical activity (MVPA) time by season. Since steps per day and MVPA minutes are markedly right skewed (*p*-values based on Shapiro–Wilk and Shapiro–Francia tests for normality were <0.001), we used log-transformed daily number of steps and MVPA time as the outcomes. The average difference (as a percent difference with 95% confidence interval (CI)) was estimated comparing patients enrolled in the summer, autumn, or winter to those patients enrolled in the spring.

Using daily number of steps and MVPA duration, we developed a dichotomized outcome to evaluate the proportion of participants who were physically active. Being physically active was defined as either taking more than 8000 steps per day or performing more than 60 min of MVPA. We chose 8000 steps per day based on a reference that adults usually take 5000 steps per day and perform daily activities such as house errands, walking, or shopping [22]. We then added 3000 steps per day to account for 30 min of MVPA (10 min of MVPA is around 1000 steps [23,24]). From this, we equated 8000 steps to 60 min of MVPA.

Logistic regression was conducted to compare the odds of being physically active by season with adjustments made for age, sex, smoking status, FEV1% pred, any pulmonary comorbidities (COPD, asthma, or ILD), and any extra-pulmonary comorbidities (cardiovascular diseases or diabetes mellitus). In addition, we performed stratified analyses to evaluate differences in PA associated with the season in prespecified subgroups of age (<65 vs. ≥65 years) [25] and CRF (<500 vs. ≥500 m in 6MWD) [26].

To determine factors leading to the differences in PA according to season, we conducted an additional analysis for patients (*n* = 85) living in the Seoul metropolitan area because we considered the bias of area environment and topography in Korea. To find an association between weather and daily steps, we modeled wind chill temperature as a continuous variable using restricted cubic splines with knots at the 5th, 35th, 65th, and 95th percentiles of the sample distribution to provide a flexible estimate of the dose-response relationship between weather factors and daily steps.

For all analyses, a *p*-value of <0.05 was considered statistically significant. All analyses were performed using STATA software, version 14 (Stata Corp LP, College Station, TX, USA).

#### **3. Results**

The characteristics of 555 patients are described in Table 1. The mean (SD) age and daily steps of study participants were 61.1 (8.9) years and 10,603 (5200) (range, 425–32,143), respectively. Among the participants, Fitbit data from 30.8% (*n* = 171), 32.1% (*n* = 178), 17.8% (*n* = 99), and 19.3% (*n* = 107) of the study subjects were collected in the spring, summer, autumn, and winter seasons, respectively. Although patients in the winter season were likely to be younger and have better pulmonary function than patients in other seasons, the clinical characteristics, including sex, BMI, smoking status, comorbidities, pathologic stage, and cardiorespiratory fitness, were not different across the seasons (Table 1).


**Table 1.** Characteristics of study participants by season (*n* = 555).



Values are presented as either n (%) or mean (SD). COPD, chronic obstructive pulmonary disease; ILD, interstitial lung disease; FVC, forced vital capacity; FEV1, forced expiratory volume in 1 s; DLco, diffusing capacity of carbon monoxide.

The mean (SD) daily steps were 11,438 (5922), 11,147 (5065), 10,404 (4403), and 8548 (4293) in patients who participated in this study in the spring, summer, autumn, and winter seasons, respectively. The proportion of patients who had 8000 or more steps per day was 71%, 72%, 71%, and 54% in participants who received surgery in the spring, summer, autumn, and winter seasons, respectively (Figure 1).

**Figure 1.** Mean daily steps and the proportion of participants who had more than 8000 steps in each season.

In the fully-adjusted models, patients in the winter season had a significantly lower number of daily steps, with a low of 27.04% (95% CI = −36.68%, −15.93%) compared to the daily steps of patients in the spring season. In comparing the mean (SD) MVPA time, patients in the winter season had the lowest MVPA among subjects (spring, 60.3 (57.2) min/d; summer, 58.5 (46.4) min/d; autumn, 51.0 (35.6) min/d; and winter, 35.0 (36.3) min/d). In the fully-adjusted models, compared to patients in the spring season, the MVPA time was significantly shorter by 35.22% (95% CI = −49.18%, −17.43%) for patients in the winter season. The number of steps and duration of MVPA were significantly lower in the winter season compared to spring season irrespective of age and 6MWD (Table 2).


**Table 2.** Differences in physical activity by season.

Models were adjusted for age, sex, smoking status, FEV1% pred, any pulmonary comorbidities, and any extra-pulmonary comorbidities. Pulmonary comorbidities include chronic obstructive pulmonary disease, asthma, or interstitial lung disease, and extra-pulmonary comorbidities include cardiovascular diseases or diabetes mellitus. 6MWD, 6-min walk distance. MVPA, moderate-to-vigorous physical activity.

In the fully-adjusted models, the OR for 8000 or more steps per day was 0.46 (95% CI 0.28, 0.77) for patients in the winter compared to those in the spring. The proportion of patients who had 60 min or more MVPA per day was also similar (spring, 30.4%; summer, 42.7%; autumn, 32.3%; winter, 18.7%). In the fully-adjusted models, the OR for MVPA ≥ 60 min/day was 0.52 (95% CI = 0.29, 0.94) for patients in the winter compared to those in the spring. In particular, the OR for low CRF (<500 m) was 0.30 (95% CI 0.10, 0.87) for patients in the winter compared to those in the spring (Table 3).


**Table 3.** Odds ratios (95% CI) for physical activity by season.

Models were adjusted for age, sex, smoking status, FEV1% pred, any pulmonary comorbidities, and any extra-pulmonary comorbidities. Pulmonary comorbidities include chronic obstructive pulmonary disease, asthma, or interstitial lung disease, and extra-pulmonary comorbidities include cardiovascular diseases or diabetes mellitus. 6MWD, 6-min walk distance. MVPA, moderate-to-vigorous physical activity.

In the spline regression models, the relationship between wind chill temperature and daily steps was linear below 25◦ wind chill temperature, with lower daily steps corresponding to colder temperatures (Figure 2). The association between winter and the decline in PA was consistent across all the subgroups.

**Figure 2.** Mean daily steps (95% CI) by wind chill temperature. The curves represent daily step (solid line) and their 95% confidence intervals (dashed lines) based on restricted cubic splines for wind chill temperature with knots at the 5th, 35th, 65th, and 95th percentiles of their sample distributions. The reference value (diamond dot) was set at the 90th percentile.

#### **4. Discussion**

We found that preoperative PA was significantly lower in the winter season compared to other seasons, irrespective of age and CRF. In particular, the low MVPA in the winter season was more robust in patients who had low CRF, but the effect was not statistically significant. In terms of weather factors, wind chill temperature was inversely associated with PA. These results demonstrate that the PA of preoperative lung cancer patients is significantly affected by the season and temperature levels.

Our study showed that daily step was lower by 27% in the winter season compared with the spring season. These findings are consistent with previous studies that showed the association between daily PA and the season [8,27–30]. Those studies observed that PA decreased during the winter season in healthy adults [8] and elderly subjects [29]. This phenomenon was profoundly observed in patients with lung and heart diseases. In COPD patients, the daily step count reduced by 43.3 steps/day/ ◦C in the winter season [27], and in heart failure patients, there was a significant seasonal variation in activity between summer and winter ranging from 13.78% to 20.69% [30]. In addition, the odds of walking more than 8000 steps or having more than 60 min MVPA per day in winter is around 0.5 compared to those in spring. It means that only about 50% of patients in the winter season had similar levels of PA as those of patients in spring time. It would be important for clinicians and researchers to evaluate PA in different seasons, especially in winter and provide more tailored educations for PA depending on season. In fact, the cold and snow of winter weather can significantly affect participation in outdoor activity. In particular, influenza during the winter season is an important contributor to the winter burden among older adults and lung-disease patients [31]. Furthermore, slippery roads and a high risk of falling can affect the decline in PA [32]. When outdoor activities are complicated by weather conditions, indoor activity equipment such as treadmills and stationary bicycles offer a helpful alternative. A recent study found that preoperative exercise was effective in reducing postoperative complications and length of hospital stay in patients with lung cancer [6], suggesting the importance of maintaining PA before lung cancer surgery. Another study found that that pulmonary rehabilitation programs using treadmills and bicycles were a valid preoperative strategy to improve physical performance in patients with both NSCLC and COPD [33]. It would be worthwhile to develop a PA intervention program for winter and evaluate its impact on postoperative pulmonary complications.

We found that a few patients achieving MVPA ≥ 60 min/day were more robust in patients with low CRF (≤500 m of 6MWD) compared to those with high CRF (>500 m of 6MWD). Lower exercise tolerance assessed through the 6MWD is associated with poorer postoperative outcomes after lung resection [34]. However, a 7-day intensive preoperative program of PA combined with inspiratory muscle training increased the 6MWD in lung cancer patients in a previous study [35]. Increasing PA has positive effects on CRF and muscle power, such as health-related physical function, leading to improved postoperative recovery in early-stage NSCLC patients. In a previous study, structured and planned preoperative PA reduced postoperative complications (from 45% to 67%) and the length of hospital stay (by 4–5 days) in patients with lung cancer [3,6], suggesting that preoperative PA is an effective strategy for lung cancer patients who will undergo surgical resection. Thus, a strategic PA program should be recommended to maintain and promote PA even in the winter season, especially for low-CRF patients and older adults, for whom it is feasible to receive surgery for lung cancer.

However, seasons are a crude measure when it comes to understanding the effects of weather on PA [36]. Previous studies have observed the effects of several factors related to the season on daily activities. In a healthy population, there was a 2.9% decrease in steps per day for every 10 ◦C drop in temperature [37]. In addition to temperature, day length and daylight hours account for 73% of the monthly differences in daily activity level, and these three parameters are independent predictors of daily activities [38]. To analyze the climate conditions affecting PA in our study, we conducted a subgroup analysis to identify the relationship between climate conditions and activity for 85 people living in the capital city of South Korea. We found that wind chill temperature significantly affected the steps per day of preoperative patients. In the spline regression models, the relationship between wind chill temperature and daily steps was linear below 25 ◦C wind chill temperature, with lower daily steps

corresponding to colder temperatures. This is similar to the previous findings [30,39]. The wind chill temperature used in our study is likely to be more closely related to patients' PA behavior compared to absolute temperature, as the wind chill temperature is a quantitative measure of the heat or cold that the human body feels, calculated based on air temperature and wind velocity [40]. While it is not possible to change the weather conditions, a better understanding of how weather influences PA might be helpful when developing strategies to ameliorate the impact of adverse weather conditions on future PA interventions in low-CRF patients and older adults.

There are several limitations to this study. Firstly, this was a cross-sectional study and we were not able to observe the change of PA of the same patient group in different seasons. In addition, due to enrollment periods, a limited number of patients were observed in the winter season. These factors limited our ability to observe the magnitude of seasonal variation in PA among patients. Despite these limitations, our study indicates little variability in PA by season. In fact, patients in the winter season were younger and had better lung function than patients in other season and we still see the different levels of PA depending on season. Secondly, patients who were motivated to perform PA might be more likely to participate in the study. The mean daily step figures were higher in this study compared to data reported in earlier studies [11]. This might be due to our recruitment of patients with ECOG 0 or 1 as they would be healthier and more physically active than general cancer patients. However, mean 6MWD was similar to that reported in the previous study [26], suggesting our study participants had similar CRF to other lung cancer patients. Thirdly, the physical activity device used only counts the steps and the time spent on all those steps, but it does not differentiate whether the steps are outside or inside the house. Therefore we have a limitation in assuming that patients in the winter season were less active due to decreased outdoor PA. Lastly, the results of this study might not be generalizable as it was conducted with patients at a tertiary cancer center in Seoul, Korea. Additional studies would be necessary to confirm the findings with different populations and in different settings.

#### **5. Conclusions**

In this study, we found that season and temperature levels affect PA among preoperative lung cancer patients. Patients were much less physically active in the winter season than other seasons and patients in the winter season had lower cardiorespiratory function. Health professionals need to be aware of these seasonal differences and recommend indoor physical activities that preoperative patients with lung cancer can do in winter.

**Author Contributions:** Conceptualization: S.K., H.Y.P., J.I.Z., and J.C.; study design: S.K., H.Y.P., J.I.Z., and J.C.; data collection: S.K., J.K.L., G.L., O.J.K., Y.M.S., and J.I.Z.; data and statistical analysis: S.K., D.K., and J.C.; drafting the manuscript: S.K., H.Y.P., J.I.Z., and J.C.; revising the manuscript: S.K., H.Y.P., D.K., J.K.L., G.L., O.J.K., Y.M.S., J.I.Z., and J.C. All authors have read and agreed to the published version of the manuscript.

**Funding:** This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) [No. 2015R1C1A2A01055805 and 2017R1A2B2006435].

**Conflicts of Interest:** The authors declare no conflict of interest

#### **References**


© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

#### *Article*
